Chapter 15 — Decision Matrix

Data Analytics Decision Matrix

The core of the handbook. Stop guessing. Use these decision trees to choose the right method for missing values, outliers, encoding, scaling, feature selection, and metrics.

I have this data and this goal. What should I do next?

This is the chapter you open mid-project. Each section gives a decision tree, a comparison table, and a professional recommendation — so you choose with reasons, not habit.

Missing valuesOutliersEncodingScalingFeature selectionMetrics

15.1 Missing value decision guide

decision tree

Missing Values?
│
├── Numeric
│   ├── Normal distribution ───────► Mean
│   ├── Skewed / has outliers ─────► Median
│   └── Time series / ordered ─────► Forward / backward fill
│
├── Categorical
│   ├── One dominant value ────────► Mode
│   └── Missingness is meaningful ─► New "Unknown" category
│
└── > 40% missing in a column
    └── Low predictive value ──────► Drop the column

Method	Use when	Avoid when	Why
Mean	Symmetric numeric data	Skewed data / outliers	Outliers drag the mean and distort fills
Median	Skewed numeric data	Exact average behaviour needed	Robust to extreme values
Mode	Categorical columns	Continuous numeric data	Keeps category structure intact
Forward / back fill	Time series, ordered logs	Shuffled / unrelated rows	Uses real neighbouring values
KNN / model impute	Strong feature correlations	Large data / tight deadlines	More accurate but slow and leak-prone
Drop rows	<5% rows affected, random	Missingness is patterned	Dropping a pattern biases the data

Professional recommendation

For most tabular projects, impute numeric columns with the median and categorical columns with a "Missing" category. Add a binary was_missing flag when the fact of missingness might carry signal. Always impute after the train/test split.

python

# Median for numeric, "Missing" flag for categorical — leak-safe order
from sklearn.impute import SimpleImputer

num_imputer = SimpleImputer(strategy='median')
cat_imputer = SimpleImputer(strategy='constant', fill_value='Missing')

# fit on TRAIN only, then transform both
X_train[num_cols] = num_imputer.fit_transform(X_train[num_cols])
X_test[num_cols]  = num_imputer.transform(X_test[num_cols])

DataXForgeTry it now: Missing Value Cleaner · Outlier Detector apply these decisions to a real file in seconds.

15.2 Outlier handling guide

decision tree

Outlier detected?
│
├── Data error (typo, wrong unit, impossible value)
│   └──────────────────────────────► Remove / correct
│
└── Real event (genuine extreme value)
    ├── Goal = describe / report ──► Keep, report separately
    ├── Goal = linear model ───────► Cap (winsorize) or log-transform
    └── Goal = tree model ─────────► Keep (trees are robust)

Action	Use when	Avoid when
Remove	Confirmed data-entry error	Value is a real rare event
Cap / winsorize	Modeling with linear / distance models	Extreme values are the target of interest (fraud, churn spikes)
Log / sqrt transform	Right-skewed positive values	Zeros / negatives present
Keep as-is	Tree models, anomaly detection	Distance-based models (KNN, SVM)

Never delete outliers just because they look big. In fraud, churn, and quality control the outliers are the signal you are paid to find.

15.3 Encoding selection guide

decision tree

Categorical variable?
│
├── Ordinal (has order: low < mid < high) ─► Label / Ordinal encoding
│
├── Nominal (no order)
│   ├── Low cardinality (< ~15) ──────────────► One-Hot encoding
│   └── High cardinality (zip, city, id) ──────► Target / frequency encoding
│
└── Tree model + high cardinality ──────────► Native categorical (LightGBM / CatBoost)

Encoding	Use when	Avoid when	Risk
Label / Ordinal	True order exists	No order (nominal)	Fake order misleads linear models
One-Hot	Few categories, nominal	Hundreds of categories	Column explosion
Target encoding	High cardinality	Small data	Leakage if not cross-fitted
Frequency	Cardinality matters as signal	Equal-frequency categories	Collisions lose information

Professional recommendation

Default to One-Hot for nominal columns under ~15 categories. For high-cardinality columns use target encoding with K-fold cross-fitting to avoid leakage, or switch to CatBoost / LightGBM which handle categories natively.

15.4 Scaling selection guide

decision tree

Which model?
│
├── Distance / gradient based
│   Linear Reg · Logistic Reg · SVM · KNN · K-Means · Neural Net · PCA
│   └──────────────────────────────► SCALING REQUIRED
│       ├── Outliers present ──────► RobustScaler
│       ├── Roughly normal ────────► StandardScaler
│       └── Bounded range needed ──► MinMaxScaler
│
└── Tree based
    Decision Tree · Random Forest · XGBoost · LightGBM
    └──────────────────────────────► SCALING NOT REQUIRED

Scaler	Use when	Avoid when
StandardScaler	Features ~ normal, no big outliers	Heavy outliers present
MinMaxScaler	Need [0,1] range (neural nets, images)	Outliers compress the rest of the data
RobustScaler	Outliers present	Clean, normal data (overkill)

Leakage alert: fit the scaler on training data only, then transform test data. Fitting on the full dataset leaks test statistics into training. See Chapter 20.

15.5 Feature selection guide

decision tree

Too many features?
│
├── Remove redundancy
│   ├── Pairwise correlation > 0.9 ──► Drop one of the pair
│   └── Multicollinearity (VIF > 5) ─► Drop / combine
│
├── Rank relevance
│   ├── Any relationship ────────────► Mutual Information
│   └── Linear only ─────────────────► Correlation with target
│
└── Model driven
    ├── Need sparsity + speed ───────► LASSO (L1)
    └── Need explainability ─────────► SHAP importance

Professional recommendation

Start by removing near-duplicate features (correlation > 0.9 and high VIF). Then rank what remains with Mutual Information (captures non-linear links) and validate top features with SHAP on a baseline model. Avoid selecting features using the target before splitting.

15.6 Metric selection guide

decision tree

Problem type?
│
├── Classification
│   ├── Balanced classes ──────────► Accuracy
│   └── Imbalanced classes
│       ├── False negatives costly ► Recall
│       ├── False positives costly ► Precision
│       ├── Need balance ──────────► F1 score
│       └── Rank quality ──────────► ROC-AUC / PR-AUC
│
└── Regression
    ├── Penalise big errors ───────► RMSE
    ├── Robust to outliers ────────► MAE
    └── % error / forecasting ─────► MAPE

Metric	Use when	Avoid when
Accuracy	Classes are balanced	Imbalanced (99% one class)
Recall	Missing a positive is costly (cancer, fraud)	False alarms are very expensive
Precision	False alarms are costly (spam block)	Missing positives is dangerous
F1 / ROC-AUC	Imbalanced, need overall quality	Stakeholder needs a simple % story
RMSE	Large errors must be punished	Many legitimate outliers
MAE	Robust, interpretable error	Big errors must dominate

Pick the metric before training. The metric encodes the business cost of being wrong — choosing it after seeing results is how teams fool themselves.

Common mistakes to avoid

Filling skewed numeric columns with the mean instead of the median
Deleting real outliers that are actually the signal you need
Choosing the evaluation metric after seeing the results

Quick cheatsheet

SimpleImputer(strategy='median') -> Robust numeric imputation

OneHotEncoder(handle_unknown="ignore") -> Safe nominal encoding

RobustScaler() -> Scale data that has outliers

mutual_info_classif() -> Rank feature relevance (non-linear)

roc_auc_score() -> Metric for imbalanced classes