Data Analytics Decision Matrix
The core of the handbook. Stop guessing. Use these decision trees to choose the right method for missing values, outliers, encoding, scaling, feature selection, and metrics.
I have this data and this goal. What should I do next?
This is the chapter you open mid-project. Each section gives a decision tree, a comparison table, and a professional recommendation — so you choose with reasons, not habit.
Missing Values? │ ├── Numeric │ ├── Normal distribution ───────► Mean │ ├── Skewed / has outliers ─────► Median │ └── Time series / ordered ─────► Forward / backward fill │ ├── Categorical │ ├── One dominant value ────────► Mode │ └── Missingness is meaningful ─► New "Unknown" category │ └── > 40% missing in a column └── Low predictive value ──────► Drop the column
| Method | Use when | Avoid when | Why |
|---|---|---|---|
| Mean | Symmetric numeric data | Skewed data / outliers | Outliers drag the mean and distort fills |
| Median | Skewed numeric data | Exact average behaviour needed | Robust to extreme values |
| Mode | Categorical columns | Continuous numeric data | Keeps category structure intact |
| Forward / back fill | Time series, ordered logs | Shuffled / unrelated rows | Uses real neighbouring values |
| KNN / model impute | Strong feature correlations | Large data / tight deadlines | More accurate but slow and leak-prone |
| Drop rows | <5% rows affected, random | Missingness is patterned | Dropping a pattern biases the data |
Professional recommendation
For most tabular projects, impute numeric columns with the median and categorical columns with a "Missing" category. Add a binary was_missing flag when the fact of missingness might carry signal. Always impute after the train/test split.
# Median for numeric, "Missing" flag for categorical — leak-safe order from sklearn.impute import SimpleImputer num_imputer = SimpleImputer(strategy='median') cat_imputer = SimpleImputer(strategy='constant', fill_value='Missing') # fit on TRAIN only, then transform both X_train[num_cols] = num_imputer.fit_transform(X_train[num_cols]) X_test[num_cols] = num_imputer.transform(X_test[num_cols])
Outlier detected? │ ├── Data error (typo, wrong unit, impossible value) │ └──────────────────────────────► Remove / correct │ └── Real event (genuine extreme value) ├── Goal = describe / report ──► Keep, report separately ├── Goal = linear model ───────► Cap (winsorize) or log-transform └── Goal = tree model ─────────► Keep (trees are robust)
| Action | Use when | Avoid when |
|---|---|---|
| Remove | Confirmed data-entry error | Value is a real rare event |
| Cap / winsorize | Modeling with linear / distance models | Extreme values are the target of interest (fraud, churn spikes) |
| Log / sqrt transform | Right-skewed positive values | Zeros / negatives present |
| Keep as-is | Tree models, anomaly detection | Distance-based models (KNN, SVM) |
Categorical variable? │ ├── Ordinal (has order: low < mid < high) ─► Label / Ordinal encoding │ ├── Nominal (no order) │ ├── Low cardinality (< ~15) ──────────────► One-Hot encoding │ └── High cardinality (zip, city, id) ──────► Target / frequency encoding │ └── Tree model + high cardinality ──────────► Native categorical (LightGBM / CatBoost)
| Encoding | Use when | Avoid when | Risk |
|---|---|---|---|
| Label / Ordinal | True order exists | No order (nominal) | Fake order misleads linear models |
| One-Hot | Few categories, nominal | Hundreds of categories | Column explosion |
| Target encoding | High cardinality | Small data | Leakage if not cross-fitted |
| Frequency | Cardinality matters as signal | Equal-frequency categories | Collisions lose information |
Professional recommendation
Default to One-Hot for nominal columns under ~15 categories. For high-cardinality columns use target encoding with K-fold cross-fitting to avoid leakage, or switch to CatBoost / LightGBM which handle categories natively.
Which model? │ ├── Distance / gradient based │ Linear Reg · Logistic Reg · SVM · KNN · K-Means · Neural Net · PCA │ └──────────────────────────────► SCALING REQUIRED │ ├── Outliers present ──────► RobustScaler │ ├── Roughly normal ────────► StandardScaler │ └── Bounded range needed ──► MinMaxScaler │ └── Tree based Decision Tree · Random Forest · XGBoost · LightGBM └──────────────────────────────► SCALING NOT REQUIRED
| Scaler | Use when | Avoid when |
|---|---|---|
| StandardScaler | Features ~ normal, no big outliers | Heavy outliers present |
| MinMaxScaler | Need [0,1] range (neural nets, images) | Outliers compress the rest of the data |
| RobustScaler | Outliers present | Clean, normal data (overkill) |
Too many features? │ ├── Remove redundancy │ ├── Pairwise correlation > 0.9 ──► Drop one of the pair │ └── Multicollinearity (VIF > 5) ─► Drop / combine │ ├── Rank relevance │ ├── Any relationship ────────────► Mutual Information │ └── Linear only ─────────────────► Correlation with target │ └── Model driven ├── Need sparsity + speed ───────► LASSO (L1) └── Need explainability ─────────► SHAP importance
Professional recommendation
Start by removing near-duplicate features (correlation > 0.9 and high VIF). Then rank what remains with Mutual Information (captures non-linear links) and validate top features with SHAP on a baseline model. Avoid selecting features using the target before splitting.
Problem type? │ ├── Classification │ ├── Balanced classes ──────────► Accuracy │ └── Imbalanced classes │ ├── False negatives costly ► Recall │ ├── False positives costly ► Precision │ ├── Need balance ──────────► F1 score │ └── Rank quality ──────────► ROC-AUC / PR-AUC │ └── Regression ├── Penalise big errors ───────► RMSE ├── Robust to outliers ────────► MAE └── % error / forecasting ─────► MAPE
| Metric | Use when | Avoid when |
|---|---|---|
| Accuracy | Classes are balanced | Imbalanced (99% one class) |
| Recall | Missing a positive is costly (cancer, fraud) | False alarms are very expensive |
| Precision | False alarms are costly (spam block) | Missing positives is dangerous |
| F1 / ROC-AUC | Imbalanced, need overall quality | Stakeholder needs a simple % story |
| RMSE | Large errors must be punished | Many legitimate outliers |
| MAE | Robust, interpretable error | Big errors must dominate |
- Filling skewed numeric columns with the mean instead of the median
- Deleting real outliers that are actually the signal you need
- Choosing the evaluation metric after seeing the results
SimpleImputer(strategy='median') -> Robust numeric imputationOneHotEncoder(handle_unknown="ignore") -> Safe nominal encodingRobustScaler() -> Scale data that has outliersmutual_info_classif() -> Rank feature relevance (non-linear)roc_auc_score() -> Metric for imbalanced classes