Chapter 15 — Decision Matrix

Data Analytics Decision Matrix

The core of the handbook. Stop guessing. Use these decision trees to choose the right method for missing values, outliers, encoding, scaling, feature selection, and metrics.

I have this data and this goal. What should I do next?

This is the chapter you open mid-project. Each section gives a decision tree, a comparison table, and a professional recommendation — so you choose with reasons, not habit.

Missing valuesOutliersEncodingScalingFeature selectionMetrics
15.1 Missing value decision guide
decision tree
Missing Values?
│
├── Numeric
│   ├── Normal distribution ───────► Mean
│   ├── Skewed / has outliers ─────► Median
│   └── Time series / ordered ─────► Forward / backward fill
│
├── Categorical
│   ├── One dominant value ────────► Mode
│   └── Missingness is meaningful ─► New "Unknown" category
│
└── > 40% missing in a column
    └── Low predictive value ──────► Drop the column
MethodUse whenAvoid whenWhy
MeanSymmetric numeric dataSkewed data / outliersOutliers drag the mean and distort fills
MedianSkewed numeric dataExact average behaviour neededRobust to extreme values
ModeCategorical columnsContinuous numeric dataKeeps category structure intact
Forward / back fillTime series, ordered logsShuffled / unrelated rowsUses real neighbouring values
KNN / model imputeStrong feature correlationsLarge data / tight deadlinesMore accurate but slow and leak-prone
Drop rows<5% rows affected, randomMissingness is patternedDropping a pattern biases the data

Professional recommendation

For most tabular projects, impute numeric columns with the median and categorical columns with a "Missing" category. Add a binary was_missing flag when the fact of missingness might carry signal. Always impute after the train/test split.

python
# Median for numeric, "Missing" flag for categorical — leak-safe order
from sklearn.impute import SimpleImputer

num_imputer = SimpleImputer(strategy='median')
cat_imputer = SimpleImputer(strategy='constant', fill_value='Missing')

# fit on TRAIN only, then transform both
X_train[num_cols] = num_imputer.fit_transform(X_train[num_cols])
X_test[num_cols]  = num_imputer.transform(X_test[num_cols])
DataXForgeTry it now: Missing Value Cleaner · Outlier Detector apply these decisions to a real file in seconds.
15.2 Outlier handling guide
decision tree
Outlier detected?
│
├── Data error (typo, wrong unit, impossible value)
│   └──────────────────────────────► Remove / correct
│
└── Real event (genuine extreme value)
    ├── Goal = describe / report ──► Keep, report separately
    ├── Goal = linear model ───────► Cap (winsorize) or log-transform
    └── Goal = tree model ─────────► Keep (trees are robust)
ActionUse whenAvoid when
RemoveConfirmed data-entry errorValue is a real rare event
Cap / winsorizeModeling with linear / distance modelsExtreme values are the target of interest (fraud, churn spikes)
Log / sqrt transformRight-skewed positive valuesZeros / negatives present
Keep as-isTree models, anomaly detectionDistance-based models (KNN, SVM)
Never delete outliers just because they look big. In fraud, churn, and quality control the outliers are the signal you are paid to find.
15.3 Encoding selection guide
decision tree
Categorical variable?
│
├── Ordinal (has order: low < mid < high) ─► Label / Ordinal encoding
│
├── Nominal (no order)
│   ├── Low cardinality (< ~15) ──────────────► One-Hot encoding
│   └── High cardinality (zip, city, id) ──────► Target / frequency encoding
│
└── Tree model + high cardinality ──────────► Native categorical (LightGBM / CatBoost)
EncodingUse whenAvoid whenRisk
Label / OrdinalTrue order existsNo order (nominal)Fake order misleads linear models
One-HotFew categories, nominalHundreds of categoriesColumn explosion
Target encodingHigh cardinalitySmall dataLeakage if not cross-fitted
FrequencyCardinality matters as signalEqual-frequency categoriesCollisions lose information

Professional recommendation

Default to One-Hot for nominal columns under ~15 categories. For high-cardinality columns use target encoding with K-fold cross-fitting to avoid leakage, or switch to CatBoost / LightGBM which handle categories natively.

15.4 Scaling selection guide
decision tree
Which model?
│
├── Distance / gradient based
│   Linear Reg · Logistic Reg · SVM · KNN · K-Means · Neural Net · PCA
│   └──────────────────────────────► SCALING REQUIRED
│       ├── Outliers present ──────► RobustScaler
│       ├── Roughly normal ────────► StandardScaler
│       └── Bounded range needed ──► MinMaxScaler
│
└── Tree based
    Decision Tree · Random Forest · XGBoost · LightGBM
    └──────────────────────────────► SCALING NOT REQUIRED
ScalerUse whenAvoid when
StandardScalerFeatures ~ normal, no big outliersHeavy outliers present
MinMaxScalerNeed [0,1] range (neural nets, images)Outliers compress the rest of the data
RobustScalerOutliers presentClean, normal data (overkill)
Leakage alert: fit the scaler on training data only, then transform test data. Fitting on the full dataset leaks test statistics into training. See Chapter 20.
15.5 Feature selection guide
decision tree
Too many features?
│
├── Remove redundancy
│   ├── Pairwise correlation > 0.9 ──► Drop one of the pair
│   └── Multicollinearity (VIF > 5) ─► Drop / combine
│
├── Rank relevance
│   ├── Any relationship ────────────► Mutual Information
│   └── Linear only ─────────────────► Correlation with target
│
└── Model driven
    ├── Need sparsity + speed ───────► LASSO (L1)
    └── Need explainability ─────────► SHAP importance

Professional recommendation

Start by removing near-duplicate features (correlation > 0.9 and high VIF). Then rank what remains with Mutual Information (captures non-linear links) and validate top features with SHAP on a baseline model. Avoid selecting features using the target before splitting.

15.6 Metric selection guide
decision tree
Problem type?
│
├── Classification
│   ├── Balanced classes ──────────► Accuracy
│   └── Imbalanced classes
│       ├── False negatives costly ► Recall
│       ├── False positives costly ► Precision
│       ├── Need balance ──────────► F1 score
│       └── Rank quality ──────────► ROC-AUC / PR-AUC
│
└── Regression
    ├── Penalise big errors ───────► RMSE
    ├── Robust to outliers ────────► MAE
    └── % error / forecasting ─────► MAPE
MetricUse whenAvoid when
AccuracyClasses are balancedImbalanced (99% one class)
RecallMissing a positive is costly (cancer, fraud)False alarms are very expensive
PrecisionFalse alarms are costly (spam block)Missing positives is dangerous
F1 / ROC-AUCImbalanced, need overall qualityStakeholder needs a simple % story
RMSELarge errors must be punishedMany legitimate outliers
MAERobust, interpretable errorBig errors must dominate
Pick the metric before training. The metric encodes the business cost of being wrong — choosing it after seeing results is how teams fool themselves.
Common mistakes to avoid
Quick cheatsheet
SimpleImputer(strategy='median') -> Robust numeric imputation
OneHotEncoder(handle_unknown="ignore") -> Safe nominal encoding
RobustScaler() -> Scale data that has outliers
mutual_info_classif() -> Rank feature relevance (non-linear)
roc_auc_score() -> Metric for imbalanced classes