Chapter 30 — Imbalanced Data
Imbalanced Learning
When one class is rare (fraud, churn, disease), accuracy lies and naive models predict the majority. Resampling, class weights, thresholds, and the right metrics.
If 99% of transactions are legitimate, a model that always says "legit" is 99% accurate and 100% useless. Imbalanced problems need different metrics, training tricks, and decision thresholds.
30.1 Why accuracy fails
the trap
1,000,000 transactions, 1,000 fraud (0.1%)
│
└── "always predict legit" → 99.9% accuracy, catches 0 fraud
Use Recall, Precision, PR-AUC — never plain accuracy.30.2 Strategy decision tree
decision tree
Imbalanced classes? │ ├── Try first (cheapest) ─────────► class_weight / scale_pos_weight ├── Still poor recall ────────────► Oversample minority (SMOTE) ├── Huge majority, plenty of data ► Undersample majority ├── Extreme (<0.1%) / novelty ────► Anomaly detection └── Always ───────────────────────► Tune the decision threshold
30.3 The techniques
| Technique | Use when | Risk |
|---|---|---|
| Class weights | First thing to try — no data change | May not be enough alone |
| SMOTE (synthetic oversample) | Moderate imbalance, tabular | Can create unrealistic points; leak if done before split |
| Random undersample | Majority is huge and redundant | Throws away information |
| Anomaly detection | Extreme rarity, few labels | Different evaluation |
Resample inside the CV fold, after the split — never on the whole dataset. SMOTE on full data leaks synthetic neighbours into the test set and inflates every score.
30.4 The leak-safe pipeline
python
# imblearn Pipeline applies SMOTE only to each training fold from imblearn.pipeline import Pipeline from imblearn.over_sampling import SMOTE from lightgbm import LGBMClassifier pipe = Pipeline([ ('smote', SMOTE(random_state=42)), ('clf', LGBMClassifier(class_weight='balanced')), ]) cross_val_score(pipe, X, y, cv=5, scoring='average_precision') # PR-AUC
30.5 Tune the threshold to the business cost
The default 0.5 cutoff is rarely right. Choose the threshold that optimizes the real trade-off (e.g. catch 90% of fraud while keeping false alarms manageable).
python
from sklearn.metrics import precision_recall_curve import numpy as np prec, rec, thr = precision_recall_curve(y_test, proba) # pick the threshold meeting a recall target, e.g. recall ≥ 0.90 idx = np.argmax(rec <= 0.90) print(f"threshold={thr[idx]:.3f} precision={prec[idx]:.2f}")
30.6 Metrics that tell the truth
| Metric | Tells you |
|---|---|
| Recall | % of real positives caught |
| Precision | % of flagged that are real |
| F1 | Balance of the two |
| PR-AUC (avg precision) | Best summary for rare positives |
| ROC-AUC | Ranking quality (can look rosy under extreme imbalance) |
Professional recommendation
Startclass_weight='balanced'
ResampleSMOTE inside the pipeline
Judge byPR-AUC + Recall
DeployCost-tuned threshold
Common mistakes to avoid
- Reporting accuracy on an imbalanced problem
- Applying SMOTE before the train/test split (leakage)
- Leaving the threshold at 0.5 instead of tuning to cost
- Undersampling so hard you discard useful signal
- Trusting ROC-AUC alone when positives are <1%
Quick cheatsheet
class_weight='balanced' -> reweight classesimblearn SMOTE in Pipeline -> leak-safe oversampleaverage_precision -> PR-AUC scoringprecision_recall_curve -> pick thresholdscale_pos_weight -> XGBoost/LightGBM imbalance