Chapter 30 — Imbalanced Data

Imbalanced Learning

When one class is rare (fraud, churn, disease), accuracy lies and naive models predict the majority. Resampling, class weights, thresholds, and the right metrics.

If 99% of transactions are legitimate, a model that always says "legit" is 99% accurate and 100% useless. Imbalanced problems need different metrics, training tricks, and decision thresholds.
30.1 Why accuracy fails
the trap
1,000,000 transactions, 1,000 fraud (0.1%)
   │
   └── "always predict legit" → 99.9% accuracy, catches 0 fraud
       Use Recall, Precision, PR-AUC — never plain accuracy.
30.2 Strategy decision tree
decision tree
Imbalanced classes?
│
├── Try first (cheapest) ─────────► class_weight / scale_pos_weight
├── Still poor recall ────────────► Oversample minority (SMOTE)
├── Huge majority, plenty of data ► Undersample majority
├── Extreme (<0.1%) / novelty ────► Anomaly detection
└── Always ───────────────────────► Tune the decision threshold
30.3 The techniques
TechniqueUse whenRisk
Class weightsFirst thing to try — no data changeMay not be enough alone
SMOTE (synthetic oversample)Moderate imbalance, tabularCan create unrealistic points; leak if done before split
Random undersampleMajority is huge and redundantThrows away information
Anomaly detectionExtreme rarity, few labelsDifferent evaluation
Resample inside the CV fold, after the split — never on the whole dataset. SMOTE on full data leaks synthetic neighbours into the test set and inflates every score.
30.4 The leak-safe pipeline
python
# imblearn Pipeline applies SMOTE only to each training fold
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from lightgbm import LGBMClassifier

pipe = Pipeline([
    ('smote', SMOTE(random_state=42)),
    ('clf',   LGBMClassifier(class_weight='balanced')),
])
cross_val_score(pipe, X, y, cv=5, scoring='average_precision')  # PR-AUC
30.5 Tune the threshold to the business cost

The default 0.5 cutoff is rarely right. Choose the threshold that optimizes the real trade-off (e.g. catch 90% of fraud while keeping false alarms manageable).

python
from sklearn.metrics import precision_recall_curve
import numpy as np
prec, rec, thr = precision_recall_curve(y_test, proba)
# pick the threshold meeting a recall target, e.g. recall ≥ 0.90
idx = np.argmax(rec <= 0.90)
print(f"threshold={thr[idx]:.3f}  precision={prec[idx]:.2f}")
30.6 Metrics that tell the truth
MetricTells you
Recall% of real positives caught
Precision% of flagged that are real
F1Balance of the two
PR-AUC (avg precision)Best summary for rare positives
ROC-AUCRanking quality (can look rosy under extreme imbalance)

Professional recommendation

Startclass_weight='balanced'
ResampleSMOTE inside the pipeline
Judge byPR-AUC + Recall
DeployCost-tuned threshold
Common mistakes to avoid
Quick cheatsheet
class_weight='balanced' -> reweight classes
imblearn SMOTE in Pipeline -> leak-safe oversample
average_precision -> PR-AUC scoring
precision_recall_curve -> pick threshold
scale_pos_weight -> XGBoost/LightGBM imbalance