Chapter 30 — Imbalanced Data

Imbalanced Learning

When one class is rare (fraud, churn, disease), accuracy lies and naive models predict the majority. Resampling, class weights, thresholds, and the right metrics.

If 99% of transactions are legitimate, a model that always says "legit" is 99% accurate and 100% useless. Imbalanced problems need different metrics, training tricks, and decision thresholds.

30.1 Why accuracy fails

the trap

1,000,000 transactions, 1,000 fraud (0.1%)
   │
   └── "always predict legit" → 99.9% accuracy, catches 0 fraud
       Use Recall, Precision, PR-AUC — never plain accuracy.

30.2 Strategy decision tree

decision tree

Imbalanced classes?
│
├── Try first (cheapest) ─────────► class_weight / scale_pos_weight
├── Still poor recall ────────────► Oversample minority (SMOTE)
├── Huge majority, plenty of data ► Undersample majority
├── Extreme (<0.1%) / novelty ────► Anomaly detection
└── Always ───────────────────────► Tune the decision threshold

30.3 The techniques

Technique	Use when	Risk
Class weights	First thing to try — no data change	May not be enough alone
SMOTE (synthetic oversample)	Moderate imbalance, tabular	Can create unrealistic points; leak if done before split
Random undersample	Majority is huge and redundant	Throws away information
Anomaly detection	Extreme rarity, few labels	Different evaluation

Resample inside the CV fold, after the split — never on the whole dataset. SMOTE on full data leaks synthetic neighbours into the test set and inflates every score.

30.4 The leak-safe pipeline

python

# imblearn Pipeline applies SMOTE only to each training fold
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from lightgbm import LGBMClassifier

pipe = Pipeline([
    ('smote', SMOTE(random_state=42)),
    ('clf',   LGBMClassifier(class_weight='balanced')),
])
cross_val_score(pipe, X, y, cv=5, scoring='average_precision')  # PR-AUC

30.5 Tune the threshold to the business cost

The default 0.5 cutoff is rarely right. Choose the threshold that optimizes the real trade-off (e.g. catch 90% of fraud while keeping false alarms manageable).

python

from sklearn.metrics import precision_recall_curve
import numpy as np
prec, rec, thr = precision_recall_curve(y_test, proba)
# pick the threshold meeting a recall target, e.g. recall ≥ 0.90
idx = np.argmax(rec <= 0.90)
print(f"threshold={thr[idx]:.3f}  precision={prec[idx]:.2f}")

30.6 Metrics that tell the truth

Metric	Tells you
Recall	% of real positives caught
Precision	% of flagged that are real
F1	Balance of the two
PR-AUC (avg precision)	Best summary for rare positives
ROC-AUC	Ranking quality (can look rosy under extreme imbalance)

Professional recommendation

Startclass_weight='balanced'

ResampleSMOTE inside the pipeline

Judge byPR-AUC + Recall

DeployCost-tuned threshold

Common mistakes to avoid

Reporting accuracy on an imbalanced problem
Applying SMOTE before the train/test split (leakage)
Leaving the threshold at 0.5 instead of tuning to cost
Undersampling so hard you discard useful signal
Trusting ROC-AUC alone when positives are <1%

Quick cheatsheet

class_weight='balanced' -> reweight classes

imblearn SMOTE in Pipeline -> leak-safe oversample

average_precision -> PR-AUC scoring

precision_recall_curve -> pick threshold

scale_pos_weight -> XGBoost/LightGBM imbalance