Chapter 29 — Validation & Tuning
Model Validation & Hyperparameter Tuning
How to estimate true performance and tune without fooling yourself: cross-validation strategies, the search methods, nested CV, and probability calibration.
A single train/test split gives a noisy, often optimistic score. Proper validation is what separates a model that works in production from one that only looked good once.
29.1 Why one split isn't enough
k-fold cross-validation
Data split into k folds (e.g. k=5): Fold1 [TEST ][train][train][train][train] Fold2 [train][TEST ][train][train][train] ... each fold is the test set once Score = average of k test scores ── lower-variance estimate
29.2 Choosing a CV strategy
decision tree
Your data is... │ ├── plain tabular ───────────────► KFold ├── classification (esp. imbalanced) ► StratifiedKFold ├── grouped (same user many rows) ──► GroupKFold ├── time-ordered ──────────────────► TimeSeriesSplit └── grouped + imbalanced ──────────► StratifiedGroupKFold
python
from sklearn.model_selection import cross_val_score, StratifiedKFold cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(pipe, X, y, cv=cv, scoring='roc_auc') print(f"AUC {scores.mean():.3f} ± {scores.std():.3f}")
29.3 Hyperparameter search methods
| Method | Use when | Weakness |
|---|---|---|
| Grid search | Few params, small grid | Explodes combinatorially |
| Random search | Many params, limited budget | No learning between trials |
| Bayesian (Optuna) | Expensive models, want efficiency | More setup |
| Hyperband / ASHA | Can early-stop bad trials | Needs a partial-fit signal |
python
# Optuna: efficient, modern default import optuna def objective(trial): params = { 'num_leaves': trial.suggest_int('num_leaves', 15, 255), 'learning_rate': trial.suggest_float('learning_rate', 1e-3, 0.3, log=True), } return cross_val_score(LGBMClassifier(**params), X, y, cv=5, scoring='roc_auc').mean() study = optuna.create_study(direction='maximize') study.optimize(objective, n_trials=50)
29.4 Nested CV — the honest score
If you tune on the same folds you evaluate on, the reported score is optimistic. Nested CV keeps an outer loop for evaluation and an inner loop for tuning.
nested cv
Outer fold (evaluate)
└── Inner CV (tune hyperparameters)
└── pick best params
└── score best model on the untouched outer test fold
Repeat → unbiased performance estimate29.5 Probability calibration
Many models output scores, not true probabilities. If you act on the probability (expected value, thresholds), calibrate it.
Uncalibrated
Model says "0.9" but only 60% of those are positive. Decisions based on the number are wrong.Calibrated
Platt scaling (sigmoid) or isotonic regression maps scores to honest probabilities. Check with a reliability curve + Brier score.Professional recommendation
Default CVStratifiedKFold, k=5
TuningOptuna, fixed budget
Final scoreNested CV or held-out set
ProbabilitiesCalibrate + Brier score
Common mistakes to avoid
- Reporting the tuning score as the final performance (optimistic)
- Random KFold on grouped or time-ordered data
- Preprocessing outside the CV loop (leakage — use a Pipeline)
- Grid-searching a huge space when random/Bayesian is far cheaper
- Treating raw model scores as calibrated probabilities
Quick cheatsheet
StratifiedKFold(shuffle=True) -> classification CVcross_val_score(pipe, ...) -> leak-safe scoringoptuna.create_study() -> efficient tuningCalibratedClassifierCV -> honest probabilitiesnested CV -> unbiased estimate