Chapter 29 — Validation & Tuning

Model Validation & Hyperparameter Tuning

How to estimate true performance and tune without fooling yourself: cross-validation strategies, the search methods, nested CV, and probability calibration.

A single train/test split gives a noisy, often optimistic score. Proper validation is what separates a model that works in production from one that only looked good once.
29.1 Why one split isn't enough
k-fold cross-validation
Data split into k folds (e.g. k=5):
Fold1  [TEST ][train][train][train][train]
Fold2  [train][TEST ][train][train][train]
...    each fold is the test set once
Score = average of k test scores  ── lower-variance estimate
29.2 Choosing a CV strategy
decision tree
Your data is...
│
├── plain tabular ───────────────► KFold
├── classification (esp. imbalanced) ► StratifiedKFold
├── grouped (same user many rows) ──► GroupKFold
├── time-ordered ──────────────────► TimeSeriesSplit
└── grouped + imbalanced ──────────► StratifiedGroupKFold
python
from sklearn.model_selection import cross_val_score, StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipe, X, y, cv=cv, scoring='roc_auc')
print(f"AUC {scores.mean():.3f} ± {scores.std():.3f}")
29.3 Hyperparameter search methods
MethodUse whenWeakness
Grid searchFew params, small gridExplodes combinatorially
Random searchMany params, limited budgetNo learning between trials
Bayesian (Optuna)Expensive models, want efficiencyMore setup
Hyperband / ASHACan early-stop bad trialsNeeds a partial-fit signal
python
# Optuna: efficient, modern default
import optuna
def objective(trial):
    params = {
        'num_leaves': trial.suggest_int('num_leaves', 15, 255),
        'learning_rate': trial.suggest_float('learning_rate', 1e-3, 0.3, log=True),
    }
    return cross_val_score(LGBMClassifier(**params), X, y, cv=5, scoring='roc_auc').mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)
29.4 Nested CV — the honest score

If you tune on the same folds you evaluate on, the reported score is optimistic. Nested CV keeps an outer loop for evaluation and an inner loop for tuning.

nested cv
Outer fold (evaluate)
   └── Inner CV (tune hyperparameters)
          └── pick best params
   └── score best model on the untouched outer test fold
Repeat → unbiased performance estimate
29.5 Probability calibration

Many models output scores, not true probabilities. If you act on the probability (expected value, thresholds), calibrate it.

Uncalibrated
Model says "0.9" but only 60% of those are positive. Decisions based on the number are wrong.
Calibrated
Platt scaling (sigmoid) or isotonic regression maps scores to honest probabilities. Check with a reliability curve + Brier score.

Professional recommendation

Default CVStratifiedKFold, k=5
TuningOptuna, fixed budget
Final scoreNested CV or held-out set
ProbabilitiesCalibrate + Brier score
Common mistakes to avoid
Quick cheatsheet
StratifiedKFold(shuffle=True) -> classification CV
cross_val_score(pipe, ...) -> leak-safe scoring
optuna.create_study() -> efficient tuning
CalibratedClassifierCV -> honest probabilities
nested CV -> unbiased estimate