Chapter 29 — Validation & Tuning

Model Validation & Hyperparameter Tuning

How to estimate true performance and tune without fooling yourself: cross-validation strategies, the search methods, nested CV, and probability calibration.

A single train/test split gives a noisy, often optimistic score. Proper validation is what separates a model that works in production from one that only looked good once.

29.1 Why one split isn't enough

k-fold cross-validation

Data split into k folds (e.g. k=5):
Fold1  [TEST ][train][train][train][train]
Fold2  [train][TEST ][train][train][train]
...    each fold is the test set once
Score = average of k test scores  ── lower-variance estimate

29.2 Choosing a CV strategy

decision tree

Your data is...
│
├── plain tabular ───────────────► KFold
├── classification (esp. imbalanced) ► StratifiedKFold
├── grouped (same user many rows) ──► GroupKFold
├── time-ordered ──────────────────► TimeSeriesSplit
└── grouped + imbalanced ──────────► StratifiedGroupKFold

python

from sklearn.model_selection import cross_val_score, StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipe, X, y, cv=cv, scoring='roc_auc')
print(f"AUC {scores.mean():.3f} ± {scores.std():.3f}")

29.3 Hyperparameter search methods

Method	Use when	Weakness
Grid search	Few params, small grid	Explodes combinatorially
Random search	Many params, limited budget	No learning between trials
Bayesian (Optuna)	Expensive models, want efficiency	More setup
Hyperband / ASHA	Can early-stop bad trials	Needs a partial-fit signal

python

# Optuna: efficient, modern default
import optuna
def objective(trial):
    params = {
        'num_leaves': trial.suggest_int('num_leaves', 15, 255),
        'learning_rate': trial.suggest_float('learning_rate', 1e-3, 0.3, log=True),
    }
    return cross_val_score(LGBMClassifier(**params), X, y, cv=5, scoring='roc_auc').mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)

29.4 Nested CV — the honest score

If you tune on the same folds you evaluate on, the reported score is optimistic. Nested CV keeps an outer loop for evaluation and an inner loop for tuning.

nested cv

Outer fold (evaluate)
   └── Inner CV (tune hyperparameters)
          └── pick best params
   └── score best model on the untouched outer test fold
Repeat → unbiased performance estimate

29.5 Probability calibration

Many models output scores, not true probabilities. If you act on the probability (expected value, thresholds), calibrate it.

Uncalibrated

Model says "0.9" but only 60% of those are positive. Decisions based on the number are wrong.

Calibrated

Platt scaling (sigmoid) or isotonic regression maps scores to honest probabilities. Check with a reliability curve + Brier score.

Professional recommendation

Default CVStratifiedKFold, k=5

TuningOptuna, fixed budget

Final scoreNested CV or held-out set

ProbabilitiesCalibrate + Brier score

Common mistakes to avoid

Reporting the tuning score as the final performance (optimistic)
Random KFold on grouped or time-ordered data
Preprocessing outside the CV loop (leakage — use a Pipeline)
Grid-searching a huge space when random/Bayesian is far cheaper
Treating raw model scores as calibrated probabilities

Quick cheatsheet

StratifiedKFold(shuffle=True) -> classification CV

cross_val_score(pipe, ...) -> leak-safe scoring

optuna.create_study() -> efficient tuning

CalibratedClassifierCV -> honest probabilities

nested CV -> unbiased estimate