Chapter 20 — Data Leakage

Data Leakage Prevention Guide

Leakage is the #1 reason a model looks brilliant offline and fails in production. Learn the four classic leaks and the correct pattern for each — side by side, wrong vs right.

What is leakage? When information that would not be available at prediction time sneaks into training. The result is an inflated score that collapses in the real world. If your test score looks too good, suspect leakage first.
20.1 Leak #1 — Preprocessing before the split
Wrong
Scale / impute on the whole dataset, then split. Test statistics (mean, std) leak into training.
scaler.fit(X) # all data
Correct
Split first, fit on train, transform test.
scaler.fit(X_train)
scaler.transform(X_test)
python
# The leak-proof pattern: Pipeline fits preprocessing inside each CV fold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ('impute', SimpleImputer(strategy='median')),
    ('scale',  StandardScaler()),
    ('model',  LogisticRegression(max_iter=1000)),
])
# cross_val_score refits the whole pipe per fold — no leakage
cross_val_score(pipe, X_train, y_train, cv=5)
20.2 Leak #2 — Target information inside features
Wrong
A feature is a proxy for the target: using total_paid to predict defaulted, or target-encoding on the full data before splitting.
Correct
Only use information known before the outcome. Cross-fit target encoding (fit on other folds), audit every engineered feature: "would I know this at prediction time?"
20.3 Leak #3 — Future data in training (time series)
Wrong
Random K-fold on time-ordered data — the model trains on the future to predict the past.
KFold(shuffle=True)
Correct
Time-aware validation — always train on past, validate on future.
TimeSeriesSplit(n_splits=5)
20.4 Leak #4 — Duplicates / groups across the split
Wrong
The same customer (or near-duplicate rows) appears in both train and test, so the model "recognises" test rows.
Correct
Split by group so an entity is entirely in train or test.
GroupShuffleSplit / StratifiedGroupKFold
20.5 Leakage audit checklist

Professional recommendation

Wrap all preprocessing in an sklearn Pipeline and evaluate with cross-validation. The pipeline structurally prevents the most common leak (fitting on test data) because every transformer is refit inside each fold.

Common mistakes to avoid
Quick cheatsheet
Pipeline([...]) -> Refit preprocessing inside each fold
scaler.fit(X_train) -> Fit on train, transform test
TimeSeriesSplit() -> Forward-in-time validation
GroupShuffleSplit() -> Keep entities on one side
df.duplicated().sum() -> Catch leakage via duplicates