Chapter 20 — Data Leakage

Data Leakage Prevention Guide

Leakage is the #1 reason a model looks brilliant offline and fails in production. Learn the four classic leaks and the correct pattern for each — side by side, wrong vs right.

What is leakage? When information that would not be available at prediction time sneaks into training. The result is an inflated score that collapses in the real world. If your test score looks too good, suspect leakage first.

20.1 Leak #1 — Preprocessing before the split

Wrong

Scale / impute on the whole dataset, then split. Test statistics (mean, std) leak into training.
scaler.fit(X) # all data

Correct

Split first, fit on train, transform test.
scaler.fit(X_train)
scaler.transform(X_test)

python

# The leak-proof pattern: Pipeline fits preprocessing inside each CV fold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ('impute', SimpleImputer(strategy='median')),
    ('scale',  StandardScaler()),
    ('model',  LogisticRegression(max_iter=1000)),
])
# cross_val_score refits the whole pipe per fold — no leakage
cross_val_score(pipe, X_train, y_train, cv=5)

20.2 Leak #2 — Target information inside features

Wrong

A feature is a proxy for the target: using total_paid to predict defaulted, or target-encoding on the full data before splitting.

Correct

Only use information known before the outcome. Cross-fit target encoding (fit on other folds), audit every engineered feature: "would I know this at prediction time?"

20.3 Leak #3 — Future data in training (time series)

Wrong

Random K-fold on time-ordered data — the model trains on the future to predict the past.
KFold(shuffle=True)

Correct

Time-aware validation — always train on past, validate on future.
TimeSeriesSplit(n_splits=5)

20.4 Leak #4 — Duplicates / groups across the split

Wrong

The same customer (or near-duplicate rows) appears in both train and test, so the model "recognises" test rows.

Correct

Split by group so an entity is entirely in train or test.
GroupShuffleSplit / StratifiedGroupKFold

20.5 Leakage audit checklist

Did every fit (scaler, imputer, encoder, selector) see only training data?
Could each feature realistically be known at prediction time?
For time data, is validation strictly forward in time?
Are duplicates and the same entities kept out of both sides?
Is the test set untouched until the very final evaluation?
Does any single feature have suspiciously high importance? Investigate it.

Professional recommendation

Wrap all preprocessing in an sklearn Pipeline and evaluate with cross-validation. The pipeline structurally prevents the most common leak (fitting on test data) because every transformer is refit inside each fold.

Common mistakes to avoid

Fitting scalers or encoders before splitting the data
Using random K-fold on time-ordered data
Letting the same entity appear in both train and test

Quick cheatsheet

Pipeline([...]) -> Refit preprocessing inside each fold

scaler.fit(X_train) -> Fit on train, transform test

TimeSeriesSplit() -> Forward-in-time validation

GroupShuffleSplit() -> Keep entities on one side

df.duplicated().sum() -> Catch leakage via duplicates