Chapter 20 — Data Leakage
Data Leakage Prevention Guide
Leakage is the #1 reason a model looks brilliant offline and fails in production. Learn the four classic leaks and the correct pattern for each — side by side, wrong vs right.
What is leakage? When information that would not be available at prediction time sneaks into training. The result is an inflated score that collapses in the real world. If your test score looks too good, suspect leakage first.
20.1 Leak #1 — Preprocessing before the split
Wrong
Scale / impute on the whole dataset, then split. Test statistics (mean, std) leak into training.scaler.fit(X) # all dataCorrect
Split first, fit on train, transform test.scaler.fit(X_train)scaler.transform(X_test)python
# The leak-proof pattern: Pipeline fits preprocessing inside each CV fold from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.impute import SimpleImputer from sklearn.linear_model import LogisticRegression pipe = Pipeline([ ('impute', SimpleImputer(strategy='median')), ('scale', StandardScaler()), ('model', LogisticRegression(max_iter=1000)), ]) # cross_val_score refits the whole pipe per fold — no leakage cross_val_score(pipe, X_train, y_train, cv=5)
20.2 Leak #2 — Target information inside features
Wrong
A feature is a proxy for the target: usingtotal_paid to predict defaulted, or target-encoding on the full data before splitting.Correct
Only use information known before the outcome. Cross-fit target encoding (fit on other folds), audit every engineered feature: "would I know this at prediction time?"20.3 Leak #3 — Future data in training (time series)
Wrong
Random K-fold on time-ordered data — the model trains on the future to predict the past.KFold(shuffle=True)Correct
Time-aware validation — always train on past, validate on future.TimeSeriesSplit(n_splits=5)20.4 Leak #4 — Duplicates / groups across the split
Wrong
The same customer (or near-duplicate rows) appears in both train and test, so the model "recognises" test rows.Correct
Split by group so an entity is entirely in train or test.GroupShuffleSplit / StratifiedGroupKFold20.5 Leakage audit checklist
- Did every fit (scaler, imputer, encoder, selector) see only training data?
- Could each feature realistically be known at prediction time?
- For time data, is validation strictly forward in time?
- Are duplicates and the same entities kept out of both sides?
- Is the test set untouched until the very final evaluation?
- Does any single feature have suspiciously high importance? Investigate it.
Professional recommendation
Wrap all preprocessing in an sklearn Pipeline and evaluate with cross-validation. The pipeline structurally prevents the most common leak (fitting on test data) because every transformer is refit inside each fold.
Common mistakes to avoid
- Fitting scalers or encoders before splitting the data
- Using random K-fold on time-ordered data
- Letting the same entity appear in both train and test
Quick cheatsheet
Pipeline([...]) -> Refit preprocessing inside each foldscaler.fit(X_train) -> Fit on train, transform testTimeSeriesSplit() -> Forward-in-time validationGroupShuffleSplit() -> Keep entities on one sidedf.duplicated().sum() -> Catch leakage via duplicates