Chapter 18 — Case Studies
Real Industry Case Studies
End-to-end recommended pipelines for the five most common analytics projects. Each shows the goal, the recommended stack, the metric professionals report, and the traps to avoid.
Each case follows the same shape: goal → pipeline → metric → watch-outs. Copy the pipeline as a starting blueprint, then adapt to your data.
Case 1 — Customer Churn Prediction
Goal: flag customers likely to leave so retention can intervene. Classic imbalanced binary classification.
- Median impute
- One-Hot encode
- Feature engineering
- XGBoost
- ROC-AUC + F1
- SHAP
Recommended setup
Missing valuesMedian + "Missing" flag
EncodingOne-Hot (low card.)
ModelXGBoost / LightGBM
MetricROC-AUC + Recall
ExplainSHAP per customer
Imbalanceclass_weight / scale_pos_weight
Watch-out: tenure-derived features can leak the outcome (a churned customer's last month looks different). Engineer features only from data available before the prediction window.
Case 2 — House Price Prediction
Goal: predict a continuous sale price. Regression with mixed numeric and high-cardinality location features.
- Median impute
- Target encode location
- Feature engineering
- XGBoost
- RMSE
Recommended setup
Missing valuesMedian
EncodingTarget encode neighbourhood
Targetlog(price) then expm1 back
ModelXGBoost / LightGBM
MetricRMSE (or RMSLE)
Log-transform a right-skewed price target so errors are penalised proportionally, then invert predictions with
np.expm1.Case 3 — Fraud Detection
Goal: catch rare fraudulent transactions. Extreme class imbalance (often <1% positive).
- SMOTE / class weight
- XGBoost
- Recall first
- Precision-Recall AUC
- Threshold tuning
Recommended setup
ImbalanceSMOTE or scale_pos_weight
ModelXGBoost / Isolation Forest
Primary metricRecall
SecondaryPR-AUC + Precision
DecisionTune threshold to cost
Watch-out: apply SMOTE inside the cross-validation fold, after the split — never on the full dataset, or synthetic rows leak into the test set and inflate scores.
Case 4 — Sales Forecasting
Goal: forecast future demand from historical time series. Order and seasonality matter.
- Time features
- Lag / rolling
- Prophet / LightGBM
- Time-aware CV
- MAPE
Recommended setup
FeaturesLags, rolling means, calendar
BaselineProphet / seasonal naive
ChallengerLightGBM on lag features
ValidationRolling / expanding window
MetricMAPE / sMAPE
Watch-out: never use random K-fold on time series. Use
TimeSeriesSplit so the model is always validated on the future, never the past.Case 5 — Recommendation System
Goal: suggest relevant items to users. Sparse user–item interaction data.
- Build interaction matrix
- Collaborative filtering
- Matrix factorization
- Embeddings
- Precision@K
Recommended setup
Cold startContent-based fallback
Core methodMatrix factorization (ALS/SVD)
ModernTwo-tower embeddings
MetricPrecision@K / NDCG / MAP
18.6 Cross-case summary
| Project | Type | Go-to model | Metric |
|---|---|---|---|
| Churn | Imbalanced classification | XGBoost | ROC-AUC + Recall |
| House price | Regression | XGBoost | RMSE |
| Fraud | Rare-event classification | XGBoost + SMOTE | Recall + PR-AUC |
| Sales forecast | Time series | Prophet / LightGBM | MAPE |
| Recommender | Ranking | Matrix factorization | Precision@K |
Common mistakes to avoid
- Copying a pipeline without checking it fits your data and goal
- Engineering features from data unavailable at prediction time
- Applying SMOTE before the train/test split
Quick cheatsheet
df.info() -> Structure and non-null countsdf.describe() -> Numeric summary statisticsdf.isnull().sum() -> Missing-value counts by columndf.groupby() -> Segmented aggregationpd.merge() -> Join multiple datasets