Chapter 18 — Case Studies

Real Industry Case Studies

End-to-end recommended pipelines for the five most common analytics projects. Each shows the goal, the recommended stack, the metric professionals report, and the traps to avoid.

Each case follows the same shape: goal → pipeline → metric → watch-outs. Copy the pipeline as a starting blueprint, then adapt to your data.

Case 1 — Customer Churn Prediction

Goal: flag customers likely to leave so retention can intervene. Classic imbalanced binary classification.

Median impute
One-Hot encode
Feature engineering
XGBoost
ROC-AUC + F1
SHAP

Recommended setup

Missing valuesMedian + "Missing" flag

EncodingOne-Hot (low card.)

ModelXGBoost / LightGBM

MetricROC-AUC + Recall

ExplainSHAP per customer

Imbalanceclass_weight / scale_pos_weight

Watch-out: tenure-derived features can leak the outcome (a churned customer's last month looks different). Engineer features only from data available before the prediction window.

Case 2 — House Price Prediction

Goal: predict a continuous sale price. Regression with mixed numeric and high-cardinality location features.

Median impute
Target encode location
Feature engineering
XGBoost
RMSE

Recommended setup

Missing valuesMedian

EncodingTarget encode neighbourhood

Targetlog(price) then expm1 back

ModelXGBoost / LightGBM

MetricRMSE (or RMSLE)

Log-transform a right-skewed price target so errors are penalised proportionally, then invert predictions with np.expm1.

Case 3 — Fraud Detection

Goal: catch rare fraudulent transactions. Extreme class imbalance (often <1% positive).

SMOTE / class weight
XGBoost
Recall first
Precision-Recall AUC
Threshold tuning

Recommended setup

ImbalanceSMOTE or scale_pos_weight

ModelXGBoost / Isolation Forest

Primary metricRecall

SecondaryPR-AUC + Precision

DecisionTune threshold to cost

Watch-out: apply SMOTE inside the cross-validation fold, after the split — never on the full dataset, or synthetic rows leak into the test set and inflate scores.

Case 4 — Sales Forecasting

Goal: forecast future demand from historical time series. Order and seasonality matter.

Time features
Lag / rolling
Prophet / LightGBM
Time-aware CV
MAPE

Recommended setup

FeaturesLags, rolling means, calendar

BaselineProphet / seasonal naive

ChallengerLightGBM on lag features

ValidationRolling / expanding window

MetricMAPE / sMAPE

Watch-out: never use random K-fold on time series. Use TimeSeriesSplit so the model is always validated on the future, never the past.

Case 5 — Recommendation System

Goal: suggest relevant items to users. Sparse user–item interaction data.

Build interaction matrix
Collaborative filtering
Matrix factorization
Embeddings
Precision@K

Recommended setup

Cold startContent-based fallback

Core methodMatrix factorization (ALS/SVD)

ModernTwo-tower embeddings

MetricPrecision@K / NDCG / MAP

18.6 Cross-case summary

Project	Type	Go-to model	Metric
Churn	Imbalanced classification	XGBoost	ROC-AUC + Recall
House price	Regression	XGBoost	RMSE
Fraud	Rare-event classification	XGBoost + SMOTE	Recall + PR-AUC
Sales forecast	Time series	Prophet / LightGBM	MAPE
Recommender	Ranking	Matrix factorization	Precision@K

Common mistakes to avoid

Copying a pipeline without checking it fits your data and goal
Engineering features from data unavailable at prediction time
Applying SMOTE before the train/test split

Quick cheatsheet

df.info() -> Structure and non-null counts

df.describe() -> Numeric summary statistics

df.isnull().sum() -> Missing-value counts by column

df.groupby() -> Segmented aggregation

pd.merge() -> Join multiple datasets