Chapter 18 — Case Studies

Real Industry Case Studies

End-to-end recommended pipelines for the five most common analytics projects. Each shows the goal, the recommended stack, the metric professionals report, and the traps to avoid.

Each case follows the same shape: goal → pipeline → metric → watch-outs. Copy the pipeline as a starting blueprint, then adapt to your data.
Case 1 — Customer Churn Prediction

Goal: flag customers likely to leave so retention can intervene. Classic imbalanced binary classification.

Recommended setup

Missing valuesMedian + "Missing" flag
EncodingOne-Hot (low card.)
ModelXGBoost / LightGBM
MetricROC-AUC + Recall
ExplainSHAP per customer
Imbalanceclass_weight / scale_pos_weight
Watch-out: tenure-derived features can leak the outcome (a churned customer's last month looks different). Engineer features only from data available before the prediction window.
Case 2 — House Price Prediction

Goal: predict a continuous sale price. Regression with mixed numeric and high-cardinality location features.

Recommended setup

Missing valuesMedian
EncodingTarget encode neighbourhood
Targetlog(price) then expm1 back
ModelXGBoost / LightGBM
MetricRMSE (or RMSLE)
Log-transform a right-skewed price target so errors are penalised proportionally, then invert predictions with np.expm1.
Case 3 — Fraud Detection

Goal: catch rare fraudulent transactions. Extreme class imbalance (often <1% positive).

Recommended setup

ImbalanceSMOTE or scale_pos_weight
ModelXGBoost / Isolation Forest
Primary metricRecall
SecondaryPR-AUC + Precision
DecisionTune threshold to cost
Watch-out: apply SMOTE inside the cross-validation fold, after the split — never on the full dataset, or synthetic rows leak into the test set and inflate scores.
Case 4 — Sales Forecasting

Goal: forecast future demand from historical time series. Order and seasonality matter.

Recommended setup

FeaturesLags, rolling means, calendar
BaselineProphet / seasonal naive
ChallengerLightGBM on lag features
ValidationRolling / expanding window
MetricMAPE / sMAPE
Watch-out: never use random K-fold on time series. Use TimeSeriesSplit so the model is always validated on the future, never the past.
Case 5 — Recommendation System

Goal: suggest relevant items to users. Sparse user–item interaction data.

Recommended setup

Cold startContent-based fallback
Core methodMatrix factorization (ALS/SVD)
ModernTwo-tower embeddings
MetricPrecision@K / NDCG / MAP
18.6 Cross-case summary
ProjectTypeGo-to modelMetric
ChurnImbalanced classificationXGBoostROC-AUC + Recall
House priceRegressionXGBoostRMSE
FraudRare-event classificationXGBoost + SMOTERecall + PR-AUC
Sales forecastTime seriesProphet / LightGBMMAPE
RecommenderRankingMatrix factorizationPrecision@K
Common mistakes to avoid
Quick cheatsheet
df.info() -> Structure and non-null counts
df.describe() -> Numeric summary statistics
df.isnull().sum() -> Missing-value counts by column
df.groupby() -> Segmented aggregation
pd.merge() -> Join multiple datasets