Chapter 17 — Model Selection

Machine Learning Model Selection Guide

Which algorithm should you actually use? Match your target, dataset size, and need for explainability to the right model — with honest trade-off ratings for speed, accuracy, and interpretability.

17.0 Model selection decision tree

decision tree

Target variable?
│
├── Numeric (Regression)
│   ├── Need explainability ─────────► Linear / Ridge / Lasso
│   ├── Small, simple data ──────────► Linear Regression
│   └── Large, complex, max accuracy ► XGBoost / LightGBM
│
├── Categorical (Classification)
│   ├── Need explainability ─────────► Logistic Regression / Decision Tree
│   ├── Tabular, max accuracy ───────► XGBoost / LightGBM
│   ├── Small data, quick baseline ──► Logistic Regression
│   └── Images / text / audio ───────► Neural Network
│
└── No target (Unsupervised)
    ├── Group similar rows ──────────► K-Means / DBSCAN
    └── Reduce dimensions ───────────► PCA / UMAP

Golden rule: always train a simple baseline first (Logistic / Linear Regression). If a complex model can't clearly beat it, ship the simple one — it's faster, cheaper, and easier to explain.

17.1 Model trade-off ratings

Model	Accuracy	Speed	Explainability	Memory
Linear / Logistic Regression	★★★	★★★★★	★★★★★	★★★★★
Decision Tree	★★★	★★★★	★★★★★	★★★★
Random Forest	★★★★	★★★	★★★	★★
XGBoost	★★★★★	★★	★★	★★★
LightGBM	★★★★★	★★★★	★★	★★★★
SVM	★★★★	★★	★★	★★
KNN	★★★	★★	★★★	★
Neural Network	★★★★★	★	★	★

17.2 When to use / when to avoid

Model	Use when	Avoid when
Linear / Logistic	You need a transparent baseline and coefficients to explain	Relationships are strongly non-linear
Random Forest	Solid accuracy with little tuning, mixed feature types	You need the absolute top score or low latency
XGBoost / LightGBM	Tabular data, competitions, top accuracy on structured data	Tiny data, or you must fully explain every prediction
SVM	Small/medium data with clear margins, high dimensions	Large datasets (training is slow)
KNN	Small data, simple intuitive baseline	Large or high-dimensional data
Neural Network	Images, text, audio, very large datasets	Small tabular data (trees usually win)

Professional recommendation

Structured / tabularLightGBM or XGBoost

Need to explainLogistic Reg + SHAP

Quick baselineLogistic / Linear

Images / textNeural network / transfer

python

# Baseline first, then challenger — same split, same metric
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
import lightgbm as lgb

baseline = LogisticRegression(max_iter=1000).fit(X_train, y_train)
print("Baseline AUC:", roc_auc_score(y_test, baseline.predict_proba(X_test)[:, 1]))

challenger = lgb.LGBMClassifier(n_estimators=400, learning_rate=0.05).fit(X_train, y_train)
print("LightGBM AUC:", roc_auc_score(y_test, challenger.predict_proba(X_test)[:, 1]))

17.3 Common mistakes

Jumping straight to a neural network on small tabular data
Choosing a black-box model when the stakeholder needs reasons, not just predictions
Comparing models on different splits or different metrics
Tuning hyperparameters on the test set (use cross-validation)
Ignoring training/inference cost — the best offline model can be too slow to deploy

Common mistakes to avoid

Skipping business context before running technical steps
Not writing assumptions and limitations explicitly
Treating one metric as the full story

Quick cheatsheet

LogisticRegression() -> Explainable baseline classifier

LGBMClassifier() -> Top accuracy on tabular data

cross_val_score() -> Compare models fairly

shap.TreeExplainer() -> Explain model predictions

GridSearchCV() -> Tune via cross-validation only