Chapter 17 — Model Selection

Machine Learning Model Selection Guide

Which algorithm should you actually use? Match your target, dataset size, and need for explainability to the right model — with honest trade-off ratings for speed, accuracy, and interpretability.

17.0 Model selection decision tree
decision tree
Target variable?
│
├── Numeric (Regression)
│   ├── Need explainability ─────────► Linear / Ridge / Lasso
│   ├── Small, simple data ──────────► Linear Regression
│   └── Large, complex, max accuracy ► XGBoost / LightGBM
│
├── Categorical (Classification)
│   ├── Need explainability ─────────► Logistic Regression / Decision Tree
│   ├── Tabular, max accuracy ───────► XGBoost / LightGBM
│   ├── Small data, quick baseline ──► Logistic Regression
│   └── Images / text / audio ───────► Neural Network
│
└── No target (Unsupervised)
    ├── Group similar rows ──────────► K-Means / DBSCAN
    └── Reduce dimensions ───────────► PCA / UMAP
Golden rule: always train a simple baseline first (Logistic / Linear Regression). If a complex model can't clearly beat it, ship the simple one — it's faster, cheaper, and easier to explain.
17.1 Model trade-off ratings
ModelAccuracySpeedExplainabilityMemory
Linear / Logistic Regression★★★★★★★★★★★★★★★★★★
Decision Tree★★★★★★★★★★★★★★★★
Random Forest★★★★★★★★★★★★
XGBoost★★★★★★★★★★★★
LightGBM★★★★★★★★★★★★★★★
SVM★★★★★★★★★★
KNN★★★★★★★★
Neural Network★★★★★
17.2 When to use / when to avoid
ModelUse whenAvoid when
Linear / LogisticYou need a transparent baseline and coefficients to explainRelationships are strongly non-linear
Random ForestSolid accuracy with little tuning, mixed feature typesYou need the absolute top score or low latency
XGBoost / LightGBMTabular data, competitions, top accuracy on structured dataTiny data, or you must fully explain every prediction
SVMSmall/medium data with clear margins, high dimensionsLarge datasets (training is slow)
KNNSmall data, simple intuitive baselineLarge or high-dimensional data
Neural NetworkImages, text, audio, very large datasetsSmall tabular data (trees usually win)

Professional recommendation

Structured / tabularLightGBM or XGBoost
Need to explainLogistic Reg + SHAP
Quick baselineLogistic / Linear
Images / textNeural network / transfer
python
# Baseline first, then challenger — same split, same metric
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
import lightgbm as lgb

baseline = LogisticRegression(max_iter=1000).fit(X_train, y_train)
print("Baseline AUC:", roc_auc_score(y_test, baseline.predict_proba(X_test)[:, 1]))

challenger = lgb.LGBMClassifier(n_estimators=400, learning_rate=0.05).fit(X_train, y_train)
print("LightGBM AUC:", roc_auc_score(y_test, challenger.predict_proba(X_test)[:, 1]))
17.3 Common mistakes
Common mistakes to avoid
Quick cheatsheet
LogisticRegression() -> Explainable baseline classifier
LGBMClassifier() -> Top accuracy on tabular data
cross_val_score() -> Compare models fairly
shap.TreeExplainer() -> Explain model predictions
GridSearchCV() -> Tune via cross-validation only