Chapter 17 — Model Selection
Machine Learning Model Selection Guide
Which algorithm should you actually use? Match your target, dataset size, and need for explainability to the right model — with honest trade-off ratings for speed, accuracy, and interpretability.
17.0 Model selection decision tree
decision tree
Target variable? │ ├── Numeric (Regression) │ ├── Need explainability ─────────► Linear / Ridge / Lasso │ ├── Small, simple data ──────────► Linear Regression │ └── Large, complex, max accuracy ► XGBoost / LightGBM │ ├── Categorical (Classification) │ ├── Need explainability ─────────► Logistic Regression / Decision Tree │ ├── Tabular, max accuracy ───────► XGBoost / LightGBM │ ├── Small data, quick baseline ──► Logistic Regression │ └── Images / text / audio ───────► Neural Network │ └── No target (Unsupervised) ├── Group similar rows ──────────► K-Means / DBSCAN └── Reduce dimensions ───────────► PCA / UMAP
Golden rule: always train a simple baseline first (Logistic / Linear Regression). If a complex model can't clearly beat it, ship the simple one — it's faster, cheaper, and easier to explain.
17.1 Model trade-off ratings
| Model | Accuracy | Speed | Explainability | Memory |
|---|---|---|---|---|
| Linear / Logistic Regression | ★★★ | ★★★★★ | ★★★★★ | ★★★★★ |
| Decision Tree | ★★★ | ★★★★ | ★★★★★ | ★★★★ |
| Random Forest | ★★★★ | ★★★ | ★★★ | ★★ |
| XGBoost | ★★★★★ | ★★ | ★★ | ★★★ |
| LightGBM | ★★★★★ | ★★★★ | ★★ | ★★★★ |
| SVM | ★★★★ | ★★ | ★★ | ★★ |
| KNN | ★★★ | ★★ | ★★★ | ★ |
| Neural Network | ★★★★★ | ★ | ★ | ★ |
17.2 When to use / when to avoid
| Model | Use when | Avoid when |
|---|---|---|
| Linear / Logistic | You need a transparent baseline and coefficients to explain | Relationships are strongly non-linear |
| Random Forest | Solid accuracy with little tuning, mixed feature types | You need the absolute top score or low latency |
| XGBoost / LightGBM | Tabular data, competitions, top accuracy on structured data | Tiny data, or you must fully explain every prediction |
| SVM | Small/medium data with clear margins, high dimensions | Large datasets (training is slow) |
| KNN | Small data, simple intuitive baseline | Large or high-dimensional data |
| Neural Network | Images, text, audio, very large datasets | Small tabular data (trees usually win) |
Professional recommendation
Structured / tabularLightGBM or XGBoost
Need to explainLogistic Reg + SHAP
Quick baselineLogistic / Linear
Images / textNeural network / transfer
python
# Baseline first, then challenger — same split, same metric from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_auc_score import lightgbm as lgb baseline = LogisticRegression(max_iter=1000).fit(X_train, y_train) print("Baseline AUC:", roc_auc_score(y_test, baseline.predict_proba(X_test)[:, 1])) challenger = lgb.LGBMClassifier(n_estimators=400, learning_rate=0.05).fit(X_train, y_train) print("LightGBM AUC:", roc_auc_score(y_test, challenger.predict_proba(X_test)[:, 1]))
17.3 Common mistakes
- Jumping straight to a neural network on small tabular data
- Choosing a black-box model when the stakeholder needs reasons, not just predictions
- Comparing models on different splits or different metrics
- Tuning hyperparameters on the test set (use cross-validation)
- Ignoring training/inference cost — the best offline model can be too slow to deploy
Common mistakes to avoid
- Skipping business context before running technical steps
- Not writing assumptions and limitations explicitly
- Treating one metric as the full story
Quick cheatsheet
LogisticRegression() -> Explainable baseline classifierLGBMClassifier() -> Top accuracy on tabular datacross_val_score() -> Compare models fairlyshap.TreeExplainer() -> Explain model predictionsGridSearchCV() -> Tune via cross-validation only