Chapter 10 — Evaluation
Model Evaluation
Measure model performance correctly. The right metric depends on your problem type.
10.0 Metric selection guide
| Business objective | Primary metric | Why | What to avoid |
|---|---|---|---|
| General balanced classification | Accuracy + F1 | Balanced view of correctness and class performance | Avoid only accuracy if class imbalance exists |
| Missed positive is expensive (fraud, disease) | Recall | Minimizes false negatives | Do not optimize precision alone |
| False positive is expensive | Precision | Reduces false alarms | Do not optimize recall alone |
| Ranking quality by score | ROC-AUC / PR-AUC | Evaluates probability ordering | Avoid hard-threshold metrics only |
| Regression forecast error cost | MAE / RMSE / MAPE | MAE robust, RMSE penalizes large misses, MAPE interpretable | Skip MAPE when true values can be zero |
Report at least one threshold-free metric (AUC) and one threshold-based metric (Precision/Recall/F1) for classification. For regression, report both MAE and RMSE.
10.1 Classification metrics
python
from sklearn.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, classification_report, confusion_matrix, roc_auc_score
)
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred, average='weighted')
rec = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')
auc = roc_auc_score(y_test, y_prob)
print(f"Accuracy: {acc:.4f}")
print(f"Precision: {prec:.4f}")
print(f"Recall: {rec:.4f}")
print(f"F1 Score: {f1:.4f}")
print(f"ROC-AUC: {auc:.4f}")
print("
--- Full Report ---")
print(classification_report(y_test, y_pred))10.2 Confusion matrix
python
cm = confusion_matrix(y_test, y_pred) plt.figure(figsize=(6, 5)) sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=model.classes_, yticklabels=model.classes_) plt.xlabel('Predicted'); plt.ylabel('Actual') plt.title('Confusion Matrix'); plt.show()
10.3 Regression metrics
python
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score mse = mean_squared_error(y_test, y_pred) rmse = np.sqrt(mse) mae = mean_absolute_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100 # % error print(f"RMSE: {rmse:.2f}") print(f"MAE: {mae:.2f}") print(f"R²: {r2:.4f}") print(f"MAPE: {mape:.2f}%")
10.4 Feature importance
python
importance = pd.Series(model.feature_importances_, index=X.columns)
importance.sort_values(ascending=True).tail(15).plot(
kind='barh', figsize=(8, 6), color='#4a90d9'
)
plt.title('Top 15 Most Important Features')
plt.tight_layout(); plt.show()| Metric | Use when... | Range |
|---|---|---|
Accuracy | Balanced classes | 0–1 |
F1 Score | Imbalanced classes | 0–1 |
ROC-AUC | Ranking by probability | 0.5–1 |
RMSE | Regression, penalizes large errors | 0–∞ (lower=better) |
R² | Regression, % variance explained | 0–1 |
MAPE | Regression, % error (interpretable) | 0–∞% |
Common mistakes to avoid
- Skipping business context before running technical steps
- Not writing assumptions and limitations explicitly
- Treating one metric as the full story
Quick cheatsheet
df.info() -> Structure and non-null countsdf.describe() -> Numeric summary statisticsdf.isnull().sum() -> Missing-value counts by columndf.groupby() -> Segmented aggregationpd.merge() -> Join multiple datasets