Chapter 10 — Evaluation

Model Evaluation

Measure model performance correctly. The right metric depends on your problem type.

10.0 Metric selection guide

Business objective	Primary metric	Why	What to avoid
General balanced classification	Accuracy + F1	Balanced view of correctness and class performance	Avoid only accuracy if class imbalance exists
Missed positive is expensive (fraud, disease)	Recall	Minimizes false negatives	Do not optimize precision alone
False positive is expensive	Precision	Reduces false alarms	Do not optimize recall alone
Ranking quality by score	ROC-AUC / PR-AUC	Evaluates probability ordering	Avoid hard-threshold metrics only
Regression forecast error cost	MAE / RMSE / MAPE	MAE robust, RMSE penalizes large misses, MAPE interpretable	Skip MAPE when true values can be zero

Report at least one threshold-free metric (AUC) and one threshold-based metric (Precision/Recall/F1) for classification. For regression, report both MAE and RMSE.

10.1 Classification metrics

python

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, classification_report, confusion_matrix, roc_auc_score
)

acc  = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred, average='weighted')
rec  = recall_score(y_test, y_pred, average='weighted')
f1   = f1_score(y_test, y_pred, average='weighted')
auc  = roc_auc_score(y_test, y_prob)

print(f"Accuracy:  {acc:.4f}")
print(f"Precision: {prec:.4f}")
print(f"Recall:    {rec:.4f}")
print(f"F1 Score:  {f1:.4f}")
print(f"ROC-AUC:   {auc:.4f}")

print("
--- Full Report ---")
print(classification_report(y_test, y_pred))

10.2 Confusion matrix

python

cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=model.classes_,
            yticklabels=model.classes_)
plt.xlabel('Predicted'); plt.ylabel('Actual')
plt.title('Confusion Matrix'); plt.show()

10.3 Regression metrics

python

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

mse  = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae  = mean_absolute_error(y_test, y_pred)
r2   = r2_score(y_test, y_pred)
mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100  # % error

print(f"RMSE: {rmse:.2f}")
print(f"MAE:  {mae:.2f}")
print(f"R²:   {r2:.4f}")
print(f"MAPE: {mape:.2f}%")

10.4 Feature importance

python

importance = pd.Series(model.feature_importances_, index=X.columns)
importance.sort_values(ascending=True).tail(15).plot(
    kind='barh', figsize=(8, 6), color='#4a90d9'
)
plt.title('Top 15 Most Important Features')
plt.tight_layout(); plt.show()

Metric	Use when...	Range
`Accuracy`	Balanced classes	0–1
`F1 Score`	Imbalanced classes	0–1
`ROC-AUC`	Ranking by probability	0.5–1
`RMSE`	Regression, penalizes large errors	0–∞ (lower=better)
`R²`	Regression, % variance explained	0–1
`MAPE`	Regression, % error (interpretable)	0–∞%

Common mistakes to avoid

Skipping business context before running technical steps
Not writing assumptions and limitations explicitly
Treating one metric as the full story

Quick cheatsheet

df.info() -> Structure and non-null counts

df.describe() -> Numeric summary statistics

df.isnull().sum() -> Missing-value counts by column

df.groupby() -> Segmented aggregation

pd.merge() -> Join multiple datasets