Chapter 10 — Evaluation

Model Evaluation

Measure model performance correctly. The right metric depends on your problem type.

10.0 Metric selection guide
Business objectivePrimary metricWhyWhat to avoid
General balanced classificationAccuracy + F1Balanced view of correctness and class performanceAvoid only accuracy if class imbalance exists
Missed positive is expensive (fraud, disease)RecallMinimizes false negativesDo not optimize precision alone
False positive is expensivePrecisionReduces false alarmsDo not optimize recall alone
Ranking quality by scoreROC-AUC / PR-AUCEvaluates probability orderingAvoid hard-threshold metrics only
Regression forecast error costMAE / RMSE / MAPEMAE robust, RMSE penalizes large misses, MAPE interpretableSkip MAPE when true values can be zero
Report at least one threshold-free metric (AUC) and one threshold-based metric (Precision/Recall/F1) for classification. For regression, report both MAE and RMSE.
10.1 Classification metrics
python
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, classification_report, confusion_matrix, roc_auc_score
)

acc  = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred, average='weighted')
rec  = recall_score(y_test, y_pred, average='weighted')
f1   = f1_score(y_test, y_pred, average='weighted')
auc  = roc_auc_score(y_test, y_prob)

print(f"Accuracy:  {acc:.4f}")
print(f"Precision: {prec:.4f}")
print(f"Recall:    {rec:.4f}")
print(f"F1 Score:  {f1:.4f}")
print(f"ROC-AUC:   {auc:.4f}")

print("
--- Full Report ---")
print(classification_report(y_test, y_pred))
10.2 Confusion matrix
python
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=model.classes_,
            yticklabels=model.classes_)
plt.xlabel('Predicted'); plt.ylabel('Actual')
plt.title('Confusion Matrix'); plt.show()
10.3 Regression metrics
python
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

mse  = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae  = mean_absolute_error(y_test, y_pred)
r2   = r2_score(y_test, y_pred)
mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100  # % error

print(f"RMSE: {rmse:.2f}")
print(f"MAE:  {mae:.2f}")
print(f"R²:   {r2:.4f}")
print(f"MAPE: {mape:.2f}%")
10.4 Feature importance
python
importance = pd.Series(model.feature_importances_, index=X.columns)
importance.sort_values(ascending=True).tail(15).plot(
    kind='barh', figsize=(8, 6), color='#4a90d9'
)
plt.title('Top 15 Most Important Features')
plt.tight_layout(); plt.show()
MetricUse when...Range
AccuracyBalanced classes0–1
F1 ScoreImbalanced classes0–1
ROC-AUCRanking by probability0.5–1
RMSERegression, penalizes large errors0–∞ (lower=better)
Regression, % variance explained0–1
MAPERegression, % error (interpretable)0–∞%
Common mistakes to avoid
Quick cheatsheet
df.info() -> Structure and non-null counts
df.describe() -> Numeric summary statistics
df.isnull().sum() -> Missing-value counts by column
df.groupby() -> Segmented aggregation
pd.merge() -> Join multiple datasets