Chapter 09 — ML Basics
Machine Learning Basics
Build, train, and evaluate predictive models. Learn the workflow that applies to every ML project.
9.0 Model choice framework
| Situation | Start with | Why | Skip / caution |
|---|---|---|---|
| Binary classification baseline | Logistic Regression | Fast, interpretable baseline | Skip as final model if strong non-linearity exists |
| Mixed tabular features | Random Forest / Gradient Boosting | Handles interactions with little tuning | Skip if dataset is tiny and interpretability is strict |
| Continuous target | Linear Regression baseline + tree regressor | Compare simple vs non-linear | Skip single-model conclusions without baseline comparison |
| Imbalanced classes | Class weights + stratified split | Prevents majority-class bias | Do not rely on accuracy alone |
| Very small data | Cross-validation + simple model | Reduces overfitting risk | Skip complex high-variance models |
Never tune hyperparameters on the test set. Keep test data untouched until final evaluation, or your reported performance will be optimistic.
9.1 Prepare data for ML
python
# 1. Define features (X) and target (y) X = df.drop(columns=['target', 'id', 'date']) y = df['target'] # 2. Keep only numeric columns X = X.select_dtypes(include='number') # 3. Remove rows with any missing values mask = X.notna().all(axis=1) & y.notna() X, y = X[mask], y[mask] # 4. Split into train and test sets from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, # 20% for testing random_state=42, # for reproducibility stratify=y # keep class balance (classification only) ) print(f"Train: {X_train.shape}, Test: {X_test.shape}")
9.2 Classification — predict categories
python
from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier # Choose and train a model model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42) model.fit(X_train, y_train) # Predict y_pred = model.predict(X_test) y_prob = model.predict_proba(X_test)[:, 1] # probability score print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
9.3 Regression — predict numbers
python
from sklearn.linear_model import LinearRegression from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor model = RandomForestRegressor(n_estimators=100, random_state=42) model.fit(X_train, y_train) y_pred = model.predict(X_test)
9.4 Cross-validation (more reliable)
python
from sklearn.model_selection import cross_val_score # 5-fold cross-validation scores = cross_val_score(model, X, y, cv=5, scoring='accuracy') print(f"CV Scores: {scores}") print(f"Mean: {scores.mean():.3f} ± {scores.std():.3f}")
9.5 Hyperparameter tuning
python
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [3, 5, 10, None]
}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
print("Best params:", grid_search.best_params_)
best_model = grid_search.best_estimator_ML starter flow (copy template)
Classification predicts categories (churn yes/no, fraud yes/no). Regression predicts numbers (sales, price, demand). Start with a simple baseline before complex models.
python
# Train-test split X = df.drop('target', axis=1) y = df['target'] X_train, X_test, y_train, y_test = train_test_split(X, y) # Model from sklearn.linear_model import LogisticRegression model = LogisticRegression() model.fit(X_train, y_train) # Prediction y_pred = model.predict(X_test) # Evaluation accuracy_score(y_test, y_pred)
Common mistakes to avoid
- Skipping business context before running technical steps
- Not writing assumptions and limitations explicitly
- Treating one metric as the full story
Quick cheatsheet
train_test_split() -> Create holdout set before trainingmodel.fit() -> Train model on training datamodel.predict() -> Generate predicted labels/valuesclassification_report() -> Precision/recall/F1 breakdowncross_val_score() -> Cross-validated performance