Chapter 09 — ML Basics

Machine Learning Basics

Build, train, and evaluate predictive models. Learn the workflow that applies to every ML project.

9.0 Model choice framework

Situation	Start with	Why	Skip / caution
Binary classification baseline	Logistic Regression	Fast, interpretable baseline	Skip as final model if strong non-linearity exists
Mixed tabular features	Random Forest / Gradient Boosting	Handles interactions with little tuning	Skip if dataset is tiny and interpretability is strict
Continuous target	Linear Regression baseline + tree regressor	Compare simple vs non-linear	Skip single-model conclusions without baseline comparison
Imbalanced classes	Class weights + stratified split	Prevents majority-class bias	Do not rely on accuracy alone
Very small data	Cross-validation + simple model	Reduces overfitting risk	Skip complex high-variance models

Never tune hyperparameters on the test set. Keep test data untouched until final evaluation, or your reported performance will be optimistic.

9.1 Prepare data for ML

python

# 1. Define features (X) and target (y)
X = df.drop(columns=['target', 'id', 'date'])
y = df['target']

# 2. Keep only numeric columns
X = X.select_dtypes(include='number')

# 3. Remove rows with any missing values
mask = X.notna().all(axis=1) & y.notna()
X, y = X[mask], y[mask]

# 4. Split into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,       # 20% for testing
    random_state=42,     # for reproducibility
    stratify=y           # keep class balance (classification only)
)

print(f"Train: {X_train.shape}, Test: {X_test.shape}")

9.2 Classification — predict categories

python

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Choose and train a model
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]   # probability score

print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")

9.3 Regression — predict numbers

python

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

9.4 Cross-validation (more reliable)

python

from sklearn.model_selection import cross_val_score

# 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"CV Scores: {scores}")
print(f"Mean: {scores.mean():.3f} ± {scores.std():.3f}")

9.5 Hyperparameter tuning

python

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 10, None]
}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print("Best params:", grid_search.best_params_)
best_model = grid_search.best_estimator_

ML starter flow (copy template)

Classification predicts categories (churn yes/no, fraud yes/no). Regression predicts numbers (sales, price, demand). Start with a simple baseline before complex models.

python

# Train-test split
X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y)

# Model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

model.fit(X_train, y_train)

# Prediction
y_pred = model.predict(X_test)

# Evaluation
accuracy_score(y_test, y_pred)

Common mistakes to avoid

Skipping business context before running technical steps
Not writing assumptions and limitations explicitly
Treating one metric as the full story

Quick cheatsheet

train_test_split() -> Create holdout set before training

model.fit() -> Train model on training data

model.predict() -> Generate predicted labels/values

classification_report() -> Precision/recall/F1 breakdown

cross_val_score() -> Cross-validated performance