Chapter 09 — ML Basics

Machine Learning Basics

Build, train, and evaluate predictive models. Learn the workflow that applies to every ML project.

9.0 Model choice framework
SituationStart withWhySkip / caution
Binary classification baselineLogistic RegressionFast, interpretable baselineSkip as final model if strong non-linearity exists
Mixed tabular featuresRandom Forest / Gradient BoostingHandles interactions with little tuningSkip if dataset is tiny and interpretability is strict
Continuous targetLinear Regression baseline + tree regressorCompare simple vs non-linearSkip single-model conclusions without baseline comparison
Imbalanced classesClass weights + stratified splitPrevents majority-class biasDo not rely on accuracy alone
Very small dataCross-validation + simple modelReduces overfitting riskSkip complex high-variance models
Never tune hyperparameters on the test set. Keep test data untouched until final evaluation, or your reported performance will be optimistic.
9.1 Prepare data for ML
python
# 1. Define features (X) and target (y)
X = df.drop(columns=['target', 'id', 'date'])
y = df['target']

# 2. Keep only numeric columns
X = X.select_dtypes(include='number')

# 3. Remove rows with any missing values
mask = X.notna().all(axis=1) & y.notna()
X, y = X[mask], y[mask]

# 4. Split into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,       # 20% for testing
    random_state=42,     # for reproducibility
    stratify=y           # keep class balance (classification only)
)

print(f"Train: {X_train.shape}, Test: {X_test.shape}")
9.2 Classification — predict categories
python
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Choose and train a model
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]   # probability score

print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
9.3 Regression — predict numbers
python
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
9.4 Cross-validation (more reliable)
python
from sklearn.model_selection import cross_val_score

# 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"CV Scores: {scores}")
print(f"Mean: {scores.mean():.3f} ± {scores.std():.3f}")
9.5 Hyperparameter tuning
python
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 10, None]
}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print("Best params:", grid_search.best_params_)
best_model = grid_search.best_estimator_
ML starter flow (copy template)
Classification predicts categories (churn yes/no, fraud yes/no). Regression predicts numbers (sales, price, demand). Start with a simple baseline before complex models.
python
# Train-test split
X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y)

# Model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

model.fit(X_train, y_train)

# Prediction
y_pred = model.predict(X_test)

# Evaluation
accuracy_score(y_test, y_pred)
Common mistakes to avoid
Quick cheatsheet
train_test_split() -> Create holdout set before training
model.fit() -> Train model on training data
model.predict() -> Generate predicted labels/values
classification_report() -> Precision/recall/F1 breakdown
cross_val_score() -> Cross-validated performance