Chapter 34 — Deep Learning

Deep Learning Fundamentals

When (and when not) to use neural networks. Neurons, training loop, the main architectures, transfer learning, and the practical knobs that decide success.

Deep learning dominates images, audio, and language — but it usually loses to gradient-boosted trees on tabular data. Knowing when to reach for it is more valuable than knowing how to build one.

34.1 When to use deep learning

decision tree

Your data is...
│
├── Tabular / structured ────────► Use XGBoost/LightGBM (usually wins)
├── Images ────────────────────► CNN / vision transformer
├── Text / sequence ───────────► Transformer (Ch 22, 33)
├── Audio / speech ────────────► CNN / RNN / transformer
└── Tiny dataset, any type ────► Classical ML — DL will overfit

34.2 A neuron & a network

forward pass

inputs x ──► weighted sum (w·x + b) ──► activation (ReLU) ──► output
   stack many neurons → layer
   stack many layers  → deep network
   final layer: softmax (classify) or linear (regress)

34.3 The training loop

Forward pass
Compute loss
Backprop gradients
Optimizer step
Repeat (epochs)

python

# PyTorch training loop skeleton
for epoch in range(EPOCHS):
    for xb, yb in loader:
        optimizer.zero_grad()
        pred = model(xb)
        loss = criterion(pred, yb)
        loss.backward()                 # compute gradients
        optimizer.step()                # update weights

34.4 Architectures by data type

Architecture	Best for	Why
MLP (dense)	Generic vectors	Fully connected baseline
CNN	Images, grids	Learns local spatial patterns
RNN / LSTM	Sequences (legacy)	Carries state over time
Transformer	Text, increasingly vision/audio	Attention captures long-range context

34.5 Transfer learning — don't train from scratch

Pretrained models already learned general features from huge datasets. Fine-tune them on your small dataset for a fraction of the cost and far better accuracy.

From scratch

Needs massive data + compute, easily overfits a small set, slow.

Fine-tune pretrained

Start from ResNet/BERT/etc., adapt the last layers to your task. Days → minutes, small data works.

34.6 The knobs that decide success

Knob	Effect	Guidance
Learning rate	Most important hyperparameter	Use a finder / warmup; too high diverges
Batch size	Stability vs speed	As large as memory allows
Regularization	Fights overfitting	Dropout, weight decay, early stopping
Epochs + early stop	Under/overfitting	Stop when val loss rises

Deep nets overfit small data fast. Use a validation set + early stopping, augment data where possible, and don't add depth you don't need.

Professional recommendation

TabularTrees first, not DL

Images/textFine-tune a pretrained model

Top knobTune learning rate

Small dataAugment + early stop

Common mistakes to avoid

Using deep learning on small tabular data where trees win
Training from scratch instead of fine-tuning a pretrained model
Not tuning the learning rate (the highest-impact knob)
No early stopping — training into severe overfitting
Reporting train accuracy and ignoring the validation gap

Quick cheatsheet

loss.backward(); optimizer.step() -> training step

transfer learning -> fine-tune pretrained

ReLU + Adam -> sane defaults

EarlyStopping -> stop on val loss rise

dropout / weight decay -> regularize