Chapter 34 — Deep Learning

Deep Learning Fundamentals

When (and when not) to use neural networks. Neurons, training loop, the main architectures, transfer learning, and the practical knobs that decide success.

Deep learning dominates images, audio, and language — but it usually loses to gradient-boosted trees on tabular data. Knowing when to reach for it is more valuable than knowing how to build one.
34.1 When to use deep learning
decision tree
Your data is...
│
├── Tabular / structured ────────► Use XGBoost/LightGBM (usually wins)
├── Images ────────────────────► CNN / vision transformer
├── Text / sequence ───────────► Transformer (Ch 22, 33)
├── Audio / speech ────────────► CNN / RNN / transformer
└── Tiny dataset, any type ────► Classical ML — DL will overfit
34.2 A neuron & a network
forward pass
inputs x ──► weighted sum (w·x + b) ──► activation (ReLU) ──► output
   stack many neurons → layer
   stack many layers  → deep network
   final layer: softmax (classify) or linear (regress)
34.3 The training loop
python
# PyTorch training loop skeleton
for epoch in range(EPOCHS):
    for xb, yb in loader:
        optimizer.zero_grad()
        pred = model(xb)
        loss = criterion(pred, yb)
        loss.backward()                 # compute gradients
        optimizer.step()                # update weights
34.4 Architectures by data type
ArchitectureBest forWhy
MLP (dense)Generic vectorsFully connected baseline
CNNImages, gridsLearns local spatial patterns
RNN / LSTMSequences (legacy)Carries state over time
TransformerText, increasingly vision/audioAttention captures long-range context
34.5 Transfer learning — don't train from scratch

Pretrained models already learned general features from huge datasets. Fine-tune them on your small dataset for a fraction of the cost and far better accuracy.

From scratch
Needs massive data + compute, easily overfits a small set, slow.
Fine-tune pretrained
Start from ResNet/BERT/etc., adapt the last layers to your task. Days → minutes, small data works.
34.6 The knobs that decide success
KnobEffectGuidance
Learning rateMost important hyperparameterUse a finder / warmup; too high diverges
Batch sizeStability vs speedAs large as memory allows
RegularizationFights overfittingDropout, weight decay, early stopping
Epochs + early stopUnder/overfittingStop when val loss rises
Deep nets overfit small data fast. Use a validation set + early stopping, augment data where possible, and don't add depth you don't need.

Professional recommendation

TabularTrees first, not DL
Images/textFine-tune a pretrained model
Top knobTune learning rate
Small dataAugment + early stop
Common mistakes to avoid
Quick cheatsheet
loss.backward(); optimizer.step() -> training step
transfer learning -> fine-tune pretrained
ReLU + Adam -> sane defaults
EarlyStopping -> stop on val loss rise
dropout / weight decay -> regularize