Chapter 34 — Deep Learning
Deep Learning Fundamentals
When (and when not) to use neural networks. Neurons, training loop, the main architectures, transfer learning, and the practical knobs that decide success.
Deep learning dominates images, audio, and language — but it usually loses to gradient-boosted trees on tabular data. Knowing when to reach for it is more valuable than knowing how to build one.
34.1 When to use deep learning
decision tree
Your data is... │ ├── Tabular / structured ────────► Use XGBoost/LightGBM (usually wins) ├── Images ────────────────────► CNN / vision transformer ├── Text / sequence ───────────► Transformer (Ch 22, 33) ├── Audio / speech ────────────► CNN / RNN / transformer └── Tiny dataset, any type ────► Classical ML — DL will overfit
34.2 A neuron & a network
forward pass
inputs x ──► weighted sum (w·x + b) ──► activation (ReLU) ──► output stack many neurons → layer stack many layers → deep network final layer: softmax (classify) or linear (regress)
34.3 The training loop
- Forward pass
- Compute loss
- Backprop gradients
- Optimizer step
- Repeat (epochs)
python
# PyTorch training loop skeleton for epoch in range(EPOCHS): for xb, yb in loader: optimizer.zero_grad() pred = model(xb) loss = criterion(pred, yb) loss.backward() # compute gradients optimizer.step() # update weights
34.4 Architectures by data type
| Architecture | Best for | Why |
|---|---|---|
| MLP (dense) | Generic vectors | Fully connected baseline |
| CNN | Images, grids | Learns local spatial patterns |
| RNN / LSTM | Sequences (legacy) | Carries state over time |
| Transformer | Text, increasingly vision/audio | Attention captures long-range context |
34.5 Transfer learning — don't train from scratch
Pretrained models already learned general features from huge datasets. Fine-tune them on your small dataset for a fraction of the cost and far better accuracy.
From scratch
Needs massive data + compute, easily overfits a small set, slow.Fine-tune pretrained
Start from ResNet/BERT/etc., adapt the last layers to your task. Days → minutes, small data works.34.6 The knobs that decide success
| Knob | Effect | Guidance |
|---|---|---|
| Learning rate | Most important hyperparameter | Use a finder / warmup; too high diverges |
| Batch size | Stability vs speed | As large as memory allows |
| Regularization | Fights overfitting | Dropout, weight decay, early stopping |
| Epochs + early stop | Under/overfitting | Stop when val loss rises |
Deep nets overfit small data fast. Use a validation set + early stopping, augment data where possible, and don't add depth you don't need.
Professional recommendation
TabularTrees first, not DL
Images/textFine-tune a pretrained model
Top knobTune learning rate
Small dataAugment + early stop
Common mistakes to avoid
- Using deep learning on small tabular data where trees win
- Training from scratch instead of fine-tuning a pretrained model
- Not tuning the learning rate (the highest-impact knob)
- No early stopping — training into severe overfitting
- Reporting train accuracy and ignoring the validation gap
Quick cheatsheet
loss.backward(); optimizer.step() -> training steptransfer learning -> fine-tune pretrainedReLU + Adam -> sane defaultsEarlyStopping -> stop on val loss risedropout / weight decay -> regularize