Chapter 33 — NLP
NLP Fundamentals
Working with text: preprocessing, representations from bag-of-words to embeddings, the core tasks, and choosing between classical ML and transformers.
Most enterprise data is unstructured text — tickets, reviews, emails, docs. NLP turns it into something you can classify, search, and summarize. This bridges classic ML and the LLM chapter (22).
33.1 The text pipeline
- Clean
- Tokenize
- Normalize
- Represent (vectors)
- Model
- Evaluate
33.2 Preprocessing
| Step | Does | Use when |
|---|---|---|
| Lowercasing | Unifies case | Most tasks (not NER) |
| Tokenization | Split into words/subwords | Always |
| Stopword removal | Drop "the", "is" | Bag-of-words; skip for transformers |
| Stemming/Lemmatization | "running"→"run" | Classical models; skip for transformers |
Modern transformer models do their own subword tokenization and need little manual preprocessing — heavy stemming/stopword removal can even hurt them. Heavy cleaning is mainly for bag-of-words / TF-IDF.
33.3 Representations — text to numbers
evolution of text features
Bag-of-Words ──► counts, ignores order
│
TF-IDF ─────────► weights rare informative words
│
Word2Vec/GloVe ─► dense word vectors, capture meaning
│
Transformer embeddings ─► context-aware (BERT, sentence-transformers)python
# Strong classical baseline: TF-IDF + linear model from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.pipeline import make_pipeline clf = make_pipeline( TfidfVectorizer(ngram_range=(1,2), min_df=5, stop_words='english'), LogisticRegression(max_iter=1000), ) clf.fit(train_texts, train_labels)
33.4 Core NLP tasks
| Task | Example | Typical approach |
|---|---|---|
| Text classification | Spam, sentiment, topic | TF-IDF+LR baseline → fine-tuned BERT |
| Named entity recognition | Extract names, dates, orgs | spaCy / token-classification model |
| Semantic search | Find similar docs | Sentence embeddings + vector DB |
| Summarization / Q&A | Condense, answer | LLM / seq2seq (see Ch 22) |
33.5 Classical vs transformer
which to use?
Text task — pick approach? │ ├── Small data, need speed/cheap ──► TF-IDF + LogisticRegression ├── Need top accuracy, have GPU ───► Fine-tune BERT/DistilBERT ├── Similarity / search ───────────► Sentence embeddings └── Generation / reasoning ────────► LLM (Chapter 22)
Professional recommendation
Always build the TF-IDF + linear baseline first — it's minutes of work, often 80–90% of the accuracy, and tells you if the problem is even learnable. Reach for a fine-tuned transformer only when that baseline isn't good enough and the accuracy gain is worth the cost and latency.
33.6 Evaluating text models
- Classification: precision/recall/F1 per class (text data is often imbalanced)
- Always inspect misclassified examples — labels are often the problem
- Hold out by document/author, not random rows, to avoid leakage
- Watch for shortcut features (a template phrase that gives away the label)
Common mistakes to avoid
- Jumping to a transformer before trying a TF-IDF baseline
- Over-cleaning text fed to transformers (they want raw-ish input)
- Random splitting when the same author/template appears in train and test
- Ignoring class imbalance in sentiment/topic tasks
- Trusting accuracy without reading the actual misclassifications
Quick cheatsheet
TfidfVectorizer(ngram_range=(1,2)) -> strong baseline featuresspaCy -> tokenize, NER, POSsentence-transformers -> semantic embeddingsHuggingFace Trainer -> fine-tune BERTclassification_report -> per-class P/R/F1