Chapter 33 — NLP

NLP Fundamentals

Working with text: preprocessing, representations from bag-of-words to embeddings, the core tasks, and choosing between classical ML and transformers.

Most enterprise data is unstructured text — tickets, reviews, emails, docs. NLP turns it into something you can classify, search, and summarize. This bridges classic ML and the LLM chapter (22).

33.1 The text pipeline

Clean
Tokenize
Normalize
Represent (vectors)
Model
Evaluate

33.2 Preprocessing

Step	Does	Use when
Lowercasing	Unifies case	Most tasks (not NER)
Tokenization	Split into words/subwords	Always
Stopword removal	Drop "the", "is"	Bag-of-words; skip for transformers
Stemming/Lemmatization	"running"→"run"	Classical models; skip for transformers

Modern transformer models do their own subword tokenization and need little manual preprocessing — heavy stemming/stopword removal can even hurt them. Heavy cleaning is mainly for bag-of-words / TF-IDF.

33.3 Representations — text to numbers

evolution of text features

Bag-of-Words ──► counts, ignores order
   │
TF-IDF ─────────► weights rare informative words
   │
Word2Vec/GloVe ─► dense word vectors, capture meaning
   │
Transformer embeddings ─► context-aware (BERT, sentence-transformers)

python

# Strong classical baseline: TF-IDF + linear model
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

clf = make_pipeline(
    TfidfVectorizer(ngram_range=(1,2), min_df=5, stop_words='english'),
    LogisticRegression(max_iter=1000),
)
clf.fit(train_texts, train_labels)

33.4 Core NLP tasks

Task	Example	Typical approach
Text classification	Spam, sentiment, topic	TF-IDF+LR baseline → fine-tuned BERT
Named entity recognition	Extract names, dates, orgs	spaCy / token-classification model
Semantic search	Find similar docs	Sentence embeddings + vector DB
Summarization / Q&A	Condense, answer	LLM / seq2seq (see Ch 22)

33.5 Classical vs transformer

which to use?

Text task — pick approach?
│
├── Small data, need speed/cheap ──► TF-IDF + LogisticRegression
├── Need top accuracy, have GPU ───► Fine-tune BERT/DistilBERT
├── Similarity / search ───────────► Sentence embeddings
└── Generation / reasoning ────────► LLM (Chapter 22)

Professional recommendation

Always build the TF-IDF + linear baseline first — it's minutes of work, often 80–90% of the accuracy, and tells you if the problem is even learnable. Reach for a fine-tuned transformer only when that baseline isn't good enough and the accuracy gain is worth the cost and latency.

33.6 Evaluating text models

Classification: precision/recall/F1 per class (text data is often imbalanced)
Always inspect misclassified examples — labels are often the problem
Hold out by document/author, not random rows, to avoid leakage
Watch for shortcut features (a template phrase that gives away the label)

Common mistakes to avoid

Jumping to a transformer before trying a TF-IDF baseline
Over-cleaning text fed to transformers (they want raw-ish input)
Random splitting when the same author/template appears in train and test
Ignoring class imbalance in sentiment/topic tasks
Trusting accuracy without reading the actual misclassifications

Quick cheatsheet

TfidfVectorizer(ngram_range=(1,2)) -> strong baseline features

spaCy -> tokenize, NER, POS

sentence-transformers -> semantic embeddings

HuggingFace Trainer -> fine-tune BERT

classification_report -> per-class P/R/F1