Chapter 33 — NLP

NLP Fundamentals

Working with text: preprocessing, representations from bag-of-words to embeddings, the core tasks, and choosing between classical ML and transformers.

Most enterprise data is unstructured text — tickets, reviews, emails, docs. NLP turns it into something you can classify, search, and summarize. This bridges classic ML and the LLM chapter (22).
33.1 The text pipeline
33.2 Preprocessing
StepDoesUse when
LowercasingUnifies caseMost tasks (not NER)
TokenizationSplit into words/subwordsAlways
Stopword removalDrop "the", "is"Bag-of-words; skip for transformers
Stemming/Lemmatization"running"→"run"Classical models; skip for transformers
Modern transformer models do their own subword tokenization and need little manual preprocessing — heavy stemming/stopword removal can even hurt them. Heavy cleaning is mainly for bag-of-words / TF-IDF.
33.3 Representations — text to numbers
evolution of text features
Bag-of-Words ──► counts, ignores order
   │
TF-IDF ─────────► weights rare informative words
   │
Word2Vec/GloVe ─► dense word vectors, capture meaning
   │
Transformer embeddings ─► context-aware (BERT, sentence-transformers)
python
# Strong classical baseline: TF-IDF + linear model
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

clf = make_pipeline(
    TfidfVectorizer(ngram_range=(1,2), min_df=5, stop_words='english'),
    LogisticRegression(max_iter=1000),
)
clf.fit(train_texts, train_labels)
33.4 Core NLP tasks
TaskExampleTypical approach
Text classificationSpam, sentiment, topicTF-IDF+LR baseline → fine-tuned BERT
Named entity recognitionExtract names, dates, orgsspaCy / token-classification model
Semantic searchFind similar docsSentence embeddings + vector DB
Summarization / Q&ACondense, answerLLM / seq2seq (see Ch 22)
33.5 Classical vs transformer
which to use?
Text task — pick approach?
│
├── Small data, need speed/cheap ──► TF-IDF + LogisticRegression
├── Need top accuracy, have GPU ───► Fine-tune BERT/DistilBERT
├── Similarity / search ───────────► Sentence embeddings
└── Generation / reasoning ────────► LLM (Chapter 22)

Professional recommendation

Always build the TF-IDF + linear baseline first — it's minutes of work, often 80–90% of the accuracy, and tells you if the problem is even learnable. Reach for a fine-tuned transformer only when that baseline isn't good enough and the accuracy gain is worth the cost and latency.

33.6 Evaluating text models
Common mistakes to avoid
Quick cheatsheet
TfidfVectorizer(ngram_range=(1,2)) -> strong baseline features
spaCy -> tokenize, NER, POS
sentence-transformers -> semantic embeddings
HuggingFace Trainer -> fine-tune BERT
classification_report -> per-class P/R/F1