Chapter 22 — AI / LLM
AI & LLM Analytics
Modern analysts increasingly work with embeddings, vector search, and LLMs. This chapter covers RAG, embeddings, vector databases, prompt analytics, and how to evaluate an LLM system.
LLMs don't replace data skills — they add a new layer. The analyst's edge is still evaluation, measurement, and grounding outputs in real data.
22.1 Embeddings — turning text into vectors
An embedding maps text (or images) to a numeric vector where semantic similarity becomes geometric closeness. They power search, clustering, classification, and recommendations on unstructured data.
python
# Local sentence embeddings — no API key needed from sentence_transformers import SentenceTransformer import numpy as np model = SentenceTransformer('all-MiniLM-L6-v2') vecs = model.encode(['refund policy', 'how do I get my money back']) cos = vecs[0] @ vecs[1] / (np.linalg.norm(vecs[0]) * np.linalg.norm(vecs[1])) print(f"similarity: {cos:.2f}") # high — same meaning, different words
22.2 Vector databases
| Tool | Use when |
|---|---|
| FAISS | Local, fast, in-memory similarity search |
| Chroma | Lightweight local RAG prototyping |
| Pinecone / Weaviate / Qdrant | Managed, scalable production vector search |
| pgvector | You already run Postgres and want vectors alongside SQL |
22.3 RAG — Retrieval-Augmented Generation
RAG grounds an LLM in your own documents so it answers from facts instead of hallucinating. It is the most common enterprise LLM pattern.
- Chunk docs
- Embed chunks
- Store in vector DB
- Embed query
- Retrieve top-K
- LLM answers with context
RAG vs alternatives
Need the model to know your private/fresh data? │ ├── Facts change often, large corpus ──► RAG (retrieve at query time) ├── Fixed style / format / behaviour ──► Fine-tuning └── Small, stable context ───────────► Just put it in the prompt
22.4 Evaluating an LLM system
| Dimension | Question | How to measure |
|---|---|---|
| Faithfulness | Is the answer grounded in retrieved context? | LLM-as-judge, RAGAS faithfulness |
| Relevance | Does it answer the question asked? | Answer-relevance score, human rating |
| Retrieval quality | Were the right chunks fetched? | Context precision / recall, hit@K |
| Correctness | Does it match a known answer? | Exact match / F1 vs gold set |
| Safety | Toxic, biased, or leaking PII? | Guardrail classifiers, red-teaming |
| Cost / latency | Affordable and fast enough? | Tokens per query, p95 latency |
Never ship an LLM feature without an evaluation set. Build a fixed set of question–answer pairs and score every prompt or model change against it — "it looked good in a few tries" is not evaluation.
22.5 Prompt analytics
- Version prompts like code — track which prompt produced which output
- Log inputs, outputs, token counts, latency, and cost per call
- A/B test prompt variants against your evaluation set, not vibes
- Monitor for drift in output quality as models are updated by the provider
- Cache frequent queries to cut cost and latency
Professional recommendation
PrototypeChroma + MiniLM + RAG
ProductionManaged vector DB + eval gate
QualityRAGAS / LLM-as-judge
CustomiseRAG before fine-tuning
Common mistakes to avoid
- Shipping an LLM feature with no fixed evaluation set
- Trusting RAG answers without checking faithfulness to the context
- Fine-tuning when retrieval (RAG) would have solved it cheaper
Quick cheatsheet
SentenceTransformer.encode() -> Text to embeddingsfaiss / chroma -> Vector similarity searchretrieve top-K + prompt -> RAG groundingRAGAS faithfulness -> Score grounded answerslog tokens + latency -> Prompt cost analytics