Chapter 22 — AI / LLM

AI & LLM Analytics

Modern analysts increasingly work with embeddings, vector search, and LLMs. This chapter covers RAG, embeddings, vector databases, prompt analytics, and how to evaluate an LLM system.

LLMs don't replace data skills — they add a new layer. The analyst's edge is still evaluation, measurement, and grounding outputs in real data.

22.1 Embeddings — turning text into vectors

An embedding maps text (or images) to a numeric vector where semantic similarity becomes geometric closeness. They power search, clustering, classification, and recommendations on unstructured data.

python

# Local sentence embeddings — no API key needed
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')
vecs = model.encode(['refund policy', 'how do I get my money back'])

cos = vecs[0] @ vecs[1] / (np.linalg.norm(vecs[0]) * np.linalg.norm(vecs[1]))
print(f"similarity: {cos:.2f}")  # high — same meaning, different words

22.2 Vector databases

Tool	Use when
FAISS	Local, fast, in-memory similarity search
Chroma	Lightweight local RAG prototyping
Pinecone / Weaviate / Qdrant	Managed, scalable production vector search
pgvector	You already run Postgres and want vectors alongside SQL

22.3 RAG — Retrieval-Augmented Generation

RAG grounds an LLM in your own documents so it answers from facts instead of hallucinating. It is the most common enterprise LLM pattern.

Chunk docs
Embed chunks
Store in vector DB
Embed query
Retrieve top-K
LLM answers with context

RAG vs alternatives

Need the model to know your private/fresh data?
│
├── Facts change often, large corpus ──► RAG (retrieve at query time)
├── Fixed style / format / behaviour ──► Fine-tuning
└── Small, stable context ───────────► Just put it in the prompt

22.4 Evaluating an LLM system

Dimension	Question	How to measure
Faithfulness	Is the answer grounded in retrieved context?	LLM-as-judge, RAGAS faithfulness
Relevance	Does it answer the question asked?	Answer-relevance score, human rating
Retrieval quality	Were the right chunks fetched?	Context precision / recall, hit@K
Correctness	Does it match a known answer?	Exact match / F1 vs gold set
Safety	Toxic, biased, or leaking PII?	Guardrail classifiers, red-teaming
Cost / latency	Affordable and fast enough?	Tokens per query, p95 latency

Never ship an LLM feature without an evaluation set. Build a fixed set of question–answer pairs and score every prompt or model change against it — "it looked good in a few tries" is not evaluation.

22.5 Prompt analytics

Version prompts like code — track which prompt produced which output
Log inputs, outputs, token counts, latency, and cost per call
A/B test prompt variants against your evaluation set, not vibes
Monitor for drift in output quality as models are updated by the provider
Cache frequent queries to cut cost and latency

Professional recommendation

PrototypeChroma + MiniLM + RAG

ProductionManaged vector DB + eval gate

QualityRAGAS / LLM-as-judge

CustomiseRAG before fine-tuning

Common mistakes to avoid

Shipping an LLM feature with no fixed evaluation set
Trusting RAG answers without checking faithfulness to the context
Fine-tuning when retrieval (RAG) would have solved it cheaper

Quick cheatsheet

SentenceTransformer.encode() -> Text to embeddings

faiss / chroma -> Vector similarity search

retrieve top-K + prompt -> RAG grounding

RAGAS faithfulness -> Score grounded answers

log tokens + latency -> Prompt cost analytics