Chapter 22 — AI / LLM

AI & LLM Analytics

Modern analysts increasingly work with embeddings, vector search, and LLMs. This chapter covers RAG, embeddings, vector databases, prompt analytics, and how to evaluate an LLM system.

LLMs don't replace data skills — they add a new layer. The analyst's edge is still evaluation, measurement, and grounding outputs in real data.
22.1 Embeddings — turning text into vectors

An embedding maps text (or images) to a numeric vector where semantic similarity becomes geometric closeness. They power search, clustering, classification, and recommendations on unstructured data.

python
# Local sentence embeddings — no API key needed
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')
vecs = model.encode(['refund policy', 'how do I get my money back'])

cos = vecs[0] @ vecs[1] / (np.linalg.norm(vecs[0]) * np.linalg.norm(vecs[1]))
print(f"similarity: {cos:.2f}")  # high — same meaning, different words
22.2 Vector databases
ToolUse when
FAISSLocal, fast, in-memory similarity search
ChromaLightweight local RAG prototyping
Pinecone / Weaviate / QdrantManaged, scalable production vector search
pgvectorYou already run Postgres and want vectors alongside SQL
22.3 RAG — Retrieval-Augmented Generation

RAG grounds an LLM in your own documents so it answers from facts instead of hallucinating. It is the most common enterprise LLM pattern.

RAG vs alternatives
Need the model to know your private/fresh data?
│
├── Facts change often, large corpus ──► RAG (retrieve at query time)
├── Fixed style / format / behaviour ──► Fine-tuning
└── Small, stable context ───────────► Just put it in the prompt
22.4 Evaluating an LLM system
DimensionQuestionHow to measure
FaithfulnessIs the answer grounded in retrieved context?LLM-as-judge, RAGAS faithfulness
RelevanceDoes it answer the question asked?Answer-relevance score, human rating
Retrieval qualityWere the right chunks fetched?Context precision / recall, hit@K
CorrectnessDoes it match a known answer?Exact match / F1 vs gold set
SafetyToxic, biased, or leaking PII?Guardrail classifiers, red-teaming
Cost / latencyAffordable and fast enough?Tokens per query, p95 latency
Never ship an LLM feature without an evaluation set. Build a fixed set of question–answer pairs and score every prompt or model change against it — "it looked good in a few tries" is not evaluation.
22.5 Prompt analytics

Professional recommendation

PrototypeChroma + MiniLM + RAG
ProductionManaged vector DB + eval gate
QualityRAGAS / LLM-as-judge
CustomiseRAG before fine-tuning
Common mistakes to avoid
Quick cheatsheet
SentenceTransformer.encode() -> Text to embeddings
faiss / chroma -> Vector similarity search
retrieve top-K + prompt -> RAG grounding
RAGAS faithfulness -> Score grounded answers
log tokens + latency -> Prompt cost analytics