Chapter 24 — SQL
SQL for Analysts
SQL is the #1 daily tool for analysts. Query order, joins, aggregation, window functions, CTEs, and the patterns you actually use in real work.
Most analytics happens in the warehouse, not in pandas. Pulling clean, aggregated data with SQL first means smaller, faster Python later. This chapter is the 20% of SQL that covers 80% of the job.
24.1 Logical order of execution
SQL is written in one order but executed in another. Knowing this explains why you can't use a SELECT alias in WHERE, but can in ORDER BY.
execution order
1. FROM / JOIN ── assemble the rows 2. WHERE ── filter rows (no aggregates yet) 3. GROUP BY ── collapse into groups 4. HAVING ── filter groups (aggregates allowed) 5. SELECT ── choose columns / compute 6. DISTINCT 7. ORDER BY ── sort 8. LIMIT ── cut
24.2 Joins — choose the right one
which join?
Want rows from... │ ├── both tables, only matches ─────► INNER JOIN ├── all left + matches ────────────► LEFT JOIN ├── all right + matches ───────────► RIGHT JOIN ├── everything, matched or not ────► FULL OUTER JOIN └── rows in A with NO match in B ──► LEFT JOIN ... WHERE b.id IS NULL
| Join | Use when | Watch out |
|---|---|---|
| INNER | Only care about matched records | Silently drops unmatched rows |
| LEFT | Keep all customers, attach orders if any | NULLs for non-matches — handle them |
| FULL OUTER | Reconcile two sources | Can explode if keys not unique |
A join on a non-unique key causes a fan-out: row counts multiply and sums double. Always confirm the join key is unique on at least one side (
COUNT(*) vs COUNT(DISTINCT key)).24.3 Aggregation & GROUP BY
sql
SELECT
customer_id,
COUNT(*) AS orders,
SUM(amount) AS revenue,
AVG(amount) AS avg_order,
MAX(order_date) AS last_order
FROM orders
WHERE order_date >= '2026-01-01'
GROUP BY customer_id
HAVING SUM(amount) > 1000 -- filter the GROUPS, not rows
ORDER BY revenue DESC
LIMIT 100;24.4 Window functions — the analyst superpower
Windows compute across rows without collapsing them — running totals, rankings, per-group comparisons, and period-over-period change.
sql
SELECT
customer_id, order_date, amount,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) AS order_seq,
SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS running_total,
LAG(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS prev_amount,
amount - LAG(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS delta
FROM orders;| Function | Does | Use for |
|---|---|---|
ROW_NUMBER() | Unique rank 1,2,3 | First/latest row per group |
RANK()/DENSE_RANK() | Rank with ties | Leaderboards |
LAG()/LEAD() | Previous/next row value | Period-over-period change |
SUM() OVER | Running / windowed total | Cumulative revenue |
NTILE(n) | Split into n buckets | Deciles, quartiles |
24.5 CTEs — readable, step-by-step queries
sql
-- Build logic in named steps instead of nested subqueries
WITH monthly AS (
SELECT DATE_TRUNC('month', order_date) AS month, SUM(amount) AS rev
FROM orders GROUP BY 1
),
with_growth AS (
SELECT month, rev,
LAG(rev) OVER (ORDER BY month) AS prev_rev
FROM monthly
)
SELECT month, rev, ROUND(100.0*(rev-prev_rev)/prev_rev, 1) AS pct_growth
FROM with_growth
ORDER BY month;24.6 NULL logic & common gotchas
NULL = NULLis NOT true — useIS NULL/IS NOT NULLCOUNT(col)ignores NULLs;COUNT(*)counts all rowsWHERE col != 'x'drops NULL rows — addOR col IS NULLif needed- Integer division:
1/2 = 0— cast:1.0 * a / b - Filtering an aggregate goes in
HAVING, notWHERE
Professional recommendation
Push filtering and aggregation into SQL, pull a small tidy result into pandas for modeling/plots. Build complex logic with CTEs (readable, debuggable) over nested subqueries. Learn window functions early — they replace dozens of clumsy self-joins.
Common mistakes to avoid
- Joining on a non-unique key and double-counting (fan-out)
- Putting aggregate filters in WHERE instead of HAVING
- Forgetting NULLs in
!=filters and comparisons - SELECT * on huge tables instead of the columns you need
- Doing in pandas what one GROUP BY would do in the warehouse
Quick cheatsheet
ROW_NUMBER() OVER (PARTITION BY .. ORDER BY ..) -> latest/first per groupWITH x AS (...) -> CTE building blockDATE_TRUNC('month', d) -> bucket by periodCOUNT(DISTINCT key) -> check uniquenessHAVING SUM(x) > n -> filter groups