Chapter 24 — SQL

SQL for Analysts

SQL is the #1 daily tool for analysts. Query order, joins, aggregation, window functions, CTEs, and the patterns you actually use in real work.

Most analytics happens in the warehouse, not in pandas. Pulling clean, aggregated data with SQL first means smaller, faster Python later. This chapter is the 20% of SQL that covers 80% of the job.

24.1 Logical order of execution

SQL is written in one order but executed in another. Knowing this explains why you can't use a SELECT alias in WHERE, but can in ORDER BY.

execution order

1. FROM / JOIN     ── assemble the rows
2. WHERE          ── filter rows (no aggregates yet)
3. GROUP BY       ── collapse into groups
4. HAVING         ── filter groups (aggregates allowed)
5. SELECT         ── choose columns / compute
6. DISTINCT
7. ORDER BY       ── sort
8. LIMIT          ── cut

24.2 Joins — choose the right one

which join?

Want rows from...
│
├── both tables, only matches ─────► INNER JOIN
├── all left + matches ────────────► LEFT JOIN
├── all right + matches ───────────► RIGHT JOIN
├── everything, matched or not ────► FULL OUTER JOIN
└── rows in A with NO match in B ──► LEFT JOIN ... WHERE b.id IS NULL

Join	Use when	Watch out
INNER	Only care about matched records	Silently drops unmatched rows
LEFT	Keep all customers, attach orders if any	NULLs for non-matches — handle them
FULL OUTER	Reconcile two sources	Can explode if keys not unique

A join on a non-unique key causes a fan-out: row counts multiply and sums double. Always confirm the join key is unique on at least one side (COUNT(*) vs COUNT(DISTINCT key)).

24.3 Aggregation & GROUP BY

sql

SELECT
    customer_id,
    COUNT(*)              AS orders,
    SUM(amount)           AS revenue,
    AVG(amount)           AS avg_order,
    MAX(order_date)       AS last_order
FROM orders
WHERE order_date >= '2026-01-01'
GROUP BY customer_id
HAVING SUM(amount) > 1000      -- filter the GROUPS, not rows
ORDER BY revenue DESC
LIMIT 100;

24.4 Window functions — the analyst superpower

Windows compute across rows without collapsing them — running totals, rankings, per-group comparisons, and period-over-period change.

sql

SELECT
    customer_id, order_date, amount,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date)      AS order_seq,
    SUM(amount)  OVER (PARTITION BY customer_id ORDER BY order_date)      AS running_total,
    LAG(amount)  OVER (PARTITION BY customer_id ORDER BY order_date)      AS prev_amount,
    amount - LAG(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS delta
FROM orders;

Function	Does	Use for
`ROW_NUMBER()`	Unique rank 1,2,3	First/latest row per group
`RANK()/DENSE_RANK()`	Rank with ties	Leaderboards
`LAG()/LEAD()`	Previous/next row value	Period-over-period change
`SUM() OVER`	Running / windowed total	Cumulative revenue
`NTILE(n)`	Split into n buckets	Deciles, quartiles

24.5 CTEs — readable, step-by-step queries

sql

-- Build logic in named steps instead of nested subqueries
WITH monthly AS (
    SELECT DATE_TRUNC('month', order_date) AS month, SUM(amount) AS rev
    FROM orders GROUP BY 1
),
with_growth AS (
    SELECT month, rev,
           LAG(rev) OVER (ORDER BY month) AS prev_rev
    FROM monthly
)
SELECT month, rev, ROUND(100.0*(rev-prev_rev)/prev_rev, 1) AS pct_growth
FROM with_growth
ORDER BY month;

24.6 NULL logic & common gotchas

NULL = NULL is NOT true — use IS NULL / IS NOT NULL
COUNT(col) ignores NULLs; COUNT(*) counts all rows
WHERE col != 'x' drops NULL rows — add OR col IS NULL if needed
Integer division: 1/2 = 0 — cast: 1.0 * a / b
Filtering an aggregate goes in HAVING, not WHERE

Professional recommendation

Push filtering and aggregation into SQL, pull a small tidy result into pandas for modeling/plots. Build complex logic with CTEs (readable, debuggable) over nested subqueries. Learn window functions early — they replace dozens of clumsy self-joins.

Common mistakes to avoid

Joining on a non-unique key and double-counting (fan-out)
Putting aggregate filters in WHERE instead of HAVING
Forgetting NULLs in != filters and comparisons
SELECT * on huge tables instead of the columns you need
Doing in pandas what one GROUP BY would do in the warehouse

Quick cheatsheet

ROW_NUMBER() OVER (PARTITION BY .. ORDER BY ..) -> latest/first per group

WITH x AS (...) -> CTE building block

DATE_TRUNC('month', d) -> bucket by period

COUNT(DISTINCT key) -> check uniqueness

HAVING SUM(x) > n -> filter groups