Chapter 24 — SQL

SQL for Analysts

SQL is the #1 daily tool for analysts. Query order, joins, aggregation, window functions, CTEs, and the patterns you actually use in real work.

Most analytics happens in the warehouse, not in pandas. Pulling clean, aggregated data with SQL first means smaller, faster Python later. This chapter is the 20% of SQL that covers 80% of the job.
24.1 Logical order of execution

SQL is written in one order but executed in another. Knowing this explains why you can't use a SELECT alias in WHERE, but can in ORDER BY.

execution order
1. FROM / JOIN     ── assemble the rows
2. WHERE          ── filter rows (no aggregates yet)
3. GROUP BY       ── collapse into groups
4. HAVING         ── filter groups (aggregates allowed)
5. SELECT         ── choose columns / compute
6. DISTINCT
7. ORDER BY       ── sort
8. LIMIT          ── cut
24.2 Joins — choose the right one
which join?
Want rows from...
│
├── both tables, only matches ─────► INNER JOIN
├── all left + matches ────────────► LEFT JOIN
├── all right + matches ───────────► RIGHT JOIN
├── everything, matched or not ────► FULL OUTER JOIN
└── rows in A with NO match in B ──► LEFT JOIN ... WHERE b.id IS NULL
JoinUse whenWatch out
INNEROnly care about matched recordsSilently drops unmatched rows
LEFTKeep all customers, attach orders if anyNULLs for non-matches — handle them
FULL OUTERReconcile two sourcesCan explode if keys not unique
A join on a non-unique key causes a fan-out: row counts multiply and sums double. Always confirm the join key is unique on at least one side (COUNT(*) vs COUNT(DISTINCT key)).
24.3 Aggregation & GROUP BY
sql
SELECT
    customer_id,
    COUNT(*)              AS orders,
    SUM(amount)           AS revenue,
    AVG(amount)           AS avg_order,
    MAX(order_date)       AS last_order
FROM orders
WHERE order_date >= '2026-01-01'
GROUP BY customer_id
HAVING SUM(amount) > 1000      -- filter the GROUPS, not rows
ORDER BY revenue DESC
LIMIT 100;
24.4 Window functions — the analyst superpower

Windows compute across rows without collapsing them — running totals, rankings, per-group comparisons, and period-over-period change.

sql
SELECT
    customer_id, order_date, amount,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date)      AS order_seq,
    SUM(amount)  OVER (PARTITION BY customer_id ORDER BY order_date)      AS running_total,
    LAG(amount)  OVER (PARTITION BY customer_id ORDER BY order_date)      AS prev_amount,
    amount - LAG(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS delta
FROM orders;
FunctionDoesUse for
ROW_NUMBER()Unique rank 1,2,3First/latest row per group
RANK()/DENSE_RANK()Rank with tiesLeaderboards
LAG()/LEAD()Previous/next row valuePeriod-over-period change
SUM() OVERRunning / windowed totalCumulative revenue
NTILE(n)Split into n bucketsDeciles, quartiles
24.5 CTEs — readable, step-by-step queries
sql
-- Build logic in named steps instead of nested subqueries
WITH monthly AS (
    SELECT DATE_TRUNC('month', order_date) AS month, SUM(amount) AS rev
    FROM orders GROUP BY 1
),
with_growth AS (
    SELECT month, rev,
           LAG(rev) OVER (ORDER BY month) AS prev_rev
    FROM monthly
)
SELECT month, rev, ROUND(100.0*(rev-prev_rev)/prev_rev, 1) AS pct_growth
FROM with_growth
ORDER BY month;
24.6 NULL logic & common gotchas

Professional recommendation

Push filtering and aggregation into SQL, pull a small tidy result into pandas for modeling/plots. Build complex logic with CTEs (readable, debuggable) over nested subqueries. Learn window functions early — they replace dozens of clumsy self-joins.

Common mistakes to avoid
Quick cheatsheet
ROW_NUMBER() OVER (PARTITION BY .. ORDER BY ..) -> latest/first per group
WITH x AS (...) -> CTE building block
DATE_TRUNC('month', d) -> bucket by period
COUNT(DISTINCT key) -> check uniqueness
HAVING SUM(x) > n -> filter groups