Chapter 23 — Foundations

Probability & Statistics Foundations

The math under every model and test: distributions, the Central Limit Theorem, confidence intervals, bootstrapping, and what a p-value actually means.

You don't need heavy math to be a great analyst — but you do need these intuitions. They explain why the methods in the rest of the handbook work, and stop you trusting numbers that lie.

23.1 Populations, samples & sampling error

You almost never measure the whole population — you measure a sample and infer. The gap between your sample estimate and the truth is sampling error, and it shrinks as sample size grows.

core idea

Population (unknown truth: mean μ, proportion p)
        │  draw a random sample
        ▼
Sample statistic (x̄, p̂)  ── estimates ──►  Population parameter
        │
        └── uncertainty quantified by the standard error
            SE = s / √n   (smaller as n grows)

23.2 Distributions you must recognise

Distribution	Shape	Models	Example
Normal (Gaussian)	Symmetric bell	Sums/averages of many effects	Heights, measurement error
Binomial	Discrete counts	k successes in n trials	Conversions out of N visitors
Poisson	Right-skewed counts	Rare events per interval	Support tickets per hour
Log-normal	Right-skewed positive	Multiplicative growth	Income, prices, session time
Uniform	Flat	Equal-likelihood values	Random IDs, dice
Exponential	Decaying	Time between events	Time to next purchase

Before any mean-based method, plot a histogram. A right-skewed (log-normal) column means median over mean and often a log transform.

23.3 The Central Limit Theorem (CLT)

The single most useful theorem in applied statistics: the distribution of the sample mean is approximately normal for large n, no matter the shape of the underlying data. This is why t-tests and confidence intervals work even on non-normal data.

python

# Even a skewed population yields a normal-looking mean distribution
import numpy as np
pop = np.random.exponential(scale=2.0, size=1_000_000)   # very skewed
means = [np.random.choice(pop, 50).mean() for _ in range(5000)]
print(np.mean(means), np.std(means))   # ~normal, centred on the true mean

23.4 Confidence intervals — report ranges, not points

A 95% confidence interval means: if you repeated the study many times, ~95% of such intervals would contain the true value. Always prefer "12.4% ± 1.1%" to a bare "12.4%".

python

from scipy import stats
import numpy as np
data = np.array([...])
mean = data.mean()
se = stats.sem(data)                       # standard error
ci = stats.t.interval(0.95, len(data)-1, loc=mean, scale=se)
print(f"mean={mean:.2f}  95% CI={ci}")

23.5 Bootstrapping — CIs for anything

When a metric has no neat formula (median, a ratio, an AUC), resample with replacement thousands of times and read the percentiles. No distribution assumptions.

bootstrap flow

Original sample (n rows)
   │  repeat B = 10,000 times
   ▼
Resample n rows WITH replacement  ──►  compute statistic
   │
   └── collect B statistics ──► 2.5th & 97.5th percentile = 95% CI

python

import numpy as np
boot = [np.median(np.random.choice(data, len(data), replace=True))
        for _ in range(10_000)]
lo, hi = np.percentile(boot, [2.5, 97.5])
print(f"median 95% CI: [{lo:.2f}, {hi:.2f}]")

23.6 What a p-value really is (and isn't)

It IS

The probability of seeing data this extreme if the null hypothesis were true. Small p = data is surprising under "no effect".

It is NOT

The probability the hypothesis is true. NOT the size or importance of the effect. NOT "p=0.04 means 96% sure".

Statistical significance ≠ practical significance. With huge n, a meaningless 0.1% lift can be "significant". Always pair the p-value with an effect size and a confidence interval.

23.7 Type I / Type II errors & power

	Reality: no effect	Reality: real effect
Test says effect	Type I error (α, false positive)	Correct (power = 1−β)
Test says nothing	Correct	Type II error (β, false negative)

Professional recommendation

ReportEstimate + 95% CI + effect size

Skewed metricBootstrap the CI

Significanceα = 0.05, but pre-register it

Power target80%+ (plan n up front)

Common mistakes to avoid

Reporting a point estimate with no uncertainty (no CI)
Reading a p-value as "probability the hypothesis is true"
Confusing statistical significance with business importance
Using mean + std on a clearly skewed (log-normal) variable
Running until significant ("peeking") instead of fixing n in advance

Quick cheatsheet

stats.sem(x) -> standard error of the mean

stats.t.interval(0.95, n-1, loc, scale) -> confidence interval

np.percentile(boot, [2.5, 97.5]) -> bootstrap CI

stats.norm / binom / poisson -> distribution objects

effect size + CI -> always report alongside p