Probability & Statistics Foundations
The math under every model and test: distributions, the Central Limit Theorem, confidence intervals, bootstrapping, and what a p-value actually means.
You almost never measure the whole population — you measure a sample and infer. The gap between your sample estimate and the truth is sampling error, and it shrinks as sample size grows.
Population (unknown truth: mean μ, proportion p)
│ draw a random sample
▼
Sample statistic (x̄, p̂) ── estimates ──► Population parameter
│
└── uncertainty quantified by the standard error
SE = s / √n (smaller as n grows)| Distribution | Shape | Models | Example |
|---|---|---|---|
| Normal (Gaussian) | Symmetric bell | Sums/averages of many effects | Heights, measurement error |
| Binomial | Discrete counts | k successes in n trials | Conversions out of N visitors |
| Poisson | Right-skewed counts | Rare events per interval | Support tickets per hour |
| Log-normal | Right-skewed positive | Multiplicative growth | Income, prices, session time |
| Uniform | Flat | Equal-likelihood values | Random IDs, dice |
| Exponential | Decaying | Time between events | Time to next purchase |
The single most useful theorem in applied statistics: the distribution of the sample mean is approximately normal for large n, no matter the shape of the underlying data. This is why t-tests and confidence intervals work even on non-normal data.
# Even a skewed population yields a normal-looking mean distribution import numpy as np pop = np.random.exponential(scale=2.0, size=1_000_000) # very skewed means = [np.random.choice(pop, 50).mean() for _ in range(5000)] print(np.mean(means), np.std(means)) # ~normal, centred on the true mean
A 95% confidence interval means: if you repeated the study many times, ~95% of such intervals would contain the true value. Always prefer "12.4% ± 1.1%" to a bare "12.4%".
from scipy import stats import numpy as np data = np.array([...]) mean = data.mean() se = stats.sem(data) # standard error ci = stats.t.interval(0.95, len(data)-1, loc=mean, scale=se) print(f"mean={mean:.2f} 95% CI={ci}")
When a metric has no neat formula (median, a ratio, an AUC), resample with replacement thousands of times and read the percentiles. No distribution assumptions.
Original sample (n rows)
│ repeat B = 10,000 times
▼
Resample n rows WITH replacement ──► compute statistic
│
└── collect B statistics ──► 2.5th & 97.5th percentile = 95% CIimport numpy as np boot = [np.median(np.random.choice(data, len(data), replace=True)) for _ in range(10_000)] lo, hi = np.percentile(boot, [2.5, 97.5]) print(f"median 95% CI: [{lo:.2f}, {hi:.2f}]")
It IS
The probability of seeing data this extreme if the null hypothesis were true. Small p = data is surprising under "no effect".It is NOT
The probability the hypothesis is true. NOT the size or importance of the effect. NOT "p=0.04 means 96% sure".| Reality: no effect | Reality: real effect | |
|---|---|---|
| Test says effect | Type I error (α, false positive) | Correct (power = 1−β) |
| Test says nothing | Correct | Type II error (β, false negative) |
Professional recommendation
- Reporting a point estimate with no uncertainty (no CI)
- Reading a p-value as "probability the hypothesis is true"
- Confusing statistical significance with business importance
- Using mean + std on a clearly skewed (log-normal) variable
- Running until significant ("peeking") instead of fixing n in advance
stats.sem(x) -> standard error of the meanstats.t.interval(0.95, n-1, loc, scale) -> confidence intervalnp.percentile(boot, [2.5, 97.5]) -> bootstrap CIstats.norm / binom / poisson -> distribution objectseffect size + CI -> always report alongside p