Chapter 26 — Experimentation
A/B Testing & Experimentation
Design experiments that give trustworthy answers: hypotheses, power and sample size, running the test, reading results, and the traps that produce false wins.
A/B testing is how product and growth decisions are actually made. The hard part isn't the math — it's designing the test so the result means something.
26.1 The experiment lifecycle
- Hypothesis
- Pick metric
- Power & sample size
- Randomize
- Run (fixed horizon)
- Analyze
- Decide
26.2 A good hypothesis
Weak
"Let's make the button green and see what happens." No metric, no expected effect, no decision rule.Strong
"Changing the CTA to green will increase signup conversion. We'll ship if it lifts conversion by ≥2% (primary metric) with p<0.05."26.3 Power & sample size — decide n BEFORE running
Four quantities are linked: effect size, sample size, significance (α), and power. Fix three, solve the fourth. Underpowered tests waste traffic and miss real effects.
drivers of required sample size
Bigger n needed when... │ ├── smaller effect you want to detect (MDE ↓ ⇒ n ↑↑) ├── higher confidence (lower α) ├── higher power (e.g. 80% → 90%) └── higher metric variance
python
# Sample size per arm for a proportion test from statsmodels.stats.power import NormalIndPower from statsmodels.stats.proportion import proportion_effectsize baseline, target = 0.10, 0.12 # 10% -> 12% (MDE = 2pp) es = proportion_effectsize(target, baseline) n = NormalIndPower().solve_power(effect_size=es, alpha=0.05, power=0.8, alternative='two-sided') print(f"need ~{n:.0f} users per arm")
26.4 Analyzing the result
| Metric type | Test | Report |
|---|---|---|
| Conversion (yes/no) | Two-proportion z-test / chi-square | Lift + 95% CI on the difference |
| Continuous (revenue, time) | Welch's t-test (or Mann-Whitney if skewed) | Mean diff + CI |
| Skewed revenue | Bootstrap / trimmed mean | CI on the difference |
python
from statsmodels.stats.proportion import proportions_ztest import numpy as np conv = np.array([1180, 1310]) # conversions A, B n = np.array([10000, 10000]) # users A, B stat, p = proportions_ztest(conv, n) print(f"lift={ (conv[1]/n[1]-conv[0]/n[0])*100:.2f }pp p={p:.4f}")
26.5 Traps that manufacture false wins
Peeking
Checking daily and stopping when p<0.05. Inflates false positives massively.Fix
Pre-commit to a sample size / end date, or use sequential / Bayesian methods built for early stopping.Multiple metrics
Testing 20 metrics, celebrating the one that's significant.Fix
One primary metric. Correct for multiple comparisons (Bonferroni / FDR) on the rest.| Trap | Symptom | Prevention |
|---|---|---|
| Peeking | Stopped early on a lucky day | Fixed horizon or sequential test |
| Sample ratio mismatch | Split isn't 50/50 | Chi-square SRM check before analysis |
| Novelty effect | Early lift fades | Run a full business cycle (≥1–2 weeks) |
| Simpson's paradox | Overall ≠ per-segment | Check key segments separately |
Professional recommendation
BeforeFix metric, MDE, n, duration
DuringDon't peek; check SRM
VarianceCUPED to detect smaller effects
AfterLift + CI, then decide
Common mistakes to avoid
- Peeking at results and stopping when it looks significant
- Not computing sample size — running underpowered tests
- Cherry-picking the one metric (of many) that won
- Ending the test before a full weekly cycle (novelty/day-of-week effects)
- Ignoring a sample ratio mismatch that signals a broken split
Quick cheatsheet
solve_power(effect_size, alpha, power) -> required nproportions_ztest() -> conversion testttest_ind(equal_var=False) -> Welch's t-testSRM chi-square -> validate the splitCUPED -> cut variance using pre-period data