Chapter 26 — Experimentation

A/B Testing & Experimentation

Design experiments that give trustworthy answers: hypotheses, power and sample size, running the test, reading results, and the traps that produce false wins.

A/B testing is how product and growth decisions are actually made. The hard part isn't the math — it's designing the test so the result means something.
26.1 The experiment lifecycle
26.2 A good hypothesis
Weak
"Let's make the button green and see what happens." No metric, no expected effect, no decision rule.
Strong
"Changing the CTA to green will increase signup conversion. We'll ship if it lifts conversion by ≥2% (primary metric) with p<0.05."
26.3 Power & sample size — decide n BEFORE running

Four quantities are linked: effect size, sample size, significance (α), and power. Fix three, solve the fourth. Underpowered tests waste traffic and miss real effects.

drivers of required sample size
Bigger n needed when...
│
├── smaller effect you want to detect (MDE ↓ ⇒ n ↑↑)
├── higher confidence (lower α)
├── higher power (e.g. 80% → 90%)
└── higher metric variance
python
# Sample size per arm for a proportion test
from statsmodels.stats.power import NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize

baseline, target = 0.10, 0.12          # 10% -> 12% (MDE = 2pp)
es = proportion_effectsize(target, baseline)
n = NormalIndPower().solve_power(effect_size=es, alpha=0.05, power=0.8, alternative='two-sided')
print(f"need ~{n:.0f} users per arm")
26.4 Analyzing the result
Metric typeTestReport
Conversion (yes/no)Two-proportion z-test / chi-squareLift + 95% CI on the difference
Continuous (revenue, time)Welch's t-test (or Mann-Whitney if skewed)Mean diff + CI
Skewed revenueBootstrap / trimmed meanCI on the difference
python
from statsmodels.stats.proportion import proportions_ztest
import numpy as np
conv = np.array([1180, 1310])     # conversions A, B
n    = np.array([10000, 10000])    # users A, B
stat, p = proportions_ztest(conv, n)
print(f"lift={ (conv[1]/n[1]-conv[0]/n[0])*100:.2f }pp  p={p:.4f}")
26.5 Traps that manufacture false wins
Peeking
Checking daily and stopping when p<0.05. Inflates false positives massively.
Fix
Pre-commit to a sample size / end date, or use sequential / Bayesian methods built for early stopping.
Multiple metrics
Testing 20 metrics, celebrating the one that's significant.
Fix
One primary metric. Correct for multiple comparisons (Bonferroni / FDR) on the rest.
TrapSymptomPrevention
PeekingStopped early on a lucky dayFixed horizon or sequential test
Sample ratio mismatchSplit isn't 50/50Chi-square SRM check before analysis
Novelty effectEarly lift fadesRun a full business cycle (≥1–2 weeks)
Simpson's paradoxOverall ≠ per-segmentCheck key segments separately

Professional recommendation

BeforeFix metric, MDE, n, duration
DuringDon't peek; check SRM
VarianceCUPED to detect smaller effects
AfterLift + CI, then decide
Common mistakes to avoid
Quick cheatsheet
solve_power(effect_size, alpha, power) -> required n
proportions_ztest() -> conversion test
ttest_ind(equal_var=False) -> Welch's t-test
SRM chi-square -> validate the split
CUPED -> cut variance using pre-period data