Chapter 26 — Experimentation

A/B Testing & Experimentation

Design experiments that give trustworthy answers: hypotheses, power and sample size, running the test, reading results, and the traps that produce false wins.

A/B testing is how product and growth decisions are actually made. The hard part isn't the math — it's designing the test so the result means something.

26.1 The experiment lifecycle

Hypothesis
Pick metric
Power & sample size
Randomize
Run (fixed horizon)
Analyze
Decide

26.2 A good hypothesis

Weak

"Let's make the button green and see what happens." No metric, no expected effect, no decision rule.

Strong

"Changing the CTA to green will increase signup conversion. We'll ship if it lifts conversion by ≥2% (primary metric) with p<0.05."

26.3 Power & sample size — decide n BEFORE running

Four quantities are linked: effect size, sample size, significance (α), and power. Fix three, solve the fourth. Underpowered tests waste traffic and miss real effects.

drivers of required sample size

Bigger n needed when...
│
├── smaller effect you want to detect (MDE ↓ ⇒ n ↑↑)
├── higher confidence (lower α)
├── higher power (e.g. 80% → 90%)
└── higher metric variance

python

# Sample size per arm for a proportion test
from statsmodels.stats.power import NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize

baseline, target = 0.10, 0.12          # 10% -> 12% (MDE = 2pp)
es = proportion_effectsize(target, baseline)
n = NormalIndPower().solve_power(effect_size=es, alpha=0.05, power=0.8, alternative='two-sided')
print(f"need ~{n:.0f} users per arm")

26.4 Analyzing the result

Metric type	Test	Report
Conversion (yes/no)	Two-proportion z-test / chi-square	Lift + 95% CI on the difference
Continuous (revenue, time)	Welch's t-test (or Mann-Whitney if skewed)	Mean diff + CI
Skewed revenue	Bootstrap / trimmed mean	CI on the difference

python

from statsmodels.stats.proportion import proportions_ztest
import numpy as np
conv = np.array([1180, 1310])     # conversions A, B
n    = np.array([10000, 10000])    # users A, B
stat, p = proportions_ztest(conv, n)
print(f"lift={ (conv[1]/n[1]-conv[0]/n[0])*100:.2f }pp  p={p:.4f}")

26.5 Traps that manufacture false wins

Peeking

Checking daily and stopping when p<0.05. Inflates false positives massively.

Fix

Pre-commit to a sample size / end date, or use sequential / Bayesian methods built for early stopping.

Multiple metrics

Testing 20 metrics, celebrating the one that's significant.

Fix

One primary metric. Correct for multiple comparisons (Bonferroni / FDR) on the rest.

Trap	Symptom	Prevention
Peeking	Stopped early on a lucky day	Fixed horizon or sequential test
Sample ratio mismatch	Split isn't 50/50	Chi-square SRM check before analysis
Novelty effect	Early lift fades	Run a full business cycle (≥1–2 weeks)
Simpson's paradox	Overall ≠ per-segment	Check key segments separately

Professional recommendation

BeforeFix metric, MDE, n, duration

DuringDon't peek; check SRM

VarianceCUPED to detect smaller effects

AfterLift + CI, then decide

Common mistakes to avoid

Peeking at results and stopping when it looks significant
Not computing sample size — running underpowered tests
Cherry-picking the one metric (of many) that won
Ending the test before a full weekly cycle (novelty/day-of-week effects)
Ignoring a sample ratio mismatch that signals a broken split

Quick cheatsheet

solve_power(effect_size, alpha, power) -> required n

proportions_ztest() -> conversion test

ttest_ind(equal_var=False) -> Welch's t-test

SRM chi-square -> validate the split

CUPED -> cut variance using pre-period data