Chapter 35 — Reproducibility

Reproducibility, Testing & Tooling

Work others (and future-you) can rerun and trust: environments, seeds, data validation, testing data pipelines, version control, and project structure.

"It works on my machine" is where analysis goes to die. Reproducibility is the difference between a one-off notebook and an asset the team can rerun, audit, and build on.
35.1 The reproducibility stack
layers
Reproducible project =
   Code        version-controlled (git)
 + Environment pinned deps (venv + lockfile, or Docker)
 + Data        versioned / snapshotted (DVC, dated extracts)
 + Randomness  fixed seeds
 + Config      params in files, not hard-coded
35.2 Environments — pin everything
Fragile
pip install pandas with no versions. Six months later nothing installs or behavior changed silently.
Reproducible
A virtual env + a lockfile (exact versions), or a Dockerfile. Anyone gets the identical environment.
terminal
# Isolated, pinned environment
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
pip freeze > requirements.lock      # exact versions
35.3 Fix the seeds
python
import random, numpy as np
SEED = 42
random.seed(SEED); np.random.seed(SEED)
# sklearn: random_state=SEED   |   torch: torch.manual_seed(SEED)
35.4 Validate data, not just code

Data breaks silently — a renamed column, a unit change, a flood of nulls. Assert your assumptions so a bad input fails loudly instead of producing a wrong number.

python
# Pandera schema: data fails fast if assumptions break
import pandera as pa
schema = pa.DataFrameSchema({
    'age':    pa.Column(int, pa.Check.in_range(0, 120)),
    'email':  pa.Column(str, pa.Check.str_matches(r'@')),
    'amount': pa.Column(float, pa.Check.ge(0)),
})
schema.validate(df)   # raises on violation
ToolUse for
PanderaIn-code DataFrame schema checks
Great ExpectationsPipeline data-quality suites + docs
dbt testsWarehouse table assertions (unique, not null)
35.5 Test your pipeline
what to test
Tests for data work?
│
├── Unit ──────► transform functions give expected output on tiny inputs
├── Data ──────► schema, ranges, uniqueness, row counts
└── Regression ► outputs/metrics don't change unexpectedly between runs
35.6 Project structure & version control
layout
project/
├── data/        raw (read-only) + processed (gitignored)
├── notebooks/   exploration only
├── src/         reusable, importable code
├── tests/       pytest
├── requirements.txt / lockfile
├── README.md    goal, setup, run order
└── .gitignore   never commit data/secrets

Professional recommendation

Envvenv + lockfile (or Docker)
SeedsFix everywhere
DataValidate with Pandera/GE
Codegit + pytest + README
Common mistakes to avoid
Quick cheatsheet
python -m venv .venv -> isolated env
pip freeze > lock -> pin versions
np.random.seed(42) -> reproducible randomness
pandera / great_expectations -> data validation
pytest -> test transforms