Chapter 35 — Reproducibility
Reproducibility, Testing & Tooling
Work others (and future-you) can rerun and trust: environments, seeds, data validation, testing data pipelines, version control, and project structure.
"It works on my machine" is where analysis goes to die. Reproducibility is the difference between a one-off notebook and an asset the team can rerun, audit, and build on.
35.1 The reproducibility stack
layers
Reproducible project = Code version-controlled (git) + Environment pinned deps (venv + lockfile, or Docker) + Data versioned / snapshotted (DVC, dated extracts) + Randomness fixed seeds + Config params in files, not hard-coded
35.2 Environments — pin everything
Fragile
pip install pandas with no versions. Six months later nothing installs or behavior changed silently.Reproducible
A virtual env + a lockfile (exact versions), or a Dockerfile. Anyone gets the identical environment.terminal
# Isolated, pinned environment python -m venv .venv && source .venv/bin/activate pip install -r requirements.txt pip freeze > requirements.lock # exact versions
35.3 Fix the seeds
python
import random, numpy as np SEED = 42 random.seed(SEED); np.random.seed(SEED) # sklearn: random_state=SEED | torch: torch.manual_seed(SEED)
35.4 Validate data, not just code
Data breaks silently — a renamed column, a unit change, a flood of nulls. Assert your assumptions so a bad input fails loudly instead of producing a wrong number.
python
# Pandera schema: data fails fast if assumptions break import pandera as pa schema = pa.DataFrameSchema({ 'age': pa.Column(int, pa.Check.in_range(0, 120)), 'email': pa.Column(str, pa.Check.str_matches(r'@')), 'amount': pa.Column(float, pa.Check.ge(0)), }) schema.validate(df) # raises on violation
| Tool | Use for |
|---|---|
| Pandera | In-code DataFrame schema checks |
| Great Expectations | Pipeline data-quality suites + docs |
| dbt tests | Warehouse table assertions (unique, not null) |
35.5 Test your pipeline
what to test
Tests for data work? │ ├── Unit ──────► transform functions give expected output on tiny inputs ├── Data ──────► schema, ranges, uniqueness, row counts └── Regression ► outputs/metrics don't change unexpectedly between runs
35.6 Project structure & version control
layout
project/ ├── data/ raw (read-only) + processed (gitignored) ├── notebooks/ exploration only ├── src/ reusable, importable code ├── tests/ pytest ├── requirements.txt / lockfile ├── README.md goal, setup, run order └── .gitignore never commit data/secrets
Professional recommendation
Envvenv + lockfile (or Docker)
SeedsFix everywhere
DataValidate with Pandera/GE
Codegit + pytest + README
Common mistakes to avoid
- Installing packages with no version pinning
- Hard-coding paths, params, and secrets in notebooks
- Committing raw data or credentials to git
- No fixed random seed — results change every run
- Testing code but never validating the incoming data
Quick cheatsheet
python -m venv .venv -> isolated envpip freeze > lock -> pin versionsnp.random.seed(42) -> reproducible randomnesspandera / great_expectations -> data validationpytest -> test transforms