Chapter 35 — Reproducibility

Reproducibility, Testing & Tooling

Work others (and future-you) can rerun and trust: environments, seeds, data validation, testing data pipelines, version control, and project structure.

"It works on my machine" is where analysis goes to die. Reproducibility is the difference between a one-off notebook and an asset the team can rerun, audit, and build on.

35.1 The reproducibility stack

layers

Reproducible project =
   Code        version-controlled (git)
 + Environment pinned deps (venv + lockfile, or Docker)
 + Data        versioned / snapshotted (DVC, dated extracts)
 + Randomness  fixed seeds
 + Config      params in files, not hard-coded

35.2 Environments — pin everything

Fragile

pip install pandas with no versions. Six months later nothing installs or behavior changed silently.

Reproducible

A virtual env + a lockfile (exact versions), or a Dockerfile. Anyone gets the identical environment.

terminal

# Isolated, pinned environment
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
pip freeze > requirements.lock      # exact versions

35.3 Fix the seeds

python

import random, numpy as np
SEED = 42
random.seed(SEED); np.random.seed(SEED)
# sklearn: random_state=SEED   |   torch: torch.manual_seed(SEED)

35.4 Validate data, not just code

Data breaks silently — a renamed column, a unit change, a flood of nulls. Assert your assumptions so a bad input fails loudly instead of producing a wrong number.

python

# Pandera schema: data fails fast if assumptions break
import pandera as pa
schema = pa.DataFrameSchema({
    'age':    pa.Column(int, pa.Check.in_range(0, 120)),
    'email':  pa.Column(str, pa.Check.str_matches(r'@')),
    'amount': pa.Column(float, pa.Check.ge(0)),
})
schema.validate(df)   # raises on violation

Tool	Use for
Pandera	In-code DataFrame schema checks
Great Expectations	Pipeline data-quality suites + docs
dbt tests	Warehouse table assertions (unique, not null)

35.5 Test your pipeline

what to test

Tests for data work?
│
├── Unit ──────► transform functions give expected output on tiny inputs
├── Data ──────► schema, ranges, uniqueness, row counts
└── Regression ► outputs/metrics don't change unexpectedly between runs

35.6 Project structure & version control

layout

project/
├── data/        raw (read-only) + processed (gitignored)
├── notebooks/   exploration only
├── src/         reusable, importable code
├── tests/       pytest
├── requirements.txt / lockfile
├── README.md    goal, setup, run order
└── .gitignore   never commit data/secrets

Professional recommendation

Envvenv + lockfile (or Docker)

SeedsFix everywhere

DataValidate with Pandera/GE

Codegit + pytest + README

Common mistakes to avoid

Installing packages with no version pinning
Hard-coding paths, params, and secrets in notebooks
Committing raw data or credentials to git
No fixed random seed — results change every run
Testing code but never validating the incoming data

Quick cheatsheet

python -m venv .venv -> isolated env

pip freeze > lock -> pin versions

np.random.seed(42) -> reproducible randomness

pandera / great_expectations -> data validation

pytest -> test transforms