19 Reproducible Analysis
Reproducibility means others (including future you) can run your analysis and get the same results. This chapter covers practices that ensure reproducibility.
19.1 Why Reproducibility Matters
- Verification: Others can check your work
- Building on work: You or others can extend the analysis
- Debugging: Easier to find and fix problems
- Publication: Increasingly required by journals
- Future you: Six months from now, you’ll thank yourself
19.2 The Three Pillars
19.2.1 1. Version Control (Git)
Track every change to your code:
- What changed
- When it changed
- Why it changed
With Git, you can always return to a working state.
19.2.2 2. Environment Management (Conda + renv)
Lock down exact package versions:
# environment.yml
dependencies:
- python=3.11.5
- pandas=2.0.3
- numpy=1.24.3// renv.lock
"ggplot2": {
"Version": "3.4.2"
}Different package versions can give different results.
19.2.3 3. Documentation
Explain:
- How to set up the environment
- How to run the analysis
- What the data looks like
- What decisions you made and why
19.3 Reproducibility Checklist
Claude Code can audit your scripts for common reproducibility pitfalls.
Can you check
analysis/01_qc.qmdfor reproducibility issues? Look for hardcoded paths, missing package declarations, missing seed settings, or code that might behave differently on another machine.
Claude will scan the file and flag issues like absolute paths, missing library() calls, unseeded random operations, and platform-dependent code.
19.3.1 Environment
19.3.2 Data
19.3.3 Code
19.3.4 Documentation
19.4 Using Relative Paths
Bad — hardcoded absolute paths:
data <- read_csv("/Users/jm284/projects/analysis/data/data.csv")Good — relative paths with here:
library(here)
data <- read_csv(here("data", "data.csv"))Python — use pathlib:
from pathlib import Path
PROJECT_ROOT = Path(__file__).parent.parent
data = pd.read_csv(PROJECT_ROOT / "data" / "data.csv")The here package (R) and pathlib (Python) find paths relative to the project root, regardless of where scripts are run from.
19.5 Setting Random Seeds
Many analyses involve randomness (bootstrapping, sampling, train/test splits). Set seeds for reproducibility:
R:
set.seed(42)
result <- sample(data, 100)Python:
import random
import numpy as np
random.seed(42)
np.random.seed(42)
result = np.random.choice(data, 100)Document why you chose that seed (or just pick a number consistently).
19.6 Documenting Data
In your README or a separate DATA.md:
# Data Description
## Data
### data/counts.csv
- Source: GEO accession GSE12345
- Downloaded: 2024-01-15
- Format: CSV, 20,000 genes × 12 samples
- Columns: gene_id, sample1, sample2, ...
### data/metadata.csv
- Source: Provided by collaborator
- Format: CSV, 12 rows
- Columns: sample_id, condition, batch, replicate19.7 Running the Full Analysis
Create a master script or Makefile that runs everything:
Shell script (run_analysis.sh):
#!/bin/bash
set -e # Exit on error
echo "Setting up environment..."
conda activate my-analysis
echo "Step 1: Cleaning data..."
Rscript scripts/01_clean_data.R
echo "Step 2: Analysis..."
python scripts/02_analyze.py
echo "Step 3: Figures..."
Rscript scripts/03_figures.R
echo "Done!"Quarto (analysis.qmd):
A Quarto document that runs all code chunks in order is inherently reproducible — rendering the document runs the full analysis.
Make (Makefile):
all: outs/03_figures/figure1.pdf outs/02_analysis/results.csv
outs/03_figures/figure1.pdf: scripts/03_figures.qmd outs/01_clean_data/clean.csv
quarto render scripts/03_figures.qmd
outs/01_clean_data/clean.csv: scripts/01_clean_data.qmd data/data.csv
quarto render scripts/01_clean_data.qmd19.8 Testing Reproducibility
19.8.1 Fresh Environment Test
- Clone your repo to a new location
- Create environment from scratch
- Run the full analysis
- Compare results
# In a temporary directory
git clone https://github.com/user/project.git test-project
cd test-project
conda env create -f environment.yml
conda activate project-name
./run_analysis.sh19.8.2 Docker (Advanced)
For maximum reproducibility, package everything in a Docker container:
FROM rocker/tidyverse:4.3.1
COPY . /analysis
WORKDIR /analysis
RUN R -e "renv::restore()"
CMD ["Rscript", "run_all.R"]This captures:
- Operating system
- R version
- All packages
19.9 Common Reproducibility Problems
19.9.1 “It works on my machine”
Cause: Different package versions, OS differences, or missing dependencies.
Solution: Use conda/renv, test in fresh environment.
19.9.2 Missing files
Cause: Data files not included or paths are wrong.
Solution: Document data sources, use relative paths.
19.9.3 Different results each run
Cause: Unseeded random number generation.
Solution: Set random seeds.
19.9.4 Manual steps
Cause: Analysis requires clicking buttons or manual edits.
Solution: Script everything. If you must have manual steps, document them precisely.
19.10 Tools That Help
| Tool | Purpose |
|---|---|
here (R) |
Reliable relative paths |
pathlib (Python) |
Path handling |
renv |
R package management |
conda |
Python environment management |
| Quarto | Literate programming |
| Git | Version control |
| Docker | Full environment capture |
19.11 Quick Reference
For every project:
- Use Git for version control
- Lock environments with conda/renv
- Use relative paths
- Set random seeds
- Document everything
- Test in a fresh environment before sharing