19  Reproducible Analysis

Reproducibility means others (including future you) can run your analysis and get the same results. This chapter covers practices that ensure reproducibility.

19.1 Why Reproducibility Matters

  • Verification: Others can check your work
  • Building on work: You or others can extend the analysis
  • Debugging: Easier to find and fix problems
  • Publication: Increasingly required by journals
  • Future you: Six months from now, you’ll thank yourself

19.2 The Three Pillars

19.2.1 1. Version Control (Git)

Track every change to your code:

  • What changed
  • When it changed
  • Why it changed

With Git, you can always return to a working state.

19.2.2 2. Environment Management (Conda + renv)

Lock down exact package versions:

# environment.yml
dependencies:
  - python=3.11.5
  - pandas=2.0.3
  - numpy=1.24.3
// renv.lock
"ggplot2": {
  "Version": "3.4.2"
}

Different package versions can give different results.

19.2.3 3. Documentation

Explain:

  • How to set up the environment
  • How to run the analysis
  • What the data looks like
  • What decisions you made and why

19.3 Reproducibility Checklist

WarningClaude Code

Claude Code can audit your scripts for common reproducibility pitfalls.

Can you check analysis/01_qc.qmd for reproducibility issues? Look for hardcoded paths, missing package declarations, missing seed settings, or code that might behave differently on another machine.

Claude will scan the file and flag issues like absolute paths, missing library() calls, unseeded random operations, and platform-dependent code.

19.3.1 Environment

19.3.2 Data

19.3.3 Code

19.3.4 Documentation

19.4 Using Relative Paths

Bad — hardcoded absolute paths:

data <- read_csv("/Users/jm284/projects/analysis/data/data.csv")

Good — relative paths with here:

library(here)
data <- read_csv(here("data", "data.csv"))

Python — use pathlib:

from pathlib import Path

PROJECT_ROOT = Path(__file__).parent.parent
data = pd.read_csv(PROJECT_ROOT / "data" / "data.csv")

The here package (R) and pathlib (Python) find paths relative to the project root, regardless of where scripts are run from.

19.5 Setting Random Seeds

Many analyses involve randomness (bootstrapping, sampling, train/test splits). Set seeds for reproducibility:

R:

set.seed(42)
result <- sample(data, 100)

Python:

import random
import numpy as np

random.seed(42)
np.random.seed(42)
result = np.random.choice(data, 100)

Document why you chose that seed (or just pick a number consistently).

19.6 Documenting Data

In your README or a separate DATA.md:

# Data Description

## Data

### data/counts.csv
- Source: GEO accession GSE12345
- Downloaded: 2024-01-15
- Format: CSV, 20,000 genes × 12 samples
- Columns: gene_id, sample1, sample2, ...

### data/metadata.csv
- Source: Provided by collaborator
- Format: CSV, 12 rows
- Columns: sample_id, condition, batch, replicate

19.7 Running the Full Analysis

Create a master script or Makefile that runs everything:

Shell script (run_analysis.sh):

#!/bin/bash
set -e  # Exit on error

echo "Setting up environment..."
conda activate my-analysis

echo "Step 1: Cleaning data..."
Rscript scripts/01_clean_data.R

echo "Step 2: Analysis..."
python scripts/02_analyze.py

echo "Step 3: Figures..."
Rscript scripts/03_figures.R

echo "Done!"

Quarto (analysis.qmd):

A Quarto document that runs all code chunks in order is inherently reproducible — rendering the document runs the full analysis.

Make (Makefile):

all: outs/03_figures/figure1.pdf outs/02_analysis/results.csv

outs/03_figures/figure1.pdf: scripts/03_figures.qmd outs/01_clean_data/clean.csv
    quarto render scripts/03_figures.qmd

outs/01_clean_data/clean.csv: scripts/01_clean_data.qmd data/data.csv
    quarto render scripts/01_clean_data.qmd

19.8 Testing Reproducibility

19.8.1 Fresh Environment Test

  1. Clone your repo to a new location
  2. Create environment from scratch
  3. Run the full analysis
  4. Compare results
# In a temporary directory
git clone https://github.com/user/project.git test-project
cd test-project
conda env create -f environment.yml
conda activate project-name
./run_analysis.sh

19.8.2 Docker (Advanced)

For maximum reproducibility, package everything in a Docker container:

FROM rocker/tidyverse:4.3.1

COPY . /analysis
WORKDIR /analysis

RUN R -e "renv::restore()"

CMD ["Rscript", "run_all.R"]

This captures:

  • Operating system
  • R version
  • All packages

19.9 Common Reproducibility Problems

19.9.1 “It works on my machine”

Cause: Different package versions, OS differences, or missing dependencies.

Solution: Use conda/renv, test in fresh environment.

19.9.2 Missing files

Cause: Data files not included or paths are wrong.

Solution: Document data sources, use relative paths.

19.9.3 Different results each run

Cause: Unseeded random number generation.

Solution: Set random seeds.

19.9.4 Manual steps

Cause: Analysis requires clicking buttons or manual edits.

Solution: Script everything. If you must have manual steps, document them precisely.

19.10 Tools That Help

Tool Purpose
here (R) Reliable relative paths
pathlib (Python) Path handling
renv R package management
conda Python environment management
Quarto Literate programming
Git Version control
Docker Full environment capture

19.11 Quick Reference

For every project:

  1. Use Git for version control
  2. Lock environments with conda/renv
  3. Use relative paths
  4. Set random seeds
  5. Document everything
  6. Test in a fresh environment before sharing