19 Reproducible Analysis – Data Analysis in the Musser Lab

19.1 Why Reproducibility Matters

Verification: Others can check your work
Building on work: You or others can extend the analysis
Debugging: Easier to find and fix problems
Publication: Increasingly required by journals
Future you: Six months from now, you’ll thank yourself

19.2 The Three Pillars

19.2.1 1. Version Control (Git)

Track every change to your code:

What changed
When it changed
Why it changed

With Git, you can always return to a working state.

19.2.2 2. Environment Management (Conda + renv)

Lock down exact package versions:

# environment.yml
dependencies:
  - python=3.11.5
  - pandas=2.0.3
  - numpy=1.24.3

// renv.lock
"ggplot2": {
  "Version": "3.4.2"
}

Different package versions can give different results.

19.2.3 3. Documentation

Explain:

How to set up the environment
How to run the analysis
What the data looks like
What decisions you made and why

19.3 Reproducibility Checklist

Claude Code

Claude Code can audit your scripts for common reproducibility pitfalls.

Can you check analysis/01_qc.qmd for reproducibility issues? Look for hardcoded paths, missing package declarations, missing seed settings, or code that might behave differently on another machine.

Claude will scan the file and flag issues like absolute paths, missing library() calls, unseeded random operations, and platform-dependent code.

19.3.1 Environment

All dependencies listed in environment.yml (Python) or renv.lock (R)
Specific versions pinned, not just package names
Instructions for environment setup in README

19.3.2 Data

Raw data preserved unchanged
Data processing steps documented and scripted
Data sources documented
If data can’t be shared, describe format and structure

19.3.3 Code

Analysis runs from start to finish without manual intervention
File paths are relative, not absolute
No hardcoded paths like /Users/myname/...
Random seeds set for stochastic processes
Intermediate results can be regenerated from code

19.3.4 Documentation

README explains setup and execution
Analysis steps documented in code comments or notebooks
Key decisions annotated with rationale

19.4 Using Relative Paths

Bad — hardcoded absolute paths:

data <- read_csv("/Users/jm284/projects/analysis/data/data.csv")

Good — relative paths with here:

library(here)
data <- read_csv(here("data", "data.csv"))

Python — use pathlib:

from pathlib import Path

PROJECT_ROOT = Path(__file__).parent.parent
data = pd.read_csv(PROJECT_ROOT / "data" / "data.csv")

The here package (R) and pathlib (Python) find paths relative to the project root, regardless of where scripts are run from.

19.5 Setting Random Seeds

Many analyses involve randomness (bootstrapping, sampling, train/test splits). Set seeds for reproducibility:

R:

set.seed(42)
result <- sample(data, 100)

Python:

import random
import numpy as np

random.seed(42)
np.random.seed(42)
result = np.random.choice(data, 100)

Document why you chose that seed (or just pick a number consistently).

19.6 Documenting Data

In your README or a separate DATA.md:

# Data Description

## Data

### data/counts.csv
- Source: GEO accession GSE12345
- Downloaded: 2024-01-15
- Format: CSV, 20,000 genes × 12 samples
- Columns: gene_id, sample1, sample2, ...

### data/metadata.csv
- Source: Provided by collaborator
- Format: CSV, 12 rows
- Columns: sample_id, condition, batch, replicate

19.7 Running the Full Analysis

Create a master script or Makefile that runs everything:

Shell script (run_analysis.sh):

#!/bin/bash
set -e  # Exit on error

echo "Setting up environment..."
conda activate my-analysis

echo "Step 1: Cleaning data..."
Rscript scripts/01_clean_data.R

echo "Step 2: Analysis..."
python scripts/02_analyze.py

echo "Step 3: Figures..."
Rscript scripts/03_figures.R

echo "Done!"

Quarto (analysis.qmd):

A Quarto document that runs all code chunks in order is inherently reproducible — rendering the document runs the full analysis.

Make (Makefile):

all: outs/03_figures/figure1.pdf outs/02_analysis/results.csv

outs/03_figures/figure1.pdf: scripts/03_figures.qmd outs/01_clean_data/clean.csv
    quarto render scripts/03_figures.qmd

outs/01_clean_data/clean.csv: scripts/01_clean_data.qmd data/data.csv
    quarto render scripts/01_clean_data.qmd

19.8 Testing Reproducibility

19.8.1 Fresh Environment Test

Clone your repo to a new location
Create environment from scratch
Run the full analysis
Compare results

# In a temporary directory
git clone https://github.com/user/project.git test-project
cd test-project
conda env create -f environment.yml
conda activate project-name
./run_analysis.sh

19.8.2 Docker (Advanced)

For maximum reproducibility, package everything in a Docker container:

FROM rocker/tidyverse:4.3.1

COPY . /analysis
WORKDIR /analysis

RUN R -e "renv::restore()"

CMD ["Rscript", "run_all.R"]

This captures:

Operating system
R version
All packages

19.9 Common Reproducibility Problems

19.9.1 “It works on my machine”

Cause: Different package versions, OS differences, or missing dependencies.

Solution: Use conda/renv, test in fresh environment.

19.9.2 Missing files

Cause: Data files not included or paths are wrong.

Solution: Document data sources, use relative paths.

19.9.3 Different results each run

Cause: Unseeded random number generation.

Solution: Set random seeds.

19.9.4 Manual steps

Cause: Analysis requires clicking buttons or manual edits.

Solution: Script everything. If you must have manual steps, document them precisely.

19.10 Tools That Help

Tool	Purpose
`here` (R)	Reliable relative paths
`pathlib` (Python)	Path handling
`renv`	R package management
`conda`	Python environment management
Quarto	Literate programming
Git	Version control
Docker	Full environment capture

19.11 Quick Reference

For every project:

Use Git for version control
Lock environments with conda/renv
Use relative paths
Set random seeds
Document everything
Test in a fresh environment before sharing