5 Project Organization

Imagine opening a project you haven’t touched in six months. Or inheriting a colleague’s analysis when they leave the lab. You need to figure out: Where’s the raw data? Which script produces which output? What’s the current version of the analysis? Is that CSV file something I downloaded, or something a script generated?

A well-organized project answers these questions through its structure alone. You shouldn’t need to read code or ask someone—the folder layout and naming conventions should tell you what’s what. This chapter describes the conventions we use in the lab to make that possible. Every rule exists for a practical reason, and once you’ve internalized them, setting up a new project takes minutes.

In Your First R Project, you created a project with data/, scripts/, and outs/ folders. This chapter explains that structure formally and covers how it scales as projects grow.

5.1 The Two Rules

Every analysis project has two types of files, and the entire organizational system follows from keeping them separate:

Inputs are sacred. Data that comes from outside the project—sequencing results, collaborator files, public datasets—goes in data/ and is never modified by your scripts. If you need a cleaned version, your script reads the original, transforms it, and saves the result as an output.
Outputs are disposable. Everything your scripts produce—processed data, figures, tables, rendered reports—goes in outs/ and can always be regenerated by rerunning the script that created it.

This separation is the foundation. When you know that everything in data/ came from outside and everything in outs/ was generated by code, you can always trace where a file came from and how to recreate it.

5.2 Anatomy of a Lab Project

Here’s what a typical lab project looks like:

my-project/
├── .claude/              # Claude Code project config
│   └── CLAUDE.md
├── data/                 # External inputs — scripts never write here
├── scripts/              # Quarto analysis scripts (.qmd)
│   └── exploratory/      # One-off analyses
├── outs/                 # All generated outputs
├── R/                    # Shared R helper functions
├── python/               # Shared Python helper functions
├── environment.yml       # Conda environment
├── renv.lock             # R package versions
├── .gitignore
└── README.md

If you’re using Claude Code (covered in Part 3), the /new-project command creates this entire structure for you — creating directories, initializing Git, setting up conda and renv, and generating a .claude/CLAUDE.md with your project’s conventions. But it’s worth understanding what each piece does.

The .claude/ directory holds project-level configuration for Claude Code, including a CLAUDE.md file that describes the project’s purpose, environment, and conventions. As a project matures, you may also add planning documents here to track multi-session work. You’ll learn more about this in the Claude Code chapters.

Not every project needs every directory. An R-only project won’t have python/ or environment.yml. A Python-only project won’t have R/ or renv.lock. The /new-project command asks about your languages and creates only what’s needed.

5.3 data/ — Your Inputs Are Sacred

The data/ folder holds files that come from outside the project—things your scripts read but never write to:

data/
├── counts_matrix.csv          # From the sequencing core
├── sample_metadata.xlsx       # From a collaborator
├── reference_genome.fasta     # Downloaded from NCBI
└── README.md                  # Documents where each file came from

This includes raw sequencing data from core facilities, spreadsheets from collaborators, downloaded public datasets, annotation files from databases, and metadata you received or compiled by hand.

The critical rule is that scripts never write to data/. If your code produces a file, it belongs in outs/, not here. This rule means you can always trust that files in data/ are the original, unmodified inputs—you never have to wonder whether something in data/ was accidentally overwritten by a script.

Document where each file came from. A data/README.md is the simplest approach—note the source, date received, and any relevant details for each file. Six months from now, you’ll be grateful you wrote down which version of the genome annotation you downloaded, or which email attachment that metadata spreadsheet came from.

5.4 scripts/ — Numbered Analysis Scripts

All analysis scripts live in scripts/, numbered to show their logical flow:

scripts/
├── 01_import_qc.qmd
├── 02_normalize.qmd
├── 03_differential.qmd
├── 04_volcano_plots.qmd
├── 05_heatmaps.qmd
└── exploratory/

5.4.1 Why .qmd Files?

In the lab, all data analysis scripts are Quarto documents (.qmd), not plain .R or .py scripts. Quarto lets you combine code, results, and narrative explanation in a single file—so your analysis documents what it does and why as it runs. When you render a .qmd file, it produces an HTML report with your figures, tables, and text woven together.

Use .py files only for standalone utilities, CLI tools, or library code in python/. Use .R files only for helper functions in R/. The analysis itself—the thing that reads data, transforms it, produces results—is always a .qmd. The Quarto chapter covers the syntax and workflow in detail.

Each script uses one language—either R or Python, never both in the same file. When R and Python scripts need to exchange data, they communicate through files in outs/, not shared memory.

5.4.2 Why Numbers?

The two-digit prefix (01_, 02_, …, 10_, 11_) serves a simple purpose: when you run ls or look at the file explorer, scripts appear in the order of your analysis pipeline. Script 01 imports and cleans the data. Script 02 normalizes it. Script 03 runs differential analysis. A new lab member can glance at the file list and understand the analysis flow.

The numbers indicate logical order, not strict dependencies. Script 05 might read outputs from scripts 01 and 03 directly—the numbering just helps you understand the overall structure at a glance. Dependencies are encoded by the file paths each script reads, not by the numbers.

When you add a new script, assign the next available number. If you archive or delete a script, don’t renumber the remaining ones—leave gaps. This avoids confusion with any downstream references or documentation that mention the old numbers.

5.5 outs/ — Provenance Through Structure

Here’s the convention that makes everything traceable: every script gets a matching output folder. Script 01_import_qc.qmd writes all its outputs to outs/01_import_qc/:

outs/
├── 01_import_qc/
│   ├── filtered_counts.rds
│   ├── qc_summary.csv
│   ├── 01_import_qc.html      # Rendered report
│   └── BUILD_INFO.txt
├── 02_normalize/
│   ├── normalized_counts.rds
│   ├── 02_normalize.html
│   └── BUILD_INFO.txt
└── 03_differential/
    ├── limma_results.rds
    ├── significant_hits.csv
    ├── 03_differential.html
    └── BUILD_INFO.txt

This structure encodes provenance automatically. When you see a file in outs/03_differential/, you know exactly which script created it—no need to search or guess. The rendered HTML report sits alongside the data outputs, keeping scripts/ clean and making it easy to view results.

Output ownership is strict: a script writes only to its own output folder, never to another script’s. If script 05 needs a modified version of something script 01 produced, it reads script 01’s output and saves its own version in outs/05_whatever/.

5.5.1 BUILD_INFO.txt

Every numbered script writes a BUILD_INFO.txt to its output folder as its last action:

script: 03_differential.qmd
commit: a1b2c3d
date: 2026-02-14 15:30:00

This answers a question that comes up constantly: “When was this output last regenerated, and from what version of the code?” If downstream plots look wrong, you can check the upstream folder’s BUILD_INFO.txt to see whether it was generated from current code or something stale. The Quarto chapter has the R and Python code to generate this automatically.

5.5.2 Setting Up Output Paths

In your setup chunk, define paths that match this structure:

#| label: setup
#| include: false

library(tidyverse)
library(here)

# This script's output folder
dir_out <- here::here("outs", "03_differential")
dir.create(dir_out, recursive = TRUE, showWarnings = FALSE)

# Input paths
path_normalized <- here::here("outs", "02_normalize", "normalized_counts.rds")
path_metadata <- here::here("data", "sample_metadata.csv")

Then save all outputs to dir_out:

# Save results
saveRDS(results, file.path(dir_out, "limma_results.rds"))
write_csv(significant, file.path(dir_out, "significant_hits.csv"))

# Save figures
ggsave(file.path(dir_out, "volcano_plot.pdf"), p, width = 6, height = 4)

5.6 How Scripts Connect

Scripts read from two places: data/ (external inputs) and outs/ (outputs from other scripts). Dependencies between scripts are self-documenting through file paths—group all input reads at the top of each script, with comments distinguishing external data from other scripts’ outputs:

#| label: inputs

# --- Inputs (from other scripts) ---
normalized <- readRDS(here("outs/02_normalize/normalized_counts.rds"))

# --- Inputs (external data) ---
metadata <- read_csv(here("data/sample_metadata.csv"))
annotations <- read_tsv(here("data/gene_annotations.tsv"))

Reading the top of any script shows exactly what it depends on and where those files come from. No separate manifest or pipeline specification needed—the dependencies are right there in the code.

Dependencies don’t have to be strictly linear. A plotting script might read from the original data, from an early QC script, and from a later differential analysis. Here’s what that looks like as a dependency diagram:

flowchart LR
    data["data/*"] --> s01["01_import_qc"]
    s01 --> s02["02_normalize"]
    s02 --> s03["03_differential"]
    s03 --> s04["04_volcano_plots"]

    s02 --> s05["05_heatmaps"]
    s03 --> s05
    data --> s05

Because dependencies are encoded as file paths, you can trace them with a simple search:

# Find which script produces a file
grep -r "limma_results.rds" scripts/

# Find all scripts that depend on script 02's outputs
grep -r "outs/02_normalize" scripts/

5.7 The Exploratory Directory

Not everything you write is part of the main analysis pipeline. Sometimes you need to test an idea, try a new visualization, or run a quick sanity check. That’s what scripts/exploratory/ is for:

scripts/
├── 01_import_qc.qmd
├── 02_normalize.qmd
├── exploratory/
│   ├── test_umap_parameters.qmd
│   └── compare_normalization.qmd

Exploratory scripts follow relaxed rules:

No number prefixes or BUILD_INFO.txt required
No other script depends on them—this is the critical rule. Exploratory scripts can read from any outs/ folder, but nothing outside of exploratory/ reads from exploratory outputs. This is a one-way dependency.
Can be cleaned out periodically without breaking anything in the main pipeline
Good candidates for promotion—if an exploratory analysis proves valuable, promote it to a numbered script in the main directory

The one-way dependency rule is what makes the exploratory directory safe to experiment in. You can write, modify, or delete anything in there without worrying about breaking the analysis pipeline.

5.8 Growing a Project: Flat vs. Sectioned

The structure shown above is a flat layout—all scripts in one directory, one numbering sequence. This works well for small to medium projects with a single analytical thread and fewer than about ten scripts.

When a project grows to include multiple distinct analyses—say, phosphoproteomics and transcriptomics from the same experiment—a flat layout gets unwieldy. That’s when you switch to a sectioned layout, where scripts, data, and outputs are organized into subdirectories by analytical thread:

project/
├── scripts/
│   ├── phosphoproteomics/
│   │   ├── 01_import_qc.qmd
│   │   ├── 02_normalize.qmd
│   │   └── 03_differential.qmd
│   ├── transcriptomics/
│   │   ├── 01_import.qmd
│   │   └── 02_pca.qmd
│   ├── combined/
│   │   └── 01_integration.qmd
│   └── exploratory/
├── data/
│   ├── phosphoproteomics/
│   └── transcriptomics/
└── outs/
    ├── phosphoproteomics/
    │   ├── 01_import_qc/
    │   ├── 02_normalize/
    │   └── 03_differential/
    ├── transcriptomics/
    │   ├── 01_import/
    │   └── 02_pca/
    └── combined/
        └── 01_integration/

A few things to notice:

Numbering restarts at 01_ in each section. Each analytical thread has its own sequence.
Data and outs mirror the section structure. scripts/phosphoproteomics/ has a corresponding data/phosphoproteomics/ and outs/phosphoproteomics/.
Shared data that multiple sections use can live at the top level of data/ (e.g., data/gene_annotations/).
Cross-section scripts like combined/01_integration.qmd can read from any section’s outs/ folder.

Use descriptive names for sections—names that describe what the analysis is about (phosphoproteomics, transcriptomics, figures) rather than generic labels. A new lab member should be able to look at the directory names and understand the project’s scope.

5.8.1 Alternative: Prefixes

For projects with just two or three analytical threads and only a few scripts each, subdirectories can be overkill. You can use prefixes instead:

scripts/
├── phospho_01_qc.qmd
├── phospho_02_normalize.qmd
├── trans_01_import.qmd
└── combined_01_integration.qmd

Output folders mirror the naming: outs/phospho_01_qc/, outs/trans_01_import/, etc. Either approach works—the key is that outputs always mirror the script organization so provenance is clear.

5.8.2 When to Switch

Start flat. Switch to sectioned when you find yourself with more than about ten scripts or two distinct analytical threads competing for number slots. You don’t need to plan for sectioning from the start—restructuring a flat project into sections is straightforward because each script’s output folder moves with it.

5.9 Script Lifecycle

As an analysis evolves, scripts move through stages. Track this with a status field in the YAML frontmatter of each .qmd file:

---
title: "Differential Expression"
status: development
---

Status	Meaning	Location
`development`	In active development; outputs may change	`scripts/`
`finalized`	Outputs are publication-ready; modify only with deliberate re-validation	`scripts/`
`deprecated`	Superseded by a newer script; kept for reference	`scripts/old/`

Most scripts spend their life in development. When the results are solid and heading toward a paper, mark them finalized. This signals to collaborators (and to yourself) that changes should be deliberate—if you rerun a finalized script, you should check that the outputs still match what went into the manuscript.

When a script is superseded, move it to scripts/old/ and add a deprecated_by field pointing to its replacement:

---
title: "Old Heatmaps"
status: deprecated
deprecated_by: 06_improved_heatmaps.qmd
---

This makes it clear which script replaced it, while Git preserves the full history. Don’t just delete old scripts if someone might want to reference them—the old/ directory keeps them visible without cluttering the main listing.

5.10 Keeping Things Clean

5.10.1 Naming Conventions

Good file names are lowercase, descriptive, and use underscores:

Scripts: 01_import_qc.qmd, not 01_Analysis.qmd or 01 import qc.qmd
Output files: normalized_counts.rds, not data.rds or output.csv
Multiple similar files: Use a consistent pattern like volcano_3min.pdf, volcano_15min.pdf

Avoid spaces (they cause problems in terminal commands), generic names (results.csv, figure1.pdf), and date prefixes on every file—let Git track versions instead.

5.10.2 Cross-Language Data

When data produced by an R script needs to be read by a Python script (or vice versa), use Parquet format. CSV files lose type information — a column of integers might be read back as strings, or dates might be misinterpreted. Parquet avoids this by storing column types alongside the data, so what you save is exactly what you read back. Parquet files are also smaller and faster to read than CSV in both languages.

In R, Parquet support comes from the arrow package. Install it with renv::install("arrow") if you haven’t already:

# R: save as Parquet
arrow::write_parquet(results, file.path(dir_out, "results.parquet"))

# R: read Parquet
results <- arrow::read_parquet(here("outs/01_analysis/results.parquet"))

# Python: save as Parquet
results.to_parquet(out_dir / "results.parquet")

# Python: read Parquet
results = pd.read_parquet(PROJECT_ROOT / "outs/01_analysis/results.parquet")

Within a single language, use native formats—.rds for R objects, .pkl for Python objects. Parquet is specifically for data that crosses the language boundary.

5.10.3 Helper Functions

When you find yourself copying the same function between scripts, move it to a shared location:

Project-level helpers live in R/ and python/ at the project root:

# In any script, load project helpers with:
source(here("R/gene_helpers.R"))

# In any script, load project helpers with:
import sys
sys.path.insert(0, str(PROJECT_ROOT / "python"))  # tell Python where to find your modules
from gene_helpers import normalize_name

Python doesn’t automatically know to look in your project’s python/ folder for modules — sys.path.insert adds that folder to the list of places Python searches when you import something. The R equivalent (source()) is simpler because it takes a direct file path.

Fix functions in place and let Git track the history—don’t version function names (make_gene_short_v2).

Cross-project helpers go in ~/lib/R/ and ~/lib/python/. When a project-level function proves useful across two or more projects, promote it to your personal library. This keeps project repositories clean while making reusable code accessible everywhere.

5.10.4 Version Control

Your .gitignore should generally include:

outs/ — generated outputs can be regenerated from code
*_files/ and .quarto/ — Quarto rendering artifacts
renv/library/ and renv/staging/ — renv installs packages from the lock file
.DS_Store, .vscode/, .positron/ — OS and IDE files
.env, *.pem, credentials.json — secrets

Whether to commit data/ depends on file sizes. Small data files (a few MB) can be committed so the project is self-contained. Large files should be gitignored, with a data/README.md documenting where to get them. The Git & GitHub chapter covers version control in detail, and Appendix C has a complete .gitignore template.

5.10.5 Handling Old Versions

The preferred approach is to use Git—delete old files and recover them from history if needed. For deprecated scripts you want to keep visible, use scripts/old/ with the deprecated status as described in Script Lifecycle. Only add dates to output subdirectories when you genuinely need multiple versions to coexist, like comparing runs with different parameters:

outs/03_differential/
├── 2025-01-15_strict_threshold/
└── 2025-01-20_relaxed_threshold/

5.11 A Complete Example

Here’s a well-organized phosphoproteomics project—the kind of structure you’d end up with after a few months of analysis:

tryptamine_phospho/
├── .claude/
│   ├── CLAUDE.md
│   └── PHOSPHO_PLAN.md         # Planning doc tracking analysis progress
├── R/
│   └── gene_helpers.R
├── data/
│   ├── README.md               # Documents where each file came from
│   ├── raw_counts.csv          # From mass spec core
│   ├── sample_metadata.csv     # Experimental design
│   └── gene_annotations.tsv    # Downloaded from UniProt
├── scripts/
│   ├── 01_import_qc.qmd       # status: finalized
│   ├── 02_normalize.qmd       # status: finalized
│   ├── 03_differential.qmd    # status: finalized
│   ├── 04_volcano_plots.qmd   # status: development
│   ├── 05_heatmaps.qmd        # status: development
│   ├── exploratory/
│   │   └── test_new_clustering.qmd
│   └── old/
│       └── 04_volcano_v1.qmd  # status: deprecated, replaced by 04
├── outs/
│   ├── 01_import_qc/
│   │   ├── filtered_counts.rds
│   │   └── BUILD_INFO.txt
│   ├── 02_normalize/
│   │   ├── normalized_counts.rds
│   │   └── BUILD_INFO.txt
│   ├── 03_differential/
│   │   ├── limma_results.rds
│   │   ├── significant_hits.csv
│   │   └── BUILD_INFO.txt
│   ├── 04_volcano_plots/
│   │   ├── volcano_3min.pdf
│   │   └── volcano_15min.pdf
│   └── 05_heatmaps/
│       └── heatmap_all_clusters.pdf
├── environment.yml
├── renv.lock
├── .gitignore
└── README.md

From this structure, anyone can understand the project without reading a single line of code:

Where did the data come from? Check data/README.md.
What’s the analysis pipeline? Read the numbered scripts in order.
Which script produced limma_results.rds? It’s in outs/03_differential/, so script 03 made it.
Is the analysis finalized? Check the status field—scripts 01–03 are finalized, 04–05 are still in development.
What code version produced these outputs? Check BUILD_INFO.txt in each output folder.
How do I reproduce everything? Set up the environment (environment.yml + renv.lock), then run the scripts in order.

Claude Code

Claude Code can scaffold this entire structure for you and help maintain it as your project grows.

I’m starting a new project analyzing RNA-seq data from three conditions with two time points. I’ll use R for the analysis. Can you set up the project?

Claude will run /new-project to create the directory structure, initialize Git, set up renv, create a .claude/CLAUDE.md, and push to GitHub—all configured for your specific analysis.