2  Getting Started

From zero to running fMRIPrep. Quick start takes 10 minutes; full onboarding takes under 30.

Already set up? Jump to Run the Pipeline.


2.1 Quick Start (10 minutes)

2.1.1 Prerequisites

  • Git - Install Git
  • Python 3.9+ - Install Python or use your system’s package manager
  • Docker / Apptainer / Singularity - Docker, or use Apptainer/Singularity on HPC

2.1.2 1. Create Your Project (2 minutes)

Option A: From GitHub Template

  1. Go to CNClaboratory/Reproducible-fMRI
  2. Click “Use this template” -> “Create a new repository”
  3. Clone your new repository:
git clone https://github.com/YOUR-ORG/your-project.git
cd your-project

Option B: Direct Clone

git clone https://github.com/CNClaboratory/Reproducible-fMRI.git my-project
cd my-project
rm -rf .git && git init  # Start fresh git history

2.1.3 2. One-command bootstrap (3 minutes)

make setup

This auto-detects your site (UCI HPC3, NEU Explorer, UCR HPCC, or a local workstation), copies the matching preset to config/paths.toml + config/site.conf, runs uv sync to install Python dependencies, and runs preflight_check.sh --fix to create any missing data directories. On unknown hosts it falls back to the local preset so you always land on a working config.

After make setup, edit the placeholder values in the two config files:

$EDITOR config/paths.toml config/site.conf

Re-verify once edits are done:

make preflight

Alternative: ./setup.sh runs an older interactive wizard that also installs uv + dependencies and walks you through DataLad/FreeSurfer setup. Use it if you prefer prompts over editing files directly. Both paths produce the same end state.

2.1.4 3. Add Your Data

Your data repository should follow BIDS format:

my-project-data/
├── rawdata/
│   ├── dataset_description.json
│   ├── participants.tsv
│   └── sub-01/
│       ├── anat/
│       │   └── sub-01_T1w.nii.gz
│       └── func/
│           ├── sub-01_task-rest_bold.nii.gz
│           └── sub-01_task-rest_events.tsv
├── derivatives/
├── behdata/
└── analysis-cache/

Converting to BIDS? We recommend ezBIDS - a web-based tool that makes BIDS conversion painless.

2.1.5 4. Run fMRIPrep

# One subject, preview only
DRY_RUN=1 make preprocess BATCH_LABEL=pilot SUBJECT=sub-01

# One subject for real
make preprocess BATCH_LABEL=pilot SUBJECT=sub-01

# Full pipeline: QC -> preprocess -> denoise -> GLM
make all BATCH_LABEL=batch1 MODEL=models/task-main.smdl.json

2.1.6 5. Find Your Results

After fMRIPrep completes, find outputs in your data repository:

my-project-data/
└── derivatives/
    └── fmriprep/
        ├── sub-01/
        │   ├── anat/
        │   │   └── sub-01_desc-preproc_T1w.nii.gz
        │   └── func/
        │       └── sub-01_task-rest_space-MNI152NLin2009cAsym_desc-preproc_bold.nii.gz
        └── sub-01.html  # Visual QC report

Open the .html report in your browser to visually inspect preprocessing quality.

2.1.7 Platform Notes

Platform Notes
Windows fMRIPrep requires Docker Desktop with WSL2 backend. Run from WSL2 terminal. Use forward slashes and absolute:: prefix in paths.toml.
macOS Docker Desktop works out of the box. Allocate at least 8 GB RAM in preferences.
Linux (no Docker) Use Singularity/Apptainer: singularity build fmriprep.sif docker://nipreps/fmriprep:24.0.1
Local (no HPC) run_fmriprep_local.sh auto-detects CPU/RAM. For < 20 subjects, local is simpler than HPC.

2.1.8 Common Issues

Problem Solution
“paths.toml not found” Run make setup (auto-picks preset)
“FreeSurfer license missing” Get free license at https://surfer.nmr.mgh.harvard.edu/registration.html
Docker permission denied sudo usermod -aG docker $USER then log out/in
“fmriprep not found” Set FMRIPREP_MODULE or CONTAINER_ROOT in config/site.conf
“sbatch not found” on laptop Leave SLURM_ACCOUNT + SLURM_PARTITION empty in site.conf

Ready to go deeper? Continue below for full onboarding details, or jump to: - Data Setup for DataLad/git-annex - HPC Guide for SLURM and container setup - Analysis for BIDS Stats Models and confound strategies


2.2 Full Onboarding (30 minutes)

The sections below explain what make setup does under the hood for when you need to customize or debug it.


2.3 TL;DR — One-command bootstrap

git clone https://github.com/CNClaboratory/<your-project>.git
cd <your-project>
make setup                          # auto-detects site, copies preset,
                                    # runs uv sync, runs preflight --fix
$EDITOR config/paths.toml config/site.conf   # replace placeholders
make preflight                      # re-verify after editing
make all BATCH_LABEL=my-study       # run the full pipeline

make setup recognizes UCI HPC3, NEU Explorer, UCR HPCC and common local workstations by hostname. On unknown hosts it falls back to the local preset so you always land on a working config. The preflight check runs in --fix mode so missing data directories are created automatically, and it auto-detects local mode from site.conf so laptop runs don’t fail on “sbatch not found”.

The five remaining sections below explain what make setup does under the hood for when you need to customize or debug it.


2.4 How Configuration Works

You need two config files — one for Python (where is data?), one for Bash (how to run on this machine?):

File What it configures Who reads it
config/paths.toml Data locations (rawdata, derivatives, etc.) Python (libs/paths.py)
config/site.conf Execution environment (SLURM, modules, containers) Bash (pipeline scripts)

Both are git-ignored. You create them once from a preset (a matched pair for your site), then forget about them.

For machine-specific tweaks without editing the shared config, use config/paths.local.toml (overrides paths.toml keys).


2.5 1) Clone and install

# Clone to lab storage (not home — quota is too small for data)
cd /path/to/lab/storage/
git clone https://github.com/CNClaboratory/<your-project>.git
cd <your-project>

# Install Python dependencies
curl -LsSf https://astral.sh/uv/install.sh | sh  # skip if uv already installed
source ~/.bashrc
uv sync

If your project has a separate data repo, clone it alongside:

git clone https://github.com/CNClaboratory/<your-project>-data.git
# Layout: lab-storage/<project>/  and  lab-storage/<project>-data/

2.6 2) Choose a preset

Each preset is a matched pair of paths.toml + site.conf:

Site Preset directory Description
UCI HPC3 config/presets/uci/ NeuroCommand modules, /dfs10/ storage, SLURM_CONSTRAINT="intel"
UCR HPCC config/presets/ucr/ Direct Singularity, /bigdata/ storage
NEU Explorer config/presets/neu/ Apptainer system-wide, /projects/ storage, CONTAINER_ROOT-based
Local config/presets/local/ Docker or direct install, sibling data folder

New to HPC? See HPC_GUIDE.md for the complete guide from zero (SSH keys, accounts, containers, everything).

Recommended: run make setup and let it pick the preset for you. It auto-detects known hosts and falls back to local on unknown systems:

make setup

Manual alternative (if you want to pick a different preset or make setup misidentified your site):

cp config/presets/uci/paths.toml config/paths.toml
cp config/presets/uci/site.conf  config/site.conf

Your site isn’t listed? Copy the closest match — neu/ for container-based HPC without NeuroCommand modules, uci/ for NeuroCommand-based HPC, local/ for a workstation. Then edit the comments-documented fields.


2.7 3) Edit your config

Open config/paths.toml and replace placeholders:

nano config/paths.toml   # or: code config/paths.toml

At minimum, update [paths.roots]. The lab storage convention (UCI shown; NEU/UCR follow the same shape) is:

  • codebase — your per-user clone of this repo: /dfs10/meganakp_lab/<user>/repos/<repo>
  • dataset — the shared BIDS tree for one project + dataset: /dfs10/meganakp_lab/Projects/<project>/<dataset>/
[paths.roots]
codebase = "/dfs10/meganakp_lab/eolsson1/repos/Reproducible-fMRI"
dataset  = "/dfs10/meganakp_lab/Projects/lc-study/main-cohort"

Each researcher owns their own clone (so .venv + edits don’t collide), while rawdata and derivatives are shared under Projects/<project>/<dataset>/. A project can hold multiple datasets (e.g. pilot and main-cohort).

Open config/site.conf and fill in your SLURM values:

nano config/site.conf

If you don’t know your SLURM account or partitions, run:

# Find your SLURM account:
sacctmgr show associations user=$USER format=account%30

# Find available partitions:
sinfo -s

Pro tip: If you share paths.toml with your team (checked into git), create config/paths.local.toml for your personal overrides. Only the keys you specify are overridden.


2.8 4) Verify setup

make preflight

This runs 10 checks: Python version + uv, config files, path resolution, data directories, FreeSurfer license, SLURM scheduler, container runtime, container images, BIDS structure (with --bids-dir), and disk space.

Two things the preflight does automatically that used to trip people up:

  • Auto-local mode: If site.conf has no SLURM_ACCOUNT and no SLURM_PARTITION (i.e. you’re on a laptop), the SLURM and container checks are skipped automatically. You won’t see bogus sbatch not found failures on a local workstation.
  • --fix mode: make setup invokes preflight with --fix, which creates any missing data directories (rawdata/, derivatives/, etc.) on first run. Running make preflight directly does not pass --fix — add it yourself if you want auto-creation: bash scripts/setup/preflight_check.sh --fix.

Fix any FAIL items before proceeding. Warnings are informational but don’t block pipeline runs.

2.8.1 FreeSurfer license

FreeSurfer tools (fMRIPrep, PyCortex integrations) need a valid license file.

  1. Register (free) at https://surfer.nmr.mgh.harvard.edu/registration.html.

  2. Save the text you receive to config/licenses/fs_license.txt (this path is gitignored but tracked via .gitkeep). Alternatively, place it at ~/.freesurfer/license.txt.

  3. Devcontainer users: The FS_LICENSE environment variable already points to /workspaces/Reproducible-FMRI/config/licenses/fs_license.txt. Drop the file at that location and restart your shell if FreeSurfer was already running.

  4. Outside the devcontainer: export the variable yourself:

    export FS_LICENSE="$PWD/config/licenses/fs_license.txt"

    You can also add this line to .env so uv run picks it up automatically.

Commands such as recon-all and fmriprep will fail fast if the license is missing, so confirm the file exists before launching long jobs.

2.8.2 Common fixes

Failure Fix
paths.toml not found make setup (auto-picks preset) or copy one manually from config/presets/<site>/
FreeSurfer license not found See above — register and place fs_license.txt
Singularity not available module load singularity or point CONTAINER_ROOT at a dir with fmriprep*.sif
Containers not found bash scripts/setup/pull_containers.sh --dest "$CONTAINER_ROOT"
fmriprep not found after all resolution strategies Set FMRIPREP_MODULE (strategy 1) or CONTAINER_ROOT (strategy 2) in config/site.conf
Fieldmap JSONs missing B0FieldIdentifier and IntendedFor Add B0FieldIdentifier to your dcm2bids/heudiconv config — fMRIPrep silently skips SDC without it
sbatch not found (on a laptop) Leave SLURM_ACCOUNT and SLURM_PARTITION empty in site.conf — preflight will auto-skip SLURM checks
paths.toml contains placeholder values Edit config/paths.toml and replace <lab>, <user>, <project>, <group>, <username> with real values

2.9 5) Run the pipeline

# See all available commands
make help

# Preview what would run (no actual jobs submitted)
DRY_RUN=1 make preprocess BATCH_LABEL=my-study SUBJECT=sub-01

# Run for real — one subject
make preprocess BATCH_LABEL=my-study SUBJECT=sub-01

# Run all subjects
make preprocess BATCH_LABEL=my-study

# Full pipeline: QC -> preprocess -> denoise -> statistics
make all BATCH_LABEL=my-study MODEL=models/task-main.smdl.json

2.9.1 Pipeline order + expected runtimes

Stage Command Time per subject Memory Notes
QC make qc 15-30 min 8 GB MRIQC. Lightweight; runs first
Preprocess make preprocess 4-12 hours ~32 GB exclusive fMRIPrep. The big one. Includes FreeSurfer recon-all on first run; rerun is much faster (1-2h) due to caching
Validate make validate-fmriprep < 5 min 1 GB Output gate — no point starting downstream stages until this passes
Denoise make denoise 30-90 min 128 GB (CIFTI+atlas) or 64 GB (minimal) XCP-D. Resting-state FC only; NOT for task GLM
GLMsingle make glmsingle 1-3 hours 32-64 GB Single-trial betas. Needed for MVPA/RSA
GLM make glm 1-2 hours 16 GB FitLins or nilearn. Reads .smdl.json
Reports make report SUBJECT=sub-01 < 1 min 2 GB HTML subject report
Group reports make group-report 5-10 min 4 GB Cohort dashboard

First-time deployment to a new HPC site: budget an additional ~30 min to pull containers (bash scripts/setup/pull_containers.sh --dest "$CONTAINER_ROOT") plus ~5 CPU-hours for the real-site smoke test (bash scripts/tests/run_new_site_smoke.sh).

2.9.2 Monitor jobs

make status                    # SLURM queue
sacct -j JOBID --format=State  # specific job
tail -f logs/fmriprep/*.out    # live output

2.10 What each config file does

2.10.1 paths.toml — “address book”

Tells Python where to find data. Two-tier design:

[paths.roots]
dataset = "/absolute/path/to/data"     # base for everything below
codebase = "/absolute/path/to/code"

[paths.locations]
rawdata_root = "dataset::rawdata"      # -> /data/rawdata
derivatives_root = "dataset::derivatives"
# add custom locations here — no Python changes needed

Override per-machine with paths.local.toml:

# Only this key is overridden; everything else inherits
[paths.roots]
dataset = "/different/machine/path"

2.10.2 site.conf — “machine manual”

Tells Bash scripts how to run on this machine:

# SLURM
SLURM_ACCOUNT="mylab"
SLURM_PARTITION="standard"
SLURM_PARTITION_FREE="free"
# Optional feature tag, appended to sbatch only when non-empty.
# Leave empty on clusters without feature tags (e.g. NEU Explorer).
SLURM_CONSTRAINT="intel"

# Tool resolution — HPC scripts try these in order:
#   1. module load                (NeuroCommand)
#   2. find_container <tool>      (scan $CONTAINER_ROOT -> $NEUROCOMMAND_PATH -> $CONTAINER_PATH)
#   3. tool already on $PATH
# First hit wins. Set whichever combination your site needs.

# 1. NeuroCommand (UCI HPC3 pattern)
MODULE_USE_PATH="/path/to/neurocommand/modules"
NEUROCOMMAND_PATH="/path/to/neurocommand"
SINGULARITY_MODULE="singularity/3.11.3"

# 2. CONTAINER_ROOT (NEU Explorer, UCR, anything without NeuroCommand modules)
# Scripts scan this directory for fmriprep*.sif, mriqc*.sif, xcp_d*.sif.
# Populate with: scripts/setup/pull_containers.sh --dest "$CONTAINER_ROOT"
CONTAINER_ROOT="/projects/mylab/containers"

# Tool versions (module names for strategy 1; filenames for strategy 2)
FMRIPREP_MODULE="fmriprep/25.2.3"
MRIQC_MODULE="mriqc/24.0.2"
XCPD_MODULE="xcpd/0.10.0"

FS_LICENSE="$HOME/.freesurfer/license.txt"

Legacy CONTAINER_PATH (single-file mode) still works as a last-resort fallback, but CONTAINER_ROOT is the preferred way to point at a directory of tool containers.

2.10.3 tooling.toml — “API keys and tools”

For optional integrations (GitHub, paper processing, Google Drive). This file is git-ignored; copy from the example and fill in your credentials:

cp config/tooling.example.toml config/tooling.toml
[github]
username = "your-username"
token = "ghp_..."
star_list = "reproducible-fmri"

[processing]
max_repos_to_fetch = 50
max_repos_to_process = 10
enable_onefilellm = false

[papers]
drive_folder_id = "0AM...share"
service_account_json = "FILE:/path/to/service-account.json"

2.11 Configuration Deep-Dive

2.11.1 Priority order

Configuration values are resolved in this order (first wins):

Priority Source Use case
1 CLI arguments One-off overrides
2 Environment variables CI, temporary shells
3 config/*.toml files Persistent local config
4 Defaults in code Fallbacks

2.11.2 Environment variable overrides

Any config value can be overridden via environment variables. The most common:

Variable Overrides
PATHS_DATASET_ROOT [paths.roots].dataset
PATHS_CODEBASE_ROOT [paths.roots].codebase
GITHUB_TOKEN [github].token
GITHUB_USERNAME [github].username

2.11.3 Python API for paths

Access configured paths in code:

from libs.paths import get_paths

paths = get_paths()
raw = paths.rawdata_root           # /path/to/data/rawdata
deriv = paths.derivatives("fmriprep", "sub-01")  # Build subpaths
custom = paths.location("my_custom_root")        # Any configured location

See libs/paths.py for all available methods.

2.11.4 CI / GitHub Actions

For automated workflows, use GitHub Secrets instead of config files:

env:
  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
  PATHS_DATASET_ROOT: /tmp/test-data

Never commit tokens or machine-specific paths. The example files and presets provide patterns that each user customizes locally.


2.12 Data Repository Setup

If your project stores data in a separate DataLad-managed repository, mount it after cloning the code repo:

datalad clone git@github.com:YourOrg/your-project-data.git /path/to/data
datalad get derivatives/fmriprep

Then update config/paths.toml so [paths.roots].dataset points to the mounted dataset. Ensure the cache directory has write permissions.


2.13 Literature Cache

The paper extraction pipeline writes the full converted library to a cache directory resolved by libs.paths.references_papers_dir() (configured via config/paths.toml under analysis_cache_root).

For figures, tables, and equations as actual images (recommended for complex visual reasoning), run:

uv run python scripts/papers/enrich.py --dpi 200

This writes cropped assets under <analysis_cache_root>/references/papers/assets/<slug>/ (e.g., figures/*.png, tables/*.png, equations/*.png plus an index *.md).

Keep only curated paper summaries in git under docs/references/papers/ so agents can cite them quickly without relying on external mounts.

2.13.1 Making the cache accessible in your editor

If the cache lives outside the git repo (recommended), you can make it frictionless to browse:

  • Multi-root workspace: Add <analysis_cache_root>/references/papers/ as an additional folder in your editor workspace.
  • Gitignored symlink: Create a local symlink in the repo root, e.g., ln -s <analysis_cache_root>/references/papers literature_cache, and keep it gitignored.

2.13.2 Agent prompt snippet

Paste this into your project agent’s system instructions so it knows where to find literature:

Literature is always available in two places:
1) Full extracted cache (preferred): use `libs.paths.references_papers_dir()`
   and read `*.md` first; use `*.json` only for structured details.
2) Curated in-repo summaries: `docs/references/papers/`.

If you need complex figures/tables/equations, also look in
`<analysis_cache_root>/references/papers/assets/<slug>/`.

When answering literature questions, ground claims in extracted summaries and
cite using project BibTeX (`docs/references.bib` or the manuscript Paperpile
bib if present).

2.14 Validate Access

Before starting work, confirm you can reach the services your project uses:

# Check GitHub authentication
uv run python scripts/repositories/starred_repos.py --limit 1

# Confirm Google Drive credentials (if Paperpile integration is needed)
uv run python scripts/papers/convert.py \
  --bib /path/to/test.bib \
  --drive-folder "$PAPERPILE_DRIVE_FOLDER_ID" \
  --limit 0

If either command fails, verify the relevant entries in config/tooling.toml or the corresponding environment variables (GITHUB_TOKEN, GOOGLE_SERVICE_ACCOUNT_JSON).


2.15 Developer Quality Checks

Run these commands before opening pull requests or after pulling major updates:

uv run python -m compileall libs experiments analyses stimuli preprocessing pipelines
uv run ruff check .
uv run pytest

2.16 Repository Structure

Every directory at the top level has a defined purpose:

Directory Purpose Typical contents
analyses/ Domain logic for scientific analyses behavioural/, fmri/, helpers/, Jupyter/Quarto notebooks
preprocessing/ BIDS validation, QC, data ingestion, format conversions Modality-specific subpackages (fmri/, physio/, etc.)
pipelines/ Orchestrators that chain preprocessing + analyses CLI entrypoints, tasks/ reusable steps, workflow definitions
stimuli/ Stimulus definitions and generation tooling generation/ scripts, static assets (images/, sounds/)
experiments/ Runtime code for data collection fmri/, lab/, online/ (PsychoPy/jsPsych projects)
libs/ Shared libraries paths.py, utility modules reused across pipelines
scripts/ Standalone CLIs and automation repositories/, papers/, maintenance scripts
config/ Configuration templates paths.example.toml, tooling.example.toml, presets
docs/ Human-facing documentation and curated references Markdown guides, specifications
tests/ Pytest suite Unit/integration tests

2.16.1 Layout principles

  1. Code-only repository: No raw data, derivatives, or large outputs are committed. Everything routes through libs.paths and lands in an external dataset.
  2. Modal mirroring: Each modality (behavioural, fmri, physio) keeps consistent naming across preprocessing/, analyses/, and experiments/.
  3. Infrastructure reuse: Shared logic (path helpers, validation utilities) lives in libs/ instead of ad-hoc utils.py files.
  4. Pipeline transparency: CLI entrypoints in pipelines/ should be thin orchestrators that import from preprocessing/ and analyses/.
  5. Documentation proximity: Place notebooks and guides next to the code they explain; link them from the documentation index.

2.16.2 Extending the template

  • Add new modalities by creating parallel subdirectories (e.g., analyses/meg/, experiments/meg/).
  • For bespoke experiments, include stimulus generation code under stimuli/generation/ and reference exported assets via config/paths.toml.
  • Use namespace packages or uv workspaces if you need to split code into installable components; keep entrypoints within pipelines/ and scripts/.

2.17 Troubleshooting

Issue Resolution
paths.toml not found make setup (auto-picks preset) or copy one from config/presets/<site>/.
FreeSurfer license not found Register at the FreeSurfer site and place the file at config/licenses/fs_license.txt or ~/.freesurfer/license.txt. See Step 4.
Singularity not available module load singularity / apptainer, or point CONTAINER_ROOT at a directory of *.sif files in config/site.conf.
Containers not found bash scripts/setup/pull_containers.sh --dest "$CONTAINER_ROOT"
Invalid feature specification at SLURM submit Clear SLURM_CONSTRAINT in config/site.conf — your cluster does not use feature tags (e.g. NEU Explorer).
sbatch not found (on a laptop) Leave both SLURM_ACCOUNT and SLURM_PARTITION empty in site.conf; preflight auto-detects local mode.
GitHub rate limiting Ensure config/tooling.toml contains a valid token or export GITHUB_TOKEN.
Missing Google Drive credentials Share the Paperpile folder with the service account email and set GOOGLE_SERVICE_ACCOUNT_JSON.
Pipelines cannot find data Confirm the data repository is mounted and paths in config/paths.toml are correct.

2.18 Testing Your Setup

The template has three testing layers. Knowing which to run when saves wasted compute.

Layer Speed When to run Command
Unit tests < 1 s Before every push uv run pytest tests/ -q
Mock E2E < 5 s Every CI push / PR uv run pytest tests/test_pipeline_end_to_end.py -v
Real-site smoke ~15-30 min Once when onboarding a new HPC bash scripts/tests/run_new_site_smoke.sh

Before pushing a PR:

uv run pytest tests/ -q
bash scripts/setup/test_container_resolution.sh

Before a collaborator starts on a new HPC site:

# After make setup:
bash scripts/tests/run_new_site_smoke.sh

If it exits 0, real subjects are unblocked. If non-zero, the log says which step broke.

Before submitting a real run on a validated site:

make preflight
DRY_RUN=1 make pipeline SUBJECT=sub-XX BATCH_LABEL=your-study MODEL=your.smdl.json

2.19 Reference

Guide When to read
HPC_GUIDE.md SLURM optimization, resource tuning, new-site validation
SCAN_LOGGING.md Tracking scans and anomalies
ANALYSIS.md BIDS Stats Models, confound strategies, pipeline DAG
TEMPLATE_MAINTENANCE.md Syncing updates from the template