4  HPC Guide for Reproducible-fMRI

Complete reference for deploying and running the Reproducible-fMRI pipeline on SLURM-based HPC clusters. Covers everything from zero (account requests, SSH keys, container setup) through daily operations and resource optimization.

Time to deploy: ~2 hours (mostly waiting for account approvals and container pulls).

Already have HPC access? Skip to Install the Pipeline.


4.1 Part 1: Setting Up a New Site

4.1.1 1.1 Request an HPC Account

Site How to request Login node
UCI HPC3 https://rcic.uci.edu/account – PI must sponsor hpc3.rcic.uci.edu
UCR HPCC https://hpcc.ucr.edu – PI must sponsor cluster.hpcc.ucr.edu
NEU Explorer https://rc.northeastern.edu/getting-access – PI or self-request login.explorer.northeastern.edu
Other SLURM Contact your site’s research computing Ask your admin

4.1.2 1.2 Set Up SSH

Generate a key (skip if you already have one at ~/.ssh/id_ed25519):

# On your LOCAL machine (laptop/desktop), not the HPC
ssh-keygen -t ed25519 -C "your.email@university.edu"
# Press Enter for default path, set a passphrase (recommended)

Copy the key to the HPC:

# Replace <user> and <login-node> with your values
ssh-copy-id <user>@<login-node>
# Enter your HPC password when prompted

Create an SSH config for convenience (optional but recommended):

cat >> ~/.ssh/config << 'EOF'
# --- Lab HPC ---
Host hpc
    HostName hpc3.rcic.uci.edu
    User YOUR_USERNAME
    IdentityFile ~/.ssh/id_ed25519

Host hpc-ucr
    HostName cluster.hpcc.ucr.edu
    User YOUR_USERNAME
    IdentityFile ~/.ssh/id_ed25519

Host hpc-neu
    HostName login.explorer.northeastern.edu
    User YOUR_USERNAME
    IdentityFile ~/.ssh/id_ed25519
EOF

Then connect with just ssh hpc (or ssh hpc-neu, etc.).

4.1.3 1.3 Find Your SLURM Account and Partitions

Once logged into the HPC, run these to find the values you’ll need for config:

# Your SLURM account(s) -- note the non-default one (usually your PI's lab)
sacctmgr show associations user=$USER format=account%30

# Available partitions -- note the default (marked with *) and your main one
sinfo -s

# Example output:
# PARTITION    AVAIL  TIMELIMIT   NODES(A/I/O/T)
# standard*      up 14-00:00:0     117/59/5/181    <- UCI default
# free            up 3-00:00:00     141/66/5/212

How to read this: - account: Your PI’s SLURM allocation name (e.g., meganakp_lab at UCI) - PARTITION: The queue name you’ll use (e.g., standard, epyc, short) - TIMELIMIT: Max wall time per job (fMRIPrep needs ~6-12 hours)

4.1.4 1.4 Request Lab Storage (If Needed)

Site Storage path How to request Default quota
UCI HPC3 /dfs10/<lab>/ Included with account Shared lab allocation
UCR HPCC /bigdata/<lab>/ Included with account Shared lab allocation
NEU Explorer /projects/<group>/ ServiceNow request via RC portal 35 TB per PI

Do NOT store data in your home directory – quotas are too small (50-100 GB) for neuroimaging data. Use shared lab storage.

4.1.5 2.1 Clone the Repository

SSH into the HPC and clone into your per-user repos dir on lab storage:

ssh hpc  # or ssh hpc-neu, etc.

# Lab storage convention: <lab-root>/<user>/repos/<repo>
mkdir -p /dfs10/meganakp_lab/$USER/repos
cd       /dfs10/meganakp_lab/$USER/repos

# Clone the code repo (per-user clone — each researcher has their own)
git clone git@github.com:CNClaboratory/<your-project>.git
cd <your-project>

Shared rawdata and derivatives live separately under /dfs10/meganakp_lab/Projects/<project>/<dataset>/ — one BIDS tree per (project, dataset) pair. That directory is created by make setup + paths.local.toml in the next step; don’t clone anything into it.

If your project has a separate data repo (like vividness), it is cloned via git-annex into the Projects/<project>/<dataset>/ tree, not next to your code clone:

# Example:
cd /dfs10/meganakp_lab/Projects/<your-project>
datalad clone https://github.com/CNClaboratory/<your-project>-data.git <dataset>

4.1.6 2.2 Install uv (Python Package Manager)

curl -LsSf https://astral.sh/uv/install.sh | sh
source ~/.bashrc   # or ~/.bash_profile on some systems
uv --version       # verify: should print uv 0.x.x

Important: Always use uv sync and uv add, never uv pip install. The latter bypasses the lockfile and creates irreproducible environments.

4.1.7 2.3 Install Python Dependencies

cd /dfs10/meganakp_lab/$USER/repos/<your-project>
uv sync

This creates a .venv/ and installs all locked dependencies. Takes ~30 seconds.

4.1.9 3.2 Manual Setup (If Auto-Detect Fails)

Copy the preset closest to your site:

# Pick one:
cp config/presets/uci/* config/    # UCI HPC3
cp config/presets/ucr/* config/    # UCR HPCC
cp config/presets/neu/* config/    # NEU Explorer
cp config/presets/local/* config/  # Laptop/Docker

4.1.10 3.3 Edit paths.toml

nano config/paths.toml

Replace ALL placeholders (<lab>, <user>, <repo>, <project>, <dataset>, <group>). The canonical lab storage convention is:

  • codebase = <lab-root>/<user>/repos/<repo> — per-user clone
  • dataset = <lab-root>/Projects/<project>/<dataset> — shared BIDS tree
[paths.roots]
# CHANGE THESE to your actual paths:
codebase = "/dfs10/meganakp_lab/eolsson1/repos/Reproducible-fMRI"
dataset  = "/dfs10/meganakp_lab/Projects/lc-study/main-cohort"

Each project can hold multiple datasets (e.g. pilot, main-cohort, retest) — add one row in paths.local.toml per dataset.

4.1.11 3.4 Edit site.conf

nano config/site.conf

Fill in from what you learned in Step 1.3:

SLURM_ACCOUNT="meganakp_lab"     # from: sacctmgr show associations
SLURM_PARTITION="standard"       # from: sinfo -s (the partition you want)

4.1.12 3.5 Multi-Machine Overrides (Optional)

If paths.toml is checked into git (shared with your team), create config/paths.local.toml for your personal machine-specific overrides:

cat > config/paths.local.toml << 'EOF'
# Only the keys you specify are overridden
[paths.roots]
dataset = "/different/path/on/my/machine"
EOF

4.1.13 4.1 Set Up Containers

Neuroimaging tools run inside containers (Singularity/Apptainer) for reproducibility. There are three ways to set them up, depending on your site:

4.1.13.1 Option A: NeuroCommand Modules (UCI HPC3 Only)

UCI has pre-built modules. No container pull needed.

# Add to ~/.bashrc (one-time):
source /dfs10/meganakp_lab/sw/setup-lab-tools.sh

# Verify:
module use /dfs10/meganakp_lab/sw/neurocommand/local/containers/modules
module avail fmriprep

In config/site.conf:

MODULE_USE_PATH="/dfs10/meganakp_lab/sw/neurocommand/local/containers/modules"
FMRIPREP_MODULE="fmriprep/25.2.3"
MRIQC_MODULE="mriqc/24.0.2"
XCPD_MODULE="xcpd/0.10.0"

4.1.13.2 Option B: Direct Container Pull (UCR, NEU, Any Other Site)

Pull container images to shared lab storage:

# Load the container runtime if needed (UCR: module load singularity)
# NEU Explorer: apptainer is system-wide, no module needed

# Pull all pipeline containers (~15-30 min, ~20 GB total)
bash scripts/setup/pull_containers.sh \
    --dest /path/to/lab/containers

# Or pull specific tools only
bash scripts/setup/pull_containers.sh \
    --dest /path/to/lab/containers \
    --tools fmriprep,mriqc

In config/site.conf:

# Point to the container directory
CONTAINER_DIR="/path/to/lab/containers"
CONTAINER_PATH="/path/to/lab/containers/fmriprep-25.0.0.sif"
# Leave MODULE_USE_PATH="" and FMRIPREP_MODULE="" empty

4.1.13.3 Option C: Docker (Local Laptop/Desktop Only)

For local development/testing, Docker works too. The run_fmriprep_local.sh script handles this automatically.

4.1.13.4 Verify Container Access

# The runtime is auto-detected (apptainer > singularity > module)
# Just check one works:
singularity --version 2>/dev/null || apptainer --version 2>/dev/null || echo "MISSING"

4.1.14 5.1 FreeSurfer License

fMRIPrep requires a (free) FreeSurfer license.

  1. Register at https://surfer.nmr.mgh.harvard.edu/registration.html (takes 2 minutes)
  2. Receive license.txt by email
  3. Place it where the pipeline can find it:
# Option A: In the repo config (recommended)
mkdir -p config/licenses
cp ~/Downloads/license.txt config/licenses/fs_license.txt

# Option B: In your home directory
mkdir -p ~/.freesurfer
cp ~/Downloads/license.txt ~/.freesurfer/license.txt

Both locations are auto-detected by the preflight check and HPC scripts.

4.1.15 6.1 Validate Everything

make preflight

Expected output:

 Python 3.11 .............. PASS
 config/paths.toml ........ PASS
 Path resolution .......... PASS
 Key directories .......... PASS
 FreeSurfer license ....... PASS
 SLURM available .......... PASS
 Singularity/Apptainer ... PASS
 Container images ......... PASS
 BIDS structure ........... SKIP (no --bids-dir)
 Disk space ............... PASS

 8 passed, 0 failed, 1 skipped

Fix any FAIL items before proceeding. Common fixes:

Failure Fix
paths.toml not found Copy and edit a preset (Section 3.1-3.3)
Path resolution failed Check [paths.roots] values exist on disk
FreeSurfer license not found Register and place license.txt (Section 5.1)
Singularity not available module load singularity or module load apptainer
Containers not found Pull containers or set CONTAINER_PATH (Section 4.1)
Disk space low Move data to lab storage, clean scratch

4.1.16 7.1 Run the Pipeline

4.1.16.1 Preview First (Dry Run)

DRY_RUN=1 make preprocess BIDS_DIR=/path/to/rawdata SUBJECT=sub-01

Check: - Correct BIDS directory? - Correct output directory? - Correct SLURM account? - Correct container?

4.1.16.2 Process One Subject

make preprocess BIDS_DIR=/path/to/rawdata SUBJECT=sub-01

Monitor:

make status                           # SLURM queue
tail -f logs/fmriprep/fmriprep_*.out  # live output
sacct -j JOBID --format=State,Elapsed # when done

4.1.16.3 Verify Outputs

After the job completes (~6-12 hours for first subject with FreeSurfer):

# Check HTML report
ls derivatives/fmriprep/sub-01.html

# Check preprocessed BOLD
ls derivatives/fmriprep/sub-01/func/

# Check FreeSurfer reconstruction
ls derivatives/freesurfer/sub-01/

4.1.16.4 Batch Process All Subjects

# Run QC first
make qc BIDS_DIR=/path/to/rawdata

# Then preprocess all subjects
make preprocess BIDS_DIR=/path/to/rawdata

# Then post-processing
make denoise BIDS_DIR=/path/to/rawdata

# Then statistics (if you have a model)
make glm BIDS_DIR=/path/to/rawdata MODEL=models/task.smdl.json

Or run everything:

make all BIDS_DIR=/path/to/rawdata MODEL=models/task.smdl.json

4.2 Part 2: SLURM Best Practices

4.2.1 Philosophy: Maximum Resources for Maximum Speed

Our goal is fastest completion, not resource conservation.

  • Use ALL available CPUs – Query idle nodes, request the maximum
  • Use ALL memory (--mem=0) – Gets all memory on the allocated node
  • Parallelize ALWAYS – Use job arrays for independent subject processing
  • One subject per job – Array jobs complete faster than sequential processing

4.2.2 Pre-Submission Checklist (Mandatory)

NEVER submit a job without probing resources first.

# 1. Check cluster status and available nodes
sinfo -p standard -t idle -o "%n %c %m" | head -20

# 2. Get max available CPUs on idle nodes
AVAIL_CPUS=$(sinfo -p standard -t idle,mix -h -o "%c" | sort -n | tail -1)
echo "Max CPUs available: $AVAIL_CPUS"

# 3. Check your current job usage
squeue -u $USER

# 4. Recommended submission:
echo "sbatch --cpus-per-task=$AVAIL_CPUS --mem=0 your_script.sh"

4.2.3 Standard SLURM Header Template

All batch scripts should use this pattern:

#!/bin/bash
#SBATCH --job-name=<descriptive_name>
#SBATCH --account={{HPC_ACCOUNT}}       # REQUIRED: Lab account
#SBATCH --partition=standard            # Or free, gpu, highmem
#SBATCH --nodes=1
#SBATCH --cpus-per-task=48              # Use probed max
#SBATCH --mem=0                         # ALL memory on node
#SBATCH --time=4:00:00                  # Estimate conservatively
#SBATCH --output=logs/%x_%j.out
#SBATCH --error=logs/%x_%j.err

# Use ALL allocated CPUs in your code
uv run python script.py --n-jobs $SLURM_CPUS_PER_TASK

4.2.4 Parallelization Rules

4.2.4.1 ALWAYS Use Job Arrays When:

  • Processing multiple subjects independently
  • Running the same pipeline across different inputs
  • Each job doesn’t depend on others’ outputs

There are NO drawbacks to parallelizing independent subject processing.

4.2.4.2 DON’T Parallelize When:

  • Group-level analyses (needs all subjects first)
  • Sequential pipeline phases (Phase N needs Phase N-1 output)
  • Jobs would exceed memory limits together

4.2.5 Job Array Pattern (Required for Multi-Subject Processing)

#!/bin/bash
#SBATCH --job-name=analysis_pipeline
#SBATCH --account={{HPC_ACCOUNT}}
#SBATCH --partition=standard
#SBATCH --array=0-11                    # One job per subject (0-indexed)
#SBATCH --cpus-per-task=48              # MAX available
#SBATCH --mem=0                         # ALL memory
#SBATCH --time=4:00:00
#SBATCH --output=logs/%x_%A_%a.out      # %A=array job ID, %a=task ID
#SBATCH --error=logs/%x_%A_%a.err

# Subject list
SUBJECTS=(sub-01 sub-02 sub-03 sub-04 sub-05 sub-06
          sub-07 sub-08 sub-09 sub-10 sub-11 sub-12)
SUBJECT=${SUBJECTS[$SLURM_ARRAY_TASK_ID]}

echo "Processing $SUBJECT with $SLURM_CPUS_PER_TASK CPUs"

# Pass CPU count to your script
uv run python analyses/fmri/run_analysis.py $SUBJECT \
    --n-jobs $SLURM_CPUS_PER_TASK

4.2.6 Resource Recommendations by Job Type

Job Type CPUs Memory Time Array?
fMRIPrep (per subject) MAX --mem=0 24h Yes
GLM fitting MAX --mem=0 4h Yes
Mask generation 8-16 32-64G 2h Yes
Retinotopy (neuropythy) 8 64G 2h Yes
Group analysis 8-16 64G 4h No
QC/Visualization 4-8 16G 1h Optional

4.2.7 Dynamic Resource Allocation Script

Use this helper to automatically submit with optimal resources:

#!/bin/bash
# submit_optimal.sh - Submit with maximum available resources
# Usage: ./submit_optimal.sh <script.sh> [extra_sbatch_args]

SCRIPT=$1
shift
EXTRA_ARGS="$@"

# Get max available CPUs
AVAIL_CPUS=$(sinfo -p standard -t idle,mix -h -o "%c" 2>/dev/null | sort -n | tail -1)
AVAIL_CPUS=${AVAIL_CPUS:-32}  # Default to 32 if can't determine

echo "=== Submitting with optimal resources ==="
echo "Max CPUs detected: $AVAIL_CPUS"
echo "Script: $SCRIPT"

sbatch --cpus-per-task=$AVAIL_CPUS \
       --mem=0 \
       $EXTRA_ARGS \
       "$SCRIPT"

4.2.8 Common Mistakes and Fixes

Mistake Consequence Fix
Not probing resources Suboptimal allocation Always run sinfo first
Sequential subject loops 10x slower completion Convert to job array
Hardcoded --cpus-per-task=8 Underutilizing nodes Use probed max or $SLURM_CPUS_PER_TASK
Hardcoded --mem=64G Leaving memory unused Use --mem=0 for all memory
Not passing --n-jobs Single-threaded execution Pass $SLURM_CPUS_PER_TASK to scripts
Wrong --account Jobs pending forever Always use lab account

4.2.9 Checking Job Status

# Your running/pending jobs
squeue -u $USER --format="%.10i %.30j %.8T %.10M %.6D %.4C"

# Detailed job info
scontrol show job <job_id>

# Recent job history
sacct -u $USER --starttime=$(date -d '24 hours ago' +%Y-%m-%d) \
      --format=JobID,JobName,State,ExitCode,Elapsed,MaxRSS

4.2.10 Login Node Rules

The login node is for coordination, not computation.

ALLOWED FORBIDDEN
git, sbatch, squeue ANY data processing
Light file inspection Running Python scripts
Job monitoring Loops over files
module load Heavy I/O operations

If you’re not sure – use SLURM.

4.2.11 Environment Variables

Always set these in your SLURM scripts:

# Path configuration (adjust for your project)
export PROJECT_PATHS_FILE="config/paths.toml"

# Use allocated resources
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK

4.2.12 fMRIPrep on HPC

4.2.12.2 fMRIPrep 25.2.3 Default Configuration

The HPC script (run_fmriprep_hpc.sh) uses these optimized settings:

#SBATCH --exclusive        # Full node access
#SBATCH --mem=0            # All node memory
#SBATCH --constraint=intel # Best performance

4.2.12.3 Key Output Spaces

Space Purpose
T1w Native subject space
MNI152NLin2009cAsym:res-2 Standard volumetric
fsnative Native FreeSurfer surface
fsaverage Standard surface
fsLR HCP-compatible surface (for CIFTI)

4.2.12.4 Module Loading

module purge
module load singularity/3.11.3
module use /dfs10/meganakp_lab/sw/neurocommand/local/containers/modules
module load fmriprep/25.2.3

4.2.12.5 Cache Directories (on Lab Storage)

All cache directories use lab storage (/dfs10/meganakp_lab/Projects/...) to avoid quota issues: - TemplateFlow templates - Nipype cache - Python bytecode cache - XDG cache

Work directories use node-local NVMe ($TMPDIR) for I/O speed.

4.2.13 Singularity/Apptainer Container Cache

4.2.13.1 The Problem

Singularity defaults to ~/.singularity/cache/, consuming home directory quota.

4.2.13.2 Lab Configuration (UCI HPC3)

# Add to ~/.bashrc (one-time setup)
source /dfs10/meganakp_lab/sw/setup-lab-tools.sh

This sets:

Variable Value Purpose
SINGULARITY_CACHEDIR /dfs10/meganakp_lab/sw/.singularity_cache Shared cache for container layers
PATH Adds lab tools directory Access to git-annex, datalad

4.2.13.3 BeeGFS Limitation (Critical)

Do NOT set SINGULARITY_TMPDIR to /dfs10/. BeeGFS doesn’t support Singularity’s unprivileged symlink operations during container builds.

  • Cache directory (SINGULARITY_CACHEDIR) – OK on DFS (stores downloaded blobs)
  • Temp directory (SINGULARITY_TMPDIR) – MUST be local filesystem (/tmp or SLURM’s $TMPDIR)

SLURM jobs automatically get a node-local $TMPDIR, so no additional configuration is needed for batch jobs.

4.2.13.4 Pre-Built Lab Containers (UCI)

Container Location
MRIQC 24.0.2 /dfs10/meganakp_lab/sw/containers/mriqc-24.0.2.sif
XCP-D 0.10.0 /dfs10/meganakp_lab/sw/containers/xcp_d-0.10.0.sif
fMRIPrep 25.2.3 Via module: module load fmriprep/25.2.3

4.2.13.5 Migrating from Home Directory Cache

If you previously used the default ~/.singularity/cache/, follow these steps:

1. Update shell configuration:

Add to your ~/.bashrc:

# Lab tools and Singularity cache configuration
source /dfs10/meganakp_lab/sw/setup-lab-tools.sh

Then reload:

source ~/.bashrc

2. Verify configuration:

# Check that SINGULARITY_CACHEDIR is set
echo "SINGULARITY_CACHEDIR: $SINGULARITY_CACHEDIR"
# Expected: /dfs10/meganakp_lab/sw/.singularity_cache

# SINGULARITY_TMPDIR should be unset (uses local /tmp)
echo "SINGULARITY_TMPDIR: ${SINGULARITY_TMPDIR:-<unset - correct>}"

# Verify cache is using new location
singularity cache list

3. Clean up old cache (optional):

Once verified, reclaim home directory quota:

# Check old cache size
du -sh ~/.singularity/cache/

# Remove old cache
rm -rf ~/.singularity/cache/

4. Test container operations:

# Test pulling a container (should use new cache location)
singularity pull docker://hello-world

# Verify it's in the lab cache
ls -la /dfs10/meganakp_lab/sw/.singularity_cache/

4.2.13.6 Checking Group Ownership

Lab storage should have meganakp_hpc group for correct quota billing:

# Check current ownership
ls -la /dfs10/meganakp_lab/sw/containers/

# Should show: meganakp_hpc group, not your personal username
# drwxrwsr-x 2 user meganakp_hpc 4096 Feb  4 12:00 containers

If you need to create a new directory with correct group:

mkdir /dfs10/meganakp_lab/sw/new_directory
chgrp meganakp_hpc /dfs10/meganakp_lab/sw/new_directory
chmod g+s /dfs10/meganakp_lab/sw/new_directory  # Inherit group for new files

4.2.13.7 Container Cache Troubleshooting

Issue Cause Fix
“disk quota exceeded” during pull Cache in home directory Set SINGULARITY_CACHEDIR to lab storage
Symlink errors during build SINGULARITY_TMPDIR on BeeGFS Unset SINGULARITY_TMPDIR or set to /tmp
“disk quota exceeded” on lab storage Wrong group ownership Create with chgrp meganakp_hpc
Old cache consuming quota Legacy ~/.singularity/cache/ Run rm -rf ~/.singularity/cache/
Container not found after pull Wrong cache directory Verify $SINGULARITY_CACHEDIR is set correctly
“operation not permitted” during build SINGULARITY_TMPDIR set to /dfs10/ Unset SINGULARITY_TMPDIR: unset SINGULARITY_TMPDIR

4.3 Site-Specific Reference

4.3.1 UCI HPC3

Setting Value
Login ssh hpc3.rcic.uci.edu
Storage /dfs10/<lab>/Projects/
Home quota 50 GB
Containers NeuroCommand modules (module use ...)
Runtime Singularity (module load singularity/3.11.3)
Partition standard (14 days), free (3 days, preemptible)
Account <lab>_lab (e.g., meganakp_lab)

4.3.2 UCR HPCC

Setting Value
Login ssh cluster.hpcc.ucr.edu
Storage /bigdata/<lab>/shared/
Home quota 20 GB
Containers Direct Singularity pull
Runtime module load singularity
Partition epyc (AMD, 168h), intel (168h), highmem (48h)
Account <lab>

4.3.3 NEU Explorer

Setting Value
Login ssh login.explorer.northeastern.edu
Storage /projects/<group>/ (35 TB per PI, request via RC portal)
Scratch /scratch/<user>/ (purged monthly)
Home quota 100 GB
Containers Apptainer system-wide (no module load needed)
Runtime apptainer (auto-detected)
Partition short (48h, 1024 cores), express (60m), long (5d, needs approval)
Account <project_name>
Pre-built /shared/container_repository/explorer/
Bind mounts -B "/projects:/projects,/scratch:/scratch" (automatic in our scripts)

4.4 Troubleshooting

4.4.1 SLURM Job Fails Immediately (Exit Code in Seconds)

cat logs/fmriprep/fmriprep_JOBID.err

Common causes: - logs/fmriprep/ directory doesn’t exist – mkdir -p logs/fmriprep logs/mriqc - Wrong SLURM account – check sacctmgr show associations user=$USER - Container not found – check CONTAINER_PATH in site.conf

4.4.2 “module: command not found”

Your HPC doesn’t use environment modules, or you need to source the module system first. Check your HPC docs, or just use CONTAINER_PATH directly (Option B in Section 4.1).

4.4.3 “BIDSConflictingValuesError” from fMRIPrep

Your events.json sidecar has keys like "session" or "run" that conflict with BIDS entity names. Remove them from the JSON sidecar:

python3 -c "
import json, glob
for f in glob.glob('rawdata/**/func/*_events.json', recursive=True):
    with open(f) as fh: d = json.load(fh)
    changed = False
    for k in ['session', 'run', 'participant']:
        if k in d: d.pop(k); changed = True
    if changed:
        with open(f, 'w') as fh: json.dump(d, fh, indent=2)
        print(f'  Fixed {f}')
"

4.4.4 Path Resolution Fails

uv run python -c "from libs.paths import get_paths; p = get_paths(); print(p)"

If this errors, check config/paths.toml syntax and that directories exist.

4.4.5 Container Out of Memory (OOM)

Reduce fMRIPrep parallelism. Edit run_fmriprep_hpc.sh or pass env vars:

NTHREADS=4 MEM_MB=16000 make preprocess BIDS_DIR=...

4.5 Quick Reference Card

# === First-time setup (once per site) ===
make setup                # auto-detect, configure, validate
nano config/paths.toml    # edit storage paths
nano config/site.conf     # edit SLURM/container settings
make preflight            # verify everything works

# === Daily use ===
make help                 # see all commands
make status               # check SLURM jobs
DRY_RUN=1 make preprocess BIDS_DIR=/data  # preview
make preprocess BIDS_DIR=/data SUBJECT=sub-01  # one subject
make all BIDS_DIR=/data MODEL=models/task.smdl.json  # everything

# === SLURM commands ===
sinfo -p standard -t idle -o "%n %c %m"   # check available nodes
squeue -u $USER                           # your running jobs
sacct -j JOBID                            # completed job details
scancel <job_id>                          # cancel a job
scancel -u $USER                          # cancel all your jobs

# === Resource probing ===
AVAIL_CPUS=$(sinfo -p standard -t idle,mix -h -o "%c" | sort -n | tail -1)
echo "Submit with: sbatch --cpus-per-task=$AVAIL_CPUS --mem=0 script.sh"

# === Debugging ===
make preflight            # re-check environment
cat logs/fmriprep/*.err   # read error logs
scontrol show job <job_id>  # detailed job info

4.6 New Site Validation (Press-Go Checklist)

Run this checklist whenever a new site is onboarded, a new researcher takes their first run, or template changes land that touch _load_site_config.sh, Makefile, or any run_*_hpc.sh / run_*_batch.sh. Total time: under 10 minutes.

4.6.1 Prerequisites (manual confirmation)

4.6.2 Step 1 — Clone + bootstrap

ssh <your-cluster>
cd <your-lab-storage-dir>
git clone https://github.com/CNClaboratory/<your-project>.git
cd <your-project>
make setup

Expected: auto_detect prints your site name, preset is copied, uv sync installs deps, preflight_check.sh --fix runs. Some FAILs are expected (placeholder values).

4.6.3 Step 2 — Fill in placeholders

$EDITOR config/paths.toml config/site.conf

At minimum: replace <lab>, <user>, <project> in paths.toml, set SLURM_ACCOUNT in site.conf. First researcher on the cluster should also pull containers:

bash scripts/setup/pull_containers.sh --dest "$CONTAINER_ROOT"

4.6.4 Step 3 — Run the automated smoke test

bash scripts/setup/press_go_smoke_test.sh --subject sub-01 --verbose

The 7 checks:

# Check Verifies
1 Auto-detection Hostname recognized by auto_detect.sh
2 Preset directory config/presets/<site>/ exists
3 Site config loaded _load_site_config.sh sources cleanly
4 Container resolution find_container finds tools
5 make setup idempotent Re-running doesn’t clobber edited configs
6 Dry-run sbatch shape Correct --account and --constraint
7 preflight_check.sh All 10 environment checks pass

Green light (7/7) = proceed to real subjects. Red light = see failure table below.

4.6.5 Common failure modes

Failure Cause Fix
[3/7] SLURM vars empty SLURM_ACCOUNT="" in site.conf Set to a real allocation
[4/7] container not found No module or CONTAINER_ROOT scripts/setup/pull_containers.sh
[6/7] --constraint= with empty var Bug in batch launcher Open issue tagged press-go
[7/7] placeholder failures Unreplaced <lab>/<user> Edit paths.toml
[7/7] FreeSurfer license Missing license file Place at config/licenses/fs_license.txt

4.6.6 Step 4 — Real dry-run preview

DRY_RUN=1 make preprocess BATCH_LABEL=my-study SUBJECT=sub-01

Inspect the sbatch command — flags should match your site.conf.

4.6.7 Step 5 — Sign off


Consolidated from NEW_SITE_SETUP.md, HPC_BEST_PRACTICES.md, SINGULARITY_CACHE_MIGRATION.md, and press_go_validation.md. Last updated: 2026-04-13