4 HPC Guide for Reproducible-fMRI

Complete reference for deploying and running the Reproducible-fMRI pipeline on SLURM-based HPC clusters. Covers everything from zero (account requests, SSH keys, container setup) through daily operations and resource optimization.

Time to deploy: ~2 hours (mostly waiting for account approvals and container pulls).

Already have HPC access? Skip to Install the Pipeline.

4.1 Part 1: Setting Up a New Site

4.1.1 1.1 Request an HPC Account

Site	How to request	Login node
UCI HPC3	https://rcic.uci.edu/account – PI must sponsor	`hpc3.rcic.uci.edu`
UCR HPCC	https://hpcc.ucr.edu – PI must sponsor	`cluster.hpcc.ucr.edu`
NEU Explorer	https://rc.northeastern.edu/getting-access – PI or self-request	`login.explorer.northeastern.edu`
Other SLURM	Contact your site’s research computing	Ask your admin

4.1.2 1.2 Set Up SSH

Generate a key (skip if you already have one at ~/.ssh/id_ed25519):

# On your LOCAL machine (laptop/desktop), not the HPC
ssh-keygen -t ed25519 -C "your.email@university.edu"
# Press Enter for default path, set a passphrase (recommended)

Copy the key to the HPC:

# Replace <user> and <login-node> with your values
ssh-copy-id <user>@<login-node>
# Enter your HPC password when prompted

Create an SSH config for convenience (optional but recommended):

cat >> ~/.ssh/config << 'EOF'
# --- Lab HPC ---
Host hpc
    HostName hpc3.rcic.uci.edu
    User YOUR_USERNAME
    IdentityFile ~/.ssh/id_ed25519

Host hpc-ucr
    HostName cluster.hpcc.ucr.edu
    User YOUR_USERNAME
    IdentityFile ~/.ssh/id_ed25519

Host hpc-neu
    HostName login.explorer.northeastern.edu
    User YOUR_USERNAME
    IdentityFile ~/.ssh/id_ed25519
EOF

Then connect with just ssh hpc (or ssh hpc-neu, etc.).

4.1.3 1.3 Find Your SLURM Account and Partitions

Once logged into the HPC, run these to find the values you’ll need for config:

# Your SLURM account(s) -- note the non-default one (usually your PI's lab)
sacctmgr show associations user=$USER format=account%30

# Available partitions -- note the default (marked with *) and your main one
sinfo -s

# Example output:
# PARTITION    AVAIL  TIMELIMIT   NODES(A/I/O/T)
# standard*      up 14-00:00:0     117/59/5/181    <- UCI default
# free            up 3-00:00:00     141/66/5/212

How to read this: - account: Your PI’s SLURM allocation name (e.g., meganakp_lab at UCI) - PARTITION: The queue name you’ll use (e.g., standard, epyc, short) - TIMELIMIT: Max wall time per job (fMRIPrep needs ~6-12 hours)

4.1.4 1.4 Request Lab Storage (If Needed)

Site	Storage path	How to request	Default quota
UCI HPC3	`/dfs10/<lab>/`	Included with account	Shared lab allocation
UCR HPCC	`/bigdata/<lab>/`	Included with account	Shared lab allocation
NEU Explorer	`/projects/<group>/`	ServiceNow request via RC portal	35 TB per PI

Do NOT store data in your home directory – quotas are too small (50-100 GB) for neuroimaging data. Use shared lab storage.

4.1.5 2.1 Clone the Repository

SSH into the HPC and clone into your per-user repos dir on lab storage:

ssh hpc  # or ssh hpc-neu, etc.

# Lab storage convention: <lab-root>/<user>/repos/<repo>
mkdir -p /dfs10/meganakp_lab/$USER/repos
cd       /dfs10/meganakp_lab/$USER/repos

# Clone the code repo (per-user clone — each researcher has their own)
git clone git@github.com:CNClaboratory/<your-project>.git
cd <your-project>

Shared rawdata and derivatives live separately under /dfs10/meganakp_lab/Projects/<project>/<dataset>/ — one BIDS tree per (project, dataset) pair. That directory is created by make setup + paths.local.toml in the next step; don’t clone anything into it.

If your project has a separate data repo (like vividness), it is cloned via git-annex into the Projects/<project>/<dataset>/ tree, not next to your code clone:

# Example:
cd /dfs10/meganakp_lab/Projects/<your-project>
datalad clone https://github.com/CNClaboratory/<your-project>-data.git <dataset>

4.1.6 2.2 Install uv (Python Package Manager)

curl -LsSf https://astral.sh/uv/install.sh | sh
source ~/.bashrc   # or ~/.bash_profile on some systems
uv --version       # verify: should print uv 0.x.x

Important: Always use uv sync and uv add, never uv pip install. The latter bypasses the lockfile and creates irreproducible environments.

4.1.7 2.3 Install Python Dependencies

cd /dfs10/meganakp_lab/$USER/repos/<your-project>
uv sync

This creates a .venv/ and installs all locked dependencies. Takes ~30 seconds.

4.1.8 3.1 Automatic Setup (Recommended)

make setup

This will: 1. Auto-detect your site from the hostname (UCI, UCR, NEU, or unknown) 2. Copy the matching preset (config/presets/<site>/paths.toml + site.conf) 3. Install Python dependencies 4. Run preflight validation

If auto-detection works, you only need to edit the placeholder values in the copied config files.

4.1.9 3.2 Manual Setup (If Auto-Detect Fails)

Copy the preset closest to your site:

# Pick one:
cp config/presets/uci/* config/    # UCI HPC3
cp config/presets/ucr/* config/    # UCR HPCC
cp config/presets/neu/* config/    # NEU Explorer
cp config/presets/local/* config/  # Laptop/Docker

4.1.10 3.3 Edit paths.toml

nano config/paths.toml

Replace ALL placeholders (<lab>, <user>, <repo>, <project>, <dataset>, <group>). The canonical lab storage convention is:

codebase = <lab-root>/<user>/repos/<repo> — per-user clone
dataset = <lab-root>/Projects/<project>/<dataset> — shared BIDS tree

[paths.roots]
# CHANGE THESE to your actual paths:
codebase = "/dfs10/meganakp_lab/eolsson1/repos/Reproducible-fMRI"
dataset  = "/dfs10/meganakp_lab/Projects/lc-study/main-cohort"

Each project can hold multiple datasets (e.g. pilot, main-cohort, retest) — add one row in paths.local.toml per dataset.

4.1.11 3.4 Edit site.conf

nano config/site.conf

Fill in from what you learned in Step 1.3:

SLURM_ACCOUNT="meganakp_lab"     # from: sacctmgr show associations
SLURM_PARTITION="standard"       # from: sinfo -s (the partition you want)

4.1.12 3.5 Multi-Machine Overrides (Optional)

If paths.toml is checked into git (shared with your team), create config/paths.local.toml for your personal machine-specific overrides:

cat > config/paths.local.toml << 'EOF'
# Only the keys you specify are overridden
[paths.roots]
dataset = "/different/path/on/my/machine"
EOF

4.1.13 4.1 Set Up Containers

Neuroimaging tools run inside containers (Singularity/Apptainer) for reproducibility. There are three ways to set them up, depending on your site:

4.1.13.1 Option A: NeuroCommand Modules (UCI HPC3 Only)

UCI has pre-built modules. No container pull needed.

# Add to ~/.bashrc (one-time):
source /dfs10/meganakp_lab/sw/setup-lab-tools.sh

# Verify:
module use /dfs10/meganakp_lab/sw/neurocommand/local/containers/modules
module avail fmriprep

In config/site.conf:

MODULE_USE_PATH="/dfs10/meganakp_lab/sw/neurocommand/local/containers/modules"
FMRIPREP_MODULE="fmriprep/25.2.3"
MRIQC_MODULE="mriqc/24.0.2"
XCPD_MODULE="xcpd/0.10.0"

4.1.13.2 Option B: Direct Container Pull (UCR, NEU, Any Other Site)

Pull container images to shared lab storage:

# Load the container runtime if needed (UCR: module load singularity)
# NEU Explorer: apptainer is system-wide, no module needed

# Pull all pipeline containers (~15-30 min, ~20 GB total)
bash scripts/setup/pull_containers.sh \
    --dest /path/to/lab/containers

# Or pull specific tools only
bash scripts/setup/pull_containers.sh \
    --dest /path/to/lab/containers \
    --tools fmriprep,mriqc

In config/site.conf:

# Point to the container directory
CONTAINER_DIR="/path/to/lab/containers"
CONTAINER_PATH="/path/to/lab/containers/fmriprep-25.0.0.sif"
# Leave MODULE_USE_PATH="" and FMRIPREP_MODULE="" empty

4.1.13.3 Option C: Docker (Local Laptop/Desktop Only)

For local development/testing, Docker works too. The run_fmriprep_local.sh script handles this automatically.

4.1.13.4 Verify Container Access

# The runtime is auto-detected (apptainer > singularity > module)
# Just check one works:
singularity --version 2>/dev/null || apptainer --version 2>/dev/null || echo "MISSING"

4.1.14 5.1 FreeSurfer License

fMRIPrep requires a (free) FreeSurfer license.

Register at https://surfer.nmr.mgh.harvard.edu/registration.html (takes 2 minutes)
Receive license.txt by email
Place it where the pipeline can find it:

# Option A: In the repo config (recommended)
mkdir -p config/licenses
cp ~/Downloads/license.txt config/licenses/fs_license.txt

# Option B: In your home directory
mkdir -p ~/.freesurfer
cp ~/Downloads/license.txt ~/.freesurfer/license.txt

Both locations are auto-detected by the preflight check and HPC scripts.

4.1.15 6.1 Validate Everything

make preflight

Expected output:

 Python 3.11 .............. PASS
 config/paths.toml ........ PASS
 Path resolution .......... PASS
 Key directories .......... PASS
 FreeSurfer license ....... PASS
 SLURM available .......... PASS
 Singularity/Apptainer ... PASS
 Container images ......... PASS
 BIDS structure ........... SKIP (no --bids-dir)
 Disk space ............... PASS

 8 passed, 0 failed, 1 skipped

Fix any FAIL items before proceeding. Common fixes:

Failure	Fix
`paths.toml not found`	Copy and edit a preset (Section 3.1-3.3)
`Path resolution failed`	Check `[paths.roots]` values exist on disk
`FreeSurfer license not found`	Register and place `license.txt` (Section 5.1)
`Singularity not available`	`module load singularity` or `module load apptainer`
`Containers not found`	Pull containers or set `CONTAINER_PATH` (Section 4.1)
`Disk space low`	Move data to lab storage, clean scratch

4.1.16 7.1 Run the Pipeline

4.1.16.1 Preview First (Dry Run)

DRY_RUN=1 make preprocess BIDS_DIR=/path/to/rawdata SUBJECT=sub-01

Check: - Correct BIDS directory? - Correct output directory? - Correct SLURM account? - Correct container?

4.1.16.2 Process One Subject

make preprocess BIDS_DIR=/path/to/rawdata SUBJECT=sub-01

Monitor:

make status                           # SLURM queue
tail -f logs/fmriprep/fmriprep_*.out  # live output
sacct -j JOBID --format=State,Elapsed # when done

4.1.16.3 Verify Outputs

After the job completes (~6-12 hours for first subject with FreeSurfer):

# Check HTML report
ls derivatives/fmriprep/sub-01.html

# Check preprocessed BOLD
ls derivatives/fmriprep/sub-01/func/

# Check FreeSurfer reconstruction
ls derivatives/freesurfer/sub-01/

4.1.16.4 Batch Process All Subjects

# Run QC first
make qc BIDS_DIR=/path/to/rawdata

# Then preprocess all subjects
make preprocess BIDS_DIR=/path/to/rawdata

# Then post-processing
make denoise BIDS_DIR=/path/to/rawdata

# Then statistics (if you have a model)
make glm BIDS_DIR=/path/to/rawdata MODEL=models/task.smdl.json

Or run everything:

make all BIDS_DIR=/path/to/rawdata MODEL=models/task.smdl.json

4.2 Part 2: SLURM Best Practices

4.2.1 Philosophy: Maximum Resources for Maximum Speed

Our goal is fastest completion, not resource conservation.

Use ALL available CPUs – Query idle nodes, request the maximum
Use ALL memory (--mem=0) – Gets all memory on the allocated node
Parallelize ALWAYS – Use job arrays for independent subject processing
One subject per job – Array jobs complete faster than sequential processing

4.2.2 Pre-Submission Checklist (Mandatory)

NEVER submit a job without probing resources first.

# 1. Check cluster status and available nodes
sinfo -p standard -t idle -o "%n %c %m" | head -20

# 2. Get max available CPUs on idle nodes
AVAIL_CPUS=$(sinfo -p standard -t idle,mix -h -o "%c" | sort -n | tail -1)
echo "Max CPUs available: $AVAIL_CPUS"

# 3. Check your current job usage
squeue -u $USER

# 4. Recommended submission:
echo "sbatch --cpus-per-task=$AVAIL_CPUS --mem=0 your_script.sh"

4.2.3 Standard SLURM Header Template

All batch scripts should use this pattern:

#!/bin/bash
#SBATCH --job-name=<descriptive_name>
#SBATCH --account={{HPC_ACCOUNT}}       # REQUIRED: Lab account
#SBATCH --partition=standard            # Or free, gpu, highmem
#SBATCH --nodes=1
#SBATCH --cpus-per-task=48              # Use probed max
#SBATCH --mem=0                         # ALL memory on node
#SBATCH --time=4:00:00                  # Estimate conservatively
#SBATCH --output=logs/%x_%j.out
#SBATCH --error=logs/%x_%j.err

# Use ALL allocated CPUs in your code
uv run python script.py --n-jobs $SLURM_CPUS_PER_TASK

4.2.4 Parallelization Rules

4.2.4.1 ALWAYS Use Job Arrays When:

Processing multiple subjects independently
Running the same pipeline across different inputs
Each job doesn’t depend on others’ outputs

There are NO drawbacks to parallelizing independent subject processing.

4.2.4.2 DON’T Parallelize When:

Group-level analyses (needs all subjects first)
Sequential pipeline phases (Phase N needs Phase N-1 output)
Jobs would exceed memory limits together

4.2.5 Job Array Pattern (Required for Multi-Subject Processing)

#!/bin/bash
#SBATCH --job-name=analysis_pipeline
#SBATCH --account={{HPC_ACCOUNT}}
#SBATCH --partition=standard
#SBATCH --array=0-11                    # One job per subject (0-indexed)
#SBATCH --cpus-per-task=48              # MAX available
#SBATCH --mem=0                         # ALL memory
#SBATCH --time=4:00:00
#SBATCH --output=logs/%x_%A_%a.out      # %A=array job ID, %a=task ID
#SBATCH --error=logs/%x_%A_%a.err

# Subject list
SUBJECTS=(sub-01 sub-02 sub-03 sub-04 sub-05 sub-06
          sub-07 sub-08 sub-09 sub-10 sub-11 sub-12)
SUBJECT=${SUBJECTS[$SLURM_ARRAY_TASK_ID]}

echo "Processing $SUBJECT with $SLURM_CPUS_PER_TASK CPUs"

# Pass CPU count to your script
uv run python analyses/fmri/run_analysis.py $SUBJECT \
    --n-jobs $SLURM_CPUS_PER_TASK

4.2.6 Resource Recommendations by Job Type

Job Type	CPUs	Memory	Time	Array?
fMRIPrep (per subject)	MAX	`--mem=0`	24h	Yes
GLM fitting	MAX	`--mem=0`	4h	Yes
Mask generation	8-16	32-64G	2h	Yes
Retinotopy (neuropythy)	8	64G	2h	Yes
Group analysis	8-16	64G	4h	No
QC/Visualization	4-8	16G	1h	Optional

4.2.7 Dynamic Resource Allocation Script

Use this helper to automatically submit with optimal resources:

#!/bin/bash
# submit_optimal.sh - Submit with maximum available resources
# Usage: ./submit_optimal.sh <script.sh> [extra_sbatch_args]

SCRIPT=$1
shift
EXTRA_ARGS="$@"

# Get max available CPUs
AVAIL_CPUS=$(sinfo -p standard -t idle,mix -h -o "%c" 2>/dev/null | sort -n | tail -1)
AVAIL_CPUS=${AVAIL_CPUS:-32}  # Default to 32 if can't determine

echo "=== Submitting with optimal resources ==="
echo "Max CPUs detected: $AVAIL_CPUS"
echo "Script: $SCRIPT"

sbatch --cpus-per-task=$AVAIL_CPUS \
       --mem=0 \
       $EXTRA_ARGS \
       "$SCRIPT"

4.2.8 Common Mistakes and Fixes

Mistake	Consequence	Fix
Not probing resources	Suboptimal allocation	Always run `sinfo` first
Sequential subject loops	10x slower completion	Convert to job array
Hardcoded `--cpus-per-task=8`	Underutilizing nodes	Use probed max or `$SLURM_CPUS_PER_TASK`
Hardcoded `--mem=64G`	Leaving memory unused	Use `--mem=0` for all memory
Not passing `--n-jobs`	Single-threaded execution	Pass `$SLURM_CPUS_PER_TASK` to scripts
Wrong `--account`	Jobs pending forever	Always use lab account

4.2.9 Checking Job Status

# Your running/pending jobs
squeue -u $USER --format="%.10i %.30j %.8T %.10M %.6D %.4C"

# Detailed job info
scontrol show job <job_id>

# Recent job history
sacct -u $USER --starttime=$(date -d '24 hours ago' +%Y-%m-%d) \
      --format=JobID,JobName,State,ExitCode,Elapsed,MaxRSS

4.2.10 Login Node Rules

The login node is for coordination, not computation.

ALLOWED	FORBIDDEN
`git`, `sbatch`, `squeue`	ANY data processing
Light file inspection	Running Python scripts
Job monitoring	Loops over files
`module load`	Heavy I/O operations

If you’re not sure – use SLURM.

4.2.11 Environment Variables

Always set these in your SLURM scripts:

# Path configuration (adjust for your project)
export PROJECT_PATHS_FILE="config/paths.toml"

# Use allocated resources
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK

4.2.12 fMRIPrep on HPC

4.2.12.1 Batch Submission (Recommended)

Use the batch launcher for all fMRIPrep jobs:

cd preprocessing/fmri/

# Process all subjects (one job per subject, parallel)
./run_fmriprep_batch.sh --batch-label study-20260101

# Single subject
./run_fmriprep_batch.sh --batch-label study-20260101 --subject sub-01

# Specific session
./run_fmriprep_batch.sh --batch-label study-20260101 --session ses-01

# Preview without submitting
./run_fmriprep_batch.sh --batch-label study-20260101 --dry-run

4.2.12.2 fMRIPrep 25.2.3 Default Configuration

The HPC script (run_fmriprep_hpc.sh) uses these optimized settings:

#SBATCH --exclusive        # Full node access
#SBATCH --mem=0            # All node memory
#SBATCH --constraint=intel # Best performance

4.2.12.3 Key Output Spaces

Space	Purpose
`T1w`	Native subject space
`MNI152NLin2009cAsym:res-2`	Standard volumetric
`fsnative`	Native FreeSurfer surface
`fsaverage`	Standard surface
`fsLR`	HCP-compatible surface (for CIFTI)

4.2.12.4 Module Loading

module purge
module load singularity/3.11.3
module use /dfs10/meganakp_lab/sw/neurocommand/local/containers/modules
module load fmriprep/25.2.3

4.2.12.5 Cache Directories (on Lab Storage)

All cache directories use lab storage (/dfs10/meganakp_lab/Projects/...) to avoid quota issues: - TemplateFlow templates - Nipype cache - Python bytecode cache - XDG cache

Work directories use node-local NVMe ($TMPDIR) for I/O speed.

4.2.13 Singularity/Apptainer Container Cache

4.2.13.1 The Problem

Singularity defaults to ~/.singularity/cache/, consuming home directory quota.

4.2.13.2 Lab Configuration (UCI HPC3)

# Add to ~/.bashrc (one-time setup)
source /dfs10/meganakp_lab/sw/setup-lab-tools.sh

This sets:

Variable	Value	Purpose
`SINGULARITY_CACHEDIR`	`/dfs10/meganakp_lab/sw/.singularity_cache`	Shared cache for container layers
`PATH`	Adds lab tools directory	Access to git-annex, datalad

4.2.13.3 BeeGFS Limitation (Critical)

Do NOT set SINGULARITY_TMPDIR to /dfs10/. BeeGFS doesn’t support Singularity’s unprivileged symlink operations during container builds.

Cache directory (SINGULARITY_CACHEDIR) – OK on DFS (stores downloaded blobs)
Temp directory (SINGULARITY_TMPDIR) – MUST be local filesystem (/tmp or SLURM’s $TMPDIR)

SLURM jobs automatically get a node-local $TMPDIR, so no additional configuration is needed for batch jobs.

4.2.13.4 Pre-Built Lab Containers (UCI)

Container	Location
MRIQC 24.0.2	`/dfs10/meganakp_lab/sw/containers/mriqc-24.0.2.sif`
XCP-D 0.10.0	`/dfs10/meganakp_lab/sw/containers/xcp_d-0.10.0.sif`
fMRIPrep 25.2.3	Via module: `module load fmriprep/25.2.3`

4.2.13.5 Migrating from Home Directory Cache

If you previously used the default ~/.singularity/cache/, follow these steps:

1. Update shell configuration:

Add to your ~/.bashrc:

# Lab tools and Singularity cache configuration
source /dfs10/meganakp_lab/sw/setup-lab-tools.sh

Then reload:

source ~/.bashrc

2. Verify configuration:

# Check that SINGULARITY_CACHEDIR is set
echo "SINGULARITY_CACHEDIR: $SINGULARITY_CACHEDIR"
# Expected: /dfs10/meganakp_lab/sw/.singularity_cache

# SINGULARITY_TMPDIR should be unset (uses local /tmp)
echo "SINGULARITY_TMPDIR: ${SINGULARITY_TMPDIR:-<unset - correct>}"

# Verify cache is using new location
singularity cache list

3. Clean up old cache (optional):

Once verified, reclaim home directory quota:

# Check old cache size
du -sh ~/.singularity/cache/

# Remove old cache
rm -rf ~/.singularity/cache/

4. Test container operations:

# Test pulling a container (should use new cache location)
singularity pull docker://hello-world

# Verify it's in the lab cache
ls -la /dfs10/meganakp_lab/sw/.singularity_cache/

4.2.13.6 Checking Group Ownership

Lab storage should have meganakp_hpc group for correct quota billing:

# Check current ownership
ls -la /dfs10/meganakp_lab/sw/containers/

# Should show: meganakp_hpc group, not your personal username
# drwxrwsr-x 2 user meganakp_hpc 4096 Feb  4 12:00 containers

If you need to create a new directory with correct group:

mkdir /dfs10/meganakp_lab/sw/new_directory
chgrp meganakp_hpc /dfs10/meganakp_lab/sw/new_directory
chmod g+s /dfs10/meganakp_lab/sw/new_directory  # Inherit group for new files

4.2.13.7 Container Cache Troubleshooting

Issue	Cause	Fix
“disk quota exceeded” during pull	Cache in home directory	Set `SINGULARITY_CACHEDIR` to lab storage
Symlink errors during build	`SINGULARITY_TMPDIR` on BeeGFS	Unset `SINGULARITY_TMPDIR` or set to `/tmp`
“disk quota exceeded” on lab storage	Wrong group ownership	Create with `chgrp meganakp_hpc`
Old cache consuming quota	Legacy `~/.singularity/cache/`	Run `rm -rf ~/.singularity/cache/`
Container not found after pull	Wrong cache directory	Verify `$SINGULARITY_CACHEDIR` is set correctly
“operation not permitted” during build	`SINGULARITY_TMPDIR` set to `/dfs10/`	Unset `SINGULARITY_TMPDIR`: `unset SINGULARITY_TMPDIR`

4.3 Site-Specific Reference

4.3.1 UCI HPC3

Setting	Value
Login	`ssh hpc3.rcic.uci.edu`
Storage	`/dfs10/<lab>/Projects/`
Home quota	50 GB
Containers	NeuroCommand modules (`module use ...`)
Runtime	Singularity (`module load singularity/3.11.3`)
Partition	`standard` (14 days), `free` (3 days, preemptible)
Account	`<lab>_lab` (e.g., `meganakp_lab`)

4.3.2 UCR HPCC

Setting	Value
Login	`ssh cluster.hpcc.ucr.edu`
Storage	`/bigdata/<lab>/shared/`
Home quota	20 GB
Containers	Direct Singularity pull
Runtime	`module load singularity`
Partition	`epyc` (AMD, 168h), `intel` (168h), `highmem` (48h)
Account	`<lab>`

4.3.3 NEU Explorer

Setting	Value
Login	`ssh login.explorer.northeastern.edu`
Storage	`/projects/<group>/` (35 TB per PI, request via RC portal)
Scratch	`/scratch/<user>/` (purged monthly)
Home quota	100 GB
Containers	Apptainer system-wide (no `module load` needed)
Runtime	`apptainer` (auto-detected)
Partition	`short` (48h, 1024 cores), `express` (60m), `long` (5d, needs approval)
Account	`<project_name>`
Pre-built	`/shared/container_repository/explorer/`
Bind mounts	`-B "/projects:/projects,/scratch:/scratch"` (automatic in our scripts)

4.4 Troubleshooting

4.4.1 SLURM Job Fails Immediately (Exit Code in Seconds)

cat logs/fmriprep/fmriprep_JOBID.err

Common causes: - logs/fmriprep/ directory doesn’t exist – mkdir -p logs/fmriprep logs/mriqc - Wrong SLURM account – check sacctmgr show associations user=$USER - Container not found – check CONTAINER_PATH in site.conf

4.4.2 “module: command not found”

Your HPC doesn’t use environment modules, or you need to source the module system first. Check your HPC docs, or just use CONTAINER_PATH directly (Option B in Section 4.1).

4.4.3 “BIDSConflictingValuesError” from fMRIPrep

Your events.json sidecar has keys like "session" or "run" that conflict with BIDS entity names. Remove them from the JSON sidecar:

python3 -c "
import json, glob
for f in glob.glob('rawdata/**/func/*_events.json', recursive=True):
    with open(f) as fh: d = json.load(fh)
    changed = False
    for k in ['session', 'run', 'participant']:
        if k in d: d.pop(k); changed = True
    if changed:
        with open(f, 'w') as fh: json.dump(d, fh, indent=2)
        print(f'  Fixed {f}')
"

4.4.4 Path Resolution Fails

uv run python -c "from libs.paths import get_paths; p = get_paths(); print(p)"

If this errors, check config/paths.toml syntax and that directories exist.

4.4.5 Container Out of Memory (OOM)

Reduce fMRIPrep parallelism. Edit run_fmriprep_hpc.sh or pass env vars:

NTHREADS=4 MEM_MB=16000 make preprocess BIDS_DIR=...

4.5 Quick Reference Card

# === First-time setup (once per site) ===
make setup                # auto-detect, configure, validate
nano config/paths.toml    # edit storage paths
nano config/site.conf     # edit SLURM/container settings
make preflight            # verify everything works

# === Daily use ===
make help                 # see all commands
make status               # check SLURM jobs
DRY_RUN=1 make preprocess BIDS_DIR=/data  # preview
make preprocess BIDS_DIR=/data SUBJECT=sub-01  # one subject
make all BIDS_DIR=/data MODEL=models/task.smdl.json  # everything

# === SLURM commands ===
sinfo -p standard -t idle -o "%n %c %m"   # check available nodes
squeue -u $USER                           # your running jobs
sacct -j JOBID                            # completed job details
scancel <job_id>                          # cancel a job
scancel -u $USER                          # cancel all your jobs

# === Resource probing ===
AVAIL_CPUS=$(sinfo -p standard -t idle,mix -h -o "%c" | sort -n | tail -1)
echo "Submit with: sbatch --cpus-per-task=$AVAIL_CPUS --mem=0 script.sh"

# === Debugging ===
make preflight            # re-check environment
cat logs/fmriprep/*.err   # read error logs
scontrol show job <job_id>  # detailed job info

4.6 New Site Validation (Press-Go Checklist)

Run this checklist whenever a new site is onboarded, a new researcher takes their first run, or template changes land that touch _load_site_config.sh, Makefile, or any run_*_hpc.sh / run_*_batch.sh. Total time: under 10 minutes.

4.6.1 Prerequisites (manual confirmation)

SSH access to the cluster (ssh hpc or ssh <host>)
SLURM account active: sacctmgr show associations user=$USER format=account%30
Default partition works: sinfo -s shows idle or mix
FreeSurfer license at $HOME/.freesurfer/license.txt or ready to paste into config/licenses/fs_license.txt
uv on PATH (or curl available for auto-install)
make on PATH
A real subject ID for DRY_RUN testing (doesn’t need to exist)

4.6.2 Step 1 — Clone + bootstrap

ssh <your-cluster>
cd <your-lab-storage-dir>
git clone https://github.com/CNClaboratory/<your-project>.git
cd <your-project>
make setup

Expected: auto_detect prints your site name, preset is copied, uv sync installs deps, preflight_check.sh --fix runs. Some FAILs are expected (placeholder values).

4.6.3 Step 2 — Fill in placeholders

$EDITOR config/paths.toml config/site.conf

At minimum: replace <lab>, <user>, <project> in paths.toml, set SLURM_ACCOUNT in site.conf. First researcher on the cluster should also pull containers:

bash scripts/setup/pull_containers.sh --dest "$CONTAINER_ROOT"

4.6.4 Step 3 — Run the automated smoke test

bash scripts/setup/press_go_smoke_test.sh --subject sub-01 --verbose

The 7 checks:

#	Check	Verifies
1	Auto-detection	Hostname recognized by `auto_detect.sh`
2	Preset directory	`config/presets/<site>/` exists
3	Site config loaded	`_load_site_config.sh` sources cleanly
4	Container resolution	`find_container` finds tools
5	`make setup` idempotent	Re-running doesn’t clobber edited configs
6	Dry-run sbatch shape	Correct `--account` and `--constraint`
7	`preflight_check.sh`	All 10 environment checks pass

Green light (7/7) = proceed to real subjects. Red light = see failure table below.

4.6.5 Common failure modes

Failure	Cause	Fix
`[3/7]` SLURM vars empty	`SLURM_ACCOUNT=""` in site.conf	Set to a real allocation
`[4/7]` container not found	No module or CONTAINER_ROOT	`scripts/setup/pull_containers.sh`
`[6/7]` `--constraint=` with empty var	Bug in batch launcher	Open issue tagged `press-go`
`[7/7]` placeholder failures	Unreplaced `<lab>`/`<user>`	Edit `paths.toml`
`[7/7]` FreeSurfer license	Missing license file	Place at `config/licenses/fs_license.txt`

4.6.6 Step 4 — Real dry-run preview

DRY_RUN=1 make preprocess BATCH_LABEL=my-study SUBJECT=sub-01

Inspect the sbatch command — flags should match your site.conf.

4.6.7 Step 5 — Sign off

Record date and cluster in docs/research-journal.md
If new site: PR to add hostname to detect_known_site() in auto_detect.sh
If bugs found: open issue tagged press-go

Consolidated from NEW_SITE_SETUP.md, HPC_BEST_PRACTICES.md, SINGULARITY_CACHE_MIGRATION.md, and press_go_validation.md. Last updated: 2026-04-13