4 HPC Guide for Reproducible-fMRI
Complete reference for deploying and running the Reproducible-fMRI pipeline on SLURM-based HPC clusters. Covers everything from zero (account requests, SSH keys, container setup) through daily operations and resource optimization.
Time to deploy: ~2 hours (mostly waiting for account approvals and container pulls).
Already have HPC access? Skip to Install the Pipeline.
4.1 Part 1: Setting Up a New Site
4.1.1 1.1 Request an HPC Account
| Site | How to request | Login node |
|---|---|---|
| UCI HPC3 | https://rcic.uci.edu/account – PI must sponsor | hpc3.rcic.uci.edu |
| UCR HPCC | https://hpcc.ucr.edu – PI must sponsor | cluster.hpcc.ucr.edu |
| NEU Explorer | https://rc.northeastern.edu/getting-access – PI or self-request | login.explorer.northeastern.edu |
| Other SLURM | Contact your site’s research computing | Ask your admin |
4.1.2 1.2 Set Up SSH
Generate a key (skip if you already have one at ~/.ssh/id_ed25519):
# On your LOCAL machine (laptop/desktop), not the HPC
ssh-keygen -t ed25519 -C "your.email@university.edu"
# Press Enter for default path, set a passphrase (recommended)Copy the key to the HPC:
# Replace <user> and <login-node> with your values
ssh-copy-id <user>@<login-node>
# Enter your HPC password when promptedCreate an SSH config for convenience (optional but recommended):
cat >> ~/.ssh/config << 'EOF'
# --- Lab HPC ---
Host hpc
HostName hpc3.rcic.uci.edu
User YOUR_USERNAME
IdentityFile ~/.ssh/id_ed25519
Host hpc-ucr
HostName cluster.hpcc.ucr.edu
User YOUR_USERNAME
IdentityFile ~/.ssh/id_ed25519
Host hpc-neu
HostName login.explorer.northeastern.edu
User YOUR_USERNAME
IdentityFile ~/.ssh/id_ed25519
EOFThen connect with just ssh hpc (or ssh hpc-neu, etc.).
4.1.3 1.3 Find Your SLURM Account and Partitions
Once logged into the HPC, run these to find the values you’ll need for config:
# Your SLURM account(s) -- note the non-default one (usually your PI's lab)
sacctmgr show associations user=$USER format=account%30
# Available partitions -- note the default (marked with *) and your main one
sinfo -s
# Example output:
# PARTITION AVAIL TIMELIMIT NODES(A/I/O/T)
# standard* up 14-00:00:0 117/59/5/181 <- UCI default
# free up 3-00:00:00 141/66/5/212How to read this: - account: Your PI’s SLURM allocation name (e.g., meganakp_lab at UCI) - PARTITION: The queue name you’ll use (e.g., standard, epyc, short) - TIMELIMIT: Max wall time per job (fMRIPrep needs ~6-12 hours)
4.1.4 1.4 Request Lab Storage (If Needed)
| Site | Storage path | How to request | Default quota |
|---|---|---|---|
| UCI HPC3 | /dfs10/<lab>/ |
Included with account | Shared lab allocation |
| UCR HPCC | /bigdata/<lab>/ |
Included with account | Shared lab allocation |
| NEU Explorer | /projects/<group>/ |
ServiceNow request via RC portal | 35 TB per PI |
Do NOT store data in your home directory – quotas are too small (50-100 GB) for neuroimaging data. Use shared lab storage.
4.1.5 2.1 Clone the Repository
SSH into the HPC and clone into your per-user repos dir on lab storage:
ssh hpc # or ssh hpc-neu, etc.
# Lab storage convention: <lab-root>/<user>/repos/<repo>
mkdir -p /dfs10/meganakp_lab/$USER/repos
cd /dfs10/meganakp_lab/$USER/repos
# Clone the code repo (per-user clone — each researcher has their own)
git clone git@github.com:CNClaboratory/<your-project>.git
cd <your-project>Shared rawdata and derivatives live separately under /dfs10/meganakp_lab/Projects/<project>/<dataset>/ — one BIDS tree per (project, dataset) pair. That directory is created by make setup + paths.local.toml in the next step; don’t clone anything into it.
If your project has a separate data repo (like vividness), it is cloned via git-annex into the Projects/<project>/<dataset>/ tree, not next to your code clone:
# Example:
cd /dfs10/meganakp_lab/Projects/<your-project>
datalad clone https://github.com/CNClaboratory/<your-project>-data.git <dataset>4.1.6 2.2 Install uv (Python Package Manager)
curl -LsSf https://astral.sh/uv/install.sh | sh
source ~/.bashrc # or ~/.bash_profile on some systems
uv --version # verify: should print uv 0.x.xImportant: Always use
uv syncanduv add, neveruv pip install. The latter bypasses the lockfile and creates irreproducible environments.
4.1.7 2.3 Install Python Dependencies
cd /dfs10/meganakp_lab/$USER/repos/<your-project>
uv syncThis creates a .venv/ and installs all locked dependencies. Takes ~30 seconds.
4.1.8 3.1 Automatic Setup (Recommended)
make setupThis will: 1. Auto-detect your site from the hostname (UCI, UCR, NEU, or unknown) 2. Copy the matching preset (config/presets/<site>/paths.toml + site.conf) 3. Install Python dependencies 4. Run preflight validation
If auto-detection works, you only need to edit the placeholder values in the copied config files.
4.1.9 3.2 Manual Setup (If Auto-Detect Fails)
Copy the preset closest to your site:
# Pick one:
cp config/presets/uci/* config/ # UCI HPC3
cp config/presets/ucr/* config/ # UCR HPCC
cp config/presets/neu/* config/ # NEU Explorer
cp config/presets/local/* config/ # Laptop/Docker4.1.10 3.3 Edit paths.toml
nano config/paths.tomlReplace ALL placeholders (<lab>, <user>, <repo>, <project>, <dataset>, <group>). The canonical lab storage convention is:
- codebase =
<lab-root>/<user>/repos/<repo>— per-user clone - dataset =
<lab-root>/Projects/<project>/<dataset>— shared BIDS tree
[paths.roots]
# CHANGE THESE to your actual paths:
codebase = "/dfs10/meganakp_lab/eolsson1/repos/Reproducible-fMRI"
dataset = "/dfs10/meganakp_lab/Projects/lc-study/main-cohort"Each project can hold multiple datasets (e.g. pilot, main-cohort, retest) — add one row in paths.local.toml per dataset.
4.1.11 3.4 Edit site.conf
nano config/site.confFill in from what you learned in Step 1.3:
SLURM_ACCOUNT="meganakp_lab" # from: sacctmgr show associations
SLURM_PARTITION="standard" # from: sinfo -s (the partition you want)4.1.12 3.5 Multi-Machine Overrides (Optional)
If paths.toml is checked into git (shared with your team), create config/paths.local.toml for your personal machine-specific overrides:
cat > config/paths.local.toml << 'EOF'
# Only the keys you specify are overridden
[paths.roots]
dataset = "/different/path/on/my/machine"
EOF4.1.13 4.1 Set Up Containers
Neuroimaging tools run inside containers (Singularity/Apptainer) for reproducibility. There are three ways to set them up, depending on your site:
4.1.13.1 Option A: NeuroCommand Modules (UCI HPC3 Only)
UCI has pre-built modules. No container pull needed.
# Add to ~/.bashrc (one-time):
source /dfs10/meganakp_lab/sw/setup-lab-tools.sh
# Verify:
module use /dfs10/meganakp_lab/sw/neurocommand/local/containers/modules
module avail fmriprepIn config/site.conf:
MODULE_USE_PATH="/dfs10/meganakp_lab/sw/neurocommand/local/containers/modules"
FMRIPREP_MODULE="fmriprep/25.2.3"
MRIQC_MODULE="mriqc/24.0.2"
XCPD_MODULE="xcpd/0.10.0"4.1.13.2 Option B: Direct Container Pull (UCR, NEU, Any Other Site)
Pull container images to shared lab storage:
# Load the container runtime if needed (UCR: module load singularity)
# NEU Explorer: apptainer is system-wide, no module needed
# Pull all pipeline containers (~15-30 min, ~20 GB total)
bash scripts/setup/pull_containers.sh \
--dest /path/to/lab/containers
# Or pull specific tools only
bash scripts/setup/pull_containers.sh \
--dest /path/to/lab/containers \
--tools fmriprep,mriqcIn config/site.conf:
# Point to the container directory
CONTAINER_DIR="/path/to/lab/containers"
CONTAINER_PATH="/path/to/lab/containers/fmriprep-25.0.0.sif"
# Leave MODULE_USE_PATH="" and FMRIPREP_MODULE="" empty4.1.13.3 Option C: Docker (Local Laptop/Desktop Only)
For local development/testing, Docker works too. The run_fmriprep_local.sh script handles this automatically.
4.1.13.4 Verify Container Access
# The runtime is auto-detected (apptainer > singularity > module)
# Just check one works:
singularity --version 2>/dev/null || apptainer --version 2>/dev/null || echo "MISSING"4.1.14 5.1 FreeSurfer License
fMRIPrep requires a (free) FreeSurfer license.
- Register at https://surfer.nmr.mgh.harvard.edu/registration.html (takes 2 minutes)
- Receive
license.txtby email - Place it where the pipeline can find it:
# Option A: In the repo config (recommended)
mkdir -p config/licenses
cp ~/Downloads/license.txt config/licenses/fs_license.txt
# Option B: In your home directory
mkdir -p ~/.freesurfer
cp ~/Downloads/license.txt ~/.freesurfer/license.txtBoth locations are auto-detected by the preflight check and HPC scripts.
4.1.15 6.1 Validate Everything
make preflightExpected output:
Python 3.11 .............. PASS
config/paths.toml ........ PASS
Path resolution .......... PASS
Key directories .......... PASS
FreeSurfer license ....... PASS
SLURM available .......... PASS
Singularity/Apptainer ... PASS
Container images ......... PASS
BIDS structure ........... SKIP (no --bids-dir)
Disk space ............... PASS
8 passed, 0 failed, 1 skipped
Fix any FAIL items before proceeding. Common fixes:
| Failure | Fix |
|---|---|
paths.toml not found |
Copy and edit a preset (Section 3.1-3.3) |
Path resolution failed |
Check [paths.roots] values exist on disk |
FreeSurfer license not found |
Register and place license.txt (Section 5.1) |
Singularity not available |
module load singularity or module load apptainer |
Containers not found |
Pull containers or set CONTAINER_PATH (Section 4.1) |
Disk space low |
Move data to lab storage, clean scratch |
4.1.16 7.1 Run the Pipeline
4.1.16.1 Preview First (Dry Run)
DRY_RUN=1 make preprocess BIDS_DIR=/path/to/rawdata SUBJECT=sub-01Check: - Correct BIDS directory? - Correct output directory? - Correct SLURM account? - Correct container?
4.1.16.2 Process One Subject
make preprocess BIDS_DIR=/path/to/rawdata SUBJECT=sub-01Monitor:
make status # SLURM queue
tail -f logs/fmriprep/fmriprep_*.out # live output
sacct -j JOBID --format=State,Elapsed # when done4.1.16.3 Verify Outputs
After the job completes (~6-12 hours for first subject with FreeSurfer):
# Check HTML report
ls derivatives/fmriprep/sub-01.html
# Check preprocessed BOLD
ls derivatives/fmriprep/sub-01/func/
# Check FreeSurfer reconstruction
ls derivatives/freesurfer/sub-01/4.1.16.4 Batch Process All Subjects
# Run QC first
make qc BIDS_DIR=/path/to/rawdata
# Then preprocess all subjects
make preprocess BIDS_DIR=/path/to/rawdata
# Then post-processing
make denoise BIDS_DIR=/path/to/rawdata
# Then statistics (if you have a model)
make glm BIDS_DIR=/path/to/rawdata MODEL=models/task.smdl.jsonOr run everything:
make all BIDS_DIR=/path/to/rawdata MODEL=models/task.smdl.json4.2 Part 2: SLURM Best Practices
4.2.1 Philosophy: Maximum Resources for Maximum Speed
Our goal is fastest completion, not resource conservation.
- Use ALL available CPUs – Query idle nodes, request the maximum
- Use ALL memory (
--mem=0) – Gets all memory on the allocated node - Parallelize ALWAYS – Use job arrays for independent subject processing
- One subject per job – Array jobs complete faster than sequential processing
4.2.2 Pre-Submission Checklist (Mandatory)
NEVER submit a job without probing resources first.
# 1. Check cluster status and available nodes
sinfo -p standard -t idle -o "%n %c %m" | head -20
# 2. Get max available CPUs on idle nodes
AVAIL_CPUS=$(sinfo -p standard -t idle,mix -h -o "%c" | sort -n | tail -1)
echo "Max CPUs available: $AVAIL_CPUS"
# 3. Check your current job usage
squeue -u $USER
# 4. Recommended submission:
echo "sbatch --cpus-per-task=$AVAIL_CPUS --mem=0 your_script.sh"4.2.3 Standard SLURM Header Template
All batch scripts should use this pattern:
#!/bin/bash
#SBATCH --job-name=<descriptive_name>
#SBATCH --account={{HPC_ACCOUNT}} # REQUIRED: Lab account
#SBATCH --partition=standard # Or free, gpu, highmem
#SBATCH --nodes=1
#SBATCH --cpus-per-task=48 # Use probed max
#SBATCH --mem=0 # ALL memory on node
#SBATCH --time=4:00:00 # Estimate conservatively
#SBATCH --output=logs/%x_%j.out
#SBATCH --error=logs/%x_%j.err
# Use ALL allocated CPUs in your code
uv run python script.py --n-jobs $SLURM_CPUS_PER_TASK4.2.4 Parallelization Rules
4.2.4.1 ALWAYS Use Job Arrays When:
- Processing multiple subjects independently
- Running the same pipeline across different inputs
- Each job doesn’t depend on others’ outputs
There are NO drawbacks to parallelizing independent subject processing.
4.2.4.2 DON’T Parallelize When:
- Group-level analyses (needs all subjects first)
- Sequential pipeline phases (Phase N needs Phase N-1 output)
- Jobs would exceed memory limits together
4.2.5 Job Array Pattern (Required for Multi-Subject Processing)
#!/bin/bash
#SBATCH --job-name=analysis_pipeline
#SBATCH --account={{HPC_ACCOUNT}}
#SBATCH --partition=standard
#SBATCH --array=0-11 # One job per subject (0-indexed)
#SBATCH --cpus-per-task=48 # MAX available
#SBATCH --mem=0 # ALL memory
#SBATCH --time=4:00:00
#SBATCH --output=logs/%x_%A_%a.out # %A=array job ID, %a=task ID
#SBATCH --error=logs/%x_%A_%a.err
# Subject list
SUBJECTS=(sub-01 sub-02 sub-03 sub-04 sub-05 sub-06
sub-07 sub-08 sub-09 sub-10 sub-11 sub-12)
SUBJECT=${SUBJECTS[$SLURM_ARRAY_TASK_ID]}
echo "Processing $SUBJECT with $SLURM_CPUS_PER_TASK CPUs"
# Pass CPU count to your script
uv run python analyses/fmri/run_analysis.py $SUBJECT \
--n-jobs $SLURM_CPUS_PER_TASK4.2.6 Resource Recommendations by Job Type
| Job Type | CPUs | Memory | Time | Array? |
|---|---|---|---|---|
| fMRIPrep (per subject) | MAX | --mem=0 |
24h | Yes |
| GLM fitting | MAX | --mem=0 |
4h | Yes |
| Mask generation | 8-16 | 32-64G | 2h | Yes |
| Retinotopy (neuropythy) | 8 | 64G | 2h | Yes |
| Group analysis | 8-16 | 64G | 4h | No |
| QC/Visualization | 4-8 | 16G | 1h | Optional |
4.2.7 Dynamic Resource Allocation Script
Use this helper to automatically submit with optimal resources:
#!/bin/bash
# submit_optimal.sh - Submit with maximum available resources
# Usage: ./submit_optimal.sh <script.sh> [extra_sbatch_args]
SCRIPT=$1
shift
EXTRA_ARGS="$@"
# Get max available CPUs
AVAIL_CPUS=$(sinfo -p standard -t idle,mix -h -o "%c" 2>/dev/null | sort -n | tail -1)
AVAIL_CPUS=${AVAIL_CPUS:-32} # Default to 32 if can't determine
echo "=== Submitting with optimal resources ==="
echo "Max CPUs detected: $AVAIL_CPUS"
echo "Script: $SCRIPT"
sbatch --cpus-per-task=$AVAIL_CPUS \
--mem=0 \
$EXTRA_ARGS \
"$SCRIPT"4.2.8 Common Mistakes and Fixes
| Mistake | Consequence | Fix |
|---|---|---|
| Not probing resources | Suboptimal allocation | Always run sinfo first |
| Sequential subject loops | 10x slower completion | Convert to job array |
Hardcoded --cpus-per-task=8 |
Underutilizing nodes | Use probed max or $SLURM_CPUS_PER_TASK |
Hardcoded --mem=64G |
Leaving memory unused | Use --mem=0 for all memory |
Not passing --n-jobs |
Single-threaded execution | Pass $SLURM_CPUS_PER_TASK to scripts |
Wrong --account |
Jobs pending forever | Always use lab account |
4.2.9 Checking Job Status
# Your running/pending jobs
squeue -u $USER --format="%.10i %.30j %.8T %.10M %.6D %.4C"
# Detailed job info
scontrol show job <job_id>
# Recent job history
sacct -u $USER --starttime=$(date -d '24 hours ago' +%Y-%m-%d) \
--format=JobID,JobName,State,ExitCode,Elapsed,MaxRSS4.2.10 Login Node Rules
The login node is for coordination, not computation.
| ALLOWED | FORBIDDEN |
|---|---|
git, sbatch, squeue |
ANY data processing |
| Light file inspection | Running Python scripts |
| Job monitoring | Loops over files |
module load |
Heavy I/O operations |
If you’re not sure – use SLURM.
4.2.11 Environment Variables
Always set these in your SLURM scripts:
# Path configuration (adjust for your project)
export PROJECT_PATHS_FILE="config/paths.toml"
# Use allocated resources
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK4.2.12 fMRIPrep on HPC
4.2.12.1 Batch Submission (Recommended)
Use the batch launcher for all fMRIPrep jobs:
cd preprocessing/fmri/
# Process all subjects (one job per subject, parallel)
./run_fmriprep_batch.sh --batch-label study-20260101
# Single subject
./run_fmriprep_batch.sh --batch-label study-20260101 --subject sub-01
# Specific session
./run_fmriprep_batch.sh --batch-label study-20260101 --session ses-01
# Preview without submitting
./run_fmriprep_batch.sh --batch-label study-20260101 --dry-run4.2.12.2 fMRIPrep 25.2.3 Default Configuration
The HPC script (run_fmriprep_hpc.sh) uses these optimized settings:
#SBATCH --exclusive # Full node access
#SBATCH --mem=0 # All node memory
#SBATCH --constraint=intel # Best performance4.2.12.3 Key Output Spaces
| Space | Purpose |
|---|---|
T1w |
Native subject space |
MNI152NLin2009cAsym:res-2 |
Standard volumetric |
fsnative |
Native FreeSurfer surface |
fsaverage |
Standard surface |
fsLR |
HCP-compatible surface (for CIFTI) |
4.2.12.4 Module Loading
module purge
module load singularity/3.11.3
module use /dfs10/meganakp_lab/sw/neurocommand/local/containers/modules
module load fmriprep/25.2.34.2.12.5 Cache Directories (on Lab Storage)
All cache directories use lab storage (/dfs10/meganakp_lab/Projects/...) to avoid quota issues: - TemplateFlow templates - Nipype cache - Python bytecode cache - XDG cache
Work directories use node-local NVMe ($TMPDIR) for I/O speed.
4.2.13 Singularity/Apptainer Container Cache
4.2.13.1 The Problem
Singularity defaults to ~/.singularity/cache/, consuming home directory quota.
4.2.13.2 Lab Configuration (UCI HPC3)
# Add to ~/.bashrc (one-time setup)
source /dfs10/meganakp_lab/sw/setup-lab-tools.shThis sets:
| Variable | Value | Purpose |
|---|---|---|
SINGULARITY_CACHEDIR |
/dfs10/meganakp_lab/sw/.singularity_cache |
Shared cache for container layers |
PATH |
Adds lab tools directory | Access to git-annex, datalad |
4.2.13.3 BeeGFS Limitation (Critical)
Do NOT set SINGULARITY_TMPDIR to /dfs10/. BeeGFS doesn’t support Singularity’s unprivileged symlink operations during container builds.
- Cache directory (
SINGULARITY_CACHEDIR) – OK on DFS (stores downloaded blobs) - Temp directory (
SINGULARITY_TMPDIR) – MUST be local filesystem (/tmpor SLURM’s$TMPDIR)
SLURM jobs automatically get a node-local $TMPDIR, so no additional configuration is needed for batch jobs.
4.2.13.4 Pre-Built Lab Containers (UCI)
| Container | Location |
|---|---|
| MRIQC 24.0.2 | /dfs10/meganakp_lab/sw/containers/mriqc-24.0.2.sif |
| XCP-D 0.10.0 | /dfs10/meganakp_lab/sw/containers/xcp_d-0.10.0.sif |
| fMRIPrep 25.2.3 | Via module: module load fmriprep/25.2.3 |
4.2.13.5 Migrating from Home Directory Cache
If you previously used the default ~/.singularity/cache/, follow these steps:
1. Update shell configuration:
Add to your ~/.bashrc:
# Lab tools and Singularity cache configuration
source /dfs10/meganakp_lab/sw/setup-lab-tools.shThen reload:
source ~/.bashrc2. Verify configuration:
# Check that SINGULARITY_CACHEDIR is set
echo "SINGULARITY_CACHEDIR: $SINGULARITY_CACHEDIR"
# Expected: /dfs10/meganakp_lab/sw/.singularity_cache
# SINGULARITY_TMPDIR should be unset (uses local /tmp)
echo "SINGULARITY_TMPDIR: ${SINGULARITY_TMPDIR:-<unset - correct>}"
# Verify cache is using new location
singularity cache list3. Clean up old cache (optional):
Once verified, reclaim home directory quota:
# Check old cache size
du -sh ~/.singularity/cache/
# Remove old cache
rm -rf ~/.singularity/cache/4. Test container operations:
# Test pulling a container (should use new cache location)
singularity pull docker://hello-world
# Verify it's in the lab cache
ls -la /dfs10/meganakp_lab/sw/.singularity_cache/4.2.13.6 Checking Group Ownership
Lab storage should have meganakp_hpc group for correct quota billing:
# Check current ownership
ls -la /dfs10/meganakp_lab/sw/containers/
# Should show: meganakp_hpc group, not your personal username
# drwxrwsr-x 2 user meganakp_hpc 4096 Feb 4 12:00 containersIf you need to create a new directory with correct group:
mkdir /dfs10/meganakp_lab/sw/new_directory
chgrp meganakp_hpc /dfs10/meganakp_lab/sw/new_directory
chmod g+s /dfs10/meganakp_lab/sw/new_directory # Inherit group for new files4.2.13.7 Container Cache Troubleshooting
| Issue | Cause | Fix |
|---|---|---|
| “disk quota exceeded” during pull | Cache in home directory | Set SINGULARITY_CACHEDIR to lab storage |
| Symlink errors during build | SINGULARITY_TMPDIR on BeeGFS |
Unset SINGULARITY_TMPDIR or set to /tmp |
| “disk quota exceeded” on lab storage | Wrong group ownership | Create with chgrp meganakp_hpc |
| Old cache consuming quota | Legacy ~/.singularity/cache/ |
Run rm -rf ~/.singularity/cache/ |
| Container not found after pull | Wrong cache directory | Verify $SINGULARITY_CACHEDIR is set correctly |
| “operation not permitted” during build | SINGULARITY_TMPDIR set to /dfs10/ |
Unset SINGULARITY_TMPDIR: unset SINGULARITY_TMPDIR |
4.3 Site-Specific Reference
4.3.1 UCI HPC3
| Setting | Value |
|---|---|
| Login | ssh hpc3.rcic.uci.edu |
| Storage | /dfs10/<lab>/Projects/ |
| Home quota | 50 GB |
| Containers | NeuroCommand modules (module use ...) |
| Runtime | Singularity (module load singularity/3.11.3) |
| Partition | standard (14 days), free (3 days, preemptible) |
| Account | <lab>_lab (e.g., meganakp_lab) |
4.3.2 UCR HPCC
| Setting | Value |
|---|---|
| Login | ssh cluster.hpcc.ucr.edu |
| Storage | /bigdata/<lab>/shared/ |
| Home quota | 20 GB |
| Containers | Direct Singularity pull |
| Runtime | module load singularity |
| Partition | epyc (AMD, 168h), intel (168h), highmem (48h) |
| Account | <lab> |
4.3.3 NEU Explorer
| Setting | Value |
|---|---|
| Login | ssh login.explorer.northeastern.edu |
| Storage | /projects/<group>/ (35 TB per PI, request via RC portal) |
| Scratch | /scratch/<user>/ (purged monthly) |
| Home quota | 100 GB |
| Containers | Apptainer system-wide (no module load needed) |
| Runtime | apptainer (auto-detected) |
| Partition | short (48h, 1024 cores), express (60m), long (5d, needs approval) |
| Account | <project_name> |
| Pre-built | /shared/container_repository/explorer/ |
| Bind mounts | -B "/projects:/projects,/scratch:/scratch" (automatic in our scripts) |
4.4 Troubleshooting
4.4.1 SLURM Job Fails Immediately (Exit Code in Seconds)
cat logs/fmriprep/fmriprep_JOBID.errCommon causes: - logs/fmriprep/ directory doesn’t exist – mkdir -p logs/fmriprep logs/mriqc - Wrong SLURM account – check sacctmgr show associations user=$USER - Container not found – check CONTAINER_PATH in site.conf
4.4.2 “module: command not found”
Your HPC doesn’t use environment modules, or you need to source the module system first. Check your HPC docs, or just use CONTAINER_PATH directly (Option B in Section 4.1).
4.4.3 “BIDSConflictingValuesError” from fMRIPrep
Your events.json sidecar has keys like "session" or "run" that conflict with BIDS entity names. Remove them from the JSON sidecar:
python3 -c "
import json, glob
for f in glob.glob('rawdata/**/func/*_events.json', recursive=True):
with open(f) as fh: d = json.load(fh)
changed = False
for k in ['session', 'run', 'participant']:
if k in d: d.pop(k); changed = True
if changed:
with open(f, 'w') as fh: json.dump(d, fh, indent=2)
print(f' Fixed {f}')
"4.4.4 Path Resolution Fails
uv run python -c "from libs.paths import get_paths; p = get_paths(); print(p)"If this errors, check config/paths.toml syntax and that directories exist.
4.4.5 Container Out of Memory (OOM)
Reduce fMRIPrep parallelism. Edit run_fmriprep_hpc.sh or pass env vars:
NTHREADS=4 MEM_MB=16000 make preprocess BIDS_DIR=...4.5 Quick Reference Card
# === First-time setup (once per site) ===
make setup # auto-detect, configure, validate
nano config/paths.toml # edit storage paths
nano config/site.conf # edit SLURM/container settings
make preflight # verify everything works
# === Daily use ===
make help # see all commands
make status # check SLURM jobs
DRY_RUN=1 make preprocess BIDS_DIR=/data # preview
make preprocess BIDS_DIR=/data SUBJECT=sub-01 # one subject
make all BIDS_DIR=/data MODEL=models/task.smdl.json # everything
# === SLURM commands ===
sinfo -p standard -t idle -o "%n %c %m" # check available nodes
squeue -u $USER # your running jobs
sacct -j JOBID # completed job details
scancel <job_id> # cancel a job
scancel -u $USER # cancel all your jobs
# === Resource probing ===
AVAIL_CPUS=$(sinfo -p standard -t idle,mix -h -o "%c" | sort -n | tail -1)
echo "Submit with: sbatch --cpus-per-task=$AVAIL_CPUS --mem=0 script.sh"
# === Debugging ===
make preflight # re-check environment
cat logs/fmriprep/*.err # read error logs
scontrol show job <job_id> # detailed job info4.6 New Site Validation (Press-Go Checklist)
Run this checklist whenever a new site is onboarded, a new researcher takes their first run, or template changes land that touch _load_site_config.sh, Makefile, or any run_*_hpc.sh / run_*_batch.sh. Total time: under 10 minutes.
4.6.1 Prerequisites (manual confirmation)
4.6.2 Step 1 — Clone + bootstrap
ssh <your-cluster>
cd <your-lab-storage-dir>
git clone https://github.com/CNClaboratory/<your-project>.git
cd <your-project>
make setupExpected: auto_detect prints your site name, preset is copied, uv sync installs deps, preflight_check.sh --fix runs. Some FAILs are expected (placeholder values).
4.6.3 Step 2 — Fill in placeholders
$EDITOR config/paths.toml config/site.confAt minimum: replace <lab>, <user>, <project> in paths.toml, set SLURM_ACCOUNT in site.conf. First researcher on the cluster should also pull containers:
bash scripts/setup/pull_containers.sh --dest "$CONTAINER_ROOT"4.6.4 Step 3 — Run the automated smoke test
bash scripts/setup/press_go_smoke_test.sh --subject sub-01 --verboseThe 7 checks:
| # | Check | Verifies |
|---|---|---|
| 1 | Auto-detection | Hostname recognized by auto_detect.sh |
| 2 | Preset directory | config/presets/<site>/ exists |
| 3 | Site config loaded | _load_site_config.sh sources cleanly |
| 4 | Container resolution | find_container finds tools |
| 5 | make setup idempotent |
Re-running doesn’t clobber edited configs |
| 6 | Dry-run sbatch shape | Correct --account and --constraint |
| 7 | preflight_check.sh |
All 10 environment checks pass |
Green light (7/7) = proceed to real subjects. Red light = see failure table below.
4.6.5 Common failure modes
| Failure | Cause | Fix |
|---|---|---|
[3/7] SLURM vars empty |
SLURM_ACCOUNT="" in site.conf |
Set to a real allocation |
[4/7] container not found |
No module or CONTAINER_ROOT | scripts/setup/pull_containers.sh |
[6/7] --constraint= with empty var |
Bug in batch launcher | Open issue tagged press-go |
[7/7] placeholder failures |
Unreplaced <lab>/<user> |
Edit paths.toml |
[7/7] FreeSurfer license |
Missing license file | Place at config/licenses/fs_license.txt |
4.6.6 Step 4 — Real dry-run preview
DRY_RUN=1 make preprocess BATCH_LABEL=my-study SUBJECT=sub-01Inspect the sbatch command — flags should match your site.conf.
4.6.7 Step 5 — Sign off
Consolidated from NEW_SITE_SETUP.md, HPC_BEST_PRACTICES.md, SINGULARITY_CACHE_MIGRATION.md, and press_go_validation.md. Last updated: 2026-04-13