11 Known Issues
Active bugs, workarounds, and limitations encountered during real deployments across UCI HPC3, UCR HPCC, NEU Explorer, and local workstations. Each entry is written in Symptom → Cause → Fix format so an agent or human hitting the same error can search for it.
For resolved issues that no longer apply, see git history (git log --grep).
11.1 fMRIPrep + Preprocessing
11.1.1 --cifti-output crashes fMRIPrep 25.2.3
- Symptom: fMRIPrep aborts during sub-cortical aparc resampling with KeyError or shape-mismatch traceback when
--cifti-output 91kis set. - Cause: Upstream regression in 25.2.3’s nibabel + grayordinate handling.
- Fix: Default
CIFTI_OUTPUT=noneinconfig/site.confuntil upgrading to 25.2.4+. Re-enable withCIFTI_OUTPUT=91k make preprocess ...once the module is bumped.
11.1.2 Fieldmap silently skipped (no SDC applied)
- Symptom: fMRIPrep finishes successfully but the report shows “no susceptibility distortion correction” despite
fmap/files existing. - Cause: Fieldmap JSONs missing
B0FieldIdentifierandIntendedFor. fMRIPrep skips SDC silently rather than failing. - Fix: Add
B0FieldIdentifierto yourdcm2bids/heudiconvconfig. Each fieldmap pair gets a unique tag (pepolar_run-01, etc.) referenced by the corresponding BOLD’sB0FieldSource.make preflightnow warns on missing fieldmap metadata.
11.1.3 fMRIPrep rerun is much faster than first run (1h vs 5h)
- Symptom: fMRIPrep rerun completes in ~1 hour even though first run took 5+ hours. User worries something was skipped.
- Cause: FreeSurfer outputs (
recon-all, hours) are cached; rerun reuses them. - Fix: Not a bug. To force re-run: delete
derivatives/freesurfer/<sub>/. To verify cache reuse: check timestamps inderivatives/freesurfer/<sub>/scripts/recon-all.log.
11.1.4 fmriprep not found after all resolution strategies
- Symptom: Batch launcher fails with “fmriprep not found” before submitting any subject.
- Cause: Site has neither a NeuroCommand module nor a container in
$CONTAINER_ROOT. - Fix: Set
FMRIPREP_MODULE(strategy 1) ORCONTAINER_ROOT(strategy 2) inconfig/site.conf. Pull containers first:bash scripts/setup/pull_containers.sh --dest "$CONTAINER_ROOT".
11.2 SLURM / HPC
11.2.1 sbatch not found on a laptop
- Symptom:
make preprocessfails immediately with “sbatch not found” even though everything else is set up. - Cause: Preflight detects local mode by absence of
SLURM_ACCOUNTANDSLURM_PARTITION. If only one is empty, SLURM checks still run. - Fix: Leave both
SLURM_ACCOUNTandSLURM_PARTITIONempty inconfig/site.conf. Preflight will auto-skip SLURM checks in local mode.
11.2.2 Invalid feature specification at job submit
- Symptom:
sbatchrejects every job with “Invalid feature specification”. - Cause:
SLURM_CONSTRAINT="intel"(or similar feature tag) on a cluster that doesn’t use feature tags (e.g. NEU Explorer). - Fix: Set
SLURM_CONSTRAINT=""inconfig/site.conf.
11.2.3 BASH_SOURCE resolves to SLURM spool dir
- Symptom: HPC scripts fail with “REPO_ROOT not found” or load wrong config when launched via sbatch.
- Cause: SLURM copies scripts to a spool dir before execution; bash’s
BASH_SOURCE[0]then points to the spool, not the original repo. - Fix: All HPC scripts now have a
SLURM_SUBMIT_DIRfallback in theirREPO_ROOTresolution — if you’re writing a new HPC script, use the same pattern (seepreprocessing/fmri/run_fmriprep_hpc.sh:67-74).
11.2.4 DependencyNeverSatisfied — DAG leaves stuck PENDING
- Symptom:
make pipeline-statusshows downstream stages stuck PENDING with reasonDependencyNeverSatisfied. - Cause: An upstream stage failed, so its
afterok:dependency can never resolve. SLURM holds children indefinitely. - Fix: Cancel the stuck children with
scancel <job_id>, then read the failed parent’s logs (logs/fmriprep/*.errorlogs/validate/validate_fmriprep_*.out) for the real cause.
11.2.5 XCP-D 0.10+ OOM at 64 GB on full atlas + CIFTI
- Symptom: XCP-D job killed by SLURM with
OUT_OF_MEMORYduringparcellate_alff, even though earlier runs at 64 GB worked. - Cause: XCP-D 0.10+ with full atlas parcellation (16+ atlases × N runs) exceeds 64 GB during ALFF parcellation.
- Fix: Template’s
run_xcpd_hpc.shnow defaults to--mem=128G. For minimal-atlas runs, override withSBATCH_ARGS=--mem=64Gto save allocation.
11.3 Configuration / Paths
11.3.1 paths.toml not found on fresh clone
- Symptom:
make preflighterrors immediately with “paths.toml not found”. - Cause:
config/paths.tomlis git-ignored; fresh clones don’t have one. - Fix: Run
make setup(auto-detects site, copies preset). For unknown hosts it falls back to thelocalpreset.
11.3.2 Placeholders survive into runtime (<lab>, <user>, etc.)
- Symptom: Pipeline submits jobs but they fail with “directory not found: /dfs10/
/ /…”. - Cause: The preset was copied but placeholders never edited.
- Fix:
libs/paths.pynow raisesPathsNotConfiguredErrorat config load time if any<lab>,<user>,<project>,<group>, or<username>token survives. This catches the issue before any sbatch is run.
11.3.3 config/paths.local.toml overrides not applying
- Symptom: Edited
paths.local.tomlbut the values aren’t used. - Cause: Either the file isn’t at
config/paths.local.toml(path is exact), or you’re using a child repo where this pattern wasn’t yet wired in. - Fix:
libs.paths._load_toml()deep-mergespaths.local.tomloverpaths.tomlif it exists. Verify with:python -c "from libs.paths import get_paths; print(get_paths().dataset_root)".
11.3.4 Child repo’s from libs.paths import get_paths fails on bare HPC login
- Symptom: Bash wrapper script fails with
ModuleNotFoundError: tomliwhen trying to read paths.toml. - Cause: HPC login nodes ship Python 3.9 (no stdlib
tomllib) and the default user environment lackstomli. - Fix: Pipeline scripts call
run_python_inline()which prefers.venv/bin/pythonfirst, thenuv run, then system python. Make sureuv synchas run on the login node, or activate the venv before submitting.
11.4 BIDS Stats Models / GLM
11.4.1 BIDS Stats Models Input values must be arrays
- Symptom: FitLins crashes with cryptic schema-validation error mentioning “expected array, got string”.
- Cause: Wrote
"Task": "rest"instead of"Task": ["rest"]. - Fix: All
Inputfilter values must be JSON arrays, even with one entry.libs.bids_statsmodels.validate_model()catches this before run.
11.4.2 Task GLM run on XCP-D denoised BOLD (silent confound double-removal)
- Symptom: Task GLM completes successfully but contrasts come out weak / null compared to literature; whole-brain map looks “shrunken toward zero”.
- Cause: GLM was run on
_desc-denoised_bold(XCP-D output) instead of_desc-preproc_bold(fMRIPrep output). XCP-D pre-regressed motion confounds, then the GLM regressed them again, removing task variance shared with motion. - Fix: Task GLMs and gPPI must run on fMRIPrep
_desc-preproc_boldwith confounds in the design matrix. Seedocs/ANALYSIS.md§ “Task GLMs and XCP-D — IMPORTANT”.libs/confounds.load_task_confounds()raisesValueErrorif you pass a denoised file by accident.
11.4.3 Events.tsv missing → FitLins falls over
- Symptom: FitLins job fails with “no events for run X” or generates empty design matrix.
- Cause:
func/sub-<id>_task-<x>_events.tsvnot generated yet. Behavioral data lives elsewhere; events.tsv must be derived. - Fix: Generate events.tsv from your behavioral data BEFORE running the GLM. Each child repo has its own conversion (twcf:
analyses/behavioural/, vividness:experiments/fmri/). Template shipslibs/event_utils.pyfor common patterns.
11.5 DataLad / Git-Annex
11.5.1 datalad get requests password 100+ times
- Symptom:
datalad get derivatives/fmriprepprompts for SSH password on every file. - Cause: SSH agent not forwarded or key not loaded.
- Fix: Run
ssh-add ~/.ssh/<your_key>before the get. For HPC sessions, setForwardAgent yesin your local~/.ssh/configfor the cluster.
11.5.2 git-annex content available but not “wanted” — never fetched
- Symptom: Files exist in
git-annexbutdatalad getsays they’re missing on every remote. - Cause: Wanted-expression filter on the remote excludes those files.
- Fix: Check
git annex wanted <remote>. Vividness uses site-specific wanted expressions (docs/DATA_SETUP.md§ “git-annex”); UCI users shouldn’t request NEU sourcedata files and vice-versa.
11.6 Setup / Environment
11.6.1 uv pip install silently breaks the project venv
- Symptom: After
uv pip install <pkg>, lateruv syncremoves the package and breaks imports. - Cause:
uv pipbypassesuv.lock; subsequentuv syncenforces lock and removes the unmanaged install. - Fix: Always use
uv add <pkg>(writes topyproject.tomland re-locks).uv pipis only acceptable in throwaway envs:uv venv .venv-temp && source .venv-temp/bin/activate && uv pip install ....
11.6.2 Preflight passes locally but fMRIPrep fails on HPC
- Symptom:
make preflightis green on the laptop, thensbatchfails on HPC with missing module / wrong path. - Cause: Local config doesn’t match HPC config (e.g., missing
MODULE_USE_PATH, wrongCONTAINER_ROOT). - Fix: Run preflight on the HPC head node, not just the laptop:
ssh hpc; cd <repo>; make preflight. Or usepaths.local.tomlto carry HPC-specific values (gitignored).
11.6.3 FreeSurfer license missing in containerized fMRIPrep
- Symptom: fMRIPrep aborts in early FreeSurfer step with “license file not found” even after registering with NMR.
- Cause: Container can’t see
~/.freesurfer/license.txt; needs an explicit bind mount orFS_LICENSEenv var. - Fix: Place license at
config/licenses/fs_license.txt(gitignored via.gitkeep). Devcontainer auto-pointsFS_LICENSEthere. For HPC, exportFS_LICENSE="$PWD/config/licenses/fs_license.txt"beforemake preprocess.
11.7 Library / API
11.7.1 libs/paths.py is intentionally NOT synced from template
- Limitation: Each child repo customizes
libs/paths.py(extra fields, custom env var names, fallback parsers). Template’spaths.pywould clobber these. - Workaround:
sync_from_template.shputslibs/paths.pyinSYNC_WITH_CARE, requiring--include-pathsopt-in. Seedocs/TEMPLATE_MAINTENANCE.md§ “Customization Patterns” for the convergence path (migrate custom fields to[paths.locations]).
11.7.2 Shell scripts must work with both legacy module-callable and new dataclass paths.py
- Limitation:
run_python_inline()’sresolve()function tries module-level callables first, then falls back toget_paths()dataclass attributes. Don’t assume only one. - Workaround: Don’t refactor the resolve fallback chain in HPC scripts unless you’re committing to an API audit across all 4 child repos.
11.8 CI / Testing
11.8.1 Mock E2E test passes but real-site smoke fails
- Symptom: PR CI is green; first deployment to a new HPC site fails.
- Cause: Mock E2E (
tests/test_pipeline_end_to_end.py) tests orchestration only — generates a 2 MB synthetic BIDS, runssubmit_subject_pipeline.sh --test-only, never hits sbatch or real containers. - Fix: Run the real-site smoke test once when onboarding a new HPC:
bash scripts/tests/run_new_site_smoke.sh. Budget ~5 CPU-hours. Documented indocs/HPC_GUIDE.md§ “New Site Validation”.
11.8.2 pytest.ini markers silently not registering
- Symptom:
@pytest.mark.unitetc. emitPytestUnknownMarkWarningon every test run. - Cause:
pytest.iniheader is[tool:pytest](setup.cfg format) instead of[pytest](pytest.ini format). - Fix: Change
[tool:pytest]→[pytest]. Already corrected in template; check child repos that may have inherited the bug.
11.9 Cross-repo coordination
11.9.1 Child repo’s sync_from_template.sh is a stale snapshot
- Symptom: Running sync from a child repo shows a different DOC_FILES list than the template’s current one, or references the renamed
sync_fmriprep_scripts.sh. - Cause: Each child has its own snapshot of the sync script from when they last synced.
- Fix: Future syncs update the script itself (it’s in
SAFE_INFRA). After the first sync from new template, subsequent syncs use the new flags (--include-paths,--include-shells,--exclude,--diff).
11.9.2 Numpy major version drift across child repos (v1 vs v2)
- Symptom: Code that works in vividness (numpy 2.3+) breaks in twcf (numpy <2.0) with cryptic dtype errors.
- Cause: Child repos pin different numpy major versions. Some dependencies (older nilearn) require v1; newer ones (xgi) require v2.
- Fix: Document each repo’s numpy version in
pyproject.tomland call out incompatibilities in shared library code via runtime version checks. Audit findings (2026-04-26): twcf <2.0, vividness ≥2.3, Hypergraphsciousness ≥2.0.2, TI_DecNef ≥1.24.
11.10 Skills / AI agent infrastructure
11.10.1 Agent subagent_type mismatch
- Symptom:
Agent({subagent_type: "code-simplifier"})fails with “no such agent”. - Cause: Plugin-namespaced agents have different subagent_type strings.
- Fix: Use exact match:
general-purpose,Explore,Plan, etc. The code-simplifier subagent iscode-simplifier:code-simplifier.
11.10.2 Multiple Infisical MCP servers connected — only first works
- Symptom: Configured 4 Infisical MCP entries (one per org), only the first one’s tools appear.
- Cause: Claude Code dedupes identical MCP tool schemas; subsequent servers are wasted processes.
- Fix: Use a single MCP entry; switch orgs via the
infisical-get <org> ...CLI helper at~/.local/bin/infisical-get.
11.11 Adding a new entry
When you hit (and fix) a real bug, add an entry in Symptom → Cause → Fix format above. The audience is a tired researcher hitting an error message they don’t recognize at 11pm — they should be able to grep this file and find the answer in 30 seconds.