11  Known Issues

Active bugs, workarounds, and limitations encountered during real deployments across UCI HPC3, UCR HPCC, NEU Explorer, and local workstations. Each entry is written in Symptom → Cause → Fix format so an agent or human hitting the same error can search for it.

For resolved issues that no longer apply, see git history (git log --grep).


11.1 fMRIPrep + Preprocessing

11.1.1 --cifti-output crashes fMRIPrep 25.2.3

  • Symptom: fMRIPrep aborts during sub-cortical aparc resampling with KeyError or shape-mismatch traceback when --cifti-output 91k is set.
  • Cause: Upstream regression in 25.2.3’s nibabel + grayordinate handling.
  • Fix: Default CIFTI_OUTPUT=none in config/site.conf until upgrading to 25.2.4+. Re-enable with CIFTI_OUTPUT=91k make preprocess ... once the module is bumped.

11.1.2 Fieldmap silently skipped (no SDC applied)

  • Symptom: fMRIPrep finishes successfully but the report shows “no susceptibility distortion correction” despite fmap/ files existing.
  • Cause: Fieldmap JSONs missing B0FieldIdentifier and IntendedFor. fMRIPrep skips SDC silently rather than failing.
  • Fix: Add B0FieldIdentifier to your dcm2bids / heudiconv config. Each fieldmap pair gets a unique tag (pepolar_run-01, etc.) referenced by the corresponding BOLD’s B0FieldSource. make preflight now warns on missing fieldmap metadata.

11.1.3 fMRIPrep rerun is much faster than first run (1h vs 5h)

  • Symptom: fMRIPrep rerun completes in ~1 hour even though first run took 5+ hours. User worries something was skipped.
  • Cause: FreeSurfer outputs (recon-all, hours) are cached; rerun reuses them.
  • Fix: Not a bug. To force re-run: delete derivatives/freesurfer/<sub>/. To verify cache reuse: check timestamps in derivatives/freesurfer/<sub>/scripts/recon-all.log.

11.1.4 fmriprep not found after all resolution strategies

  • Symptom: Batch launcher fails with “fmriprep not found” before submitting any subject.
  • Cause: Site has neither a NeuroCommand module nor a container in $CONTAINER_ROOT.
  • Fix: Set FMRIPREP_MODULE (strategy 1) OR CONTAINER_ROOT (strategy 2) in config/site.conf. Pull containers first: bash scripts/setup/pull_containers.sh --dest "$CONTAINER_ROOT".

11.2 SLURM / HPC

11.2.1 sbatch not found on a laptop

  • Symptom: make preprocess fails immediately with “sbatch not found” even though everything else is set up.
  • Cause: Preflight detects local mode by absence of SLURM_ACCOUNT AND SLURM_PARTITION. If only one is empty, SLURM checks still run.
  • Fix: Leave both SLURM_ACCOUNT and SLURM_PARTITION empty in config/site.conf. Preflight will auto-skip SLURM checks in local mode.

11.2.2 Invalid feature specification at job submit

  • Symptom: sbatch rejects every job with “Invalid feature specification”.
  • Cause: SLURM_CONSTRAINT="intel" (or similar feature tag) on a cluster that doesn’t use feature tags (e.g. NEU Explorer).
  • Fix: Set SLURM_CONSTRAINT="" in config/site.conf.

11.2.3 BASH_SOURCE resolves to SLURM spool dir

  • Symptom: HPC scripts fail with “REPO_ROOT not found” or load wrong config when launched via sbatch.
  • Cause: SLURM copies scripts to a spool dir before execution; bash’s BASH_SOURCE[0] then points to the spool, not the original repo.
  • Fix: All HPC scripts now have a SLURM_SUBMIT_DIR fallback in their REPO_ROOT resolution — if you’re writing a new HPC script, use the same pattern (see preprocessing/fmri/run_fmriprep_hpc.sh:67-74).

11.2.4 DependencyNeverSatisfied — DAG leaves stuck PENDING

  • Symptom: make pipeline-status shows downstream stages stuck PENDING with reason DependencyNeverSatisfied.
  • Cause: An upstream stage failed, so its afterok: dependency can never resolve. SLURM holds children indefinitely.
  • Fix: Cancel the stuck children with scancel <job_id>, then read the failed parent’s logs (logs/fmriprep/*.err or logs/validate/validate_fmriprep_*.out) for the real cause.

11.2.5 XCP-D 0.10+ OOM at 64 GB on full atlas + CIFTI

  • Symptom: XCP-D job killed by SLURM with OUT_OF_MEMORY during parcellate_alff, even though earlier runs at 64 GB worked.
  • Cause: XCP-D 0.10+ with full atlas parcellation (16+ atlases × N runs) exceeds 64 GB during ALFF parcellation.
  • Fix: Template’s run_xcpd_hpc.sh now defaults to --mem=128G. For minimal-atlas runs, override with SBATCH_ARGS=--mem=64G to save allocation.

11.3 Configuration / Paths

11.3.1 paths.toml not found on fresh clone

  • Symptom: make preflight errors immediately with “paths.toml not found”.
  • Cause: config/paths.toml is git-ignored; fresh clones don’t have one.
  • Fix: Run make setup (auto-detects site, copies preset). For unknown hosts it falls back to the local preset.

11.3.2 Placeholders survive into runtime (<lab>, <user>, etc.)

  • Symptom: Pipeline submits jobs but they fail with “directory not found: /dfs10///…”.
  • Cause: The preset was copied but placeholders never edited.
  • Fix: libs/paths.py now raises PathsNotConfiguredError at config load time if any <lab>, <user>, <project>, <group>, or <username> token survives. This catches the issue before any sbatch is run.

11.3.3 config/paths.local.toml overrides not applying

  • Symptom: Edited paths.local.toml but the values aren’t used.
  • Cause: Either the file isn’t at config/paths.local.toml (path is exact), or you’re using a child repo where this pattern wasn’t yet wired in.
  • Fix: libs.paths._load_toml() deep-merges paths.local.toml over paths.toml if it exists. Verify with: python -c "from libs.paths import get_paths; print(get_paths().dataset_root)".

11.3.4 Child repo’s from libs.paths import get_paths fails on bare HPC login

  • Symptom: Bash wrapper script fails with ModuleNotFoundError: tomli when trying to read paths.toml.
  • Cause: HPC login nodes ship Python 3.9 (no stdlib tomllib) and the default user environment lacks tomli.
  • Fix: Pipeline scripts call run_python_inline() which prefers .venv/bin/python first, then uv run, then system python. Make sure uv sync has run on the login node, or activate the venv before submitting.

11.4 BIDS Stats Models / GLM

11.4.1 BIDS Stats Models Input values must be arrays

  • Symptom: FitLins crashes with cryptic schema-validation error mentioning “expected array, got string”.
  • Cause: Wrote "Task": "rest" instead of "Task": ["rest"].
  • Fix: All Input filter values must be JSON arrays, even with one entry. libs.bids_statsmodels.validate_model() catches this before run.

11.4.2 Task GLM run on XCP-D denoised BOLD (silent confound double-removal)

  • Symptom: Task GLM completes successfully but contrasts come out weak / null compared to literature; whole-brain map looks “shrunken toward zero”.
  • Cause: GLM was run on _desc-denoised_bold (XCP-D output) instead of _desc-preproc_bold (fMRIPrep output). XCP-D pre-regressed motion confounds, then the GLM regressed them again, removing task variance shared with motion.
  • Fix: Task GLMs and gPPI must run on fMRIPrep _desc-preproc_bold with confounds in the design matrix. See docs/ANALYSIS.md § “Task GLMs and XCP-D — IMPORTANT”. libs/confounds.load_task_confounds() raises ValueError if you pass a denoised file by accident.

11.4.3 Events.tsv missing → FitLins falls over

  • Symptom: FitLins job fails with “no events for run X” or generates empty design matrix.
  • Cause: func/sub-<id>_task-<x>_events.tsv not generated yet. Behavioral data lives elsewhere; events.tsv must be derived.
  • Fix: Generate events.tsv from your behavioral data BEFORE running the GLM. Each child repo has its own conversion (twcf: analyses/behavioural/, vividness: experiments/fmri/). Template ships libs/event_utils.py for common patterns.

11.5 DataLad / Git-Annex

11.5.1 datalad get requests password 100+ times

  • Symptom: datalad get derivatives/fmriprep prompts for SSH password on every file.
  • Cause: SSH agent not forwarded or key not loaded.
  • Fix: Run ssh-add ~/.ssh/<your_key> before the get. For HPC sessions, set ForwardAgent yes in your local ~/.ssh/config for the cluster.

11.5.2 git-annex content available but not “wanted” — never fetched

  • Symptom: Files exist in git-annex but datalad get says they’re missing on every remote.
  • Cause: Wanted-expression filter on the remote excludes those files.
  • Fix: Check git annex wanted <remote>. Vividness uses site-specific wanted expressions (docs/DATA_SETUP.md § “git-annex”); UCI users shouldn’t request NEU sourcedata files and vice-versa.

11.6 Setup / Environment

11.6.1 uv pip install silently breaks the project venv

  • Symptom: After uv pip install <pkg>, later uv sync removes the package and breaks imports.
  • Cause: uv pip bypasses uv.lock; subsequent uv sync enforces lock and removes the unmanaged install.
  • Fix: Always use uv add <pkg> (writes to pyproject.toml and re-locks). uv pip is only acceptable in throwaway envs: uv venv .venv-temp && source .venv-temp/bin/activate && uv pip install ....

11.6.2 Preflight passes locally but fMRIPrep fails on HPC

  • Symptom: make preflight is green on the laptop, then sbatch fails on HPC with missing module / wrong path.
  • Cause: Local config doesn’t match HPC config (e.g., missing MODULE_USE_PATH, wrong CONTAINER_ROOT).
  • Fix: Run preflight on the HPC head node, not just the laptop: ssh hpc; cd <repo>; make preflight. Or use paths.local.toml to carry HPC-specific values (gitignored).

11.6.3 FreeSurfer license missing in containerized fMRIPrep

  • Symptom: fMRIPrep aborts in early FreeSurfer step with “license file not found” even after registering with NMR.
  • Cause: Container can’t see ~/.freesurfer/license.txt; needs an explicit bind mount or FS_LICENSE env var.
  • Fix: Place license at config/licenses/fs_license.txt (gitignored via .gitkeep). Devcontainer auto-points FS_LICENSE there. For HPC, export FS_LICENSE="$PWD/config/licenses/fs_license.txt" before make preprocess.

11.7 Library / API

11.7.1 libs/paths.py is intentionally NOT synced from template

  • Limitation: Each child repo customizes libs/paths.py (extra fields, custom env var names, fallback parsers). Template’s paths.py would clobber these.
  • Workaround: sync_from_template.sh puts libs/paths.py in SYNC_WITH_CARE, requiring --include-paths opt-in. See docs/TEMPLATE_MAINTENANCE.md § “Customization Patterns” for the convergence path (migrate custom fields to [paths.locations]).

11.7.2 Shell scripts must work with both legacy module-callable and new dataclass paths.py

  • Limitation: run_python_inline()’s resolve() function tries module-level callables first, then falls back to get_paths() dataclass attributes. Don’t assume only one.
  • Workaround: Don’t refactor the resolve fallback chain in HPC scripts unless you’re committing to an API audit across all 4 child repos.

11.8 CI / Testing

11.8.1 Mock E2E test passes but real-site smoke fails

  • Symptom: PR CI is green; first deployment to a new HPC site fails.
  • Cause: Mock E2E (tests/test_pipeline_end_to_end.py) tests orchestration only — generates a 2 MB synthetic BIDS, runs submit_subject_pipeline.sh --test-only, never hits sbatch or real containers.
  • Fix: Run the real-site smoke test once when onboarding a new HPC: bash scripts/tests/run_new_site_smoke.sh. Budget ~5 CPU-hours. Documented in docs/HPC_GUIDE.md § “New Site Validation”.

11.8.2 pytest.ini markers silently not registering

  • Symptom: @pytest.mark.unit etc. emit PytestUnknownMarkWarning on every test run.
  • Cause: pytest.ini header is [tool:pytest] (setup.cfg format) instead of [pytest] (pytest.ini format).
  • Fix: Change [tool:pytest][pytest]. Already corrected in template; check child repos that may have inherited the bug.

11.9 Cross-repo coordination

11.9.1 Child repo’s sync_from_template.sh is a stale snapshot

  • Symptom: Running sync from a child repo shows a different DOC_FILES list than the template’s current one, or references the renamed sync_fmriprep_scripts.sh.
  • Cause: Each child has its own snapshot of the sync script from when they last synced.
  • Fix: Future syncs update the script itself (it’s in SAFE_INFRA). After the first sync from new template, subsequent syncs use the new flags (--include-paths, --include-shells, --exclude, --diff).

11.9.2 Numpy major version drift across child repos (v1 vs v2)

  • Symptom: Code that works in vividness (numpy 2.3+) breaks in twcf (numpy <2.0) with cryptic dtype errors.
  • Cause: Child repos pin different numpy major versions. Some dependencies (older nilearn) require v1; newer ones (xgi) require v2.
  • Fix: Document each repo’s numpy version in pyproject.toml and call out incompatibilities in shared library code via runtime version checks. Audit findings (2026-04-26): twcf <2.0, vividness ≥2.3, Hypergraphsciousness ≥2.0.2, TI_DecNef ≥1.24.

11.10 Skills / AI agent infrastructure

11.10.1 Agent subagent_type mismatch

  • Symptom: Agent({subagent_type: "code-simplifier"}) fails with “no such agent”.
  • Cause: Plugin-namespaced agents have different subagent_type strings.
  • Fix: Use exact match: general-purpose, Explore, Plan, etc. The code-simplifier subagent is code-simplifier:code-simplifier.

11.10.2 Multiple Infisical MCP servers connected — only first works

  • Symptom: Configured 4 Infisical MCP entries (one per org), only the first one’s tools appear.
  • Cause: Claude Code dedupes identical MCP tool schemas; subsequent servers are wasted processes.
  • Fix: Use a single MCP entry; switch orgs via the infisical-get <org> ... CLI helper at ~/.local/bin/infisical-get.

11.11 Adding a new entry

When you hit (and fix) a real bug, add an entry in Symptom → Cause → Fix format above. The audience is a tired researcher hitting an error message they don’t recognize at 11pm — they should be able to grep this file and find the answer in 30 seconds.