3  Data Setup

This template enforces a strict separation between analysis code and research data. Code lives in a lightweight Git repository; data (raw acquisitions, BIDS derivatives, caches) lives in a separate, potentially large repository managed with DataLad, git-annex, or cloud-synced storage. The two sides communicate only through filesystem paths resolved by libs.paths and configured in config/paths.toml.

This document covers the architectural rationale (Section 1) and a full DataLad/git-annex implementation guide (Section 2). For moving data between sites with rclone, see guides/rclone_transfers.md.


3.1 1. Data/Code Separation

3.1.1 1.1 Why Separate?

Code Repository Data Repository
Git-tracked, lightweight. Potentially large; versioned with DataLad, git-annex, or Rclone.
Houses analysis, preprocessing, and experiment code. Stores raw acquisitions, BIDS derivatives, caches, and subject/session outputs.
Configured via config/tooling.toml. Mounted paths referenced via config/paths.toml.

The two repositories communicate only through filesystem paths resolved by libs.paths.

3.1.3 1.3 Synchronising Outputs

  • For reproducible runs, stage generated artefacts with DataLad (datalad add) in the data repository, not here.
  • Configure CI to mount shared storage or use DataLad runners so automated jobs can resolve the same paths as local developers.
  • If sharing results publicly, export them from the data repo (e.g., via datalad publish) and keep this template focused on code.

3.1.4 1.4 Common Patterns

Scenario Recommendation
Multiple studies share a stimuli library Store assets in a dedicated stimuli dataset and point stimuli_root to its mount location.
Cloud execution Mount object storage (e.g., S3 via s3fs) at runtime and update paths.toml through environment variables or templating.
Read-only collaborators Provide them with config/tooling.toml sans secrets and a read-only paths.toml pointing to shared mounts.

By enforcing this separation, the code repository stays lightweight, enabling rapid cloning, review, and automated testing without transferring large datasets.


3.2 2. DataLad & git-annex

This section explains how to set up and manage neuroimaging data using DataLad and git-annex, providing the practical implementation of the separation described above.

3.2.1 2.1 Overview

DataLad combines Git’s version control with git-annex’s large file handling to create reproducible, portable datasets. The key insight:

  • On your primary storage (e.g., HPC): All files are real files. Work normally.
  • When cloned elsewhere: Files appear as lightweight pointers. Fetch on demand with datalad get.

This means no daily overhead at your compute location, but collaborators can clone the dataset without downloading terabytes.

3.2.2 2.2 Core Concepts

3.2.2.1 Git vs Git-Annex

What Tracked by Behavior
Small text files (.tsv, .json, .csv, .txt) Git directly Always present in every clone
Large binary files (.nii.gz, .mgz, images) Git-annex Pointer in git, content fetched on demand

3.2.2.2 Remotes

Remote Type Purpose Example
Git remote Stores version history and annex pointers GitHub
Git-annex special remote Stores actual file content HPC storage, SharePoint via rclone, S3

A dataset can have multiple content remotes. Git-annex tracks which remotes have which files.

3.2.3 2.3 Setting Up a New Dataset

3.2.3.1 Initialize on Primary Storage

Create the dataset where you’ll do most of your work (typically HPC):

cd /path/to/projects
datalad create -c text2git my-dataset
cd my-dataset

The -c text2git configuration automatically keeps small text files in git proper.

3.2.3.2 Configure Large File Rules

Refine what gets annexed:

git annex config --set annex.largefiles \
  'largerthan=1mb and not (include=*.tsv or include=*.csv or include=*.json or include=*.txt or include=*.md)'

This means: - Files > 1MB -> annexed (pointer in git, content tracked separately) - Text files (.tsv, .csv, .json, .txt, .md) -> always in git directly

3.2.3.3 Create Directory Structure

For BIDS neuroimaging data:

mkdir -p sourcedata rawdata derivatives taskdata
Directory Purpose Annexed?
sourcedata/ Original scanner exports, DICOMs Yes
rawdata/ BIDS-converted NIfTI data Yes
derivatives/ Pipeline outputs (fmriprep, etc.) Yes
taskdata/ Behavioral events, configs No (small text files)

3.2.3.4 Add Dataset Metadata

Create dataset_description.json:

{
  "Name": "My Dataset",
  "BIDSVersion": "1.9.0",
  "DatasetType": "raw",
  "License": "CC-BY-4.0",
  "Authors": ["Your Name"],
  "DatasetDOI": ""
}

3.2.3.5 Initial Save

datalad save -m "Initialize dataset structure"

3.2.4 2.4 Setting Up Remotes

3.2.4.1 GitHub (Git History Only)

Create an empty repo on GitHub, then:

datalad siblings add -s github \
  --url git@github.com:YourOrg/your-dataset.git

datalad push --to github

GitHub stores git history and annex pointers but not the actual large files.

3.2.4.2 SharePoint via Rclone

If you have an rclone remote configured (e.g., sharepoint_lab:):

git annex initremote sharepoint \
  type=external externaltype=rclone \
  target=sharepoint_lab \
  prefix=Projects/MyProject/data \
  chunk=50MiB \
  encryption=none

Parameters: - target: Your rclone remote name - prefix: Path within the remote - chunk: Split large files (helps with flaky connections) - encryption: Use none for lab storage, shared for public clouds

For more on rclone configuration, see guides/rclone_transfers.md.

3.2.4.3 HPC via Rsync (for clones to access)

When others clone from GitHub, they need a way to fetch content from HPC:

git annex initremote hpc-lab \
  type=rsync \
  rsyncurl=user@hpc.university.edu:/path/to/dataset \
  encryption=none

3.2.5 2.5 Daily Workflows

3.2.5.1 Working on Primary Storage (HPC)

No special steps needed. Files are real files:

cd /path/to/my-dataset

# Run your analysis
python my_analysis.py

# Save changes
datalad save -m "Add GLM results for sub-01"

# Push history to GitHub
datalad push --to github

3.2.5.2 Cloning to Local Machine

# Clone (fast - just metadata)
datalad clone git@github.com:YourOrg/your-dataset.git ~/data/my-dataset
cd ~/data/my-dataset

# Enable the HPC remote
git annex enableremote hpc-lab

# Get specific data you need
datalad get rawdata/sub-01
datalad get derivatives/fmriprep/sub-01

3.2.5.3 Archiving to SharePoint

Copy content to the archive remote:

# Single subject
git annex copy --to sharepoint derivatives/fmriprep/sub-01

# All derivatives
git annex copy --to sharepoint derivatives/

3.2.5.4 Checking Where Content Lives

# Which remotes have this file?
git annex whereis rawdata/sub-01/anat/sub-01_T1w.nii.gz

# What's missing locally?
git annex find --not --in here

# What's not yet archived?
git annex find --in here --not --in sharepoint

3.2.6 2.6 Preferred Content Rules

Automate what each remote should store:

# HPC wants everything (it's the workhorse)
git annex wanted . "include=*"

# SharePoint wants archived data
git annex wanted sharepoint "include=derivatives/* or include=rawdata/*"

# Local clones want only what's explicitly fetched
git annex wanted here "present"

Then use git annex sync --content to automatically move files to their preferred locations.

3.2.7 2.7 Migration from Existing Data

3.2.7.1 Importing Files

cd my-dataset

# Copy files in (via rclone, rsync, cp, etc.)
rclone copy sharepoint_lab:OldProject/rawdata ./rawdata --progress

# Tell git-annex about them
datalad save -m "Import existing rawdata"

3.2.7.2 Registering Existing Copies

If files already exist on a remote and you don’t want to re-upload:

# After importing locally, register that sharepoint also has them
git annex registerurl SHA256E-... sharepoint:Projects/OldProject/rawdata/file.nii.gz

3.2.8 2.8 Changing Institutions

The beauty of this setup: GitHub holds the portable identity.

# Add new HPC as remote
git annex initremote new-hpc type=rsync rsyncurl=user@newhpc:/path/...

# Copy content to new location
git annex copy --to new-hpc

# Mark old remote as gone (keeps history, prevents fetch attempts)
git annex dead old-hpc

3.2.9 2.9 Troubleshooting

3.2.9.1 “Unable to access remote”

# Check remote configuration
git annex info sharepoint

# Test rclone directly
rclone lsd sharepoint_lab:Projects/

3.2.9.2 Large Clone Sizes

If git clone is slow, the git history may have grown. Consider:

# Check sizes
git count-objects -vH

# For new clones, use shallow clone then unshallow
git clone --depth 1 ... && git fetch --unshallow

3.2.9.3 Pointer Files Appearing on HPC

If files turn into pointers unexpectedly:

# Unlock for editing
git annex unlock rawdata/sub-01/

# Or fix permissions
git annex fix

3.2.10 2.10 Best Practices

  1. Commit often, push daily: Small commits are easier to review and revert.

  2. Use meaningful messages: datalad save -m "fmriprep sub-01 ses-01 complete" not datalad save -m "update".

  3. Archive completed subjects: Once QC passes, git annex copy --to sharepoint so you have a backup.

  4. Document your remotes: Keep a README listing remote names, paths, and access requirements.

  5. Test clone workflow: Periodically clone fresh to ensure collaborators can access data.

3.2.11 2.11 Integration with Analysis Code

Your analysis repository references the data repository via config/paths.toml:

[paths]
data_root = "/path/to/my-dataset"
rawdata_root = "${data_root}/rawdata"
derivatives_root = "${data_root}/derivatives"
taskdata_root = "${data_root}/taskdata"

The code repo stays lightweight (fast clones, easy CI), while the data repo handles the heavy lifting.

3.2.12 2.12 Further Reading


3.3 3. DataLad-Container Provenance (Optional)

For sites that want per-subject execution provenance — every input file hash, container SHA256 digest, and command line recorded as a git commit — the template integrates with the datalad-container extension. This is the lightweight alternative to BABS-style full DataLad RIA stores: you get the audit trail without restructuring how your data lives on disk.

3.3.1 3.1 What it does

When USE_DATALAD=1 and DATALAD_CONTAINER_ROOT is set in config/site.conf:

  • find_container <tool> in _load_site_config.sh consults datalad containers-list first (recording the SHA256 digest of the resolved image), falling back to the existing CONTAINER_ROOTNEUROCOMMAND_PATHCONTAINER_PATH chain if no DataLad image is found.
  • datalad_provenance_wrap wraps each pipeline tool invocation in datalad run, which records the inputs, outputs, container digest, and full command line as a per-subject git commit in the DataLad superdataset.
  • Each provenance commit triggers a CATEGORY_EXECUTION_PROVENANCE event in logs/guardrail_events.jsonl, queryable via make guardrail-summary.

3.3.2 3.2 One-time setup

# Install the extension
pip install datalad-container

# Create a DataLad superdataset to hold your container library
datalad create -c text2git $HOME/dl-containers
cd $HOME/dl-containers

# Pin each tool by version (records SHA256 digest automatically)
datalad containers-add fmriprep --url docker://nipreps/fmriprep:25.2.5
datalad containers-add mriqc    --url docker://nipreps/mriqc:24.0.2
datalad containers-add xcp_d    --url docker://pennlinc/xcp_d:0.10.0
datalad containers-add fitlins  --url docker://poldracklab/fitlins:0.11.0

datalad save -m "pin tool versions for project X"

3.3.3 3.3 Enable the integration

In config/site.conf:

USE_DATALAD="1"
DATALAD_CONTAINER_ROOT="$HOME/dl-containers"

That is the entire opt-in. All existing pipeline scripts now resolve containers via DataLad first and wrap their invocations in datalad run. When you re-run the same subject through the same template version, the run is reproduced from the recorded provenance rather than re-executed (cache hit on the DataLad commit).

3.3.4 3.4 What this gets you

  • Container hash locking: every image is referenced by SHA256, not by :latest or :25.2.5 tags that might drift.
  • Per-subject audit trail: git log in the DataLad superdataset shows every fMRIPrep run with its exact inputs and command.
  • Bit-reproducible reruns: datalad rerun <commit> re-executes the recorded command with the recorded inputs, hash-verifying the container.
  • Cross-site portability: ship the DataLad superdataset to a collaborator and they reproduce your runs exactly.

This integration is opt-in: when USE_DATALAD=0 (the default), the template behaves exactly as before and DataLad is not even a required dependency. The choice is yours, made per-site in site.conf.

3.3.5 3.5 Relationship to BABS

BABS (Zhao et al. 2024, Imaging Neuroscience) is the gold standard for execution provenance in neuroimaging — it builds on the same DataLad substrate but goes much further, packaging each subject’s full input dataset, container, and outputs into a per-subject DataLad RIA branch. If you need that level of audit trail (e.g., for an HCP-scale dataset where you must regenerate any subject from scratch given only the commit hash), use BABS.

The template’s datalad-container integration covers a different use case: you want container-hash locking and a record of every run’s command and inputs, without committing to BABS’s data- management architecture for your study. The two are not mutually exclusive — a lab can adopt both.