Skip to content

NeuroJSON/siamize

Repository files navigation

siamize - native C++/ONNX port of SIAM v0.3 brain segmentation

ci


Table of Contents


Overview

A native, vendor-neutral port of SIAM v0.3 — the Segment It All Model for head/brain tissue segmentation — that runs without PyTorch, nnU-Net, or torchio at deployment time.

siamize ships:

  1. A slim Python reference (py/siam_ref.py) that reproduces SIAM's inference using only PyTorch + numpy + nibabel + scipy + dynamic_network_architectures. No nnU-Net, no torchio, no SimpleITK.
  2. An ONNX export pipeline (tools/onnx_export/) that converts each fold of the SIAM v0.3 ResEnc-UNet to fp16 .onnx, validating against (1).
  3. A C++ standalone binary (src/) — 232 KB executable + 23 MB libonnxruntime.so + per-fold 270 MB .onnx — drop-in for siam-pred with no Python at runtime.

Accuracy vs original SIAM on the bundled sub-01_T1w.nii.gz (5-fold ensemble, 18 classes):

Pipeline Voxel agreement Worst per-class Dice
py/siam_ref.py 99.989% 0.9990
tools/onnx_export/siam_ort.py 99.989% 0.9990
C++ binary (build/siamize) 99.715% 0.9697 (Anomalies, 17 voxels)

Quickstart

1. Fetch dependencies

scripts/fetch_deps.sh        # downloads ORT prebuilt + clones nifti_clib into third_party/
git submodule update --init  # pulls the bundled jsonlab under matlab/jsonlab (only needed for the MATLAB/Octave wrapper; see below)

The fetch script auto-detects the host (Linux x64, Linux aarch64, macOS x86_64, macOS arm64, Windows x64) and pulls the right ORT prebuilt. On Windows, run it from Git Bash (or any POSIX shell — Git for Windows ships bash, curl, and tar with .zip support out of the box). For native Windows users without a POSIX shell, the equivalent PowerShell script is also provided:

scripts\fetch_deps.ps1

The fp16 ONNX fold weights are not fetched up-front: siamize and its MATLAB/Octave wrapper auto-download any missing fold from NeuroJSON (URL prefix https://neurojson.org/io/stat.cgi?action=get&db=siam_v03&doc=dynshape&file=, overridable via SIAMIZE_WEIGHTS_BASE_URL) into a shared cache ($SIAMIZE_CACHE_DIR, default $HOME/.cache/siamize/models/ on POSIX or %LOCALAPPDATA%/siamize/models/ on Windows). One download serves both the CLI binary and the MEX. The default doc=dynshape variant has dynamic spatial axes (any patch size); doc=coreml ships an InstanceNorm-rewritten variant for Apple's Core ML EP (see the CoreML section below), and the resolver auto-picks the right one based on -c. If you want all five folds pre-staged before going offline, run:

scripts/fetch_weights.sh     # downloads the 5 fp16 .onnx folds (~1.35 GB) into models/

2. Build the C++ binary

CPU-only (default):

make            # convenience target -- wraps cmake configure + build
# or, equivalently:
cmake -S . -B build && cmake --build build -j

This produces build/siamize. libonnxruntime.so.1 is located by RPATH: the binary looks first in $ORIGIN (next to itself) and then in third_party/onnxruntime/lib/ (the development tree), so you can either drop the .so next to the binary for distribution or run from a fresh checkout without setting LD_LIBRARY_PATH.

Optional: NVIDIA GPU build (CUDA Execution Provider)

make cuda       # re-fetches GPU ORT prebuilt (only if needed) + configures + builds

That's the convenience shortcut. The equivalent explicit form:

rm -rf third_party/onnxruntime build
ORT_BUILD=gpu scripts/fetch_deps.sh                  # default = CUDA 12.x build
# or, if your NVIDIA driver is CUDA 13:
# ORT_BUILD=gpu ORT_CUDA=13 scripts/fetch_deps.sh
cmake -S . -B build -DSIAMIZE_GPU=cuda
cmake --build build -j

The binary accepts -c {auto,cpu,cuda,tensorrt,coreml} (default auto). On auto siamize prefers (in order): TensorRT EP (if compiled in), CUDA EP (if compiled in), Apple Core ML EP (if compiled in on macOS), then CPU. The EP probe is graceful — if the runtime libraries (libcudart, libcudnn, libcublasLt, ...) can't be loaded, siamize falls back to CPU with a [cuda] unavailable-style log line. Pass -c cuda / -c tensorrt / -c coreml to force a specific EP and fail loudly if it isn't available; -c cpu skips GPU even when the build includes it.

On a multi-GPU host, -G N selects the device; the index matches nvidia-smi -L.

CUDA runtime libraries are loaded via dlopen, so you may need to set LD_LIBRARY_PATH to include their location. With PyTorch-managed CUDA (pip's nvidia-* packages):

NV=$(python3 -c "import nvidia; print(nvidia.__path__[0])")
export LD_LIBRARY_PATH="$NV/cublas/lib:$NV/cuda_runtime/lib:$NV/cudnn/lib:$NV/cufft/lib:$NV/curand/lib:$NV/cuda_nvrtc/lib:$NV/nvjitlink/lib:$LD_LIBRARY_PATH"
build/siamize -i ... -c cuda ...

With a system CUDA install, point at it via the standard CUDA_HOME env var (set by the NVIDIA installer on most distros, otherwise default /usr/local/cuda):

export CUDA_HOME=${CUDA_HOME:-/usr/local/cuda}
export LD_LIBRARY_PATH="$CUDA_HOME/lib64:$LD_LIBRARY_PATH"
build/siamize -i ... -c cuda ...

ORT 1.26 requires cuDNN 9 with a kernel image for your GPU's compute capability — older GPUs (e.g., Volta sm_70) may need a cuDNN build that explicitly includes those kernels. If cuDNN was installed separately (the typical NVIDIA flow), make sure its lib64/ is on LD_LIBRARY_PATH too; the official installer drops it next to $CUDA_HOME/lib64/ so the line above usually covers it.

What the prebuilt CUDA bundle ships vs. what you must supply

The siamize-*-cuda.zip artifact produced by CI / make package-cuda contains only what's redistributable: the siamize binary, ORT 1.26 core, and ORT's CUDA EP plugin DLLs. CUDA/cuDNN themselves are not bundled — they're large (~1 GB combined for cuDNN 9 + cuBLAS + cuFFT), and cuDNN's license forbids third-party redistribution.

Component In the zip You install
siamize / siamize.exe
libonnxruntime.so.1 / onnxruntime.dll
libonnxruntime_providers_shared.so / .dll
libonnxruntime_providers_cuda.so / .dll
libcudart (CUDA runtime) CUDA Toolkit, or pip install nvidia-cuda-runtime-cu12
libcublas + libcublasLt CUDA Toolkit, or pip install nvidia-cublas-cu12
libcudnn (cuDNN 9 for ORT 1.26) NVIDIA cuDNN 9 installer, or pip install "nvidia-cudnn-cu12==9.*"
libcufft, libcurand, cuda_nvrtc, nvjitlink CUDA Toolkit, or matching nvidia-*-cu12 pip wheels
Required shared libraries by exact filename

ORT 1.26's CUDA EP plugin (libonnxruntime_providers_cuda.so) is dlopen'd at runtime. It in turn pulls in the following dependencies at the exact major version:

Library (Linux) Library (Windows) Provided by
libcudart.so.12 cudart64_12.dll CUDA 12 runtime
libcublas.so.12 cublas64_12.dll CUDA 12
libcublasLt.so.12 cublasLt64_12.dll CUDA 12
libcufft.so.11 cufft64_11.dll CUDA 12 (cuFFT is independently versioned)
libcurand.so.10 curand64_10.dll CUDA 12 (cuRAND ditto)
libcudnn.so.9 cudnn64_9.dll cuDNN 9
libcudnn_ops.so.9 cudnn_ops64_9.dll cuDNN 9
libcudnn_cnn.so.9 cudnn_cnn64_9.dll cuDNN 9
libcudnn_engines_precompiled.so.9 cudnn_engines_precompiled64_9.dll cuDNN 9
libnvrtc.so.12 nvrtc64_120_0.dll CUDA 12 nvrtc
libnvJitLink.so.12 nvJitLink_120_0.dll CUDA 12 nvjitlink

If any one of these is missing or its loader path is wrong, siamize's -c auto probe falls back to CPU and the [warn] ORT: line names the first missing file — e.g.:

[warn]     ORT: Failed to load library .../libonnxruntime_providers_cuda.so
           with error: libcublasLt.so.12: cannot open shared object file
[cuda]     unavailable (...); using CPU

Means cuBLAS isn't on LD_LIBRARY_PATH. The CUDA EP plugin loaded; the transitive cuBLAS dependency didn't. Add cuBLAS's directory to LD_LIBRARY_PATH (per the pip-wheels or CUDA_HOME one-liners above) and re-run. The probe will then surface the next missing library, if any — fix iteratively until -c auto reports [cuda] enabled (gpuid=0).

To preflight before running siamize, ldd the CUDA EP plugin:

ldd third_party/onnxruntime/lib/libonnxruntime_providers_cuda.so | grep -E "not found|=>"
# every line should resolve; any "not found" entry names a missing lib

On Windows, use dumpbin /dependents (Visual Studio Developer Prompt):

dumpbin /dependents third_party\onnxruntime\lib\onnxruntime_providers_cuda.dll
Windows: pointing siamize.exe at the CUDA runtime DLLs

On Windows the loader uses PATH (not LD_LIBRARY_PATH) to find DLLs. The CUDA Toolkit installer sets the CUDA_PATH env var (e.g. C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.x) and usually prepends %CUDA_PATH%\bin to PATH itself. If siamize.exe reports "cannot find cudart64_12.dll" after unzipping the bundle, force-add it:

# PowerShell
$env:PATH = "$env:CUDA_PATH\bin;" + $env:PATH

# cmd.exe equivalent:
# set PATH=%CUDA_PATH%\bin;%PATH%

.\siamize.exe -i input.nii.gz -o pred.nii.gz -M 0 -c cuda

cuDNN's Windows installer copies its DLLs into %CUDA_PATH%\bin (the default checkbox in the cuDNN MSI), so the same one-liner usually covers cuDNN too.

For a lighter-weight install via pip wheels (no CUDA Toolkit needed):

pip install nvidia-cuda-runtime-cu12 nvidia-cublas-cu12 `
            "nvidia-cudnn-cu12==9.*"   `
            nvidia-cufft-cu12 nvidia-curand-cu12 `
            nvidia-cuda-nvrtc-cu12 nvidia-nvjitlink-cu12

# Prepend the wheel DLL dirs to PATH (Windows equivalent of the
# LD_LIBRARY_PATH one-liner shown above for Linux).
$NV = (python -c "import nvidia; print(nvidia.__path__[0])")
$env:PATH = "$NV\cublas\bin;$NV\cuda_runtime\bin;$NV\cudnn\bin;" `
          + "$NV\cufft\bin;$NV\curand\bin;$NV\cuda_nvrtc\bin;" `
          + "$NV\nvjitlink\bin;" + $env:PATH

.\siamize.exe -i input.nii.gz -o pred.nii.gz -M 0 -c cuda

Note: the pip wheels put their DLLs under bin\ on Windows (vs. lib\ on Linux). Hardware-compatibility caveat is the same as Linux — the pip cuDNN/cuBLAS wheels target sm_75+; older GPUs need the official NVIDIA installer.

Optional: TensorRT EP (advanced, batch-processing workloads only)

For workloads that process hundreds of volumes with the same model/GPU combo, the TensorRT Execution Provider can shave ~35% off CUDA EP wall time. It's an opt-in build:

# Build with TRT enabled (gpu ORT prebuilt also has the TRT provider plugin).
make tensorrt
# equivalent explicit form:
# cmake -S . -B build -DSIAMIZE_GPU=tensorrt && cmake --build build -j

# Install the matching TensorRT Python wheel (ships libnvinfer + per-SM
# builder resources). Pin it to your CUDA runtime version.
pip install --user "tensorrt~=10.0"

# Make TRT libs visible alongside the CUDA libs.
TRT=$(python3 -c "import os, tensorrt_libs; print(os.path.dirname(tensorrt_libs.__file__))")
export LD_LIBRARY_PATH="$TRT:$LD_LIBRARY_PATH"

build/siamize -i input.nii.gz -o output.nii.gz \
    -M models/fold_0_fp16.onnx \
    -c tensorrt \
    --trt-cache-dir $HOME/.cache/siamize/trt

Cost model on a Turing RTX 2080 SUPER (single fold):

Mode Wall time Notes
CUDA EP 13.3 s warm
TRT EP, first run 962 s one-time engine build per fold/GPU/TRT-version
TRT EP, cached 8.7 s ~35 % faster than CUDA, every subsequent run

Correctness: TRT vs CUDA output → 99.97 % voxel agreement, worst per-class Dice 0.997 (fused-kernel rounding only).

Breakeven: amortizing one cold engine build (962 s) against the per-run savings (13.3 − 8.7 = 4.6 s) takes ~209 inferences per fold. For a 5-fold ensemble that's ~209 full-volume runs end-to-end.

Hidden costs:

  • TensorRT Python wheel: ~1 GB on disk (libnvinfer + per-arch builder resources for sm_75…sm_120).
  • Engine cache: 274 MB per fold (1.37 GB for the 5-fold ensemble).
  • Cache invalidation: any change to the ONNX model, the GPU compute capability, or the TRT minor version forces a fresh ~16 min/fold rebuild.

If you're not deploying to a batch server, stick with the default CUDA EP. The TRT path stays available for the lab that needs it.

Optional: Core ML EP (Apple Silicon)

On macOS / Apple Silicon, siamize can run inference through Apple's Core ML stack — CPU + Metal GPU + Neural Engine (ANE), selectable via MLComputeUnits. Core ML is statically linked into ORT 1.26's macOS dylib, so no separate provider plugin to fetch:

make coreml             # builds with -DSIAMIZE_GPU=coreml
build/siamize -i input.nii.gz -o pred.nii.gz -M 0 -c coreml
# or just:
build/siamize -i input.nii.gz -o pred.nii.gz -M 0       # -c auto picks CoreML on macOS

CoreML-relevant flags:

Flag Default Effect
--coreml-units {all|cpune|cpugpu|cpu} all Which hardware Core ML can route ops to. all = CPU + Metal GPU + ANE.
--coreml-cache-dir P ~/.cache/siamize/coreml Per-host .mlmodelc compile cache. First run takes ~10-30 s to compile; subsequent runs hit the cache.
--coreml-static-shapes 0|1 1 RequireStaticInputShapes knob. Default 1 pairs with the doc=coreml fixed-shape ONNX.

Important: SIAM v0.3's encoder/decoder uses 3D InstanceNorm, which Core ML's MLProgram format rejects (rank-5 input; spec supports only rank 3-4). siamize ships under doc=coreml an InstanceNorm-rewritten ONNX where each rank-5 IN is replaced with Reshape → InstanceNorm-rank-3 → Reshape. The math is bit-exact to the original. The siamize weight resolver auto-fetches the doc=coreml variant when CoreML EP is active.

Compile-time RAM peaks roughly:

--coreml-units mlcompilerd peak (estimated)
all (CPU + GPU + ANE) 6-8 GB
cpugpu 4-6 GB
cpu 1-2 GB

If your host is RAM-tight (< 14 GB free), siamize's auto-lowmem heuristic drops to --coreml-units cpu automatically. To benchmark real ANE acceleration, run on a host with ≥ 14 GB free RAM and pass --coreml-units all explicitly.

The tools/onnx_export/rewrite_for_coreml.py script applies the InstanceNorm rewrite to any fp16/fp32 SIAM ONNX; it's how the doc=coreml bundle was produced from the same trained weights as doc=dynshape. Useful if you want to re-export with different fold or quantization choices.

3. Run

# Full 5-fold ensemble (the digit shortcut expands to fold_<N>_fp16.onnx;
# any missing weight auto-downloads into the shared cache).
build/siamize -i input.nii.gz -o output.nii.gz -M 0,1,2,3,4 -v

# Single-fold prediction is also supported:
build/siamize -i input.nii.gz -o output.nii.gz -M 0 -v

# Explicit paths still work alongside shortcuts:
build/siamize -i input.nii.gz -o output.nii.gz \
    -M models/fold_0_fp16.onnx,models/fold_1_fp16.onnx

-t/--thread defaults to 0 = min(hardware_concurrency, 16). The 16-thread cap is empirical (see the "CPU thread tuning" section below for the Threadripper measurements that motivated it). Pass an explicit -t N to override.

Useful flags (CLI quick reference)

The most-used options beyond -i / -o / -M / -c:

Flag Effect
-v / --verbose progress messages (default ON since 0.1.0). Pair with -q / --quiet to silence.
-G N CUDA / TensorRT device id. Matches nvidia-smi -L indices (siamize sets CUDA_DEVICE_ORDER=PCI_BUS_ID).
-t N ORT intra-op thread count. Default min(hc, 16).
-P ZxYxX Sliding-window patch size, default 256x256x192. Smaller patches → lower peak memory, more tiles. Requires dynamic-axes ONNX.
-u S Target isotropic spacing in mm, default 0.75.
-C N|spm Output classes. -C 18 (default, SIAM v0.3) or -C spm to remap to SPM12's 6 TPM channels (GM, WM, CSF, Bone, Soft, Air).
--tpm [0|1] + --tpm-t T Write a 4D float32 tissue probability map (per-voxel softmax over classes) instead of a uint8 labelmap. --tpm-t > 1 softens the softmax.
--upsample Save at the network's internal 0.75 mm canonical-RAS grid instead of resampling back to the input grid.
--shuffle Apply JData _ArrayShuffle_=4 byte-shuffle before zlib on TPM .jnii/.bnii output. 1.5-2.5× smaller payload for spec-compliant readers.
--lowmem Force the full low-memory preset (smaller patch + tighter EP knobs). Auto-applied on hosts with < 14 GB free RAM / < 12 GB free VRAM.
--no-arena Disable ORT's CPU memory arena (saves ~16 GB peak RSS on the 18-class model at ~1.5× wall cost).
--cudnn-max-workspace 0 + --arena-extend same + --cudnn-algo heuristic Tight-VRAM GPU recipe (8 GB consumer cards).
--gpu-mem-limit N[K|M|G] Cap the CUDA EP arena.

Run build/siamize --help for the full list with longer descriptions.

JNIfTI containers (.jnii / .bnii)

In addition to NIfTI-1, siamize reads and writes JNIfTI — the JData/BJData wrapping of NIfTI metadata + voxel data used by the NeuroJSON ecosystem. Input format is inferred from the file extension; output format is selected via -F/--format:

# Text-JSON JNIfTI labelmap (human-readable; zlib + base64 payload)
build/siamize -i input.nii.gz -o labels.jnii -M 0 -F jnii

# BJData binary JNIfTI labelmap (compact binary; zlib raw bytes)
build/siamize -i input.nii.gz -o labels.bnii -M 0 -F bnii

# JNIfTI input (e.g. produced by jsonlab's savejd):
build/siamize -i preproc.bnii -o labels.bnii -M 0 -F bnii
-F Output container Payload Notes
nii (default) NIfTI-1 (.nii, gzipped if .gz) raw bytes Smallest for sparse uint8 labelmaps.
jnii JSON-text JNIfTI zlib + base64 Human-readable; pairs with loadjd / jsonlab.
bnii BJData binary JNIfTI zlib (raw bytes) Compact binary; pairs with the NeuroJSON ecosystem.

For label volumes specifically, .nii.gz is usually the tightest on disk — raw gzip already exploits per-voxel value redundancy efficiently. JNIfTI's value here is interoperability with JSON / JData tooling (jsonlab, jdataencode / jdatadecode, browser-based viewers, the NeuroJSON database), not raw compression. Voxel-identical round-tripping has been verified for .nii.gz.jnii.bnii inputs on the bundled sub-01_T1w.nii.gz (100 % agreement, 5-fold ensemble).

.jnii / .bnii labelmap and TPM outputs additionally carry a JGIFTI-style LabelTable at NIFTIHeader._DataInfo_.LabelTable — anatomical names + per-tissue RGBA colors keyed by label ID. Two presets are emitted automatically: the full SIAM v0.3 18-class dictionary (default), or the SPM12 6-class TPM dictionary (GM, WM, CSF, Bone, Soft, Air) when --classes spm is set. Viewers that honour the JGIFTI spec render tissue names and colors instead of an unlabeled colormap. NIfTI-1 output (.nii / .nii.gz) doesn't carry the LabelTable since the format has no extensible JSON header.

For TPM output (--tpm) on the .jnii / .bnii path, opt into byte-shuffle compression with --shuffle. The flag wraps the fp32 payload with a JData _ArrayShuffle_=4 annotation (blosc2-style per-byte plane regrouping before zlib), giving 1.5-2.5× smaller files at no decode cost on JData-spec-compliant readers (siamize itself; future jsonlab versions). Default OFF for interop with current jsonlab which doesn't yet implement the unshuffle.

4. Regression test (optional)

tests/run_regression.sh

Runs the bundled sample through build/siamize and reports voxel agreement vs tests/pred_ref_allfolds.nii.gz.

MATLAB / GNU Octave bindings

The same inference pipeline is callable from MATLAB and Octave through a thin MEX (siamex.mex*) wrapped by a pure-MATLAB dispatcher (matlab/siamize.m). MEX and CLI predictions are bit-identical (they share the siamize_core C++ sources).

Build

CPU-only MEX (works on every host):

# Octave (Linux/macOS):
make mex-octave              # -> matlab/siamex.mex

# MATLAB (Linux/macOS/Windows):
make mex-matlab              # -> matlab/siamex.mexa64 / .mexmaca64 / .mexw64

GPU-enabled MEX variants (same CLI capabilities as the corresponding siamize binary build):

# CUDA EP (NVIDIA, Linux/Windows):
make cudaoct                 # Octave MEX,  -DSIAMIZE_GPU=cuda
make cudamex                 # MATLAB MEX,  -DSIAMIZE_GPU=cuda

# CoreML EP (Apple Silicon, macOS):
make coremloct               # Octave MEX,  -DSIAMIZE_GPU=coreml
make coremlmex               # MATLAB MEX,  -DSIAMIZE_GPU=coreml

All targets drop the .mex* next to matlab/siamize.m so the wrapper finds it via its addpath auto-detection. Equivalent explicit CMake forms (any of the above translates to):

cmake -S . -B build -DSIAMIZE_BUILD_OCTAVE_MEX=ON  [-DSIAMIZE_GPU=cuda|coreml]
# or:
cmake -S . -B build -DSIAMIZE_BUILD_MATLAB_MEX=ON  [-DSIAMIZE_GPU=cuda|coreml]
cmake --build build -j

The bundled jsonlab submodule (matlab/jsonlab/) provides loadjd / savejd / loadnifti / jnii2nii / savejnifti / etc.; siamize.m adds it to the path automatically if it isn't already visible.

Calling format

siamize.m accepts flexible inputs and returns a single jnifti struct (nii.NIFTIHeader + nii.NIFTIData). The same struct is written to disk when an outputfile is supplied.

% one-shot file -> file (defaults: single-fold fold_0, auto-downloaded)
siamize('input.nii.gz', 'labels.nii.gz');

% cross-format: read .nii.gz, write binary JNIfTI, full 5-fold ensemble
siamize('input.nii.gz', 'labels.bnii', 0:4);

% struct input (jnifti or readnifti-style), in-memory labels
nii_in  = loadnifti('input.nii.gz');
nii_out = siamize(nii_in);        % nii_out.NIFTIData is uint8 3D labels

% bare 3D array, default centered affine inferred
nii_out = siamize(my_volume);
nii_out = siamize(my_volume, 0);                       % single fold by shortcut
nii_out = siamize(my_volume, '0,2,4', 'verbose', true);

% explicit affine + output file + ensemble + opts
siamize(my_volume, A, 'labels.nii.gz', 0:4, 'compute', 'cuda');

% multi-GPU box, pick the GPU 1 of N (matches `nvidia-smi -L` index):
siamize('in.nii.gz', 'lab.nii.gz', 0:4, 'compute', 'cuda', 'gpu', 1);

% CoreML EP (Apple Silicon; needs MEX built via `make coremlmex` / coremloct):
siamize('in.nii.gz', 'lab.nii.gz', 0, 'compute', 'coreml');

% TPM mode: nii_out.NIFTIData becomes 4D single (float32) [X, Y, Z, 18]:
nii_tpm = siamize('input.nii.gz', 0:4, 'tpm', true, 'tpm_t', 1.5);
siamize('input.nii.gz', 'tpm.nii.gz', 0:4, 'tpm', true);  % save TPM to disk

% SPM12-style 6-class output (GM, WM, CSF, Bone, Soft, Air):
siamize('input.nii.gz', 'spm.nii.gz', 0, 'classes', 'spm');

% Upsample mode: save at the network's 0.75 mm grid instead of the input grid.
siamize('input.nii.gz', 'pred_hires.nii.gz', 0, 'upsample', true);
First arg Interpretation
'file.{nii,nii.gz,jnii,bnii}' read via loadjd; affine taken from header
jnifti struct (.NIFTIData + .NIFTIHeader.Affine) passthrough
readnifti struct (.img + .hdr.srow_*) passthrough; affine from sform
3D numeric array identity rotation + centered translation synthesized when no affine follows

The models argument accepts numeric indices, char shortcuts, full paths, or mixes thereof: 0, 0:4, '0,2,4', {'0','fold_3_fp16.onnx'}. Output extension picks the writer (.nii[.gz]jnii2nii, .jnii/.bniisavejnifti). The shared weight cache ($SIAMIZE_CACHE_DIR) is reused so a single download serves both the MEX and the CLI binary. Full reference: matlab/README.md.

Tests

make mex-test
# equivalent: octave-cli --no-gui --eval "cd matlab/tests; run_tests('--exit')"

30 unit tests that stub the underlying MEX so they run in under a second and require no ORT or weight files. Covers argument-form dispatch, default-affine math, model-spec parsing, file-in/file-out across the four extensions, source-header preservation, and the error paths. CI runs the same suite on both Octave and MATLAB legs.

Layers, in dependency order

py/siam_ref.py          # slim PyTorch reference, used to validate (2)
                              │
                              v
tools/onnx_export/      # PyTorch → fp16 .onnx; uses py/siam_ref to verify
                              │
                              v
src/  +  CMakeLists.txt # C++ standalone with ONNX Runtime, uses .onnx from (2)
        │       │
        │       └───>  build/siamize          # CLI binary
        │
        └─────────>    build/siamex.mex*      # MATLAB / Octave MEX
                              │               (shares siamize_core sources)
                              v
                       matlab/siamize.m       # pure-MATLAB dispatcher
                       matlab/jsonlab/        # bundled NeuroJSON jsonlab (submodule)
                       matlab/tests/          # Octave + MATLAB unit tests

Footprint

Artifact Size
build/siamize binary (static-linked C++/zlib/OpenMP) ~2-3 MB
matlab/siamex.mex (Octave MEX, dynamic libstdc++) ~200 KB
matlab/siamex.mexa64 (MATLAB MEX, static libstdc++) ~3-4 MB
libonnxruntime.so.1.26.0 23 MB
One fold .onnx (fp16) 270 MB
Five folds 1.35 GB
Single-fold deployable bundle ≈295 MB

vs. the original SIAM stack: multi-GB PyTorch + nnU-Net + torchio install, plus 5.4 GB checkpoints.

Runtime shared-library dependencies of siamize

libonnxruntime.so.1    # the only non-glibc dep; bundled with the binary
libm.so.6              # glibc
libc.so.6              # glibc
ld-linux-x86-64.so.2   # glibc

libstdc++, libgcc, libgomp, libz and nifti_clib are all statically linked into the binary (CMake option SIAMIZE_STATIC_LINK=ON, the default). The libstdc++ / libgcc / libpthread etc. that show up in ldd output are transitive deps of libonnxruntime.so.1, not of siamize itself — verifiable via readelf -d build/siamize | grep NEEDED. ONNX Runtime ships only as a .so (Microsoft does not provide a static .a); building ORT from source statically is possible but a substantial undertaking and not done by default here.

To toggle the static linking, pass -DSIAMIZE_STATIC_LINK=OFF to CMake.

Platforms

The C++ code is portable C++17 and the build is CMake-driven. CI builds the binary on all three:

Host Toolchain Static-linked C/C++ runtime?
Linux x86_64 / aarch64 GCC (Apt) yes (-static-libstdc++ -static-libgcc, static libgomp.a, static libz.a)
macOS x86_64 / arm64 Apple clang partial: relies on libc++ (ABI-stable on macOS); OpenMP via Homebrew libomp
Windows x64 MSVC yes (/MT static CRT); onnxruntime.dll copied next to siamize.exe

On every platform the binary ships with libonnxruntime (.so / .dylib / .dll) sitting next to it; everything else statically linkable is statically linked by default. Set -DSIAMIZE_STATIC_LINK=OFF to keep things dynamic.

The MATLAB / Octave MEX (siamex.mex*) is exercised by CI on linux-octave, linux-matlab, and windows-matlab matrix legs; on Linux the MATLAB MEX statically embeds libstdc++ (to escape MATLAB's older bundled libstdc++.so.6) while the Octave MEX stays dynamic (static-linking would conflict with Octave's already-loaded C++ runtime).

Locally tested: Linux x86_64. macOS / Windows are exercised by CI (see .github/workflows/ci.yml); please open an issue if a host setup breaks.

Performance

All measurements use the bundled tests/sub-01_T1w.nii.gz (160×192×192, 1.0/1.333/1.333 mm) running siamize's 5-fold ensemble with models/fold_*_fp16.onnx.

CPU (TITAN V workstation, x86, 8 threads)

Run Time
C++ 5-fold ensemble (siamize -c cpu) 634 s (10.5 min)
C++ single fold 126 s
Python ORT 5-fold 781 s (13 min)
Original siam-pred 5-fold CPU (per upstream README) ~25 min

CPU thread tuning on many-core hosts (Zen2 / EPYC / Threadripper)

ORT's CPU Execution Provider does not scale linearly past ~16 threads on this workload. Profiling on an AMD Threadripper 3990X (Zen2, 64C/128T, single NUMA node) gives a clear shape:

-t Wall (s) CPU% Avg cores busy
4 (no auto) 300.2 331 % 3.3
8 206.4 540 % 5.4
12 171.6 694 % 6.9
16 158.9 840 % 8.4 ← optimum
32 160.4 1098 % 11.0
48 162.2 1372 % 13.7
64 165.1 1632 % 16.3
0 (=128) 178.4 2674 % 26.7

Zen2's 16-cores-per-CCD topology is the reason: -t 16 keeps the active hot set inside one CCD's 64 MB L3 and one memory controller. At 32+ threads the active set spills across CCDs through the I/O die, which costs roughly what the extra parallelism gains. At 128 the contention wins outright.

To match this, siamize -t 0 (the default) now resolves to min(hardware_concurrency, 16) rather than full hardware_concurrency. On laptops / small workstations (≤16 cores) this is a no-op; on big servers it stops siamize from oversubscribing ORT's thread pool. Pass -t N explicitly to override.

Memory-arena trade-off

siamize -c cpu keeps ORT's CPU memory arena + memory-pattern optimizer on by default. On the same Threadripper run this buys 1.5× wall-time (273.6 → 178.3 s on -t 0) at the cost of peak RSS growing from 12.4 GB → 28.3 GB for the 18-class network. Profiling traced the off-arena path's slowdown to 75 M minor page faults and a 43 % dTLB miss rate from per-op mmap/munmap churn.

If you'd rather trade speed for memory, pass --no-arena on the CLI (or engine_tuning.cpu_arena = false from the MEX). The -v header surfaces the choice (--no-arena is appended when the arena is disabled).

NVIDIA GPU (siamize built with -DSIAMIZE_GPU=cuda)

Run GPU Time vs CPU C++
Single fold RTX 2080 Super (Turing sm_75, 8 GB) 13.3 s (±0.04 s, n=3) 9.5×
5-fold ensemble RTX 2080 Super (Turing sm_75, 8 GB) 58.5 s ~11×
Single fold A100-PCIE-40GB (Ampere sm_80, 40 GB) 9.8 s ~13×

Correctness: the Turing 5-fold output matches the Phase-1 PyTorch reference at 99.7167% voxel agreement — identical to the CPU C++ result. Switching to the CUDA Execution Provider does not introduce additional numerical drift on top of fp16 ONNX + cubic-Hermite resampling.

GPU memory: the full 5-fold run fits on the 8 GB RTX 2080 Super with no OOM. Estimated peak total ≈ 4–6 GB (model weights + held activations + cuDNN workspace + output). For low-VRAM cards a gpu_mem_limit knob can be wired in; in practice 8 GB has been sufficient.

Engine choice / GPU portability

The C++ binary uses ONNX Runtime with one of four execution providers, selected by -DSIAMIZE_GPU=<backend> at build time and by -c at run time. All four are wired in and CI-tested:

-DSIAMIZE_GPU= -c Hardware Status
none (default) cpu any CPU always available
cuda cuda NVIDIA via ORT CUDA EP CI: Linux + Windows builds
tensorrt tensorrt NVIDIA via ORT TensorRT EP (+ CUDA fallback) opt-in build
coreml coreml Apple Silicon CPU + Metal GPU + ANE via ORT CoreML EP CI: macos-14 build

On a given host you build with the backend(s) you want; siamize's -c auto then picks the best available one at run time (TRT > CUDA > CoreML > CPU). MEX builds have matching cudamex / cudaoct / coremlmex / coremloct targets so MATLAB / Octave callers get the same EP coverage.

ORT does also offer DirectML (any DX12 GPU on Windows), OpenVINO (Intel CPU/GPU), and ROCm (AMD) EPs — siamize doesn't wire them up today, but adding any of them is a CMake + sliding.cpp probe-block change rather than a code-change. The CoreML wiring (sliding.cpp's #ifdef SIAMIZE_HAS_COREML block + CMakeLists.txt GPU-backend switch) is the cleanest template if you want to send a PR for one of these.

For a vendor-neutral GPU path on Linux (Vulkan / OpenCL) the same .onnx files can feed MNN (OpenCL) or TVM (Vulkan / OpenCL / SPIR-V). Initial exploration of ncnn found its Vulkan backend lacks 3D conv kernels for this model.

Known precision gap (~0.3% vs Python)

The C++ pipeline uses cubic Catmull-Rom (3rd-order Hermite) for the forward image resample. scipy/skimage use cubic B-spline (also 3rd-order, different basis with a pre-filter step). The two give visually identical output but differ at fp32-noise level on the network input, which propagates into ~0.27% boundary voxel disagreements after argmax. If sub-percent precision matters, a scipy-compatible cubic B-spline resampler is the obvious next upgrade (~150 lines of standard code).

Citation

If you use siamize in your work, please cite the original SIAM paper:

Valabregue, R., Khemir, I., Bardinet, E., Rousseau, F., Auzias, G., & Dorent, R. (2026). SIAM: Head and Brain MRI Segmentation from Few High-Quality Templates via Synthetic Training. arXiv:2605.02737. https://arxiv.org/abs/2605.02737

BibTeX:

@article{valabregue2026siam,
  title   = {SIAM: Head and Brain MRI Segmentation from Few High-Quality Templates via Synthetic Training},
  author  = {Valabregue, Romain and Khemir, Ikram and Bardinet, Eric and Rousseau, Francois and Auzias, Guillaume and Dorent, Reuben},
  year    = {2026},
  journal = {arXiv preprint arXiv:2605.02737},
  url     = {https://arxiv.org/abs/2605.02737}
}

Optional: citing this port

If you are required to also cite the specific software port (e.g., a journal that asks for the inference tool you used), you may additionally reference siamize:

@software{siamize,
  title  = {siamize: native C++/ONNX port of SIAM v0.3 brain segmentation},
  author = {Fang, Qianqian},
  year   = {2026},
  url    = {https://github.com/NeuroJSON/siamize}
}

This is secondary — please always cite the SIAM paper above first.

Credits

siamize is a port of SIAM v0.3 by Valabregue, Khemir, Bardinet, Rousseau, Auzias & Dorent (2026), and reuses the published SIAM v0.3 weights without modification.

Bundled third-party code

  • zmat by Qianqian Fang — the single-header amalgamation src/zmat/zmat.h provides all .nii.gz compression and decompression. zmat is part of the NeuroJSON project, supported by US NIH grant U24-NS124027. Upstream zmat is GPL-3.0; this single file has been dual-licensed under Apache-2.0 for siamize by the zmat author, as documented in the file's header. Inside zmat:
    • miniz by Rich Geldreich — public-domain (Unlicense) zlib-subset deflate/inflate.
  • nlohmann/json (NeuroJSON fork) — the single-header src/nlohmann/json.hpp provides the JSON / BJData parser and serializer used for .jnii / .bnii I/O. MIT-licensed. The NeuroJSON fork extends upstream nlohmann/json with BJData Draft 4 SOA (structure-of-arrays) support; the JData annotated-array spec (_ArrayType_, _ArrayZipData_, etc.) is implemented by src/jnifti_io.cpp on top of it. The header is vendored from the umcx project by Qianqian Fang — a single-source OpenMP / OpenACC Monte Carlo photon simulator — where the same header serves the same JData / BJData container role; keeping a single vendored copy across NeuroJSON-family C/C++ tools (umcx, siamize, …) avoids divergence.

Standards implemented

  • JNIfTI specification by Qianqian Fang — defines the JSON / BJData wrapping of NIfTI-1/2 metadata and voxel data used by siamize for .jnii / .bnii containers. JNIfTI is part of the NeuroJSON project; the underlying JData annotated-array encoding (_ArrayType_, _ArraySize_, _ArrayZipData_, …) and BJData binary container are implemented in src/jnifti_io.cpp on top of the vendored nlohmann/json header described above.

  • JGIFTI specification by Qianqian Fang — defines the JSON encoding of GIFTI surface and segmentation metadata, also part of the NeuroJSON project. siamize borrows JGIFTI's LabelTable schema (object form keyed by stringified integer label IDs, each entry holding Label + RGBA) to attach anatomical names + per-tissue colors to the .jnii / .bnii labelmap output at NIFTIHeader._DataInfo_.LabelTable. Two presets are emitted: the full SIAM v0.3 18-class dictionary, and the SPM12 6-class TPM dictionary when --classes spm is set.

Test data

The bundled test image tests/sub-01_T1w.nii.gz is the sub-01 anatomical T1-weighted scan from OpenNeuro ds000001 v1.0.0, redistributed here under its original CC0 public-domain dedication. See tests/README.md for details.

About

Native C++/ONNX port of SIAM v0.3 brain segmentation

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors