- Copyright: (C) Qianqian Fang (2026) <q.fang at neu.edu>
- License: Apache License, Version 2.0
- Version: 0.1.0
- GitHub: https://github.com/NeuroJSON/siamize
- Upstream: https://github.com/romainVala/SIAM — SIAM v0.3 by Valabregue, Khemir, Bardinet, Rousseau, Auzias & Dorent (2026), arXiv:2605.02737
- Overview
- Quickstart
- MATLAB / GNU Octave bindings
- Layers, in dependency order
- Footprint
- Platforms
- Performance
- Engine choice / GPU portability
- Known precision gap (~0.3% vs Python)
- Citation
- Credits
A native, vendor-neutral port of SIAM v0.3 — the Segment It All Model for head/brain tissue segmentation — that runs without PyTorch, nnU-Net, or torchio at deployment time.
siamize ships:
- A slim Python reference (
py/siam_ref.py) that reproduces SIAM's inference using only PyTorch + numpy + nibabel + scipy +dynamic_network_architectures. No nnU-Net, no torchio, no SimpleITK. - An ONNX export pipeline (
tools/onnx_export/) that converts each fold of the SIAM v0.3 ResEnc-UNet to fp16.onnx, validating against (1). - A C++ standalone binary (
src/) — 232 KB executable + 23 MBlibonnxruntime.so+ per-fold 270 MB.onnx— drop-in forsiam-predwith no Python at runtime.
Accuracy vs original SIAM on the bundled sub-01_T1w.nii.gz
(5-fold ensemble, 18 classes):
| Pipeline | Voxel agreement | Worst per-class Dice |
|---|---|---|
py/siam_ref.py |
99.989% | 0.9990 |
tools/onnx_export/siam_ort.py |
99.989% | 0.9990 |
C++ binary (build/siamize) |
99.715% | 0.9697 (Anomalies, 17 voxels) |
scripts/fetch_deps.sh # downloads ORT prebuilt + clones nifti_clib into third_party/
git submodule update --init # pulls the bundled jsonlab under matlab/jsonlab (only needed for the MATLAB/Octave wrapper; see below)The fetch script auto-detects the host (Linux x64, Linux aarch64, macOS
x86_64, macOS arm64, Windows x64) and pulls the right ORT prebuilt. On
Windows, run it from Git Bash (or any POSIX shell — Git for Windows ships
bash, curl, and tar with .zip support out of the box). For native
Windows users without a POSIX shell, the equivalent PowerShell script is
also provided:
scripts\fetch_deps.ps1The fp16 ONNX fold weights are not fetched up-front: siamize and
its MATLAB/Octave wrapper auto-download any missing fold from
NeuroJSON (URL prefix
https://neurojson.org/io/stat.cgi?action=get&db=siam_v03&doc=dynshape&file=,
overridable via SIAMIZE_WEIGHTS_BASE_URL) into a shared cache ($SIAMIZE_CACHE_DIR,
default $HOME/.cache/siamize/models/ on POSIX or
%LOCALAPPDATA%/siamize/models/ on Windows). One download serves both
the CLI binary and the MEX. The default doc=dynshape variant has
dynamic spatial axes (any patch size); doc=coreml ships an
InstanceNorm-rewritten variant for Apple's Core ML EP (see the CoreML
section below), and the resolver auto-picks the right one based on
-c. If you want all five folds pre-staged before going offline, run:
scripts/fetch_weights.sh # downloads the 5 fp16 .onnx folds (~1.35 GB) into models/CPU-only (default):
make # convenience target -- wraps cmake configure + build
# or, equivalently:
cmake -S . -B build && cmake --build build -jThis produces build/siamize. libonnxruntime.so.1 is located by RPATH:
the binary looks first in $ORIGIN (next to itself) and then in
third_party/onnxruntime/lib/ (the development tree), so you can either
drop the .so next to the binary for distribution or run from a fresh
checkout without setting LD_LIBRARY_PATH.
make cuda # re-fetches GPU ORT prebuilt (only if needed) + configures + buildsThat's the convenience shortcut. The equivalent explicit form:
rm -rf third_party/onnxruntime build
ORT_BUILD=gpu scripts/fetch_deps.sh # default = CUDA 12.x build
# or, if your NVIDIA driver is CUDA 13:
# ORT_BUILD=gpu ORT_CUDA=13 scripts/fetch_deps.sh
cmake -S . -B build -DSIAMIZE_GPU=cuda
cmake --build build -jThe binary accepts -c {auto,cpu,cuda,tensorrt,coreml} (default
auto). On auto siamize prefers (in order): TensorRT EP (if
compiled in), CUDA EP (if compiled in), Apple Core ML EP (if compiled
in on macOS), then CPU. The EP probe is graceful — if the runtime
libraries (libcudart, libcudnn, libcublasLt, ...) can't be
loaded, siamize falls back to CPU with a [cuda] unavailable-style
log line. Pass -c cuda / -c tensorrt / -c coreml to force a
specific EP and fail loudly if it isn't available; -c cpu skips
GPU even when the build includes it.
On a multi-GPU host, -G N selects the device; the index matches
nvidia-smi -L.
CUDA runtime libraries are loaded via dlopen, so you may need to set
LD_LIBRARY_PATH to include their location. With PyTorch-managed CUDA
(pip's nvidia-* packages):
NV=$(python3 -c "import nvidia; print(nvidia.__path__[0])")
export LD_LIBRARY_PATH="$NV/cublas/lib:$NV/cuda_runtime/lib:$NV/cudnn/lib:$NV/cufft/lib:$NV/curand/lib:$NV/cuda_nvrtc/lib:$NV/nvjitlink/lib:$LD_LIBRARY_PATH"
build/siamize -i ... -c cuda ...With a system CUDA install, point at it via the standard CUDA_HOME env
var (set by the NVIDIA installer on most distros, otherwise default
/usr/local/cuda):
export CUDA_HOME=${CUDA_HOME:-/usr/local/cuda}
export LD_LIBRARY_PATH="$CUDA_HOME/lib64:$LD_LIBRARY_PATH"
build/siamize -i ... -c cuda ...ORT 1.26 requires cuDNN 9 with a kernel image for your GPU's compute
capability — older GPUs (e.g., Volta sm_70) may need a cuDNN build that
explicitly includes those kernels. If cuDNN was installed separately (the
typical NVIDIA flow), make sure its lib64/ is on LD_LIBRARY_PATH too;
the official installer drops it next to $CUDA_HOME/lib64/ so the line
above usually covers it.
The siamize-*-cuda.zip artifact produced by CI / make package-cuda
contains only what's redistributable: the siamize binary, ORT 1.26
core, and ORT's CUDA EP plugin DLLs. CUDA/cuDNN themselves are not
bundled — they're large (~1 GB combined for cuDNN 9 + cuBLAS + cuFFT),
and cuDNN's license forbids third-party redistribution.
| Component | In the zip | You install |
|---|---|---|
siamize / siamize.exe |
✅ | — |
libonnxruntime.so.1 / onnxruntime.dll |
✅ | — |
libonnxruntime_providers_shared.so / .dll |
✅ | — |
libonnxruntime_providers_cuda.so / .dll |
✅ | — |
libcudart (CUDA runtime) |
❌ | CUDA Toolkit, or pip install nvidia-cuda-runtime-cu12 |
libcublas + libcublasLt |
❌ | CUDA Toolkit, or pip install nvidia-cublas-cu12 |
libcudnn (cuDNN 9 for ORT 1.26) |
❌ | NVIDIA cuDNN 9 installer, or pip install "nvidia-cudnn-cu12==9.*" |
libcufft, libcurand, cuda_nvrtc, nvjitlink |
❌ | CUDA Toolkit, or matching nvidia-*-cu12 pip wheels |
ORT 1.26's CUDA EP plugin (libonnxruntime_providers_cuda.so) is
dlopen'd at runtime. It in turn pulls in the following dependencies
at the exact major version:
| Library (Linux) | Library (Windows) | Provided by |
|---|---|---|
libcudart.so.12 |
cudart64_12.dll |
CUDA 12 runtime |
libcublas.so.12 |
cublas64_12.dll |
CUDA 12 |
libcublasLt.so.12 |
cublasLt64_12.dll |
CUDA 12 |
libcufft.so.11 |
cufft64_11.dll |
CUDA 12 (cuFFT is independently versioned) |
libcurand.so.10 |
curand64_10.dll |
CUDA 12 (cuRAND ditto) |
libcudnn.so.9 |
cudnn64_9.dll |
cuDNN 9 |
libcudnn_ops.so.9 |
cudnn_ops64_9.dll |
cuDNN 9 |
libcudnn_cnn.so.9 |
cudnn_cnn64_9.dll |
cuDNN 9 |
libcudnn_engines_precompiled.so.9 |
cudnn_engines_precompiled64_9.dll |
cuDNN 9 |
libnvrtc.so.12 |
nvrtc64_120_0.dll |
CUDA 12 nvrtc |
libnvJitLink.so.12 |
nvJitLink_120_0.dll |
CUDA 12 nvjitlink |
If any one of these is missing or its loader path is wrong, siamize's
-c auto probe falls back to CPU and the [warn] ORT: line names
the first missing file — e.g.:
[warn] ORT: Failed to load library .../libonnxruntime_providers_cuda.so
with error: libcublasLt.so.12: cannot open shared object file
[cuda] unavailable (...); using CPU
Means cuBLAS isn't on LD_LIBRARY_PATH. The CUDA EP plugin loaded;
the transitive cuBLAS dependency didn't. Add cuBLAS's directory to
LD_LIBRARY_PATH (per the pip-wheels or CUDA_HOME one-liners
above) and re-run. The probe will then surface the next missing
library, if any — fix iteratively until -c auto reports
[cuda] enabled (gpuid=0).
To preflight before running siamize, ldd the CUDA EP plugin:
ldd third_party/onnxruntime/lib/libonnxruntime_providers_cuda.so | grep -E "not found|=>"
# every line should resolve; any "not found" entry names a missing libOn Windows, use dumpbin /dependents (Visual Studio Developer Prompt):
dumpbin /dependents third_party\onnxruntime\lib\onnxruntime_providers_cuda.dllOn Windows the loader uses PATH (not LD_LIBRARY_PATH) to find DLLs.
The CUDA Toolkit installer sets the CUDA_PATH env var (e.g.
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.x) and usually
prepends %CUDA_PATH%\bin to PATH itself. If siamize.exe reports
"cannot find cudart64_12.dll" after unzipping the bundle, force-add it:
# PowerShell
$env:PATH = "$env:CUDA_PATH\bin;" + $env:PATH
# cmd.exe equivalent:
# set PATH=%CUDA_PATH%\bin;%PATH%
.\siamize.exe -i input.nii.gz -o pred.nii.gz -M 0 -c cudacuDNN's Windows installer copies its DLLs into %CUDA_PATH%\bin (the
default checkbox in the cuDNN MSI), so the same one-liner usually
covers cuDNN too.
For a lighter-weight install via pip wheels (no CUDA Toolkit needed):
pip install nvidia-cuda-runtime-cu12 nvidia-cublas-cu12 `
"nvidia-cudnn-cu12==9.*" `
nvidia-cufft-cu12 nvidia-curand-cu12 `
nvidia-cuda-nvrtc-cu12 nvidia-nvjitlink-cu12
# Prepend the wheel DLL dirs to PATH (Windows equivalent of the
# LD_LIBRARY_PATH one-liner shown above for Linux).
$NV = (python -c "import nvidia; print(nvidia.__path__[0])")
$env:PATH = "$NV\cublas\bin;$NV\cuda_runtime\bin;$NV\cudnn\bin;" `
+ "$NV\cufft\bin;$NV\curand\bin;$NV\cuda_nvrtc\bin;" `
+ "$NV\nvjitlink\bin;" + $env:PATH
.\siamize.exe -i input.nii.gz -o pred.nii.gz -M 0 -c cudaNote: the pip wheels put their DLLs under bin\ on Windows (vs. lib\
on Linux). Hardware-compatibility caveat is the same as Linux — the pip
cuDNN/cuBLAS wheels target sm_75+; older GPUs need the official NVIDIA
installer.
For workloads that process hundreds of volumes with the same model/GPU combo, the TensorRT Execution Provider can shave ~35% off CUDA EP wall time. It's an opt-in build:
# Build with TRT enabled (gpu ORT prebuilt also has the TRT provider plugin).
make tensorrt
# equivalent explicit form:
# cmake -S . -B build -DSIAMIZE_GPU=tensorrt && cmake --build build -j
# Install the matching TensorRT Python wheel (ships libnvinfer + per-SM
# builder resources). Pin it to your CUDA runtime version.
pip install --user "tensorrt~=10.0"
# Make TRT libs visible alongside the CUDA libs.
TRT=$(python3 -c "import os, tensorrt_libs; print(os.path.dirname(tensorrt_libs.__file__))")
export LD_LIBRARY_PATH="$TRT:$LD_LIBRARY_PATH"
build/siamize -i input.nii.gz -o output.nii.gz \
-M models/fold_0_fp16.onnx \
-c tensorrt \
--trt-cache-dir $HOME/.cache/siamize/trtCost model on a Turing RTX 2080 SUPER (single fold):
| Mode | Wall time | Notes |
|---|---|---|
| CUDA EP | 13.3 s | warm |
| TRT EP, first run | 962 s | one-time engine build per fold/GPU/TRT-version |
| TRT EP, cached | 8.7 s | ~35 % faster than CUDA, every subsequent run |
Correctness: TRT vs CUDA output → 99.97 % voxel agreement, worst per-class Dice 0.997 (fused-kernel rounding only).
Breakeven: amortizing one cold engine build (962 s) against the per-run savings (13.3 − 8.7 = 4.6 s) takes ~209 inferences per fold. For a 5-fold ensemble that's ~209 full-volume runs end-to-end.
Hidden costs:
- TensorRT Python wheel: ~1 GB on disk (libnvinfer + per-arch builder resources for sm_75…sm_120).
- Engine cache: 274 MB per fold (1.37 GB for the 5-fold ensemble).
- Cache invalidation: any change to the ONNX model, the GPU compute capability, or the TRT minor version forces a fresh ~16 min/fold rebuild.
If you're not deploying to a batch server, stick with the default CUDA EP. The TRT path stays available for the lab that needs it.
On macOS / Apple Silicon, siamize can run inference through Apple's
Core ML stack — CPU + Metal GPU + Neural Engine (ANE), selectable
via MLComputeUnits. Core ML is statically linked into ORT 1.26's
macOS dylib, so no separate provider plugin to fetch:
make coreml # builds with -DSIAMIZE_GPU=coreml
build/siamize -i input.nii.gz -o pred.nii.gz -M 0 -c coreml
# or just:
build/siamize -i input.nii.gz -o pred.nii.gz -M 0 # -c auto picks CoreML on macOSCoreML-relevant flags:
| Flag | Default | Effect |
|---|---|---|
--coreml-units {all|cpune|cpugpu|cpu} |
all |
Which hardware Core ML can route ops to. all = CPU + Metal GPU + ANE. |
--coreml-cache-dir P |
~/.cache/siamize/coreml |
Per-host .mlmodelc compile cache. First run takes ~10-30 s to compile; subsequent runs hit the cache. |
--coreml-static-shapes 0|1 |
1 |
RequireStaticInputShapes knob. Default 1 pairs with the doc=coreml fixed-shape ONNX. |
Important: SIAM v0.3's encoder/decoder uses 3D InstanceNorm,
which Core ML's MLProgram format rejects (rank-5 input; spec
supports only rank 3-4). siamize ships under doc=coreml an
InstanceNorm-rewritten ONNX where each rank-5 IN is replaced with
Reshape → InstanceNorm-rank-3 → Reshape. The math is bit-exact
to the original. The siamize weight resolver auto-fetches the
doc=coreml variant when CoreML EP is active.
Compile-time RAM peaks roughly:
--coreml-units |
mlcompilerd peak (estimated) |
|---|---|
all (CPU + GPU + ANE) |
6-8 GB |
cpugpu |
4-6 GB |
cpu |
1-2 GB |
If your host is RAM-tight (< 14 GB free), siamize's auto-lowmem
heuristic drops to --coreml-units cpu automatically. To
benchmark real ANE acceleration, run on a host with ≥ 14 GB free
RAM and pass --coreml-units all explicitly.
The tools/onnx_export/rewrite_for_coreml.py script applies the
InstanceNorm rewrite to any fp16/fp32 SIAM ONNX; it's how the
doc=coreml bundle was produced from the same trained weights as
doc=dynshape. Useful if you want to re-export with different
fold or quantization choices.
# Full 5-fold ensemble (the digit shortcut expands to fold_<N>_fp16.onnx;
# any missing weight auto-downloads into the shared cache).
build/siamize -i input.nii.gz -o output.nii.gz -M 0,1,2,3,4 -v
# Single-fold prediction is also supported:
build/siamize -i input.nii.gz -o output.nii.gz -M 0 -v
# Explicit paths still work alongside shortcuts:
build/siamize -i input.nii.gz -o output.nii.gz \
-M models/fold_0_fp16.onnx,models/fold_1_fp16.onnx-t/--thread defaults to 0 = min(hardware_concurrency, 16). The
16-thread cap is empirical (see the "CPU thread tuning" section below
for the Threadripper measurements that motivated it). Pass an explicit
-t N to override.
The most-used options beyond -i / -o / -M / -c:
| Flag | Effect |
|---|---|
-v / --verbose |
progress messages (default ON since 0.1.0). Pair with -q / --quiet to silence. |
-G N |
CUDA / TensorRT device id. Matches nvidia-smi -L indices (siamize sets CUDA_DEVICE_ORDER=PCI_BUS_ID). |
-t N |
ORT intra-op thread count. Default min(hc, 16). |
-P ZxYxX |
Sliding-window patch size, default 256x256x192. Smaller patches → lower peak memory, more tiles. Requires dynamic-axes ONNX. |
-u S |
Target isotropic spacing in mm, default 0.75. |
-C N|spm |
Output classes. -C 18 (default, SIAM v0.3) or -C spm to remap to SPM12's 6 TPM channels (GM, WM, CSF, Bone, Soft, Air). |
--tpm [0|1] + --tpm-t T |
Write a 4D float32 tissue probability map (per-voxel softmax over classes) instead of a uint8 labelmap. --tpm-t > 1 softens the softmax. |
--upsample |
Save at the network's internal 0.75 mm canonical-RAS grid instead of resampling back to the input grid. |
--shuffle |
Apply JData _ArrayShuffle_=4 byte-shuffle before zlib on TPM .jnii/.bnii output. 1.5-2.5× smaller payload for spec-compliant readers. |
--lowmem |
Force the full low-memory preset (smaller patch + tighter EP knobs). Auto-applied on hosts with < 14 GB free RAM / < 12 GB free VRAM. |
--no-arena |
Disable ORT's CPU memory arena (saves ~16 GB peak RSS on the 18-class model at ~1.5× wall cost). |
--cudnn-max-workspace 0 + --arena-extend same + --cudnn-algo heuristic |
Tight-VRAM GPU recipe (8 GB consumer cards). |
--gpu-mem-limit N[K|M|G] |
Cap the CUDA EP arena. |
Run build/siamize --help for the full list with longer descriptions.
In addition to NIfTI-1, siamize reads and writes
JNIfTI — the JData/BJData wrapping
of NIfTI metadata + voxel data used by the NeuroJSON ecosystem.
Input format is inferred from the file extension; output format is
selected via -F/--format:
# Text-JSON JNIfTI labelmap (human-readable; zlib + base64 payload)
build/siamize -i input.nii.gz -o labels.jnii -M 0 -F jnii
# BJData binary JNIfTI labelmap (compact binary; zlib raw bytes)
build/siamize -i input.nii.gz -o labels.bnii -M 0 -F bnii
# JNIfTI input (e.g. produced by jsonlab's savejd):
build/siamize -i preproc.bnii -o labels.bnii -M 0 -F bnii-F |
Output container | Payload | Notes |
|---|---|---|---|
nii (default) |
NIfTI-1 (.nii, gzipped if .gz) |
raw bytes | Smallest for sparse uint8 labelmaps. |
jnii |
JSON-text JNIfTI | zlib + base64 | Human-readable; pairs with loadjd / jsonlab. |
bnii |
BJData binary JNIfTI | zlib (raw bytes) | Compact binary; pairs with the NeuroJSON ecosystem. |
For label volumes specifically, .nii.gz is usually the tightest on
disk — raw gzip already exploits per-voxel value redundancy
efficiently. JNIfTI's value here is interoperability with JSON / JData
tooling (jsonlab, jdataencode / jdatadecode, browser-based viewers,
the NeuroJSON database), not raw compression. Voxel-identical
round-tripping has been verified for .nii.gz ↔ .jnii ↔ .bnii
inputs on the bundled sub-01_T1w.nii.gz (100 % agreement, 5-fold
ensemble).
.jnii / .bnii labelmap and TPM outputs additionally carry a
JGIFTI-style LabelTable at
NIFTIHeader._DataInfo_.LabelTable — anatomical names + per-tissue
RGBA colors keyed by label ID. Two presets are emitted automatically:
the full SIAM v0.3 18-class dictionary (default), or the SPM12
6-class TPM dictionary (GM, WM, CSF, Bone, Soft, Air) when
--classes spm is set. Viewers that honour the
JGIFTI spec render tissue
names and colors instead of an unlabeled colormap. NIfTI-1 output
(.nii / .nii.gz) doesn't carry the LabelTable since the format
has no extensible JSON header.
For TPM output (--tpm) on the .jnii / .bnii path, opt into
byte-shuffle compression with --shuffle. The flag wraps the
fp32 payload with a JData _ArrayShuffle_=4 annotation
(blosc2-style per-byte plane regrouping before zlib), giving
1.5-2.5× smaller files at no decode cost on JData-spec-compliant
readers (siamize itself; future jsonlab versions). Default OFF for
interop with current jsonlab which doesn't yet implement the
unshuffle.
tests/run_regression.shRuns the bundled sample through build/siamize and reports voxel
agreement vs tests/pred_ref_allfolds.nii.gz.
The same inference pipeline is callable from MATLAB and Octave through a
thin MEX (siamex.mex*) wrapped by a pure-MATLAB dispatcher
(matlab/siamize.m). MEX and CLI predictions are bit-identical (they
share the siamize_core C++ sources).
CPU-only MEX (works on every host):
# Octave (Linux/macOS):
make mex-octave # -> matlab/siamex.mex
# MATLAB (Linux/macOS/Windows):
make mex-matlab # -> matlab/siamex.mexa64 / .mexmaca64 / .mexw64GPU-enabled MEX variants (same CLI capabilities as the corresponding
siamize binary build):
# CUDA EP (NVIDIA, Linux/Windows):
make cudaoct # Octave MEX, -DSIAMIZE_GPU=cuda
make cudamex # MATLAB MEX, -DSIAMIZE_GPU=cuda
# CoreML EP (Apple Silicon, macOS):
make coremloct # Octave MEX, -DSIAMIZE_GPU=coreml
make coremlmex # MATLAB MEX, -DSIAMIZE_GPU=coremlAll targets drop the .mex* next to matlab/siamize.m so the
wrapper finds it via its addpath auto-detection. Equivalent
explicit CMake forms (any of the above translates to):
cmake -S . -B build -DSIAMIZE_BUILD_OCTAVE_MEX=ON [-DSIAMIZE_GPU=cuda|coreml]
# or:
cmake -S . -B build -DSIAMIZE_BUILD_MATLAB_MEX=ON [-DSIAMIZE_GPU=cuda|coreml]
cmake --build build -jThe bundled jsonlab submodule (matlab/jsonlab/) provides
loadjd / savejd / loadnifti / jnii2nii / savejnifti / etc.;
siamize.m adds it to the path automatically if it isn't already
visible.
siamize.m accepts flexible inputs and returns a single jnifti
struct (nii.NIFTIHeader + nii.NIFTIData). The same struct is
written to disk when an outputfile is supplied.
% one-shot file -> file (defaults: single-fold fold_0, auto-downloaded)
siamize('input.nii.gz', 'labels.nii.gz');
% cross-format: read .nii.gz, write binary JNIfTI, full 5-fold ensemble
siamize('input.nii.gz', 'labels.bnii', 0:4);
% struct input (jnifti or readnifti-style), in-memory labels
nii_in = loadnifti('input.nii.gz');
nii_out = siamize(nii_in); % nii_out.NIFTIData is uint8 3D labels
% bare 3D array, default centered affine inferred
nii_out = siamize(my_volume);
nii_out = siamize(my_volume, 0); % single fold by shortcut
nii_out = siamize(my_volume, '0,2,4', 'verbose', true);
% explicit affine + output file + ensemble + opts
siamize(my_volume, A, 'labels.nii.gz', 0:4, 'compute', 'cuda');
% multi-GPU box, pick the GPU 1 of N (matches `nvidia-smi -L` index):
siamize('in.nii.gz', 'lab.nii.gz', 0:4, 'compute', 'cuda', 'gpu', 1);
% CoreML EP (Apple Silicon; needs MEX built via `make coremlmex` / coremloct):
siamize('in.nii.gz', 'lab.nii.gz', 0, 'compute', 'coreml');
% TPM mode: nii_out.NIFTIData becomes 4D single (float32) [X, Y, Z, 18]:
nii_tpm = siamize('input.nii.gz', 0:4, 'tpm', true, 'tpm_t', 1.5);
siamize('input.nii.gz', 'tpm.nii.gz', 0:4, 'tpm', true); % save TPM to disk
% SPM12-style 6-class output (GM, WM, CSF, Bone, Soft, Air):
siamize('input.nii.gz', 'spm.nii.gz', 0, 'classes', 'spm');
% Upsample mode: save at the network's 0.75 mm grid instead of the input grid.
siamize('input.nii.gz', 'pred_hires.nii.gz', 0, 'upsample', true);| First arg | Interpretation |
|---|---|
'file.{nii,nii.gz,jnii,bnii}' |
read via loadjd; affine taken from header |
jnifti struct (.NIFTIData + .NIFTIHeader.Affine) |
passthrough |
readnifti struct (.img + .hdr.srow_*) |
passthrough; affine from sform |
| 3D numeric array | identity rotation + centered translation synthesized when no affine follows |
The models argument accepts numeric indices, char shortcuts, full
paths, or mixes thereof: 0, 0:4, '0,2,4', {'0','fold_3_fp16.onnx'}.
Output extension picks the writer (.nii[.gz] → jnii2nii,
.jnii/.bnii → savejnifti). The shared weight cache
($SIAMIZE_CACHE_DIR) is reused so a single download serves both the
MEX and the CLI binary. Full reference: matlab/README.md.
make mex-test
# equivalent: octave-cli --no-gui --eval "cd matlab/tests; run_tests('--exit')"30 unit tests that stub the underlying MEX so they run in under a second and require no ORT or weight files. Covers argument-form dispatch, default-affine math, model-spec parsing, file-in/file-out across the four extensions, source-header preservation, and the error paths. CI runs the same suite on both Octave and MATLAB legs.
py/siam_ref.py # slim PyTorch reference, used to validate (2)
│
v
tools/onnx_export/ # PyTorch → fp16 .onnx; uses py/siam_ref to verify
│
v
src/ + CMakeLists.txt # C++ standalone with ONNX Runtime, uses .onnx from (2)
│ │
│ └───> build/siamize # CLI binary
│
└─────────> build/siamex.mex* # MATLAB / Octave MEX
│ (shares siamize_core sources)
v
matlab/siamize.m # pure-MATLAB dispatcher
matlab/jsonlab/ # bundled NeuroJSON jsonlab (submodule)
matlab/tests/ # Octave + MATLAB unit tests
| Artifact | Size |
|---|---|
build/siamize binary (static-linked C++/zlib/OpenMP) |
~2-3 MB |
matlab/siamex.mex (Octave MEX, dynamic libstdc++) |
~200 KB |
matlab/siamex.mexa64 (MATLAB MEX, static libstdc++) |
~3-4 MB |
libonnxruntime.so.1.26.0 |
23 MB |
One fold .onnx (fp16) |
270 MB |
| Five folds | 1.35 GB |
| Single-fold deployable bundle | ≈295 MB |
vs. the original SIAM stack: multi-GB PyTorch + nnU-Net + torchio install, plus 5.4 GB checkpoints.
libonnxruntime.so.1 # the only non-glibc dep; bundled with the binary
libm.so.6 # glibc
libc.so.6 # glibc
ld-linux-x86-64.so.2 # glibc
libstdc++, libgcc, libgomp, libz and nifti_clib are all statically
linked into the binary (CMake option SIAMIZE_STATIC_LINK=ON, the default).
The libstdc++ / libgcc / libpthread etc. that show up in ldd output
are transitive deps of libonnxruntime.so.1, not of siamize itself —
verifiable via readelf -d build/siamize | grep NEEDED. ONNX Runtime
ships only as a .so (Microsoft does not provide a static .a); building
ORT from source statically is possible but a substantial undertaking and
not done by default here.
To toggle the static linking, pass -DSIAMIZE_STATIC_LINK=OFF to CMake.
The C++ code is portable C++17 and the build is CMake-driven. CI builds the binary on all three:
| Host | Toolchain | Static-linked C/C++ runtime? |
|---|---|---|
| Linux x86_64 / aarch64 | GCC (Apt) | yes (-static-libstdc++ -static-libgcc, static libgomp.a, static libz.a) |
| macOS x86_64 / arm64 | Apple clang | partial: relies on libc++ (ABI-stable on macOS); OpenMP via Homebrew libomp |
| Windows x64 | MSVC | yes (/MT static CRT); onnxruntime.dll copied next to siamize.exe |
On every platform the binary ships with libonnxruntime (.so / .dylib /
.dll) sitting next to it; everything else statically linkable is statically
linked by default. Set -DSIAMIZE_STATIC_LINK=OFF to keep things dynamic.
The MATLAB / Octave MEX (siamex.mex*) is exercised by CI on
linux-octave, linux-matlab, and windows-matlab matrix legs; on
Linux the MATLAB MEX statically embeds libstdc++ (to escape MATLAB's
older bundled libstdc++.so.6) while the Octave MEX stays dynamic
(static-linking would conflict with Octave's already-loaded C++
runtime).
Locally tested: Linux x86_64. macOS / Windows are exercised by CI (see
.github/workflows/ci.yml); please open an issue if a host setup breaks.
All measurements use the bundled tests/sub-01_T1w.nii.gz (160×192×192,
1.0/1.333/1.333 mm) running siamize's 5-fold ensemble with models/fold_*_fp16.onnx.
| Run | Time |
|---|---|
C++ 5-fold ensemble (siamize -c cpu) |
634 s (10.5 min) |
| C++ single fold | 126 s |
| Python ORT 5-fold | 781 s (13 min) |
Original siam-pred 5-fold CPU (per upstream README) |
~25 min |
ORT's CPU Execution Provider does not scale linearly past ~16 threads on this workload. Profiling on an AMD Threadripper 3990X (Zen2, 64C/128T, single NUMA node) gives a clear shape:
-t |
Wall (s) | CPU% | Avg cores busy |
|---|---|---|---|
| 4 (no auto) | 300.2 | 331 % | 3.3 |
| 8 | 206.4 | 540 % | 5.4 |
| 12 | 171.6 | 694 % | 6.9 |
| 16 | 158.9 | 840 % | 8.4 ← optimum |
| 32 | 160.4 | 1098 % | 11.0 |
| 48 | 162.2 | 1372 % | 13.7 |
| 64 | 165.1 | 1632 % | 16.3 |
| 0 (=128) | 178.4 | 2674 % | 26.7 |
Zen2's 16-cores-per-CCD topology is the reason: -t 16 keeps the
active hot set inside one CCD's 64 MB L3 and one memory controller.
At 32+ threads the active set spills across CCDs through the I/O
die, which costs roughly what the extra parallelism gains. At 128
the contention wins outright.
To match this, siamize -t 0 (the default) now resolves to
min(hardware_concurrency, 16) rather than full
hardware_concurrency. On laptops / small workstations (≤16 cores)
this is a no-op; on big servers it stops siamize from oversubscribing
ORT's thread pool. Pass -t N explicitly to override.
siamize -c cpu keeps ORT's CPU memory arena + memory-pattern
optimizer on by default. On the same Threadripper run this
buys 1.5× wall-time (273.6 → 178.3 s on -t 0) at the cost of
peak RSS growing from 12.4 GB → 28.3 GB for the 18-class network.
Profiling traced the off-arena path's slowdown to 75 M minor page
faults and a 43 % dTLB miss rate from per-op mmap/munmap churn.
If you'd rather trade speed for memory, pass --no-arena on the
CLI (or engine_tuning.cpu_arena = false from the MEX). The
-v header surfaces the choice (--no-arena is appended when
the arena is disabled).
| Run | GPU | Time | vs CPU C++ |
|---|---|---|---|
| Single fold | RTX 2080 Super (Turing sm_75, 8 GB) | 13.3 s (±0.04 s, n=3) | 9.5× |
| 5-fold ensemble | RTX 2080 Super (Turing sm_75, 8 GB) | 58.5 s | ~11× |
| Single fold | A100-PCIE-40GB (Ampere sm_80, 40 GB) | 9.8 s | ~13× |
Correctness: the Turing 5-fold output matches the Phase-1 PyTorch reference at 99.7167% voxel agreement — identical to the CPU C++ result. Switching to the CUDA Execution Provider does not introduce additional numerical drift on top of fp16 ONNX + cubic-Hermite resampling.
GPU memory: the full 5-fold run fits on the 8 GB RTX 2080 Super with no OOM.
Estimated peak total ≈ 4–6 GB (model weights + held activations + cuDNN
workspace + output). For low-VRAM cards a gpu_mem_limit knob can be wired
in; in practice 8 GB has been sufficient.
The C++ binary uses ONNX Runtime with one of four execution
providers, selected by -DSIAMIZE_GPU=<backend> at build time and by
-c at run time. All four are wired in and CI-tested:
-DSIAMIZE_GPU= |
-c |
Hardware | Status |
|---|---|---|---|
none (default) |
cpu |
any CPU | always available |
cuda |
cuda |
NVIDIA via ORT CUDA EP | CI: Linux + Windows builds |
tensorrt |
tensorrt |
NVIDIA via ORT TensorRT EP (+ CUDA fallback) | opt-in build |
coreml |
coreml |
Apple Silicon CPU + Metal GPU + ANE via ORT CoreML EP | CI: macos-14 build |
On a given host you build with the backend(s) you want; siamize's -c auto then picks the best available one at run time (TRT > CUDA >
CoreML > CPU). MEX builds have matching cudamex / cudaoct /
coremlmex / coremloct targets so MATLAB / Octave callers get the
same EP coverage.
ORT does also offer DirectML (any DX12 GPU on Windows), OpenVINO
(Intel CPU/GPU), and ROCm (AMD) EPs — siamize doesn't wire them up
today, but adding any of them is a CMake + sliding.cpp probe-block
change rather than a code-change. The CoreML wiring (sliding.cpp's
#ifdef SIAMIZE_HAS_COREML block + CMakeLists.txt GPU-backend
switch) is the cleanest template if you want to send a PR for one
of these.
For a vendor-neutral GPU path on Linux (Vulkan / OpenCL) the same
.onnx files can feed MNN (OpenCL)
or TVM (Vulkan / OpenCL / SPIR-V). Initial
exploration of ncnn found its
Vulkan backend lacks 3D conv kernels for this model.
The C++ pipeline uses cubic Catmull-Rom (3rd-order Hermite) for the forward image resample. scipy/skimage use cubic B-spline (also 3rd-order, different basis with a pre-filter step). The two give visually identical output but differ at fp32-noise level on the network input, which propagates into ~0.27% boundary voxel disagreements after argmax. If sub-percent precision matters, a scipy-compatible cubic B-spline resampler is the obvious next upgrade (~150 lines of standard code).
If you use siamize in your work, please cite the original SIAM paper:
Valabregue, R., Khemir, I., Bardinet, E., Rousseau, F., Auzias, G., & Dorent, R. (2026). SIAM: Head and Brain MRI Segmentation from Few High-Quality Templates via Synthetic Training. arXiv:2605.02737. https://arxiv.org/abs/2605.02737
BibTeX:
@article{valabregue2026siam,
title = {SIAM: Head and Brain MRI Segmentation from Few High-Quality Templates via Synthetic Training},
author = {Valabregue, Romain and Khemir, Ikram and Bardinet, Eric and Rousseau, Francois and Auzias, Guillaume and Dorent, Reuben},
year = {2026},
journal = {arXiv preprint arXiv:2605.02737},
url = {https://arxiv.org/abs/2605.02737}
}If you are required to also cite the specific software port (e.g., a
journal that asks for the inference tool you used), you may additionally
reference siamize:
@software{siamize,
title = {siamize: native C++/ONNX port of SIAM v0.3 brain segmentation},
author = {Fang, Qianqian},
year = {2026},
url = {https://github.com/NeuroJSON/siamize}
}This is secondary — please always cite the SIAM paper above first.
siamize is a port of SIAM v0.3
by Valabregue, Khemir, Bardinet, Rousseau, Auzias & Dorent (2026), and reuses
the published SIAM v0.3 weights without modification.
- zmat by Qianqian Fang — the
single-header amalgamation
src/zmat/zmat.hprovides all.nii.gzcompression and decompression. zmat is part of the NeuroJSON project, supported by US NIH grant U24-NS124027. Upstream zmat is GPL-3.0; this single file has been dual-licensed under Apache-2.0 for siamize by the zmat author, as documented in the file's header. Inside zmat:- miniz by Rich Geldreich — public-domain (Unlicense) zlib-subset deflate/inflate.
- nlohmann/json (NeuroJSON
fork) — the single-header
src/nlohmann/json.hppprovides the JSON / BJData parser and serializer used for.jnii/.bniiI/O. MIT-licensed. The NeuroJSON fork extends upstream nlohmann/json with BJData Draft 4 SOA (structure-of-arrays) support; the JData annotated-array spec (_ArrayType_,_ArrayZipData_, etc.) is implemented bysrc/jnifti_io.cppon top of it. The header is vendored from the umcx project by Qianqian Fang — a single-source OpenMP / OpenACC Monte Carlo photon simulator — where the same header serves the same JData / BJData container role; keeping a single vendored copy across NeuroJSON-family C/C++ tools (umcx, siamize, …) avoids divergence.
-
JNIfTI specification by Qianqian Fang — defines the JSON / BJData wrapping of NIfTI-1/2 metadata and voxel data used by siamize for
.jnii/.bniicontainers. JNIfTI is part of the NeuroJSON project; the underlying JData annotated-array encoding (_ArrayType_,_ArraySize_,_ArrayZipData_, …) and BJData binary container are implemented insrc/jnifti_io.cppon top of the vendored nlohmann/json header described above. -
JGIFTI specification by Qianqian Fang — defines the JSON encoding of GIFTI surface and segmentation metadata, also part of the NeuroJSON project. siamize borrows JGIFTI's
LabelTableschema (object form keyed by stringified integer label IDs, each entry holdingLabel+RGBA) to attach anatomical names + per-tissue colors to the.jnii/.bniilabelmap output atNIFTIHeader._DataInfo_.LabelTable. Two presets are emitted: the full SIAM v0.3 18-class dictionary, and the SPM12 6-class TPM dictionary when--classes spmis set.
The bundled test image tests/sub-01_T1w.nii.gz is the sub-01 anatomical
T1-weighted scan from OpenNeuro ds000001 v1.0.0,
redistributed here under its original CC0 public-domain dedication.
See tests/README.md for details.