PERF: Enable FFTW SIMD codelets with per-CPU introspection at configure time#6004
Conversation
…re time
FFTW SIMD codelets are hand-written assembly routines baked into the
library at compile time. Passing -march=native to ITK alone does NOT
activate them; they must be explicitly requested via FFTW CMake options.
This change adds automatic SIMD codelet selection based on the actual
build-host CPU:
Native builds (not cross-compiling):
- ARM64 (aarch64/arm64): NEON=ON (mandatory in ARMv8); x86 SIMD off.
- x86/x86_64: SSE, SSE2, AVX, AVX2 each probed independently via
__builtin_cpu_supports() / CheckCSourceRuns so that codelets are
enabled only for what the CPU actually supports. A pre-AVX Sandy
Bridge gets SSE+SSE2; a Haswell or later gets all four.
- Other: all SIMD off (conservative fallback).
Cross-compiled builds:
- ARM64: NEON=ON; x86 SIMD off.
- x86_64: SSE+SSE2 only (baseline; AVX/AVX2 not assumed for target).
- Other: all SIMD off.
Each flag is an individually overridable cache option, e.g.:
cmake -DFFTW_ENABLE_AVX2=OFF
Also fixes fftwd (double-precision) ExternalProject_Add which was missing
all SIMD flags — it now mirrors the fftwf configuration.
Verified on Apple M4 (arm64): NEON=ON, 19 NEON codelets compiled into
libfftw3f.a; reduces BCDTest_rVN4 from ~1446 s to an estimated ~180-360 s.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
NOTE: This is purposefully targeted for ITKv5.4 branch, but once it is finalized it should be cherry-picked for the ITKv6 branch as well. |
|
| Filename | Overview |
|---|---|
| CMake/itkExternal_FFTW.cmake | Adds per-CPU SIMD detection at configure time for FFTW; case-sensitive regex misses Windows ARM64 (ARM64 ≠ arm64), causing NEON to stay off there; MSVC lacks __builtin_cpu_supports, producing noisy failed-check messages; both ExternalProject blocks consistently forward the new flags. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[CMake configure starts] --> B{ITK_USE_SYSTEM_FFTW?}
B -- Yes --> Z[find_package FFTW — no SIMD detection]
B -- No --> C{CMAKE_CROSSCOMPILING?}
C -- No native --> D{CMAKE_SYSTEM_PROCESSOR}
D -- aarch64 / arm64 --> E[NEON=ON\nSSE/SSE2/AVX/AVX2=OFF]
D -- x86_64 / AMD64 / i686 --> F[check_c_source_runs\n__builtin_cpu_supports per level]
F --> G[Set _fftw_default_sse/sse2/avx/avx2\nper actual CPU capability]
D -- other --> H[All SIMD OFF]
C -- Yes cross --> I{CMAKE_SYSTEM_PROCESSOR}
I -- aarch64 / arm64 --> J[NEON=ON]
I -- x86_64 / AMD64 --> K[SSE=ON, SSE2=ON\nAVX/AVX2 not assumed]
I -- other --> L[All SIMD OFF]
E & G & H & J & K & L --> M[option FFTW_ENABLE_NEON/SSE/SSE2/AVX/AVX2\ndefaults from detection, user-overridable]
M --> N[ExternalProject_Add fftwf\nforwards all 5 ENABLE_* flags]
M --> O[ExternalProject_Add fftwd\nforwards all 5 ENABLE_* flags]
Reviews (1): Last reviewed commit: "PERF: Enable FFTW SIMD codelets with per..." | Re-trigger Greptile
- Add ARM64 (all-caps) to the regex pattern for CMAKE_SYSTEM_PROCESSOR in both native and cross-compile branches; Windows reports ARM64 instead of arm64/aarch64, so NEON was silently left disabled. - Guard __builtin_cpu_supports() probes with a compiler-ID check for GNU/Clang/AppleClang; MSVC lacks this intrinsic and would emit confusing "Performing Test ... - Failed" messages during configure. - Fix doc comment to include ENABLE_SSE in the list of FFTW options. Addresses review comments from @greptile-apps on PR InsightSoftwareConsortium#6004. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
none of the ITK CI configurations enable FFTW. Both ITK_USE_FFTWD and ITK_USE_FFTWF are explicitly OFF in the user presets, and none of the Azure Pipelines, GitHub Actions, or Pixi workflows set any FFTW flags. FFTW This means the FFTW SIMD detection changes in PR #6004 won't be exercised by CI — they can only be tested on local builds that explicitly enable FFTW (-DITK_USE_FFTWF=ON or -DITK_USE_FFTWD=ON). Tested locally. |
a43fa81
into
InsightSoftwareConsortium:release-5.4
…re time FFTW SIMD codelets (NEON, SSE/SSE2, AVX, AVX2) are hand-written assembly routines baked into the library at compile time. Previously all SIMD flags were hardcoded to OFF, producing scalar-only FFTW builds regardless of the host CPU. Add per-CPU SIMD detection at CMake configure time: - ARM64 (aarch64/arm64/ARM64): NEON=ON (mandatory in ARMv8) - x86/x86_64 with GCC/Clang: probe SSE, SSE2, AVX, AVX2 individually via __builtin_cpu_supports() / CheckCSourceRuns - x86/x86_64 with MSVC: skip probes (intrinsic unavailable), default OFF - Cross-compile ARM64: NEON=ON; x86_64: SSE+SSE2 only (conservative) - All other architectures: all SIMD off (safe fallback) Every flag is an individually overridable cache option (e.g. cmake -DFFTW_ENABLE_AVX2=OFF). Cherry-picked from PR InsightSoftwareConsortium#6004 (targeting release-5.4) with review fixes: - ARM64 regex includes all-caps variant for Windows ARM64 - MSVC compiler guard on __builtin_cpu_supports probes - ENABLE_SSE included in documentation comment Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Should we turn on FFTW at least in some CI build(s)? It is most convenient to enable in Linux builds. |
Summary
FFTW SIMD codelets (NEON, SSE/SSE2, AVX, AVX2) are hand-written assembly
routines baked into the library at compile time. Passing
-march=nativeto the ITK/BRAINSTools build does not activate them; they must be
requested explicitly via FFTW's own CMake options (
ENABLE_NEON,ENABLE_SSE2,ENABLE_AVX,ENABLE_AVX2). Before this PR those optionswere not forwarded, so every FFTW build was a scalar-only build regardless
of the host CPU.
Changes
Per-CPU SIMD detection at CMake configure time (
itkExternal_FFTW.cmake):aarch64|arm64)NEON=ON(mandatory in ARMv8)CheckCSourceRuns+__builtin_cpu_supports()NEON=ONSSE=ON,SSE2=ON; AVX/AVX2 not assumedEvery flag (
FFTW_ENABLE_NEON,FFTW_ENABLE_SSE,FFTW_ENABLE_SSE2,FFTW_ENABLE_AVX,FFTW_ENABLE_AVX2) is an individually overridablecache option, e.g.
cmake -DFFTW_ENABLE_AVX2=OFF ....Both
fftwf(single-precision) andfftwd(double-precision)ExternalProject_Addblocks now forward all five SIMD flags consistently.Motivation / observed impact
On Apple M4 (ARM64),
MaskedFFTNormalizedCorrelationImageFilterwith ascalar FFTW ran 7.5–9× slower than necessary because NEON codelets were
absent. After this change:
ENABLE_NEON:BOOL=ON, all x86 flagsOFFnmconfirms 19 NEON codelet functions (_fftwf_codelet_n*) compiledinto
libfftw3f.aMeasured end-to-end on Apple M4 (BRAINSTools
BCDTest_rVN4-rpc-rac-rmpj,which exercises
MaskedFFTNormalizedCorrelationImageFilterheavily across~50 LLS landmark passes):
The
CheckCSourceRuns-based x86 probing ensures that older CPUs (SandyBridge = SSE+SSE2 only, Ivy Bridge = +AVX, Haswell+ = +AVX2) each receive
only the SIMD levels they actually support, rather than blindly enabling all
four and potentially miscompiling on a cross-compile target or older host.
Testing
ENABLE_NEON=ON, all x86 flagsOFF, 19 NEON codelets confirmed vianm.BCDTest_rVN4-rpc-rac-rmpjpassed in 351 s (timeout budget: 6000 s);image error 18 < tolerance 50 confirms numerical correctness.
CheckCSourceRunsresults are CMake-cached after the first configure run,so subsequent
cmake .calls do not re-run the CPUID executables.AI Contribution Disclosure
This change was developed with AI assistance (Claude Sonnet 4.6 via Claude
Code). The root-cause diagnosis, detection logic, cross-compile fallback
design, and documentation were reviewed and verified by the human author
before submission. All code in this PR is understood and accountable.
🤖 Generated with Claude Code