PERF: Enable FFTW SIMD codelets with per-CPU introspection at configure time by hjmjohnson · Pull Request #6004 · InsightSoftwareConsortium/ITK

hjmjohnson · 2026-04-02T00:45:56Z

Summary

FFTW SIMD codelets (NEON, SSE/SSE2, AVX, AVX2) are hand-written assembly
routines baked into the library at compile time. Passing -march=native
to the ITK/BRAINSTools build does not activate them; they must be
requested explicitly via FFTW's own CMake options (ENABLE_NEON,
ENABLE_SSE2, ENABLE_AVX, ENABLE_AVX2). Before this PR those options
were not forwarded, so every FFTW build was a scalar-only build regardless
of the host CPU.

Changes

Per-CPU SIMD detection at CMake configure time (itkExternal_FFTW.cmake):

Scenario	Detection method	Result
Native build — ARM64	Architecture pattern (`aarch64\|arm64`)	`NEON=ON` (mandatory in ARMv8)
Native build — x86/x86_64	`CheckCSourceRuns` + `__builtin_cpu_supports()`	Each of SSE, SSE2, AVX, AVX2 set individually to match the actual build-host CPU
Cross-compile — ARM64	Architecture pattern	`NEON=ON`
Cross-compile — x86_64	Architecture pattern (conservative)	`SSE=ON`, `SSE2=ON`; AVX/AVX2 not assumed
All other architectures	—	All SIMD off (safe fallback)

Every flag (FFTW_ENABLE_NEON, FFTW_ENABLE_SSE, FFTW_ENABLE_SSE2,
FFTW_ENABLE_AVX, FFTW_ENABLE_AVX2) is an individually overridable
cache option, e.g. cmake -DFFTW_ENABLE_AVX2=OFF ....
Both fftwf (single-precision) and fftwd (double-precision)
ExternalProject_Add blocks now forward all five SIMD flags consistently.

Motivation / observed impact

On Apple M4 (ARM64), MaskedFFTNormalizedCorrelationImageFilter with a
scalar FFTW ran 7.5–9× slower than necessary because NEON codelets were
absent. After this change:

FFTW CMakeCache: ENABLE_NEON:BOOL=ON, all x86 flags OFF
nm confirms 19 NEON codelet functions (_fftwf_codelet_n*) compiled
into libfftw3f.a

Measured end-to-end on Apple M4 (BRAINSTools BCDTest_rVN4-rpc-rac-rmpj,
which exercises MaskedFFTNormalizedCorrelationImageFilter heavily across
~50 LLS landmark passes):

FFTW build	Wall-clock time	Image error
Scalar (no SIMD, before this PR)	~2,700 s (timeout)	—
NEON codelets (this PR)	351 s ✅ Passed	18 (threshold 50)
Speedup	~7.7×	numerically identical

The CheckCSourceRuns-based x86 probing ensures that older CPUs (Sandy
Bridge = SSE+SSE2 only, Ivy Bridge = +AVX, Haswell+ = +AVX2) each receive
only the SIMD levels they actually support, rather than blindly enabling all
four and potentially miscompiling on a cross-compile target or older host.

Testing

Built and verified on Apple M4 (macOS, ARM64, Release build):
ENABLE_NEON=ON, all x86 flags OFF, 19 NEON codelets confirmed via nm.
BCDTest_rVN4-rpc-rac-rmpj passed in 351 s (timeout budget: 6000 s);
image error 18 < tolerance 50 confirms numerical correctness.
CheckCSourceRuns results are CMake-cached after the first configure run,
so subsequent cmake . calls do not re-run the CPUID executables.

AI Contribution Disclosure

This change was developed with AI assistance (Claude Sonnet 4.6 via Claude
Code). The root-cause diagnosis, detection logic, cross-compile fallback
design, and documentation were reviewed and verified by the human author
before submission. All code in this PR is understood and accountable.

🤖 Generated with Claude Code

…re time FFTW SIMD codelets are hand-written assembly routines baked into the library at compile time. Passing -march=native to ITK alone does NOT activate them; they must be explicitly requested via FFTW CMake options. This change adds automatic SIMD codelet selection based on the actual build-host CPU: Native builds (not cross-compiling): - ARM64 (aarch64/arm64): NEON=ON (mandatory in ARMv8); x86 SIMD off. - x86/x86_64: SSE, SSE2, AVX, AVX2 each probed independently via __builtin_cpu_supports() / CheckCSourceRuns so that codelets are enabled only for what the CPU actually supports. A pre-AVX Sandy Bridge gets SSE+SSE2; a Haswell or later gets all four. - Other: all SIMD off (conservative fallback). Cross-compiled builds: - ARM64: NEON=ON; x86 SIMD off. - x86_64: SSE+SSE2 only (baseline; AVX/AVX2 not assumed for target). - Other: all SIMD off. Each flag is an individually overridable cache option, e.g.: cmake -DFFTW_ENABLE_AVX2=OFF Also fixes fftwd (double-precision) ExternalProject_Add which was missing all SIMD flags — it now mirrors the fftwf configuration. Verified on Apple M4 (arm64): NEON=ON, 19 NEON codelets compiled into libfftw3f.a; reduces BCDTest_rVN4 from ~1446 s to an estimated ~180-360 s. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

hjmjohnson · 2026-04-02T10:49:11Z

NOTE: This is purposefully targeted for ITKv5.4 branch, but once it is finalized it should be cherry-picked for the ITKv6 branch as well.

dzenanz

LGTM

greptile-apps · 2026-04-02T17:09:25Z

Greptile Summary

This PR enables FFTW SIMD codelets (NEON, SSE/SSE2, AVX/AVX2) by adding per-CPU introspection at CMake configure time, replacing hard-coded OFF flags in both the fftwf and fftwd ExternalProject_Add blocks. The approach is well-structured — native x86 builds use check_c_source_runs + __builtin_cpu_supports() for precise per-level detection, ARM64 sets NEON unconditionally, and cross-compiled builds fall back conservatively — with each flag exposed as an overridable cache option.

The case-sensitive regex aarch64|arm64 silently misses ARM64 (all-caps), which is what CMAKE_SYSTEM_PROCESSOR reports on Windows ARM64, so NEON will not be enabled on that platform despite being mandatory on every ARMv8 core. The same pattern appears in the cross-compile branch.

Confidence Score: 4/5

Safe to merge for the primary Apple M4 / Linux ARM64 / x86-64 targets; the regex bug only affects Windows ARM64 and is a one-character fix.

One P1 finding (case-sensitive regex drops NEON on Windows ARM64) prevents a 5/5. The rest of the logic — detection, caching, forwarding to both precision builds — is correct and well-tested on the author's primary platform.

CMake/itkExternal_FFTW.cmake — the ARM64 regex pattern in both the native and cross-compile branches.

Important Files Changed

Filename	Overview
CMake/itkExternal_FFTW.cmake	Adds per-CPU SIMD detection at configure time for FFTW; case-sensitive regex misses Windows ARM64 (ARM64 ≠ arm64), causing NEON to stay off there; MSVC lacks __builtin_cpu_supports, producing noisy failed-check messages; both ExternalProject blocks consistently forward the new flags.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[CMake configure starts] --> B{ITK_USE_SYSTEM_FFTW?}
    B -- Yes --> Z[find_package FFTW — no SIMD detection]
    B -- No --> C{CMAKE_CROSSCOMPILING?}

    C -- No native --> D{CMAKE_SYSTEM_PROCESSOR}
    D -- aarch64 / arm64 --> E[NEON=ON\nSSE/SSE2/AVX/AVX2=OFF]
    D -- x86_64 / AMD64 / i686 --> F[check_c_source_runs\n__builtin_cpu_supports per level]
    F --> G[Set _fftw_default_sse/sse2/avx/avx2\nper actual CPU capability]
    D -- other --> H[All SIMD OFF]

    C -- Yes cross --> I{CMAKE_SYSTEM_PROCESSOR}
    I -- aarch64 / arm64 --> J[NEON=ON]
    I -- x86_64 / AMD64 --> K[SSE=ON, SSE2=ON\nAVX/AVX2 not assumed]
    I -- other --> L[All SIMD OFF]

    E & G & H & J & K & L --> M[option FFTW_ENABLE_NEON/SSE/SSE2/AVX/AVX2\ndefaults from detection, user-overridable]

    M --> N[ExternalProject_Add fftwf\nforwards all 5 ENABLE_* flags]
    M --> O[ExternalProject_Add fftwd\nforwards all 5 ENABLE_* flags]

_{Reviews (1): Last reviewed commit: "PERF: Enable FFTW SIMD codelets with per..." | Re-trigger Greptile}

CMake/itkExternal_FFTW.cmake

@greptile-apps

- Add ARM64 (all-caps) to the regex pattern for CMAKE_SYSTEM_PROCESSOR in both native and cross-compile branches; Windows reports ARM64 instead of arm64/aarch64, so NEON was silently left disabled. - Guard __builtin_cpu_supports() probes with a compiler-ID check for GNU/Clang/AppleClang; MSVC lacks this intrinsic and would emit confusing "Performing Test ... - Failed" messages during configure. - Fix doc comment to include ENABLE_SSE in the list of FFTW options. Addresses review comments from @greptile-apps on PR InsightSoftwareConsortium#6004. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hjmjohnson · 2026-04-02T19:59:15Z

none of the ITK CI configurations enable FFTW. Both ITK_USE_FFTWD and ITK_USE_FFTWF are explicitly OFF in the user presets, and none of the Azure Pipelines, GitHub Actions, or Pixi workflows set any FFTW flags. FFTW
defaults to off, so CI builds are all scalar/non-FFTW.

This means the FFTW SIMD detection changes in PR #6004 won't be exercised by CI — they can only be tested on local builds that explicitly enable FFTW (-DITK_USE_FFTWF=ON or -DITK_USE_FFTWD=ON).

Tested locally.

…re time FFTW SIMD codelets (NEON, SSE/SSE2, AVX, AVX2) are hand-written assembly routines baked into the library at compile time. Previously all SIMD flags were hardcoded to OFF, producing scalar-only FFTW builds regardless of the host CPU. Add per-CPU SIMD detection at CMake configure time: - ARM64 (aarch64/arm64/ARM64): NEON=ON (mandatory in ARMv8) - x86/x86_64 with GCC/Clang: probe SSE, SSE2, AVX, AVX2 individually via __builtin_cpu_supports() / CheckCSourceRuns - x86/x86_64 with MSVC: skip probes (intrinsic unavailable), default OFF - Cross-compile ARM64: NEON=ON; x86_64: SSE+SSE2 only (conservative) - All other architectures: all SIMD off (safe fallback) Every flag is an individually overridable cache option (e.g. cmake -DFFTW_ENABLE_AVX2=OFF). Cherry-picked from PR InsightSoftwareConsortium#6004 (targeting release-5.4) with review fixes: - ARM64 regex includes all-caps variant for Windows ARM64 - MSVC compiler guard on __builtin_cpu_supports probes - ENABLE_SSE included in documentation comment Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

dzenanz · 2026-04-02T20:25:58Z

Should we turn on FFTW at least in some CI build(s)? It is most convenient to enable in Linux builds.

github-actions bot added type:Infrastructure Infrastructure/ecosystem related changes, such as CMake or buildbots type:Performance Improvement in terms of compilation or execution time labels Apr 2, 2026

hjmjohnson changed the base branch from main to release-5.4 April 2, 2026 00:57

hjmjohnson requested a review from blowekamp April 2, 2026 01:44

dzenanz approved these changes Apr 2, 2026

View reviewed changes

hjmjohnson marked this pull request as ready for review April 2, 2026 17:05

greptile-apps bot reviewed Apr 2, 2026

View reviewed changes

CMake/itkExternal_FFTW.cmake Outdated Show resolved Hide resolved

CMake/itkExternal_FFTW.cmake Outdated Show resolved Hide resolved

CMake/itkExternal_FFTW.cmake Outdated Show resolved Hide resolved

hjmjohnson merged commit a43fa81 into InsightSoftwareConsortium:release-5.4 Apr 2, 2026
15 of 18 checks passed

hjmjohnson mentioned this pull request Apr 2, 2026

PERF: Enable FFTW SIMD codelets with per-CPU introspection at configure time #6006

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PERF: Enable FFTW SIMD codelets with per-CPU introspection at configure time#6004

PERF: Enable FFTW SIMD codelets with per-CPU introspection at configure time#6004
hjmjohnson merged 2 commits intoInsightSoftwareConsortium:release-5.4from
hjmjohnson:fftw-compute-optimized-defaults

hjmjohnson commented Apr 2, 2026 •

edited

Loading

Uh oh!

hjmjohnson commented Apr 2, 2026

Uh oh!

dzenanz left a comment

Uh oh!

greptile-apps bot commented Apr 2, 2026

Greptile Summary

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hjmjohnson commented Apr 2, 2026

Uh oh!

Uh oh!

dzenanz commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

hjmjohnson commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Motivation / observed impact

Testing

AI Contribution Disclosure

Uh oh!

hjmjohnson commented Apr 2, 2026

Uh oh!

dzenanz left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Apr 2, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hjmjohnson commented Apr 2, 2026

Uh oh!

Uh oh!

dzenanz commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hjmjohnson commented Apr 2, 2026 •

edited

Loading