Skip to content

PERF: Enable FFTW SIMD codelets with per-CPU introspection at configure time#6004

Merged
hjmjohnson merged 2 commits intoInsightSoftwareConsortium:release-5.4from
hjmjohnson:fftw-compute-optimized-defaults
Apr 2, 2026
Merged

PERF: Enable FFTW SIMD codelets with per-CPU introspection at configure time#6004
hjmjohnson merged 2 commits intoInsightSoftwareConsortium:release-5.4from
hjmjohnson:fftw-compute-optimized-defaults

Conversation

@hjmjohnson
Copy link
Copy Markdown
Member

@hjmjohnson hjmjohnson commented Apr 2, 2026

Summary

FFTW SIMD codelets (NEON, SSE/SSE2, AVX, AVX2) are hand-written assembly
routines baked into the library at compile time. Passing -march=native
to the ITK/BRAINSTools build does not activate them; they must be
requested explicitly via FFTW's own CMake options (ENABLE_NEON,
ENABLE_SSE2, ENABLE_AVX, ENABLE_AVX2). Before this PR those options
were not forwarded, so every FFTW build was a scalar-only build regardless
of the host CPU.

Changes

  • Per-CPU SIMD detection at CMake configure time (itkExternal_FFTW.cmake):

    Scenario Detection method Result
    Native build — ARM64 Architecture pattern (aarch64|arm64) NEON=ON (mandatory in ARMv8)
    Native build — x86/x86_64 CheckCSourceRuns + __builtin_cpu_supports() Each of SSE, SSE2, AVX, AVX2 set individually to match the actual build-host CPU
    Cross-compile — ARM64 Architecture pattern NEON=ON
    Cross-compile — x86_64 Architecture pattern (conservative) SSE=ON, SSE2=ON; AVX/AVX2 not assumed
    All other architectures All SIMD off (safe fallback)
  • Every flag (FFTW_ENABLE_NEON, FFTW_ENABLE_SSE, FFTW_ENABLE_SSE2,
    FFTW_ENABLE_AVX, FFTW_ENABLE_AVX2) is an individually overridable
    cache option
    , e.g. cmake -DFFTW_ENABLE_AVX2=OFF ....

  • Both fftwf (single-precision) and fftwd (double-precision)
    ExternalProject_Add blocks now forward all five SIMD flags consistently.

Motivation / observed impact

On Apple M4 (ARM64), MaskedFFTNormalizedCorrelationImageFilter with a
scalar FFTW ran 7.5–9× slower than necessary because NEON codelets were
absent. After this change:

  • FFTW CMakeCache: ENABLE_NEON:BOOL=ON, all x86 flags OFF
  • nm confirms 19 NEON codelet functions (_fftwf_codelet_n*) compiled
    into libfftw3f.a

Measured end-to-end on Apple M4 (BRAINSTools BCDTest_rVN4-rpc-rac-rmpj,
which exercises MaskedFFTNormalizedCorrelationImageFilter heavily across
~50 LLS landmark passes):

FFTW build Wall-clock time Image error
Scalar (no SIMD, before this PR) ~2,700 s (timeout)
NEON codelets (this PR) 351 s ✅ Passed 18 (threshold 50)
Speedup ~7.7× numerically identical

The CheckCSourceRuns-based x86 probing ensures that older CPUs (Sandy
Bridge = SSE+SSE2 only, Ivy Bridge = +AVX, Haswell+ = +AVX2) each receive
only the SIMD levels they actually support, rather than blindly enabling all
four and potentially miscompiling on a cross-compile target or older host.

Testing

  • Built and verified on Apple M4 (macOS, ARM64, Release build):
    ENABLE_NEON=ON, all x86 flags OFF, 19 NEON codelets confirmed via nm.
  • BCDTest_rVN4-rpc-rac-rmpj passed in 351 s (timeout budget: 6000 s);
    image error 18 < tolerance 50 confirms numerical correctness.
  • CheckCSourceRuns results are CMake-cached after the first configure run,
    so subsequent cmake . calls do not re-run the CPUID executables.

AI Contribution Disclosure

This change was developed with AI assistance (Claude Sonnet 4.6 via Claude
Code). The root-cause diagnosis, detection logic, cross-compile fallback
design, and documentation were reviewed and verified by the human author
before submission. All code in this PR is understood and accountable.


🤖 Generated with Claude Code

…re time

FFTW SIMD codelets are hand-written assembly routines baked into the
library at compile time.  Passing -march=native to ITK alone does NOT
activate them; they must be explicitly requested via FFTW CMake options.

This change adds automatic SIMD codelet selection based on the actual
build-host CPU:

  Native builds (not cross-compiling):
  - ARM64  (aarch64/arm64): NEON=ON (mandatory in ARMv8); x86 SIMD off.
  - x86/x86_64: SSE, SSE2, AVX, AVX2 each probed independently via
    __builtin_cpu_supports() / CheckCSourceRuns so that codelets are
    enabled only for what the CPU actually supports.  A pre-AVX Sandy
    Bridge gets SSE+SSE2; a Haswell or later gets all four.
  - Other: all SIMD off (conservative fallback).

  Cross-compiled builds:
  - ARM64: NEON=ON; x86 SIMD off.
  - x86_64: SSE+SSE2 only (baseline; AVX/AVX2 not assumed for target).
  - Other: all SIMD off.

Each flag is an individually overridable cache option, e.g.:
  cmake -DFFTW_ENABLE_AVX2=OFF

Also fixes fftwd (double-precision) ExternalProject_Add which was missing
all SIMD flags — it now mirrors the fftwf configuration.

Verified on Apple M4 (arm64): NEON=ON, 19 NEON codelets compiled into
libfftw3f.a; reduces BCDTest_rVN4 from ~1446 s to an estimated ~180-360 s.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added type:Infrastructure Infrastructure/ecosystem related changes, such as CMake or buildbots type:Performance Improvement in terms of compilation or execution time labels Apr 2, 2026
@hjmjohnson hjmjohnson changed the base branch from main to release-5.4 April 2, 2026 00:57
@hjmjohnson hjmjohnson requested a review from blowekamp April 2, 2026 01:44
@hjmjohnson
Copy link
Copy Markdown
Member Author

NOTE: This is purposefully targeted for ITKv5.4 branch, but once it is finalized it should be cherry-picked for the ITKv6 branch as well.

Copy link
Copy Markdown
Member

@dzenanz dzenanz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hjmjohnson hjmjohnson marked this pull request as ready for review April 2, 2026 17:05
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 2, 2026

Greptile Summary

This PR enables FFTW SIMD codelets (NEON, SSE/SSE2, AVX/AVX2) by adding per-CPU introspection at CMake configure time, replacing hard-coded OFF flags in both the fftwf and fftwd ExternalProject_Add blocks. The approach is well-structured — native x86 builds use check_c_source_runs + __builtin_cpu_supports() for precise per-level detection, ARM64 sets NEON unconditionally, and cross-compiled builds fall back conservatively — with each flag exposed as an overridable cache option.

  • The case-sensitive regex aarch64|arm64 silently misses ARM64 (all-caps), which is what CMAKE_SYSTEM_PROCESSOR reports on Windows ARM64, so NEON will not be enabled on that platform despite being mandatory on every ARMv8 core. The same pattern appears in the cross-compile branch.

Confidence Score: 4/5

Safe to merge for the primary Apple M4 / Linux ARM64 / x86-64 targets; the regex bug only affects Windows ARM64 and is a one-character fix.

One P1 finding (case-sensitive regex drops NEON on Windows ARM64) prevents a 5/5. The rest of the logic — detection, caching, forwarding to both precision builds — is correct and well-tested on the author's primary platform.

CMake/itkExternal_FFTW.cmake — the ARM64 regex pattern in both the native and cross-compile branches.

Important Files Changed

Filename Overview
CMake/itkExternal_FFTW.cmake Adds per-CPU SIMD detection at configure time for FFTW; case-sensitive regex misses Windows ARM64 (ARM64 ≠ arm64), causing NEON to stay off there; MSVC lacks __builtin_cpu_supports, producing noisy failed-check messages; both ExternalProject blocks consistently forward the new flags.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[CMake configure starts] --> B{ITK_USE_SYSTEM_FFTW?}
    B -- Yes --> Z[find_package FFTW — no SIMD detection]
    B -- No --> C{CMAKE_CROSSCOMPILING?}

    C -- No native --> D{CMAKE_SYSTEM_PROCESSOR}
    D -- aarch64 / arm64 --> E[NEON=ON\nSSE/SSE2/AVX/AVX2=OFF]
    D -- x86_64 / AMD64 / i686 --> F[check_c_source_runs\n__builtin_cpu_supports per level]
    F --> G[Set _fftw_default_sse/sse2/avx/avx2\nper actual CPU capability]
    D -- other --> H[All SIMD OFF]

    C -- Yes cross --> I{CMAKE_SYSTEM_PROCESSOR}
    I -- aarch64 / arm64 --> J[NEON=ON]
    I -- x86_64 / AMD64 --> K[SSE=ON, SSE2=ON\nAVX/AVX2 not assumed]
    I -- other --> L[All SIMD OFF]

    E & G & H & J & K & L --> M[option FFTW_ENABLE_NEON/SSE/SSE2/AVX/AVX2\ndefaults from detection, user-overridable]

    M --> N[ExternalProject_Add fftwf\nforwards all 5 ENABLE_* flags]
    M --> O[ExternalProject_Add fftwd\nforwards all 5 ENABLE_* flags]
Loading

Reviews (1): Last reviewed commit: "PERF: Enable FFTW SIMD codelets with per..." | Re-trigger Greptile

- Add ARM64 (all-caps) to the regex pattern for CMAKE_SYSTEM_PROCESSOR
  in both native and cross-compile branches; Windows reports ARM64
  instead of arm64/aarch64, so NEON was silently left disabled.
- Guard __builtin_cpu_supports() probes with a compiler-ID check for
  GNU/Clang/AppleClang; MSVC lacks this intrinsic and would emit
  confusing "Performing Test ... - Failed" messages during configure.
- Fix doc comment to include ENABLE_SSE in the list of FFTW options.

Addresses review comments from @greptile-apps on PR InsightSoftwareConsortium#6004.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@hjmjohnson
Copy link
Copy Markdown
Member Author

none of the ITK CI configurations enable FFTW. Both ITK_USE_FFTWD and ITK_USE_FFTWF are explicitly OFF in the user presets, and none of the Azure Pipelines, GitHub Actions, or Pixi workflows set any FFTW flags. FFTW
defaults to off, so CI builds are all scalar/non-FFTW.

This means the FFTW SIMD detection changes in PR #6004 won't be exercised by CI — they can only be tested on local builds that explicitly enable FFTW (-DITK_USE_FFTWF=ON or -DITK_USE_FFTWD=ON).

Tested locally.

@hjmjohnson hjmjohnson merged commit a43fa81 into InsightSoftwareConsortium:release-5.4 Apr 2, 2026
15 of 18 checks passed
hjmjohnson added a commit to hjmjohnson/ITK that referenced this pull request Apr 2, 2026
…re time

FFTW SIMD codelets (NEON, SSE/SSE2, AVX, AVX2) are hand-written assembly
routines baked into the library at compile time.  Previously all SIMD
flags were hardcoded to OFF, producing scalar-only FFTW builds regardless
of the host CPU.

Add per-CPU SIMD detection at CMake configure time:
- ARM64 (aarch64/arm64/ARM64): NEON=ON (mandatory in ARMv8)
- x86/x86_64 with GCC/Clang: probe SSE, SSE2, AVX, AVX2 individually
  via __builtin_cpu_supports() / CheckCSourceRuns
- x86/x86_64 with MSVC: skip probes (intrinsic unavailable), default OFF
- Cross-compile ARM64: NEON=ON; x86_64: SSE+SSE2 only (conservative)
- All other architectures: all SIMD off (safe fallback)

Every flag is an individually overridable cache option
(e.g. cmake -DFFTW_ENABLE_AVX2=OFF).

Cherry-picked from PR InsightSoftwareConsortium#6004 (targeting release-5.4) with review fixes:
- ARM64 regex includes all-caps variant for Windows ARM64
- MSVC compiler guard on __builtin_cpu_supports probes
- ENABLE_SSE included in documentation comment

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dzenanz
Copy link
Copy Markdown
Member

dzenanz commented Apr 2, 2026

Should we turn on FFTW at least in some CI build(s)? It is most convenient to enable in Linux builds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type:Infrastructure Infrastructure/ecosystem related changes, such as CMake or buildbots type:Performance Improvement in terms of compilation or execution time

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants