PERF: Enable FFTW SIMD codelets with per-CPU introspection at configure time by hjmjohnson · Pull Request #6006 · InsightSoftwareConsortium/ITK

hjmjohnson · 2026-04-02T20:02:14Z

Summary

Add per-CPU SIMD detection at CMake configure time for FFTW external project builds
Replace hardcoded ENABLE_*:BOOL=OFF flags with auto-detected, user-overridable cache options
Forward all five SIMD flags (NEON, SSE, SSE2, AVX, AVX2) consistently to both fftwf and fftwd ExternalProject blocks

Detection Policy

Scenario	Detection	Result
Native ARM64 (aarch64/arm64/ARM64)	Architecture pattern	NEON=ON
Native x86_64 with GCC/Clang	`CheckCSourceRuns` + `__builtin_cpu_supports()`	Per-CPU SSE/SSE2/AVX/AVX2
Native x86_64 with MSVC	Probes skipped (intrinsic unavailable)	All SIMD OFF (user can override)
Cross-compile ARM64	Architecture pattern	NEON=ON
Cross-compile x86_64	Conservative	SSE+SSE2 only
Other architectures	—	All SIMD OFF

Every flag is individually overridable: cmake -DFFTW_ENABLE_AVX2=OFF ...

Relationship to PR #6004

This is the ITK v6 (main branch) forward-port of #6004, which targets release-5.4. Includes all review fixes from #6004:

ARM64 (all-caps) added to regex for Windows ARM64 compatibility
__builtin_cpu_supports probes guarded by compiler-ID check (GCC/Clang/AppleClang) to avoid noisy failures on MSVC
ENABLE_SSE included in documentation comment

Test plan

CI builds pass (FFTW is not enabled in CI, so this is a no-op for CI)
Local build with -DITK_USE_FFTWF=ON on x86_64 Linux confirms SSE2/AVX2 detection
Local build on Apple M4 confirms NEON=ON

🤖 Generated with Claude Code

…re time FFTW SIMD codelets (NEON, SSE/SSE2, AVX, AVX2) are hand-written assembly routines baked into the library at compile time. Previously all SIMD flags were hardcoded to OFF, producing scalar-only FFTW builds regardless of the host CPU. Add per-CPU SIMD detection at CMake configure time: - ARM64 (aarch64/arm64/ARM64): NEON=ON (mandatory in ARMv8) - x86/x86_64 with GCC/Clang: probe SSE, SSE2, AVX, AVX2 individually via __builtin_cpu_supports() / CheckCSourceRuns - x86/x86_64 with MSVC: skip probes (intrinsic unavailable), default OFF - Cross-compile ARM64: NEON=ON; x86_64: SSE+SSE2 only (conservative) - All other architectures: all SIMD off (safe fallback) Every flag is an individually overridable cache option (e.g. cmake -DFFTW_ENABLE_AVX2=OFF). Cherry-picked from PR InsightSoftwareConsortium#6004 (targeting release-5.4) with review fixes: - ARM64 regex includes all-caps variant for Windows ARM64 - MSVC compiler guard on __builtin_cpu_supports probes - ENABLE_SSE included in documentation comment Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-04-02T20:05:51Z

Greptile Summary

This PR replaces the previous hardcoded ENABLE_*=OFF flags in FFTW's ExternalProject_Add calls with auto-detected, user-overridable FFTW_ENABLE_* cache options. Detection uses CheckCSourceRuns with __builtin_cpu_supports() for GCC/Clang on x86/x86_64 (correctly gated off MSVC) and an architecture-pattern match for ARM64/AArch64 NEON, with a conservative cross-compile fallback. The five flags are consistently forwarded to both fftwf and fftwd ExternalProject blocks.

Confidence Score: 5/5

Safe to merge; detection logic is correct and all remaining findings are minor P2 suggestions.

The SIMD detection is correctly guarded for cross-compilation, MSVC, and non-x86/non-ARM architectures. The CheckCSourceRuns probes use proper CMake caching. The option() defaults are well-documented. Both remaining comments are P2 (stale-cache UX note and a harmless no-op SSE flag on fftwd), neither of which affects correctness or build reliability.

No files require special attention; only minor P2 style items remain.

Important Files Changed

Filename	Overview
CMake/itkExternal_FFTW.cmake	Adds per-CPU SIMD detection for FFTW ExternalProject builds: introduces CheckCSourceRuns probes for x86 SIMD levels (SSE/SSE2/AVX/AVX2) and architecture-based NEON detection, exposes five FFTW_ENABLE_* cache options, and forwards them to both fftwf and fftwd ExternalProject blocks.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[cmake configure] --> B{ITK_USE_SYSTEM_FFTW?}
    B -- yes --> Z[find_package FFTW]
    B -- no --> C{CMAKE_CROSSCOMPILING?}

    C -- no --> D{CMAKE_SYSTEM_PROCESSOR}
    D -- aarch64/arm64/ARM64 --> E[_fftw_default_neon = ON]
    D -- x86_64/AMD64/i686 --> F{C compiler = GCC/Clang/AppleClang?}
    F -- yes --> G[check_c_source_runs per SIMD level\nsse / sse2 / avx / avx2]
    G --> H[_fftw_default_* = ON if probe passes]
    F -- no MSVC --> I[all _fftw_default_* = OFF]
    D -- other --> I

    C -- yes --> J{CMAKE_SYSTEM_PROCESSOR}
    J -- aarch64/arm64/ARM64 --> K[_fftw_default_neon = ON]
    J -- x86_64/AMD64 --> L[_fftw_default_sse = ON\n_fftw_default_sse2 = ON]
    J -- other --> M[all _fftw_default_* = OFF]

    E & H & I & K & L & M --> N[option FFTW_ENABLE_NEON/SSE/SSE2/AVX/AVX2\ndefault = detected value\ncached — user-overridable]

    N --> O{ITK_USE_FFTWF?}
    O -- yes --> P[ExternalProject_Add fftwf\nENABLE_FLOAT=ON\n+ all FFTW_ENABLE_* flags]

    N --> Q{ITK_USE_FFTWD?}
    Q -- yes --> R[ExternalProject_Add fftwd\nENABLE_FLOAT=OFF\n+ all FFTW_ENABLE_* flags]

_{Reviews (1): Last reviewed commit: "PERF: Enable FFTW SIMD codelets with per..." | Re-trigger Greptile}

CMake/itkExternal_FFTW.cmake

@greptile-apps

- Add message(STATUS) showing detected FFTW SIMD flags at configure time so users can verify detection without inspecting the cache. - Remove ENABLE_SSE from the fftwd (double-precision) ExternalProject block; SSE1 codelets are float-only and have no effect on fftwd. - Document in the file header that option() defaults only apply on first configure and that ENABLE_SSE is not forwarded to fftwd. Addresses review comments from @greptile-apps on PR InsightSoftwareConsortium#6006. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@greptile-apps

- Add message(STATUS) showing detected FFTW SIMD flags at configure time so users can verify detection without inspecting the cache. - Remove ENABLE_SSE from the fftwd (double-precision) ExternalProject block; SSE1 codelets are float-only and have no effect on fftwd. - Document in the file header that option() defaults only apply on first configure and that ENABLE_SSE is not forwarded to fftwd. Addresses review comments from @greptile-apps on PR InsightSoftwareConsortium#6006. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hjmjohnson · 2026-04-02T20:25:54Z

CI does not test FFTW

The following tests passed:
	itkFFTWF_FFTTest
	itkFFTWF_RealFFTTest
	itkVnlFFTWF_FFTTest
	itkVnlFFTWF_RealFFTTest
	itkFFTWD_FFTTest
	itkFFTWD_RealFFTTest
	itkVnlFFTWD_FFTTest
	itkVnlFFTWD_RealFFTTest
	itkFFTWComplexToComplexFFTImageFilter2DFloatTest
	itkFFTWComplexToComplexFFTImageFilter3DFloatTest
	itkFFTWComplexToComplexFFTImageFilter2DDoubleTest
	itkFFTWComplexToComplexFFTImageFilter3DDoubleTest
	itkFFTWForward1DFFTImageFilterTest
	itkFFTWInverse1DFFTImageFilterTest
	itkFFTWComplexToComplex1DFFTImageFilterTest
	itkFFTW1DImageFilterTest

100% tests passed, 0 tests failed out of 16

hjmjohnson · 2026-04-02T20:26:43Z

@dzenanz These are updates for the main branch followup to #6004 for the release-5.4 branch

dzenanz

We should enable FFTW in some CI build.

seanm · 2026-04-02T20:37:08Z

Configure-time checking of such things usually does not play well with 'universal binaries' on macOS (where one builds for both arm & intel)...

hjmjohnson · 2026-04-02T20:45:41Z

Configure-time checking of such things usually does not play well with 'universal binaries' on macOS (where one builds for both arm & intel)...

@seanm The defaults can be overridden. The default is to use no optimization, and in that case FFTW is at least 7.5 times slower, and often 10 times slower. FFTW becomes nearly useless in these cases.

The behavior is intended to default to no optimization if "CROSS_COMPLING" is on, and Ithink that is the case for universal binaries.

Do you have a recommendation for an alternative strategy?

Do we need something more than 886494b#diff-c55dd1b7b03f8f37a8d66eb87317f3aaf58d457eb7adea1aab4fd5d69a95ab1fR90.

seanm · 2026-04-02T21:52:25Z

"...each of SSE, SSE2, AVX, AVX2 is probed individually via __builtin_cpu_supports..." so even on non-Mac this strategy is dangerous, because it assumes that the machine building the code is the same as the machine running the code. If I'm making a Windows app and my buildbot is a beefy modern CPU but I still want customers with old CPUs to run the app, the app will presumably crash for them because it's using new CPU instructions that their old CPUs don't support.

Do you have a recommendation for an alternative strategy?

Compile-time checks.

#if defined(__SSE2__) && __SSE2__
  something1
#elif defined(__arm64__)
  something2
#elif defined(__x86_64__)
  something3
#elif etc.

(because the compiler knows what CPU it's compiling for)

hjmjohnson · 2026-04-02T22:20:20Z

@seanm These are the defaults for the most common use cases. In the less common scenario you define, one simply needs to explicitly turn them off. Once off, they stay off on subsequent cmake configurations, unless explicitly requested to be turned back on.

cmake -DENABLE_SSE:BOOL=OFF -DENABLE_NEON:BOOL=OFF ....

That is the primary reason for using option() rather than set() for these variables.

seanm · 2026-04-02T22:26:37Z

In the less common scenario you define...

Not so sure the scenario I described is "less common". I suspect most programmers have better CPUs than the customers running their code.

Anyway, I don't use FFTW so I don't much care in this case.

…builds Addresses seanm's review of ITK PR InsightSoftwareConsortium#6006: the previous check_c_source_runs approach probed the BUILD HOST's CPU at configure time, producing FFTW binaries that require the build machine's exact CPU and SIGILL on any machine that lacks the detected SIMD extensions. This is unsafe for redistributed binary packages (conda, pip/PyPI, manylinux Docker images) where build and target machines differ. New detection policy (compile-time only, never runtime): x86_64 / AMD64: SSE and SSE2 are mandated by the AMD64 ABI — every 64-bit x86 CPU supports them regardless of age. Both are enabled by default. Safe for all manylinux2014 / manylinux_2_28 / conda x86_64 builds. aarch64 / arm64: NEON is mandated by the AArch64 ABI — every arm64 CPU has it. Enabled by default. Safe for all conda / manylinux aarch64 builds. AVX / AVX2 (Sandy Bridge 2011 / Haswell 2013 required): NOT universally available; default OFF for redistribution safety. Auto-enabled only when the compiler is already generating those instructions — i.e. when the user passed -march=native, -mavx2, /arch:AVX2, or similar. Detected via check_c_source_compiles (not _runs) which tests what the compiler targets, not what the build host's CPU can execute. This implements seanm's recommended "the compiler knows what CPU it's compiling for" approach. macOS universal binary (CMAKE_OSX_ARCHITECTURES with >1 entry): SIMD defaults disabled; a single configure pass cannot produce correct per-slice codelets for both arm64 and x86_64. This change is a strict improvement on the previous behaviour for the most important redistribution platforms: - conda/pip on x86_64: SSE+SSE2 always ON (was OFF without runtime probe) - conda/pip on arm64: NEON always ON (unchanged) - AVX2 on build host: ON only when compiler targets it (was ON always) Closes InsightSoftwareConsortium#6006 (follow-up addressing seanm review) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…builds Addresses seanm's review of ITK PR InsightSoftwareConsortium#6006: the previous check_c_source_runs approach probed the BUILD HOST's CPU at configure time, producing FFTW binaries that require the build machine's exact CPU and SIGILL on any machine that lacks the detected SIMD extensions. This is unsafe for redistributed binary packages (conda, pip/PyPI, manylinux Docker images) where build and target machines differ. New detection policy (compile-time only, never runtime): x86_64 / AMD64: SSE and SSE2 are mandated by the AMD64 ABI -- every 64-bit x86 CPU supports them regardless of age. Both are enabled by default. Safe for all manylinux2014 / manylinux_2_28 / conda x86_64 builds. aarch64 / arm64: NEON is mandated by the AArch64 ABI -- every arm64 CPU has it. Enabled by default. Safe for all conda / manylinux aarch64 builds. AVX / AVX2 (Sandy Bridge 2011 / Haswell 2013 required): NOT universally available; default OFF for redistribution safety. Auto-enabled only when the compiler is already generating those instructions -- i.e. when the user passed -march=native, -mavx2, /arch:AVX2, or similar. Detected via check_c_source_compiles (not _runs) which tests what the compiler targets, not what the build host's CPU can execute. This implements seanm's recommended "the compiler knows what CPU it's compiling for" approach. The AVX/AVX2 cache variables are unset before each probe so that detection re-runs on every configure when compiler flags change (e.g. user later adds -march=native). macOS universal binary (CMAKE_OSX_ARCHITECTURES with >1 entry): SIMD defaults disabled; a single configure pass cannot produce correct per-slice codelets for both arm64 and x86_64. This change is a strict improvement on the previous behaviour for the most important redistribution platforms: - conda/pip on x86_64: SSE+SSE2 always ON (was OFF without runtime probe) - conda/pip on arm64: NEON always ON (unchanged) - AVX2 on build host: ON only when compiler targets it (was ON always) Closes InsightSoftwareConsortium#6006 (follow-up addressing seanm review) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions bot added type:Infrastructure Infrastructure/ecosystem related changes, such as CMake or buildbots type:Performance Improvement in terms of compilation or execution time labels Apr 2, 2026

greptile-apps bot reviewed Apr 2, 2026

View reviewed changes

CMake/itkExternal_FFTW.cmake Show resolved Hide resolved

CMake/itkExternal_FFTW.cmake Show resolved Hide resolved

hjmjohnson force-pushed the fftw-simd-windows-arm64-fix branch from e61066f to a9b11c8 Compare April 2, 2026 20:24

hjmjohnson requested a review from dzenanz April 2, 2026 20:26

dzenanz approved these changes Apr 2, 2026

View reviewed changes

thewtex approved these changes Apr 2, 2026

View reviewed changes

hjmjohnson merged commit 360c9be into InsightSoftwareConsortium:main Apr 2, 2026
15 of 18 checks passed

hjmjohnson mentioned this pull request Apr 3, 2026

PERF: Use ABI-guaranteed SIMD baselines for redistribution-safe FFTW builds #6007

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PERF: Enable FFTW SIMD codelets with per-CPU introspection at configure time#6006

PERF: Enable FFTW SIMD codelets with per-CPU introspection at configure time#6006
hjmjohnson merged 2 commits intoInsightSoftwareConsortium:mainfrom
hjmjohnson:fftw-simd-windows-arm64-fix

hjmjohnson commented Apr 2, 2026

Uh oh!

greptile-apps bot commented Apr 2, 2026

Greptile Summary

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

hjmjohnson commented Apr 2, 2026

Uh oh!

hjmjohnson commented Apr 2, 2026

Uh oh!

dzenanz left a comment

Uh oh!

seanm commented Apr 2, 2026

Uh oh!

hjmjohnson commented Apr 2, 2026 •

edited

Loading

Uh oh!

seanm commented Apr 2, 2026 •

edited

Loading

Uh oh!

hjmjohnson commented Apr 2, 2026

Uh oh!

Uh oh!

seanm commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

hjmjohnson commented Apr 2, 2026

Summary

Detection Policy

Relationship to PR #6004

Test plan

Uh oh!

greptile-apps bot commented Apr 2, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

hjmjohnson commented Apr 2, 2026

Uh oh!

hjmjohnson commented Apr 2, 2026

Uh oh!

dzenanz left a comment

Choose a reason for hiding this comment

Uh oh!

seanm commented Apr 2, 2026

Uh oh!

hjmjohnson commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seanm commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hjmjohnson commented Apr 2, 2026

Uh oh!

Uh oh!

seanm commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hjmjohnson commented Apr 2, 2026 •

edited

Loading

seanm commented Apr 2, 2026 •

edited

Loading