Skip to content

PERF: Enable FFTW SIMD codelets with per-CPU introspection at configure time#6006

Merged
hjmjohnson merged 2 commits intoInsightSoftwareConsortium:mainfrom
hjmjohnson:fftw-simd-windows-arm64-fix
Apr 2, 2026
Merged

PERF: Enable FFTW SIMD codelets with per-CPU introspection at configure time#6006
hjmjohnson merged 2 commits intoInsightSoftwareConsortium:mainfrom
hjmjohnson:fftw-simd-windows-arm64-fix

Conversation

@hjmjohnson
Copy link
Copy Markdown
Member

Summary

  • Add per-CPU SIMD detection at CMake configure time for FFTW external project builds
  • Replace hardcoded ENABLE_*:BOOL=OFF flags with auto-detected, user-overridable cache options
  • Forward all five SIMD flags (NEON, SSE, SSE2, AVX, AVX2) consistently to both fftwf and fftwd ExternalProject blocks

Detection Policy

Scenario Detection Result
Native ARM64 (aarch64/arm64/ARM64) Architecture pattern NEON=ON
Native x86_64 with GCC/Clang CheckCSourceRuns + __builtin_cpu_supports() Per-CPU SSE/SSE2/AVX/AVX2
Native x86_64 with MSVC Probes skipped (intrinsic unavailable) All SIMD OFF (user can override)
Cross-compile ARM64 Architecture pattern NEON=ON
Cross-compile x86_64 Conservative SSE+SSE2 only
Other architectures All SIMD OFF

Every flag is individually overridable: cmake -DFFTW_ENABLE_AVX2=OFF ...

Relationship to PR #6004

This is the ITK v6 (main branch) forward-port of #6004, which targets release-5.4. Includes all review fixes from #6004:

  • ARM64 (all-caps) added to regex for Windows ARM64 compatibility
  • __builtin_cpu_supports probes guarded by compiler-ID check (GCC/Clang/AppleClang) to avoid noisy failures on MSVC
  • ENABLE_SSE included in documentation comment

Test plan

  • CI builds pass (FFTW is not enabled in CI, so this is a no-op for CI)
  • Local build with -DITK_USE_FFTWF=ON on x86_64 Linux confirms SSE2/AVX2 detection
  • Local build on Apple M4 confirms NEON=ON

🤖 Generated with Claude Code

…re time

FFTW SIMD codelets (NEON, SSE/SSE2, AVX, AVX2) are hand-written assembly
routines baked into the library at compile time.  Previously all SIMD
flags were hardcoded to OFF, producing scalar-only FFTW builds regardless
of the host CPU.

Add per-CPU SIMD detection at CMake configure time:
- ARM64 (aarch64/arm64/ARM64): NEON=ON (mandatory in ARMv8)
- x86/x86_64 with GCC/Clang: probe SSE, SSE2, AVX, AVX2 individually
  via __builtin_cpu_supports() / CheckCSourceRuns
- x86/x86_64 with MSVC: skip probes (intrinsic unavailable), default OFF
- Cross-compile ARM64: NEON=ON; x86_64: SSE+SSE2 only (conservative)
- All other architectures: all SIMD off (safe fallback)

Every flag is an individually overridable cache option
(e.g. cmake -DFFTW_ENABLE_AVX2=OFF).

Cherry-picked from PR InsightSoftwareConsortium#6004 (targeting release-5.4) with review fixes:
- ARM64 regex includes all-caps variant for Windows ARM64
- MSVC compiler guard on __builtin_cpu_supports probes
- ENABLE_SSE included in documentation comment

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added type:Infrastructure Infrastructure/ecosystem related changes, such as CMake or buildbots type:Performance Improvement in terms of compilation or execution time labels Apr 2, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 2, 2026

Greptile Summary

This PR replaces the previous hardcoded ENABLE_*=OFF flags in FFTW's ExternalProject_Add calls with auto-detected, user-overridable FFTW_ENABLE_* cache options. Detection uses CheckCSourceRuns with __builtin_cpu_supports() for GCC/Clang on x86/x86_64 (correctly gated off MSVC) and an architecture-pattern match for ARM64/AArch64 NEON, with a conservative cross-compile fallback. The five flags are consistently forwarded to both fftwf and fftwd ExternalProject blocks.

Confidence Score: 5/5

Safe to merge; detection logic is correct and all remaining findings are minor P2 suggestions.

The SIMD detection is correctly guarded for cross-compilation, MSVC, and non-x86/non-ARM architectures. The CheckCSourceRuns probes use proper CMake caching. The option() defaults are well-documented. Both remaining comments are P2 (stale-cache UX note and a harmless no-op SSE flag on fftwd), neither of which affects correctness or build reliability.

No files require special attention; only minor P2 style items remain.

Important Files Changed

Filename Overview
CMake/itkExternal_FFTW.cmake Adds per-CPU SIMD detection for FFTW ExternalProject builds: introduces CheckCSourceRuns probes for x86 SIMD levels (SSE/SSE2/AVX/AVX2) and architecture-based NEON detection, exposes five FFTW_ENABLE_* cache options, and forwards them to both fftwf and fftwd ExternalProject blocks.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[cmake configure] --> B{ITK_USE_SYSTEM_FFTW?}
    B -- yes --> Z[find_package FFTW]
    B -- no --> C{CMAKE_CROSSCOMPILING?}

    C -- no --> D{CMAKE_SYSTEM_PROCESSOR}
    D -- aarch64/arm64/ARM64 --> E[_fftw_default_neon = ON]
    D -- x86_64/AMD64/i686 --> F{C compiler = GCC/Clang/AppleClang?}
    F -- yes --> G[check_c_source_runs per SIMD level\nsse / sse2 / avx / avx2]
    G --> H[_fftw_default_* = ON if probe passes]
    F -- no MSVC --> I[all _fftw_default_* = OFF]
    D -- other --> I

    C -- yes --> J{CMAKE_SYSTEM_PROCESSOR}
    J -- aarch64/arm64/ARM64 --> K[_fftw_default_neon = ON]
    J -- x86_64/AMD64 --> L[_fftw_default_sse = ON\n_fftw_default_sse2 = ON]
    J -- other --> M[all _fftw_default_* = OFF]

    E & H & I & K & L & M --> N[option FFTW_ENABLE_NEON/SSE/SSE2/AVX/AVX2\ndefault = detected value\ncached — user-overridable]

    N --> O{ITK_USE_FFTWF?}
    O -- yes --> P[ExternalProject_Add fftwf\nENABLE_FLOAT=ON\n+ all FFTW_ENABLE_* flags]

    N --> Q{ITK_USE_FFTWD?}
    Q -- yes --> R[ExternalProject_Add fftwd\nENABLE_FLOAT=OFF\n+ all FFTW_ENABLE_* flags]
Loading

Reviews (1): Last reviewed commit: "PERF: Enable FFTW SIMD codelets with per..." | Re-trigger Greptile

hjmjohnson added a commit to hjmjohnson/ITK that referenced this pull request Apr 2, 2026
- Add message(STATUS) showing detected FFTW SIMD flags at configure time
  so users can verify detection without inspecting the cache.
- Remove ENABLE_SSE from the fftwd (double-precision) ExternalProject
  block; SSE1 codelets are float-only and have no effect on fftwd.
- Document in the file header that option() defaults only apply on
  first configure and that ENABLE_SSE is not forwarded to fftwd.

Addresses review comments from @greptile-apps on PR InsightSoftwareConsortium#6006.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add message(STATUS) showing detected FFTW SIMD flags at configure time
  so users can verify detection without inspecting the cache.
- Remove ENABLE_SSE from the fftwd (double-precision) ExternalProject
  block; SSE1 codelets are float-only and have no effect on fftwd.
- Document in the file header that option() defaults only apply on
  first configure and that ENABLE_SSE is not forwarded to fftwd.

Addresses review comments from @greptile-apps on PR InsightSoftwareConsortium#6006.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@hjmjohnson hjmjohnson force-pushed the fftw-simd-windows-arm64-fix branch from e61066f to a9b11c8 Compare April 2, 2026 20:24
@hjmjohnson
Copy link
Copy Markdown
Member Author

CI does not test FFTW

The following tests passed:
	itkFFTWF_FFTTest
	itkFFTWF_RealFFTTest
	itkVnlFFTWF_FFTTest
	itkVnlFFTWF_RealFFTTest
	itkFFTWD_FFTTest
	itkFFTWD_RealFFTTest
	itkVnlFFTWD_FFTTest
	itkVnlFFTWD_RealFFTTest
	itkFFTWComplexToComplexFFTImageFilter2DFloatTest
	itkFFTWComplexToComplexFFTImageFilter3DFloatTest
	itkFFTWComplexToComplexFFTImageFilter2DDoubleTest
	itkFFTWComplexToComplexFFTImageFilter3DDoubleTest
	itkFFTWForward1DFFTImageFilterTest
	itkFFTWInverse1DFFTImageFilterTest
	itkFFTWComplexToComplex1DFFTImageFilterTest
	itkFFTW1DImageFilterTest

100% tests passed, 0 tests failed out of 16

@hjmjohnson hjmjohnson requested a review from dzenanz April 2, 2026 20:26
@hjmjohnson
Copy link
Copy Markdown
Member Author

@dzenanz These are updates for the main branch followup to #6004 for the release-5.4 branch

Copy link
Copy Markdown
Member

@dzenanz dzenanz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should enable FFTW in some CI build.

@seanm
Copy link
Copy Markdown
Contributor

seanm commented Apr 2, 2026

Configure-time checking of such things usually does not play well with 'universal binaries' on macOS (where one builds for both arm & intel)...

@hjmjohnson
Copy link
Copy Markdown
Member Author

hjmjohnson commented Apr 2, 2026

Configure-time checking of such things usually does not play well with 'universal binaries' on macOS (where one builds for both arm & intel)...

@seanm The defaults can be overridden. The default is to use no optimization, and in that case FFTW is at least 7.5 times slower, and often 10 times slower. FFTW becomes nearly useless in these cases.

The behavior is intended to default to no optimization if "CROSS_COMPLING" is on, and Ithink that is the case for universal binaries.

Do you have a recommendation for an alternative strategy?

Do we need something more than 886494b#diff-c55dd1b7b03f8f37a8d66eb87317f3aaf58d457eb7adea1aab4fd5d69a95ab1fR90.

@seanm
Copy link
Copy Markdown
Contributor

seanm commented Apr 2, 2026

"...each of SSE, SSE2, AVX, AVX2 is probed individually via __builtin_cpu_supports..." so even on non-Mac this strategy is dangerous, because it assumes that the machine building the code is the same as the machine running the code. If I'm making a Windows app and my buildbot is a beefy modern CPU but I still want customers with old CPUs to run the app, the app will presumably crash for them because it's using new CPU instructions that their old CPUs don't support.

Do you have a recommendation for an alternative strategy?

Compile-time checks.

#if defined(__SSE2__) && __SSE2__
  something1
#elif defined(__arm64__)
  something2
#elif defined(__x86_64__)
  something3
#elif etc.

(because the compiler knows what CPU it's compiling for)

@hjmjohnson
Copy link
Copy Markdown
Member Author

@seanm These are the defaults for the most common use cases. In the less common scenario you define, one simply needs to explicitly turn them off. Once off, they stay off on subsequent cmake configurations, unless explicitly requested to be turned back on.

cmake -DENABLE_SSE:BOOL=OFF -DENABLE_NEON:BOOL=OFF ....

That is the primary reason for using option() rather than set() for these variables.

@hjmjohnson hjmjohnson merged commit 360c9be into InsightSoftwareConsortium:main Apr 2, 2026
15 of 18 checks passed
@seanm
Copy link
Copy Markdown
Contributor

seanm commented Apr 2, 2026

In the less common scenario you define...

Not so sure the scenario I described is "less common". I suspect most programmers have better CPUs than the customers running their code.

Anyway, I don't use FFTW so I don't much care in this case.

hjmjohnson added a commit to hjmjohnson/ITK that referenced this pull request Apr 3, 2026
…builds

Addresses seanm's review of ITK PR InsightSoftwareConsortium#6006: the previous
check_c_source_runs approach probed the BUILD HOST's CPU at configure
time, producing FFTW binaries that require the build machine's exact CPU
and SIGILL on any machine that lacks the detected SIMD extensions.  This
is unsafe for redistributed binary packages (conda, pip/PyPI, manylinux
Docker images) where build and target machines differ.

New detection policy (compile-time only, never runtime):

  x86_64 / AMD64:
    SSE and SSE2 are mandated by the AMD64 ABI — every 64-bit x86 CPU
    supports them regardless of age.  Both are enabled by default.
    Safe for all manylinux2014 / manylinux_2_28 / conda x86_64 builds.

  aarch64 / arm64:
    NEON is mandated by the AArch64 ABI — every arm64 CPU has it.
    Enabled by default.  Safe for all conda / manylinux aarch64 builds.

  AVX / AVX2 (Sandy Bridge 2011 / Haswell 2013 required):
    NOT universally available; default OFF for redistribution safety.
    Auto-enabled only when the compiler is already generating those
    instructions — i.e. when the user passed -march=native, -mavx2,
    /arch:AVX2, or similar.  Detected via check_c_source_compiles
    (not _runs) which tests what the compiler targets, not what the
    build host's CPU can execute.  This implements seanm's recommended
    "the compiler knows what CPU it's compiling for" approach.

  macOS universal binary (CMAKE_OSX_ARCHITECTURES with >1 entry):
    SIMD defaults disabled; a single configure pass cannot produce
    correct per-slice codelets for both arm64 and x86_64.

This change is a strict improvement on the previous behaviour for the
most important redistribution platforms:
  - conda/pip on x86_64: SSE+SSE2 always ON (was OFF without runtime probe)
  - conda/pip on arm64:  NEON always ON (unchanged)
  - AVX2 on build host:  ON only when compiler targets it (was ON always)

Closes InsightSoftwareConsortium#6006 (follow-up addressing seanm review)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
hjmjohnson added a commit to hjmjohnson/ITK that referenced this pull request Apr 3, 2026
…builds

Addresses seanm's review of ITK PR InsightSoftwareConsortium#6006: the previous
check_c_source_runs approach probed the BUILD HOST's CPU at configure
time, producing FFTW binaries that require the build machine's exact CPU
and SIGILL on any machine that lacks the detected SIMD extensions.  This
is unsafe for redistributed binary packages (conda, pip/PyPI, manylinux
Docker images) where build and target machines differ.

New detection policy (compile-time only, never runtime):

  x86_64 / AMD64:
    SSE and SSE2 are mandated by the AMD64 ABI — every 64-bit x86 CPU
    supports them regardless of age.  Both are enabled by default.
    Safe for all manylinux2014 / manylinux_2_28 / conda x86_64 builds.

  aarch64 / arm64:
    NEON is mandated by the AArch64 ABI — every arm64 CPU has it.
    Enabled by default.  Safe for all conda / manylinux aarch64 builds.

  AVX / AVX2 (Sandy Bridge 2011 / Haswell 2013 required):
    NOT universally available; default OFF for redistribution safety.
    Auto-enabled only when the compiler is already generating those
    instructions — i.e. when the user passed -march=native, -mavx2,
    /arch:AVX2, or similar.  Detected via check_c_source_compiles
    (not _runs) which tests what the compiler targets, not what the
    build host's CPU can execute.  This implements seanm's recommended
    "the compiler knows what CPU it's compiling for" approach.

  macOS universal binary (CMAKE_OSX_ARCHITECTURES with >1 entry):
    SIMD defaults disabled; a single configure pass cannot produce
    correct per-slice codelets for both arm64 and x86_64.

This change is a strict improvement on the previous behaviour for the
most important redistribution platforms:
  - conda/pip on x86_64: SSE+SSE2 always ON (was OFF without runtime probe)
  - conda/pip on arm64:  NEON always ON (unchanged)
  - AVX2 on build host:  ON only when compiler targets it (was ON always)

Closes InsightSoftwareConsortium#6006 (follow-up addressing seanm review)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
hjmjohnson added a commit to hjmjohnson/ITK that referenced this pull request Apr 3, 2026
…builds

Addresses seanm's review of ITK PR InsightSoftwareConsortium#6006: the previous
check_c_source_runs approach probed the BUILD HOST's CPU at configure
time, producing FFTW binaries that require the build machine's exact CPU
and SIGILL on any machine that lacks the detected SIMD extensions.  This
is unsafe for redistributed binary packages (conda, pip/PyPI, manylinux
Docker images) where build and target machines differ.

New detection policy (compile-time only, never runtime):

  x86_64 / AMD64:
    SSE and SSE2 are mandated by the AMD64 ABI -- every 64-bit x86 CPU
    supports them regardless of age.  Both are enabled by default.
    Safe for all manylinux2014 / manylinux_2_28 / conda x86_64 builds.

  aarch64 / arm64:
    NEON is mandated by the AArch64 ABI -- every arm64 CPU has it.
    Enabled by default.  Safe for all conda / manylinux aarch64 builds.

  AVX / AVX2 (Sandy Bridge 2011 / Haswell 2013 required):
    NOT universally available; default OFF for redistribution safety.
    Auto-enabled only when the compiler is already generating those
    instructions -- i.e. when the user passed -march=native, -mavx2,
    /arch:AVX2, or similar.  Detected via check_c_source_compiles
    (not _runs) which tests what the compiler targets, not what the
    build host's CPU can execute.  This implements seanm's recommended
    "the compiler knows what CPU it's compiling for" approach.
    The AVX/AVX2 cache variables are unset before each probe so that
    detection re-runs on every configure when compiler flags change
    (e.g. user later adds -march=native).

  macOS universal binary (CMAKE_OSX_ARCHITECTURES with >1 entry):
    SIMD defaults disabled; a single configure pass cannot produce
    correct per-slice codelets for both arm64 and x86_64.

This change is a strict improvement on the previous behaviour for the
most important redistribution platforms:
  - conda/pip on x86_64: SSE+SSE2 always ON (was OFF without runtime probe)
  - conda/pip on arm64:  NEON always ON (unchanged)
  - AVX2 on build host:  ON only when compiler targets it (was ON always)

Closes InsightSoftwareConsortium#6006 (follow-up addressing seanm review)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
hjmjohnson added a commit to hjmjohnson/ITK that referenced this pull request Apr 12, 2026
…builds

Addresses seanm's review of ITK PR InsightSoftwareConsortium#6006: the previous
check_c_source_runs approach probed the BUILD HOST's CPU at configure
time, producing FFTW binaries that require the build machine's exact CPU
and SIGILL on any machine that lacks the detected SIMD extensions.  This
is unsafe for redistributed binary packages (conda, pip/PyPI, manylinux
Docker images) where build and target machines differ.

New detection policy (compile-time only, never runtime):

  x86_64 / AMD64:
    SSE and SSE2 are mandated by the AMD64 ABI -- every 64-bit x86 CPU
    supports them regardless of age.  Both are enabled by default.
    Safe for all manylinux2014 / manylinux_2_28 / conda x86_64 builds.

  aarch64 / arm64:
    NEON is mandated by the AArch64 ABI -- every arm64 CPU has it.
    Enabled by default.  Safe for all conda / manylinux aarch64 builds.

  AVX / AVX2 (Sandy Bridge 2011 / Haswell 2013 required):
    NOT universally available; default OFF for redistribution safety.
    Auto-enabled only when the compiler is already generating those
    instructions -- i.e. when the user passed -march=native, -mavx2,
    /arch:AVX2, or similar.  Detected via check_c_source_compiles
    (not _runs) which tests what the compiler targets, not what the
    build host's CPU can execute.  This implements seanm's recommended
    "the compiler knows what CPU it's compiling for" approach.
    The AVX/AVX2 cache variables are unset before each probe so that
    detection re-runs on every configure when compiler flags change
    (e.g. user later adds -march=native).

  macOS universal binary (CMAKE_OSX_ARCHITECTURES with >1 entry):
    SIMD defaults disabled; a single configure pass cannot produce
    correct per-slice codelets for both arm64 and x86_64.

This change is a strict improvement on the previous behaviour for the
most important redistribution platforms:
  - conda/pip on x86_64: SSE+SSE2 always ON (was OFF without runtime probe)
  - conda/pip on arm64:  NEON always ON (unchanged)
  - AVX2 on build host:  ON only when compiler targets it (was ON always)

Closes InsightSoftwareConsortium#6006 (follow-up addressing seanm review)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type:Infrastructure Infrastructure/ecosystem related changes, such as CMake or buildbots type:Performance Improvement in terms of compilation or execution time

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants