PERF: Enable FFTW SIMD codelets with per-CPU introspection at configure time#6006
Conversation
…re time FFTW SIMD codelets (NEON, SSE/SSE2, AVX, AVX2) are hand-written assembly routines baked into the library at compile time. Previously all SIMD flags were hardcoded to OFF, producing scalar-only FFTW builds regardless of the host CPU. Add per-CPU SIMD detection at CMake configure time: - ARM64 (aarch64/arm64/ARM64): NEON=ON (mandatory in ARMv8) - x86/x86_64 with GCC/Clang: probe SSE, SSE2, AVX, AVX2 individually via __builtin_cpu_supports() / CheckCSourceRuns - x86/x86_64 with MSVC: skip probes (intrinsic unavailable), default OFF - Cross-compile ARM64: NEON=ON; x86_64: SSE+SSE2 only (conservative) - All other architectures: all SIMD off (safe fallback) Every flag is an individually overridable cache option (e.g. cmake -DFFTW_ENABLE_AVX2=OFF). Cherry-picked from PR InsightSoftwareConsortium#6004 (targeting release-5.4) with review fixes: - ARM64 regex includes all-caps variant for Windows ARM64 - MSVC compiler guard on __builtin_cpu_supports probes - ENABLE_SSE included in documentation comment Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
| Filename | Overview |
|---|---|
| CMake/itkExternal_FFTW.cmake | Adds per-CPU SIMD detection for FFTW ExternalProject builds: introduces CheckCSourceRuns probes for x86 SIMD levels (SSE/SSE2/AVX/AVX2) and architecture-based NEON detection, exposes five FFTW_ENABLE_* cache options, and forwards them to both fftwf and fftwd ExternalProject blocks. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[cmake configure] --> B{ITK_USE_SYSTEM_FFTW?}
B -- yes --> Z[find_package FFTW]
B -- no --> C{CMAKE_CROSSCOMPILING?}
C -- no --> D{CMAKE_SYSTEM_PROCESSOR}
D -- aarch64/arm64/ARM64 --> E[_fftw_default_neon = ON]
D -- x86_64/AMD64/i686 --> F{C compiler = GCC/Clang/AppleClang?}
F -- yes --> G[check_c_source_runs per SIMD level\nsse / sse2 / avx / avx2]
G --> H[_fftw_default_* = ON if probe passes]
F -- no MSVC --> I[all _fftw_default_* = OFF]
D -- other --> I
C -- yes --> J{CMAKE_SYSTEM_PROCESSOR}
J -- aarch64/arm64/ARM64 --> K[_fftw_default_neon = ON]
J -- x86_64/AMD64 --> L[_fftw_default_sse = ON\n_fftw_default_sse2 = ON]
J -- other --> M[all _fftw_default_* = OFF]
E & H & I & K & L & M --> N[option FFTW_ENABLE_NEON/SSE/SSE2/AVX/AVX2\ndefault = detected value\ncached — user-overridable]
N --> O{ITK_USE_FFTWF?}
O -- yes --> P[ExternalProject_Add fftwf\nENABLE_FLOAT=ON\n+ all FFTW_ENABLE_* flags]
N --> Q{ITK_USE_FFTWD?}
Q -- yes --> R[ExternalProject_Add fftwd\nENABLE_FLOAT=OFF\n+ all FFTW_ENABLE_* flags]
Reviews (1): Last reviewed commit: "PERF: Enable FFTW SIMD codelets with per..." | Re-trigger Greptile
- Add message(STATUS) showing detected FFTW SIMD flags at configure time so users can verify detection without inspecting the cache. - Remove ENABLE_SSE from the fftwd (double-precision) ExternalProject block; SSE1 codelets are float-only and have no effect on fftwd. - Document in the file header that option() defaults only apply on first configure and that ENABLE_SSE is not forwarded to fftwd. Addresses review comments from @greptile-apps on PR InsightSoftwareConsortium#6006. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add message(STATUS) showing detected FFTW SIMD flags at configure time so users can verify detection without inspecting the cache. - Remove ENABLE_SSE from the fftwd (double-precision) ExternalProject block; SSE1 codelets are float-only and have no effect on fftwd. - Document in the file header that option() defaults only apply on first configure and that ENABLE_SSE is not forwarded to fftwd. Addresses review comments from @greptile-apps on PR InsightSoftwareConsortium#6006. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
e61066f to
a9b11c8
Compare
|
CI does not test FFTW The following tests passed:
itkFFTWF_FFTTest
itkFFTWF_RealFFTTest
itkVnlFFTWF_FFTTest
itkVnlFFTWF_RealFFTTest
itkFFTWD_FFTTest
itkFFTWD_RealFFTTest
itkVnlFFTWD_FFTTest
itkVnlFFTWD_RealFFTTest
itkFFTWComplexToComplexFFTImageFilter2DFloatTest
itkFFTWComplexToComplexFFTImageFilter3DFloatTest
itkFFTWComplexToComplexFFTImageFilter2DDoubleTest
itkFFTWComplexToComplexFFTImageFilter3DDoubleTest
itkFFTWForward1DFFTImageFilterTest
itkFFTWInverse1DFFTImageFilterTest
itkFFTWComplexToComplex1DFFTImageFilterTest
itkFFTW1DImageFilterTest
100% tests passed, 0 tests failed out of 16 |
dzenanz
left a comment
There was a problem hiding this comment.
We should enable FFTW in some CI build.
|
Configure-time checking of such things usually does not play well with 'universal binaries' on macOS (where one builds for both arm & intel)... |
@seanm The defaults can be overridden. The default is to use no optimization, and in that case FFTW is at least 7.5 times slower, and often 10 times slower. FFTW becomes nearly useless in these cases. The behavior is intended to default to no optimization if "CROSS_COMPLING" is on, and Ithink that is the case for universal binaries. Do you have a recommendation for an alternative strategy? Do we need something more than 886494b#diff-c55dd1b7b03f8f37a8d66eb87317f3aaf58d457eb7adea1aab4fd5d69a95ab1fR90. |
|
"...each of SSE, SSE2, AVX, AVX2 is probed individually via __builtin_cpu_supports..." so even on non-Mac this strategy is dangerous, because it assumes that the machine building the code is the same as the machine running the code. If I'm making a Windows app and my buildbot is a beefy modern CPU but I still want customers with old CPUs to run the app, the app will presumably crash for them because it's using new CPU instructions that their old CPUs don't support.
Compile-time checks. (because the compiler knows what CPU it's compiling for) |
|
@seanm These are the defaults for the most common use cases. In the less common scenario you define, one simply needs to explicitly turn them off. Once off, they stay off on subsequent cmake configurations, unless explicitly requested to be turned back on. cmake -DENABLE_SSE:BOOL=OFF -DENABLE_NEON:BOOL=OFF ....That is the primary reason for using |
360c9be
into
InsightSoftwareConsortium:main
Not so sure the scenario I described is "less common". I suspect most programmers have better CPUs than the customers running their code. Anyway, I don't use FFTW so I don't much care in this case. |
…builds Addresses seanm's review of ITK PR InsightSoftwareConsortium#6006: the previous check_c_source_runs approach probed the BUILD HOST's CPU at configure time, producing FFTW binaries that require the build machine's exact CPU and SIGILL on any machine that lacks the detected SIMD extensions. This is unsafe for redistributed binary packages (conda, pip/PyPI, manylinux Docker images) where build and target machines differ. New detection policy (compile-time only, never runtime): x86_64 / AMD64: SSE and SSE2 are mandated by the AMD64 ABI — every 64-bit x86 CPU supports them regardless of age. Both are enabled by default. Safe for all manylinux2014 / manylinux_2_28 / conda x86_64 builds. aarch64 / arm64: NEON is mandated by the AArch64 ABI — every arm64 CPU has it. Enabled by default. Safe for all conda / manylinux aarch64 builds. AVX / AVX2 (Sandy Bridge 2011 / Haswell 2013 required): NOT universally available; default OFF for redistribution safety. Auto-enabled only when the compiler is already generating those instructions — i.e. when the user passed -march=native, -mavx2, /arch:AVX2, or similar. Detected via check_c_source_compiles (not _runs) which tests what the compiler targets, not what the build host's CPU can execute. This implements seanm's recommended "the compiler knows what CPU it's compiling for" approach. macOS universal binary (CMAKE_OSX_ARCHITECTURES with >1 entry): SIMD defaults disabled; a single configure pass cannot produce correct per-slice codelets for both arm64 and x86_64. This change is a strict improvement on the previous behaviour for the most important redistribution platforms: - conda/pip on x86_64: SSE+SSE2 always ON (was OFF without runtime probe) - conda/pip on arm64: NEON always ON (unchanged) - AVX2 on build host: ON only when compiler targets it (was ON always) Closes InsightSoftwareConsortium#6006 (follow-up addressing seanm review) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…builds Addresses seanm's review of ITK PR InsightSoftwareConsortium#6006: the previous check_c_source_runs approach probed the BUILD HOST's CPU at configure time, producing FFTW binaries that require the build machine's exact CPU and SIGILL on any machine that lacks the detected SIMD extensions. This is unsafe for redistributed binary packages (conda, pip/PyPI, manylinux Docker images) where build and target machines differ. New detection policy (compile-time only, never runtime): x86_64 / AMD64: SSE and SSE2 are mandated by the AMD64 ABI — every 64-bit x86 CPU supports them regardless of age. Both are enabled by default. Safe for all manylinux2014 / manylinux_2_28 / conda x86_64 builds. aarch64 / arm64: NEON is mandated by the AArch64 ABI — every arm64 CPU has it. Enabled by default. Safe for all conda / manylinux aarch64 builds. AVX / AVX2 (Sandy Bridge 2011 / Haswell 2013 required): NOT universally available; default OFF for redistribution safety. Auto-enabled only when the compiler is already generating those instructions — i.e. when the user passed -march=native, -mavx2, /arch:AVX2, or similar. Detected via check_c_source_compiles (not _runs) which tests what the compiler targets, not what the build host's CPU can execute. This implements seanm's recommended "the compiler knows what CPU it's compiling for" approach. macOS universal binary (CMAKE_OSX_ARCHITECTURES with >1 entry): SIMD defaults disabled; a single configure pass cannot produce correct per-slice codelets for both arm64 and x86_64. This change is a strict improvement on the previous behaviour for the most important redistribution platforms: - conda/pip on x86_64: SSE+SSE2 always ON (was OFF without runtime probe) - conda/pip on arm64: NEON always ON (unchanged) - AVX2 on build host: ON only when compiler targets it (was ON always) Closes InsightSoftwareConsortium#6006 (follow-up addressing seanm review) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…builds Addresses seanm's review of ITK PR InsightSoftwareConsortium#6006: the previous check_c_source_runs approach probed the BUILD HOST's CPU at configure time, producing FFTW binaries that require the build machine's exact CPU and SIGILL on any machine that lacks the detected SIMD extensions. This is unsafe for redistributed binary packages (conda, pip/PyPI, manylinux Docker images) where build and target machines differ. New detection policy (compile-time only, never runtime): x86_64 / AMD64: SSE and SSE2 are mandated by the AMD64 ABI -- every 64-bit x86 CPU supports them regardless of age. Both are enabled by default. Safe for all manylinux2014 / manylinux_2_28 / conda x86_64 builds. aarch64 / arm64: NEON is mandated by the AArch64 ABI -- every arm64 CPU has it. Enabled by default. Safe for all conda / manylinux aarch64 builds. AVX / AVX2 (Sandy Bridge 2011 / Haswell 2013 required): NOT universally available; default OFF for redistribution safety. Auto-enabled only when the compiler is already generating those instructions -- i.e. when the user passed -march=native, -mavx2, /arch:AVX2, or similar. Detected via check_c_source_compiles (not _runs) which tests what the compiler targets, not what the build host's CPU can execute. This implements seanm's recommended "the compiler knows what CPU it's compiling for" approach. The AVX/AVX2 cache variables are unset before each probe so that detection re-runs on every configure when compiler flags change (e.g. user later adds -march=native). macOS universal binary (CMAKE_OSX_ARCHITECTURES with >1 entry): SIMD defaults disabled; a single configure pass cannot produce correct per-slice codelets for both arm64 and x86_64. This change is a strict improvement on the previous behaviour for the most important redistribution platforms: - conda/pip on x86_64: SSE+SSE2 always ON (was OFF without runtime probe) - conda/pip on arm64: NEON always ON (unchanged) - AVX2 on build host: ON only when compiler targets it (was ON always) Closes InsightSoftwareConsortium#6006 (follow-up addressing seanm review) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…builds Addresses seanm's review of ITK PR InsightSoftwareConsortium#6006: the previous check_c_source_runs approach probed the BUILD HOST's CPU at configure time, producing FFTW binaries that require the build machine's exact CPU and SIGILL on any machine that lacks the detected SIMD extensions. This is unsafe for redistributed binary packages (conda, pip/PyPI, manylinux Docker images) where build and target machines differ. New detection policy (compile-time only, never runtime): x86_64 / AMD64: SSE and SSE2 are mandated by the AMD64 ABI -- every 64-bit x86 CPU supports them regardless of age. Both are enabled by default. Safe for all manylinux2014 / manylinux_2_28 / conda x86_64 builds. aarch64 / arm64: NEON is mandated by the AArch64 ABI -- every arm64 CPU has it. Enabled by default. Safe for all conda / manylinux aarch64 builds. AVX / AVX2 (Sandy Bridge 2011 / Haswell 2013 required): NOT universally available; default OFF for redistribution safety. Auto-enabled only when the compiler is already generating those instructions -- i.e. when the user passed -march=native, -mavx2, /arch:AVX2, or similar. Detected via check_c_source_compiles (not _runs) which tests what the compiler targets, not what the build host's CPU can execute. This implements seanm's recommended "the compiler knows what CPU it's compiling for" approach. The AVX/AVX2 cache variables are unset before each probe so that detection re-runs on every configure when compiler flags change (e.g. user later adds -march=native). macOS universal binary (CMAKE_OSX_ARCHITECTURES with >1 entry): SIMD defaults disabled; a single configure pass cannot produce correct per-slice codelets for both arm64 and x86_64. This change is a strict improvement on the previous behaviour for the most important redistribution platforms: - conda/pip on x86_64: SSE+SSE2 always ON (was OFF without runtime probe) - conda/pip on arm64: NEON always ON (unchanged) - AVX2 on build host: ON only when compiler targets it (was ON always) Closes InsightSoftwareConsortium#6006 (follow-up addressing seanm review) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
ENABLE_*:BOOL=OFFflags with auto-detected, user-overridable cache optionsfftwfandfftwdExternalProject blocksDetection Policy
CheckCSourceRuns+__builtin_cpu_supports()Every flag is individually overridable:
cmake -DFFTW_ENABLE_AVX2=OFF ...Relationship to PR #6004
This is the ITK v6 (main branch) forward-port of #6004, which targets release-5.4. Includes all review fixes from #6004:
ARM64(all-caps) added to regex for Windows ARM64 compatibility__builtin_cpu_supportsprobes guarded by compiler-ID check (GCC/Clang/AppleClang) to avoid noisy failures on MSVCENABLE_SSEincluded in documentation commentTest plan
-DITK_USE_FFTWF=ONon x86_64 Linux confirms SSE2/AVX2 detection🤖 Generated with Claude Code