Skip to content

Add AVX-512 support#231

Open
Shnatsel wants to merge 61 commits into
linebender:mainfrom
Shnatsel:avx512-yes-really
Open

Add AVX-512 support#231
Shnatsel wants to merge 61 commits into
linebender:mainfrom
Shnatsel:avx512-yes-really

Conversation

@Shnatsel

@Shnatsel Shnatsel commented May 24, 2026

Copy link
Copy Markdown
Contributor

Yes, really. It's all here. In one humongous PR. Sorry 😅

This is probably best reviewed commit-by-commit. The first commit is still big because the history was getting really messy with changes and rollbacks, and squashing it made it less of a mess.

This also touches other backends in three ways:

  1. set_mask() is now a backend method so it could be specialized per-level
  2. Changes to mask conversion routines to support different internal representations bled into other levels. It occasionally adds an intermediate array but it gets optimized out in practice.
  3. transmute_copy() is wrapped into checked_transmute_copy() and the raw version disallowed after I almost had a horrible accident with it. This could be its own PR but I wanted the insurance right away. This was split and shipped in v0.5.0

Everything changed here should be covered by tests. I've expanded test coverage where it was lacking.

Closes #179

Shnatsel added 20 commits May 24, 2026 18:24
…edicated AVX-512 implementations for complex int/float vector operations that benefit the most.

LLM summary of the changes:

Implemented:
- Added `X86::Avx512` in the generator with Ice Lake feature set, `native_width = 512`, `max_block_size = 512`.
- Generated new `fearless_simd/src/generated/avx512.rs`.
- Wired public API: `Avx512`, `x86::Avx512`, `Level::Avx512`, `Level::as_avx512`, dispatch, and `kernel!` support.
- Updated runtime/static detection so Ice Lake AVX-512 is selected before AVX2, while `as_avx2()` and `as_sse4_2()` downgrade correctly.
- Bumped MSRV/docs/CI/check-target metadata to Rust 1.89.

Generator/backend behavior:
- 512-bit vectors use native `__m512`, `__m512d`, and `__m512i`.
- AVX-512 masks now use raw compact `__mmask8/16/32/64` storage, with no aligned wrapper.
- Generic `SimdFrom<__mmask*, S>` / `From<mask*, __mmask*>` now route through `from_bitmask` / `to_bitmask`, so they are correct for non-AVX-512 `S` too.
- Added AVX-512 compare/select paths using mask-returning compares and mask blends.
- Added direct conversion paths, including `f32 <-> i32/u32` and `u8 <-> u16`.
- Added AVX-512 vector slides for vectors only; masks intentionally have no slide support.
- Added dedicated AVX-512 zip/unzip/interleave/deinterleave using `permutex2var`, especially for 256/512-bit widths.

Tests/coverage:
- Extended `#[simd_test]` to include AVX-512.
- Added AVX-512 detection/dispatch coverage.
- Updated mask bitwise tests for canonical boolean mask lanes.
- Added a regression test that AVX-512 mask public types are compact and match `__mmask*` sizes.
…ackend, and specialize it for AVX-512. Add test coverage that sets every single bit and verifies it was set correctly.
…rage. Only for 8-bit left shift LLVM autovectorizes the scalar fallback into GFNI instructions on 256-bit halves which emits more instructions but schedules better and ends up being slightly faster according to llvm-mca on sapphire rapids; but the difference isn't huge and I don't want to rely on autovectorization because of its fragility.
… so they didn't show up earlier when I removed those methods.
@Shnatsel Shnatsel mentioned this pull request May 24, 2026
@LaurenzV

Copy link
Copy Markdown
Collaborator

I think it would indeed be great to have a custom PR for 3.

@Shnatsel

Copy link
Copy Markdown
Contributor Author

It will cause a lot of conflicts if I try to split it, but I have it isolated to its own commit at least: f08f7e6

Shnatsel added 25 commits May 27, 2026 21:45
Merged origin/main commit 13dd530. The merge applied cleanly.
Replace AVX512 interleaved load intrinsics emitted by the branch with checked_transmute_copy, then regenerate the generated AVX512 module.
Merged origin/main commit fbc97da. The merge applied cleanly.
Regenerate the branch-added AVX512 module so by-value transmutes use checked_transmute_copy, matching PR linebender#234.

Validation: cargo test
Merged origin/main commit 0d13b0a. The merge applied cleanly.
Regenerate the branch-added AVX512 module so reference casts use checked_cast_ref and checked_cast_mut. Also apply the float bit-pattern assertion style from PR linebender#235 to the branch-added f32x16 interleaved-load test.

Validation: cargo test
Merged origin/main commit 650815d. The merge applied cleanly.
PR linebender#237 only updates NEON load construction. The AVX512 branch-specific unsafe load sites were already adapted in the PR linebender#233 follow-up, and a search found no remaining load intrinsics needing the linebender#237 pattern.
Includes the regenerated AVX-512 output from the same generator update.
Includes regenerated AVX-512 slide helpers for the same safety cleanup.
@Shnatsel

Copy link
Copy Markdown
Contributor Author

I've researched whether the instruction set we chose is forward-compatible with Intel's upcoming AVX10. It is: according to the Intel AVX10 architecture specification revision 7.0, all AVX10 CPUs include the AVX-512 features from Ice Lake (our target) as well as Sapphire Rapids (higher than our target but doesn't add anything particularly useful).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Mask types for AVX-512

2 participants