Various Fixes and enhancements in x86 intrinsics by sayantn · Pull Request #1594 · rust-lang/stdarch

sayantn · 2024-06-24T20:18:26Z

Updated the x86-intel.xml to be a more recent update of Intel Intrinsics Guide (v3.6.8)
Added functionality to auto-generate a missing-x86.md for ease of implementation
Updated disassembly to allow windows-gnu targets (they have the same implementation as linux targets, as that uses objdump from binutils, and binutils is a dependency of GCC). Add the x86_64-pc-windows-gnu target in CI
Fixed some of the stream intrinsics.
modified floating-point reduce-add and reduce-mul intrinsics to NOT use simd_reduce_add_unordered and simd_reduce_mul_unordered as Intel specifies a strict associativity. Follow GCC and hand-implement the associativity ourselves (_mm512_reduce_add_ps and friends are setting fast-math flags they should not set #1533)
Fixed _load_mask32 etc in AVX512BW (they should have taken a __mmask32/__mmask64 pointer, but took u32/u64 pointer)
As moves never modify any flags, add preserves_flags to the asm! blocks for moves
Fix _mm_loadu_si64 (it had target-feature sse, but needs sse2), _mm256_extract_epi64, _mm256_extract_epi32, _mm256_cvtsi256_si32 (these had target-feature avx2, but need avx).
Fixed _mm_cvtt intrinsics (they were actually calling vcvtss2si, when they should call vcvttss2si)
Removed all MMX support from stdarch-verify, and made the target-feature verification stricter
Implemented the missing intrinsics mentioned in Missing x86 vendor intrinsics (SSE2, SSE 4.1, AVX2) #1178 with feature-gate simd-x86-updates (Tracking Issue for Missing BMI1, AVX2, SSE2, SSE4.1, SSE4a and TBM intrinsics rust#126936)
Fixed _mm512_kunpackb
Modified the reduce-max and reduce-min intrinsics to preserve associativity specified by Intel and to use the comparison function they described (which is NOT maxnum from LLVM)
Bumped the OS in CI Docker containers to Ubunto 24.04 (except for in armv7-unknown-linux-gnueabihf and x86_64-unknown-linux-gnu-emulated)

Modifying fma has been moved to #1597
Masked load/stores are on standby due to rust-lang/rust#126919

rustbot · 2024-06-24T20:18:31Z

rustbot has assigned @Amanieu.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

Updated the intrinsics list from version 3.4 to 3.6.8. Added a missing-x86.md file to track progress.

fixed reduce-add and reduce-mul. and load/store of mask32 and mask64. added preserves-flags to mov asm. fixed the missing list. fixed `_mm_loadu_si64`. Added `assert_instr`

Added some tests, Fixed incorrect target-features, and verification code for target-features. Removed all MMX support from verification.

`_mm512_kunpackb` was implemented wrong, and `simd_reduce_max` uses `maxnum` for comparison, which adheres to IEEE754, but Intel specifically says that they do NOT adhere to IEEE754 for NaNs, which can give wrong results

Fixed x86_64-apple-darwin freezing. Bump all docker to Ubuntu-24.04 (except for emulated and armv7)

RalfJung · 2024-07-02T09:50:26Z

+/// must be aligned on a 32-byte boundary or a general-protection exception may be generated. To
+/// minimize caching, the data is flagged as non-temporal (unlikely to be used again soon)
+///
+/// [Intel's documentation](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm256_stream_load_si256)


This (and all other AVX2 non-temporal operations) should get the same safety comment that the older non-temporal stores have. See e.g. here.

I checked and only non-temporal stores have special memory orderings on x86. x86 non-temporal loads work just like normal loads.

@Amanieu told that that doesn't apply to streaming loads, only streaming stores.

Oh, I didn't realize non-temporal loads even are a thing. More nightmare waiting to happen, I guess...

ruriww · 2026-06-07T04:54:17Z

I checked and only non-temporal stores have special memory orderings on x86. x86 non-temporal loads work just like normal loads.

Is there a reason the non-temporal loads are written with inline assembly then? It generates extra instructions when offsetting. https://godbolt.org/z/os8bG3Efr

rustbot assigned Amanieu Jun 24, 2024

workingjubilee reviewed Jun 25, 2024

View reviewed changes

Comment thread crates/core_arch/src/x86/sse41.rs Outdated

workingjubilee mentioned this pull request Jun 25, 2024

Add support for unaligned simd masked load/store rust-lang/rust#126919

Closed

sayantn force-pushed the avx512-fixes branch 2 times, most recently from 7387776 to f66cec7 Compare June 25, 2024 06:07

jhorstmann mentioned this pull request Jun 25, 2024

Exclude intel SVML functions from missing intrinsics report #1253

Closed

sayantn force-pushed the avx512-fixes branch 3 times, most recently from 79cca5a to df30a0c Compare June 26, 2024 11:48

sayantn mentioned this pull request Jun 26, 2024

Tracking Issue for AVX512 intrinsics rust-lang/rust#111137

Closed

2 tasks

sayantn marked this pull request as ready for review June 26, 2024 14:16

sayantn mentioned this pull request Jun 26, 2024

Tracking Issue for Missing BMI1, AVX2, SSE2, SSE4.1, SSE4a and TBM intrinsics rust-lang/rust#126936

Closed

2 tasks

sayantn added 5 commits June 27, 2024 23:45

Update Intrinsics list

e0c2d4b

Updated the intrinsics list from version 3.4 to 3.6.8. Added a missing-x86.md file to track progress.

Upgraded disassembly to include windows-gnu targets

5802a9c

Fixed many intrinsics

7ca3ce3

fixed reduce-add and reduce-mul. and load/store of mask32 and mask64. added preserves-flags to mov asm. fixed the missing list. fixed `_mm_loadu_si64`. Added `assert_instr`

Fixed some more intrinsics

249f5c5

Added some tests, Fixed incorrect target-features, and verification code for target-features. Removed all MMX support from verification.

Add the missing BMI1, SSE2, SSE4.1 and AVX2 intrinsics

b36e66d

sayantn force-pushed the avx512-fixes branch from 900aaeb to 04a1218 Compare June 27, 2024 18:20

Update CI to accommodate for windows-gnu targets

c3d1833

sayantn force-pushed the avx512-fixes branch from 04a1218 to c3d1833 Compare June 27, 2024 18:43

sayantn requested a review from workingjubilee June 27, 2024 19:11

Fixed _mm512_kunpackb, reduce-max and reduce-min

53a290d

`_mm512_kunpackb` was implemented wrong, and `simd_reduce_max` uses `maxnum` for comparison, which adheres to IEEE754, but Intel specifically says that they do NOT adhere to IEEE754 for NaNs, which can give wrong results

Amanieu reviewed Jun 29, 2024

View reviewed changes

Comment thread crates/core_arch/src/x86/avx2.rs Outdated

Comment thread crates/core_arch/src/x86/sse41.rs Outdated

Comment thread crates/core_arch/src/x86/sse41.rs Outdated

Comment thread crates/stdarch-verify/tests/x86-intel.rs Outdated

Some fixes as asked by @Amanieu

8c975ef

Amanieu reviewed Jun 29, 2024

View reviewed changes

Comment thread crates/core_arch/src/x86/sse41.rs Outdated

Comment thread crates/core_arch/src/x86/avx2.rs Outdated

sayantn force-pushed the avx512-fixes branch 4 times, most recently from 231b968 to 1c7aafe Compare June 29, 2024 12:20

sayantn force-pushed the avx512-fixes branch 6 times, most recently from 2be8efe to a58f1ee Compare June 29, 2024 14:35

Fixing CI

e45d2b9

Fixed x86_64-apple-darwin freezing. Bump all docker to Ubuntu-24.04 (except for emulated and armv7)

sayantn force-pushed the avx512-fixes branch from a58f1ee to e45d2b9 Compare June 29, 2024 14:42

Amanieu merged commit 87158e6 into rust-lang:master Jun 29, 2024

Amanieu mentioned this pull request Jun 30, 2024

_mm512_reduce_add_ps and friends are setting fast-math flags they should not set #1533

Closed

This was referenced Jul 1, 2024

_mm_storeu_si16 and _mm_storeu_si64 are missing but not documented as such rust-lang/rust#62743

Closed

Missing x86/x86_64 intrinsic: _mm_loadu_si32 rust-lang/rust#62876

Closed

RalfJung reviewed Jul 2, 2024

View reviewed changes

ruriww mentioned this pull request Jun 7, 2026

Can non-temporal loads use the LLVM intrinsic instead? #2155

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various Fixes and enhancements in x86 intrinsics#1594

Various Fixes and enhancements in x86 intrinsics#1594
Amanieu merged 9 commits into
rust-lang:masterfrom
sayantn:avx512-fixes

sayantn commented Jun 24, 2024 •

edited

Loading

Uh oh!

rustbot commented Jun 24, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RalfJung Jul 2, 2024

Uh oh!

Amanieu Jul 2, 2024

Uh oh!

sayantn Jul 2, 2024

Uh oh!

RalfJung Jul 2, 2024

Uh oh!

ruriww commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

sayantn commented Jun 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rustbot commented Jun 24, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RalfJung Jul 2, 2024

Choose a reason for hiding this comment

Uh oh!

Amanieu Jul 2, 2024

Choose a reason for hiding this comment

Uh oh!

sayantn Jul 2, 2024

Choose a reason for hiding this comment

Uh oh!

RalfJung Jul 2, 2024

Choose a reason for hiding this comment

Uh oh!

ruriww commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sayantn commented Jun 24, 2024 •

edited

Loading