Various Fixes and enhancements in x86 intrinsics#1594
Conversation
7387776 to
f66cec7
Compare
79cca5a to
df30a0c
Compare
Updated the intrinsics list from version 3.4 to 3.6.8. Added a missing-x86.md file to track progress.
fixed reduce-add and reduce-mul. and load/store of mask32 and mask64. added preserves-flags to mov asm. fixed the missing list. fixed `_mm_loadu_si64`. Added `assert_instr`
Added some tests, Fixed incorrect target-features, and verification code for target-features. Removed all MMX support from verification.
`_mm512_kunpackb` was implemented wrong, and `simd_reduce_max` uses `maxnum` for comparison, which adheres to IEEE754, but Intel specifically says that they do NOT adhere to IEEE754 for NaNs, which can give wrong results
231b968 to
1c7aafe
Compare
2be8efe to
a58f1ee
Compare
| /// must be aligned on a 32-byte boundary or a general-protection exception may be generated. To | ||
| /// minimize caching, the data is flagged as non-temporal (unlikely to be used again soon) | ||
| /// | ||
| /// [Intel's documentation](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm256_stream_load_si256) |
There was a problem hiding this comment.
This (and all other AVX2 non-temporal operations) should get the same safety comment that the older non-temporal stores have. See e.g. here.
There was a problem hiding this comment.
I checked and only non-temporal stores have special memory orderings on x86. x86 non-temporal loads work just like normal loads.
There was a problem hiding this comment.
@Amanieu told that that doesn't apply to streaming loads, only streaming stores.
There was a problem hiding this comment.
Oh, I didn't realize non-temporal loads even are a thing. More nightmare waiting to happen, I guess...
Is there a reason the non-temporal loads are written with inline assembly then? It generates extra instructions when offsetting. https://godbolt.org/z/os8bG3Efr |
missing-x86.mdfor ease of implementationobjdumpfrom binutils, and binutils is a dependency of GCC). Add thex86_64-pc-windows-gnutarget in CIsimd_reduce_add_unorderedandsimd_reduce_mul_unorderedas Intel specifies a strict associativity. Follow GCC and hand-implement the associativity ourselves (_mm512_reduce_add_ps and friends are setting fast-math flags they should not set #1533)_load_mask32etc in AVX512BW (they should have taken a__mmask32/__mmask64pointer, but tooku32/u64pointer)preserves_flagsto theasm!blocks for moves_mm_loadu_si64(it had target-feature sse, but needs sse2),_mm256_extract_epi64,_mm256_extract_epi32,_mm256_cvtsi256_si32(these had target-feature avx2, but need avx)._mm_cvttintrinsics (they were actually calling vcvtss2si, when they should call vcvttss2si)simd-x86-updates(Tracking Issue for Missing BMI1, AVX2, SSE2, SSE4.1, SSE4a and TBM intrinsics rust#126936)_mm512_kunpackbmaxnumfrom LLVM)armv7-unknown-linux-gnueabihfandx86_64-unknown-linux-gnu-emulated)Modifying fma has been moved to #1597
Masked load/stores are on standby due to rust-lang/rust#126919