(applies to Vector256 as well)
Consider Vector128.ShiftRightLogical(ref byte) where X86 does not have a ShiftRightLogical instruction that operates on bytes:
Vector128<byte> v0 = Vector128.LoadUnsafe(ref source);
Vector128<byte> v1 = Vector128.ShiftRightLogical(v0, 4);
Which currently emits a scalar fallback
TestClass.Foo(Byte ByRef)
L0000: push rsi
L0001: sub rsp, 0x40
L0005: vzeroupper
L0008: vmovdqu xmm0, [rcx]
L000c: vmovapd [rsp+0x20], xmm0
L0012: xor esi, esi
L0014: lea rcx, [rsp+0x20]
L0019: movsxd rdx, esi
L001c: movzx ecx, byte ptr [rcx+rdx]
L0020: mov edx, 4
L0025: mov rax, 0x7ffa0845bc60
L002f: call qword ptr [rax]
L0031: lea rdx, [rsp+0x30]
L0036: movsxd rcx, esi
L0039: mov [rdx+rcx], al
L003c: inc esi
L003e: cmp esi, 0x10
L0041: jl short L0014
L0043: vmovapd xmm0, [rsp+0x30]
L0049: vpmovmskb eax, xmm0
L004d: add rsp, 0x40
L0051: pop rsi
L0052: ret
where it could instead emit a 32-bit shift and an AND to clear the overlapping bits
Vector128<byte> v0 = Vector128.LoadUnsafe(ref source);
Vector128<byte> v1 = Vector128.ShiftRightLogical(v0.AsInt32(), 4).AsByte() & Vector128.Create((byte)0xF);
TestClass.Bar(Byte ByRef)
L0000: vzeroupper
L0003: vmovdqu xmm0, [rcx]
L0007: vpsrld xmm0, xmm0, 4
L000c: vpand xmm0, xmm0, [0x7ffa087600d0]
L0014: vpmovmskb eax, xmm0
L0018: ret
We have a few places in runtime that are aware of this issue and employ workarounds, e.g.:
(applies to Vector256 as well)
Consider
Vector128.ShiftRightLogical(ref byte)where X86 does not have aShiftRightLogicalinstruction that operates on bytes:Which currently emits a scalar fallback
where it could instead emit a 32-bit shift and an AND to clear the overlapping bits
We have a few places in runtime that are aware of this issue and employ workarounds, e.g.:
runtime/src/libraries/System.Private.CoreLib/src/System/IndexOfAnyValues/IndexOfAnyAsciiSearcher.cs
Line 875 in c1abf87
runtime/src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs
Line 594 in dc6ad37