Use mul+add+permute sequence for DotProduct when AVX is available by alexcovington · Pull Request #125666 · dotnet/runtime

alexcovington · 2026-03-17T17:35:17Z

On x86 when AVX is available, it is generally more performant to calculate dot products using a multiply+permute+addition sequence instead of vdpps/vdppd.

This PR modifies lowering to use the multiply+permute+addition sequence if AVX is available.

| Namespace                       | Type                     | Method       | Job        | Toolchain                   | Mean     | Error     | StdDev    | Median   | Min      | Max      | Ratio | RatioSD | Allocated | Alloc Ratio |
|-------------------------------- |------------------------- |------------- |----------- |---------------------------- |---------:|----------:|----------:|---------:|---------:|---------:|------:|--------:|----------:|------------:|
| System.Numerics.Tests           | Perf_Plane               | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.708 ns | 0.0591 ns | 0.0680 ns | 1.673 ns | 1.645 ns | 1.840 ns |  1.00 |    0.00 |         - |          NA |
| System.Numerics.Tests           | Perf_Plane               | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.295 ns | 0.0098 ns | 0.0076 ns | 1.296 ns | 1.284 ns | 1.308 ns |  0.76 |    0.03 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Numerics.Tests           | Perf_Quaternion          | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.655 ns | 0.0324 ns | 0.0333 ns | 1.638 ns | 1.628 ns | 1.740 ns |  1.00 |    0.00 |         - |          NA |
| System.Numerics.Tests           | Perf_Quaternion          | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.302 ns | 0.0288 ns | 0.0308 ns | 1.287 ns | 1.278 ns | 1.373 ns |  0.79 |    0.02 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Numerics.Tests           | Perf_Vector2             | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.664 ns | 0.0231 ns | 0.0205 ns | 1.667 ns | 1.632 ns | 1.709 ns |  1.00 |    0.00 |         - |          NA |
| System.Numerics.Tests           | Perf_Vector2             | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.295 ns | 0.0167 ns | 0.0130 ns | 1.294 ns | 1.276 ns | 1.313 ns |  0.78 |    0.01 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Float      | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.706 ns | 0.0923 ns | 0.1063 ns | 1.648 ns | 1.624 ns | 1.961 ns |  1.00 |    0.00 |         - |          NA |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Float      | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.314 ns | 0.0369 ns | 0.0425 ns | 1.302 ns | 1.273 ns | 1.420 ns |  0.77 |    0.05 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Of<Double> | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.476 ns | 0.0282 ns | 0.0313 ns | 1.474 ns | 1.443 ns | 1.534 ns |  1.00 |    0.00 |         - |          NA |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Of<Double> | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.131 ns | 0.0327 ns | 0.0377 ns | 1.116 ns | 1.098 ns | 1.219 ns |  0.77 |    0.03 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Of<Single> | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.652 ns | 0.0278 ns | 0.0260 ns | 1.651 ns | 1.620 ns | 1.710 ns |  1.00 |    0.00 |         - |          NA |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Of<Single> | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.301 ns | 0.0238 ns | 0.0199 ns | 1.301 ns | 1.274 ns | 1.347 ns |  0.79 |    0.02 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Numerics.Tests           | Perf_VectorOf<Double>    | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.474 ns | 0.0163 ns | 0.0127 ns | 1.468 ns | 1.462 ns | 1.501 ns |  1.00 |    0.00 |         - |          NA |
| System.Numerics.Tests           | Perf_VectorOf<Double>    | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.291 ns | 0.0109 ns | 0.0085 ns | 1.289 ns | 1.282 ns | 1.311 ns |  0.88 |    0.01 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Numerics.Tests           | Perf_VectorOf<Single>    | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.887 ns | 0.0756 ns | 0.0841 ns | 1.853 ns | 1.811 ns | 2.095 ns |  1.00 |    0.00 |         - |          NA |
| System.Numerics.Tests           | Perf_VectorOf<Single>    | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.295 ns | 0.0105 ns | 0.0082 ns | 1.293 ns | 1.286 ns | 1.311 ns |  0.69 |    0.03 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Numerics.Tests           | Perf_Vector3             | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.704 ns | 0.0420 ns | 0.0467 ns | 1.702 ns | 1.641 ns | 1.806 ns |  1.00 |    0.00 |         - |          NA |
| System.Numerics.Tests           | Perf_Vector3             | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.331 ns | 0.0423 ns | 0.0488 ns | 1.317 ns | 1.283 ns | 1.412 ns |  0.78 |    0.03 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Numerics.Tests           | Perf_Vector4             | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.675 ns | 0.0402 ns | 0.0430 ns | 1.666 ns | 1.633 ns | 1.781 ns |  1.00 |    0.00 |         - |          NA |
| System.Numerics.Tests           | Perf_Vector4             | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.293 ns | 0.0133 ns | 0.0111 ns | 1.289 ns | 1.280 ns | 1.315 ns |  0.77 |    0.02 |         - |          NA |

Disasm

System.Runtime.Intrinsics.Tests.Perf_Vector128Of

Base

; System.Runtime.Intrinsics.Tests.Perf_Vector128Of`1[[System.Single, System.Private.CoreLib]].DotBenchmark()
       vpcmpeqd  xmm0,xmm0,xmm0
       vpcmpeqd  xmm1,xmm1,xmm1
       vdpps     xmm0,xmm0,xmm1,0FF
       ret
; Total bytes of code 15

Diff

; System.Runtime.Intrinsics.Tests.Perf_Vector128Of`1[[System.Single, System.Private.CoreLib]].DotBenchmark()
       vpcmpeqd  xmm0,xmm0,xmm0
       vpcmpeqd  xmm1,xmm1,xmm1
       vmulps    xmm0,xmm1,xmm0
       vpermilps xmm1,xmm0,0B1
       vaddps    xmm0,xmm1,xmm0
       vpermilps xmm1,xmm0,4E
       vaddps    xmm0,xmm1,xmm0
       ret
; Total bytes of code 33

System.Runtime.Intrinsics.Tests.Perf_Vector128Of

Base

; System.Runtime.Intrinsics.Tests.Perf_Vector128Of`1[[System.Double, System.Private.CoreLib]].DotBenchmark()
       vpcmpeqd  xmm0,xmm0,xmm0
       vpcmpeqd  xmm1,xmm1,xmm1
       vdppd     xmm0,xmm0,xmm1,33
       ret
; Total bytes of code 15

Diff

; System.Runtime.Intrinsics.Tests.Perf_Vector128Of`1[[System.Double, System.Private.CoreLib]].DotBenchmark()
       vpcmpeqd  xmm0,xmm0,xmm0
       vpcmpeqd  xmm1,xmm1,xmm1
       vmulpd    xmm0,xmm1,xmm0
       vpermilpd xmm1,xmm0,1
       vaddpd    xmm0,xmm1,xmm0
       ret
; Total bytes of code 23

System.Runtime.Intrinsics.Tests.Perf_VectorOf

Base

; System.Numerics.Tests.Perf_VectorOf`1[[System.Single, System.Private.CoreLib]].DotBenchmark()
       vbroadcastss ymm0,dword ptr [0C5A4]
       vdpps     ymm0,ymm0,[0C5C0],0FF
       vperm2f128 ymm1,ymm0,ymm0,1
       vaddps    ymm0,ymm1,ymm0
       vzeroupper
       ret
; Total bytes of code 33

Diff

; System.Numerics.Tests.Perf_VectorOf`1[[System.Single, System.Private.CoreLib]].DotBenchmark()
       vbroadcastss ymm0,dword ptr [0C8E8]
       vmulps    ymm0,ymm0,dword bcst [0C8EC]
       vpermilps ymm1,ymm0,0B1
       vaddps    ymm0,ymm1,ymm0
       vpermilps ymm1,ymm0,4E
       vaddps    ymm0,ymm0,ymm1
       vperm2f128 ymm1,ymm0,ymm0,1
       vaddps    ymm0,ymm1,ymm0
       vzeroupper
       ret
; Total bytes of code 53

System.Runtime.Intrinsics.Tests.Perf_VectorOf

Base

; System.Numerics.Tests.Perf_VectorOf`1[[System.Double, System.Private.CoreLib]].DotBenchmark()
       vbroadcastsd ymm0,qword ptr [0E6D8]
       vmulpd    ymm0,ymm0,qword bcst [0E6E0]
       vhaddpd   ymm0,ymm0,ymm0
       vperm2f128 ymm1,ymm0,ymm0,1
       vaddpd    ymm0,ymm1,ymm0
       vzeroupper
       ret
; Total bytes of code 37

Diff

; System.Numerics.Tests.Perf_VectorOf`1[[System.Double, System.Private.CoreLib]].DotBenchmark()
       vbroadcastsd ymm0,qword ptr [0E880]
       vmulpd    ymm0,ymm0,qword bcst [0E888]
       vpermilpd ymm1,ymm0,5
       vaddpd    ymm0,ymm1,ymm0
       vperm2f128 ymm1,ymm0,ymm0,1
       vaddpd    ymm0,ymm1,ymm0
       vzeroupper
       ret
; Total bytes of code 43

dotnet-policy-service · 2026-03-17T17:36:27Z

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

src/coreclr/jit/lowerxarch.cpp

tannergooding

LGTM, minus the nit about code duplication. I'm fine with deferring that, but it'd be nice to get it handled.

CC. @kg, @EgorBo for secondary review

Copilot

Pull request overview

This PR updates x86/x64 JIT lowering for SIMD DotProduct to prefer a MUL + permute + add reduction sequence when AVX is available, avoiding vdpps/vdppd in those cases to improve performance for common Vector* dot-product patterns.

Changes:

Replace AVX float/double dot-product lowering with explicit multiply + permute + add reduction sequences in LowerHWIntrinsicDot.
Add AVX-gated alternative lowering paths for some Vector128/Vector128 dot-product cases.

src/coreclr/jit/lowerxarch.cpp

EgorBo · 2026-03-23T23:45:40Z

It looks like Copilot left a useful feedback to address

…olidate code duplication into a single path for Vector128/256

src/coreclr/jit/lowerxarch.cpp

EgorBo

LGTM with a few nits

src/coreclr/jit/lowerxarch.cpp

Don't remove node if we can't find user Co-authored-by: Egor Bogatov <egorbo@gmail.com>

EgorBo

Thanks!

Use mul+add+permute sequence for DotProduct when AVX is available

2a32831

github-actions bot added the area-System.Memory label Mar 17, 2026

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Mar 17, 2026

Correct TYP_DOUBLE case for Vector128

e6e5d54

tannergooding reviewed Mar 17, 2026

View reviewed changes

src/coreclr/jit/lowerxarch.cpp Outdated Show resolved Hide resolved

Fix typo in index

682188f

This was referenced Mar 18, 2026

[android] Android.Device_Emulator.JIT.Test failing on emulators with CoreCLR #112633

Open

[Android][CoreCLR] System.Security.Cryptography.Tests killed by lowmemorykiller #118603

Open

MsQuic fails with QUIC_STATUS_OUT_OF_MEMORY on AzureLinux #123216

Open

Mark correct node as unused

853508a

tannergooding reviewed Mar 23, 2026

View reviewed changes

src/coreclr/jit/lowerxarch.cpp Outdated Show resolved Hide resolved

tannergooding reviewed Mar 23, 2026

View reviewed changes

src/coreclr/jit/lowerxarch.cpp Outdated Show resolved Hide resolved

tannergooding approved these changes Mar 23, 2026

View reviewed changes

EgorBo requested a review from Copilot March 23, 2026 22:44

Copilot started reviewing on behalf of EgorBo March 23, 2026 22:45 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

Alex Covington (Advanced Micro Devices Inc) added 3 commits March 24, 2026 09:23

Update comments, insert tmps before node, lower tmps before reuse

48765f8

Move lowering of Dot for float/double w/ AVX to helper function, cons…

79f0583

…olidate code duplication into a single path for Vector128/256

Better variable name

3710108

tannergooding approved these changes Mar 25, 2026

View reviewed changes

EgorBo reviewed Mar 25, 2026

View reviewed changes

src/coreclr/jit/lowerxarch.cpp Show resolved Hide resolved

EgorBo reviewed Mar 25, 2026

View reviewed changes

src/coreclr/jit/lowerxarch.cpp Outdated Show resolved Hide resolved

EgorBo reviewed Mar 25, 2026

View reviewed changes

src/coreclr/jit/lowerxarch.cpp Outdated Show resolved Hide resolved

EgorBo approved these changes Mar 25, 2026

View reviewed changes

Cleanup unused node case

43c1d48

EgorBo reviewed Mar 25, 2026

View reviewed changes

src/coreclr/jit/lowerxarch.cpp Outdated Show resolved Hide resolved

Consolidate more into helper function

8a97846

EgorBo reviewed Mar 25, 2026

View reviewed changes

src/coreclr/jit/lowerxarch.cpp Outdated Show resolved Hide resolved

Update src/coreclr/jit/lowerxarch.cpp

e9a2e26

Don't remove node if we can't find user Co-authored-by: Egor Bogatov <egorbo@gmail.com>

This was referenced Mar 26, 2026

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

[android-arm64] The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#6408

Open

EgorBo approved these changes Mar 27, 2026

View reviewed changes

Conversation

alexcovington commented Mar 17, 2026

System.Runtime.Intrinsics.Tests.Perf_Vector128Of

Base

Diff

System.Runtime.Intrinsics.Tests.Perf_Vector128Of

Base

Diff

System.Runtime.Intrinsics.Tests.Perf_VectorOf

Base

Diff

System.Runtime.Intrinsics.Tests.Perf_VectorOf

Base

Diff

Uh oh!

dotnet-policy-service bot commented Mar 17, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tannergooding left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

EgorBo commented Mar 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

EgorBo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

EgorBo left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants