Skip to content

Use mul+add+permute sequence for DotProduct when AVX is available#125666

Open
alexcovington wants to merge 10 commits intodotnet:mainfrom
alexcovington:avx-dotproduct
Open

Use mul+add+permute sequence for DotProduct when AVX is available#125666
alexcovington wants to merge 10 commits intodotnet:mainfrom
alexcovington:avx-dotproduct

Conversation

@alexcovington
Copy link
Copy Markdown
Contributor

On x86 when AVX is available, it is generally more performant to calculate dot products using a multiply+permute+addition sequence instead of vdpps/vdppd.

This PR modifies lowering to use the multiply+permute+addition sequence if AVX is available.

| Namespace                       | Type                     | Method       | Job        | Toolchain                   | Mean     | Error     | StdDev    | Median   | Min      | Max      | Ratio | RatioSD | Allocated | Alloc Ratio |
|-------------------------------- |------------------------- |------------- |----------- |---------------------------- |---------:|----------:|----------:|---------:|---------:|---------:|------:|--------:|----------:|------------:|
| System.Numerics.Tests           | Perf_Plane               | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.708 ns | 0.0591 ns | 0.0680 ns | 1.673 ns | 1.645 ns | 1.840 ns |  1.00 |    0.00 |         - |          NA |
| System.Numerics.Tests           | Perf_Plane               | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.295 ns | 0.0098 ns | 0.0076 ns | 1.296 ns | 1.284 ns | 1.308 ns |  0.76 |    0.03 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Numerics.Tests           | Perf_Quaternion          | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.655 ns | 0.0324 ns | 0.0333 ns | 1.638 ns | 1.628 ns | 1.740 ns |  1.00 |    0.00 |         - |          NA |
| System.Numerics.Tests           | Perf_Quaternion          | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.302 ns | 0.0288 ns | 0.0308 ns | 1.287 ns | 1.278 ns | 1.373 ns |  0.79 |    0.02 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Numerics.Tests           | Perf_Vector2             | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.664 ns | 0.0231 ns | 0.0205 ns | 1.667 ns | 1.632 ns | 1.709 ns |  1.00 |    0.00 |         - |          NA |
| System.Numerics.Tests           | Perf_Vector2             | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.295 ns | 0.0167 ns | 0.0130 ns | 1.294 ns | 1.276 ns | 1.313 ns |  0.78 |    0.01 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Float      | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.706 ns | 0.0923 ns | 0.1063 ns | 1.648 ns | 1.624 ns | 1.961 ns |  1.00 |    0.00 |         - |          NA |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Float      | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.314 ns | 0.0369 ns | 0.0425 ns | 1.302 ns | 1.273 ns | 1.420 ns |  0.77 |    0.05 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Of<Double> | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.476 ns | 0.0282 ns | 0.0313 ns | 1.474 ns | 1.443 ns | 1.534 ns |  1.00 |    0.00 |         - |          NA |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Of<Double> | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.131 ns | 0.0327 ns | 0.0377 ns | 1.116 ns | 1.098 ns | 1.219 ns |  0.77 |    0.03 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Of<Single> | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.652 ns | 0.0278 ns | 0.0260 ns | 1.651 ns | 1.620 ns | 1.710 ns |  1.00 |    0.00 |         - |          NA |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Of<Single> | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.301 ns | 0.0238 ns | 0.0199 ns | 1.301 ns | 1.274 ns | 1.347 ns |  0.79 |    0.02 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Numerics.Tests           | Perf_VectorOf<Double>    | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.474 ns | 0.0163 ns | 0.0127 ns | 1.468 ns | 1.462 ns | 1.501 ns |  1.00 |    0.00 |         - |          NA |
| System.Numerics.Tests           | Perf_VectorOf<Double>    | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.291 ns | 0.0109 ns | 0.0085 ns | 1.289 ns | 1.282 ns | 1.311 ns |  0.88 |    0.01 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Numerics.Tests           | Perf_VectorOf<Single>    | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.887 ns | 0.0756 ns | 0.0841 ns | 1.853 ns | 1.811 ns | 2.095 ns |  1.00 |    0.00 |         - |          NA |
| System.Numerics.Tests           | Perf_VectorOf<Single>    | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.295 ns | 0.0105 ns | 0.0082 ns | 1.293 ns | 1.286 ns | 1.311 ns |  0.69 |    0.03 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Numerics.Tests           | Perf_Vector3             | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.704 ns | 0.0420 ns | 0.0467 ns | 1.702 ns | 1.641 ns | 1.806 ns |  1.00 |    0.00 |         - |          NA |
| System.Numerics.Tests           | Perf_Vector3             | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.331 ns | 0.0423 ns | 0.0488 ns | 1.317 ns | 1.283 ns | 1.412 ns |  0.78 |    0.03 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Numerics.Tests           | Perf_Vector4             | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.675 ns | 0.0402 ns | 0.0430 ns | 1.666 ns | 1.633 ns | 1.781 ns |  1.00 |    0.00 |         - |          NA |
| System.Numerics.Tests           | Perf_Vector4             | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.293 ns | 0.0133 ns | 0.0111 ns | 1.289 ns | 1.280 ns | 1.315 ns |  0.77 |    0.02 |         - |          NA |
Disasm

System.Runtime.Intrinsics.Tests.Perf_Vector128Of

Base

; System.Runtime.Intrinsics.Tests.Perf_Vector128Of`1[[System.Single, System.Private.CoreLib]].DotBenchmark()
       vpcmpeqd  xmm0,xmm0,xmm0
       vpcmpeqd  xmm1,xmm1,xmm1
       vdpps     xmm0,xmm0,xmm1,0FF
       ret
; Total bytes of code 15

Diff

; System.Runtime.Intrinsics.Tests.Perf_Vector128Of`1[[System.Single, System.Private.CoreLib]].DotBenchmark()
       vpcmpeqd  xmm0,xmm0,xmm0
       vpcmpeqd  xmm1,xmm1,xmm1
       vmulps    xmm0,xmm1,xmm0
       vpermilps xmm1,xmm0,0B1
       vaddps    xmm0,xmm1,xmm0
       vpermilps xmm1,xmm0,4E
       vaddps    xmm0,xmm1,xmm0
       ret
; Total bytes of code 33

System.Runtime.Intrinsics.Tests.Perf_Vector128Of

Base

; System.Runtime.Intrinsics.Tests.Perf_Vector128Of`1[[System.Double, System.Private.CoreLib]].DotBenchmark()
       vpcmpeqd  xmm0,xmm0,xmm0
       vpcmpeqd  xmm1,xmm1,xmm1
       vdppd     xmm0,xmm0,xmm1,33
       ret
; Total bytes of code 15

Diff

; System.Runtime.Intrinsics.Tests.Perf_Vector128Of`1[[System.Double, System.Private.CoreLib]].DotBenchmark()
       vpcmpeqd  xmm0,xmm0,xmm0
       vpcmpeqd  xmm1,xmm1,xmm1
       vmulpd    xmm0,xmm1,xmm0
       vpermilpd xmm1,xmm0,1
       vaddpd    xmm0,xmm1,xmm0
       ret
; Total bytes of code 23

System.Runtime.Intrinsics.Tests.Perf_VectorOf

Base

; System.Numerics.Tests.Perf_VectorOf`1[[System.Single, System.Private.CoreLib]].DotBenchmark()
       vbroadcastss ymm0,dword ptr [0C5A4]
       vdpps     ymm0,ymm0,[0C5C0],0FF
       vperm2f128 ymm1,ymm0,ymm0,1
       vaddps    ymm0,ymm1,ymm0
       vzeroupper
       ret
; Total bytes of code 33

Diff

; System.Numerics.Tests.Perf_VectorOf`1[[System.Single, System.Private.CoreLib]].DotBenchmark()
       vbroadcastss ymm0,dword ptr [0C8E8]
       vmulps    ymm0,ymm0,dword bcst [0C8EC]
       vpermilps ymm1,ymm0,0B1
       vaddps    ymm0,ymm1,ymm0
       vpermilps ymm1,ymm0,4E
       vaddps    ymm0,ymm0,ymm1
       vperm2f128 ymm1,ymm0,ymm0,1
       vaddps    ymm0,ymm1,ymm0
       vzeroupper
       ret
; Total bytes of code 53

System.Runtime.Intrinsics.Tests.Perf_VectorOf

Base

; System.Numerics.Tests.Perf_VectorOf`1[[System.Double, System.Private.CoreLib]].DotBenchmark()
       vbroadcastsd ymm0,qword ptr [0E6D8]
       vmulpd    ymm0,ymm0,qword bcst [0E6E0]
       vhaddpd   ymm0,ymm0,ymm0
       vperm2f128 ymm1,ymm0,ymm0,1
       vaddpd    ymm0,ymm1,ymm0
       vzeroupper
       ret
; Total bytes of code 37

Diff

; System.Numerics.Tests.Perf_VectorOf`1[[System.Double, System.Private.CoreLib]].DotBenchmark()
       vbroadcastsd ymm0,qword ptr [0E880]
       vmulpd    ymm0,ymm0,qword bcst [0E888]
       vpermilpd ymm1,ymm0,5
       vaddpd    ymm0,ymm1,ymm0
       vperm2f128 ymm1,ymm0,ymm0,1
       vaddpd    ymm0,ymm1,ymm0
       vzeroupper
       ret
; Total bytes of code 43

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Mar 17, 2026
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Copy link
Copy Markdown
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minus the nit about code duplication. I'm fine with deferring that, but it'd be nice to get it handled.

CC. @kg, @EgorBo for secondary review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates x86/x64 JIT lowering for SIMD DotProduct to prefer a MUL + permute + add reduction sequence when AVX is available, avoiding vdpps/vdppd in those cases to improve performance for common Vector* dot-product patterns.

Changes:

  • Replace AVX float/double dot-product lowering with explicit multiply + permute + add reduction sequences in LowerHWIntrinsicDot.
  • Add AVX-gated alternative lowering paths for some Vector128/Vector128 dot-product cases.

@EgorBo
Copy link
Copy Markdown
Member

EgorBo commented Mar 23, 2026

It looks like Copilot left a useful feedback to address

Alex Covington (Advanced Micro Devices Inc) added 3 commits March 24, 2026 09:23
Copy link
Copy Markdown
Member

@EgorBo EgorBo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a few nits

Don't remove node if we can't find user

Co-authored-by: Egor Bogatov <egorbo@gmail.com>
Copy link
Copy Markdown
Member

@EgorBo EgorBo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-System.Memory community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants