Remove unnecessary delayfree from xarch FMA and TERNLOG instructions by tannergooding · Pull Request #128350 · dotnet/runtime

tannergooding · 2026-05-19T00:37:59Z

These instructions are fully reorderable and so much like various commutative nodes do not need to be marked delay free except in special scenarios.

This resolves #62215

…ctions

Copilot

Pull request overview

This PR updates xarch JIT HWIntrinsic register allocation/codegen handling to avoid marking AVX FMA and AVX-512 TernaryLogic operands as “delay-free” in most cases, while adding codegen support for additional operand/target overlap scenarios for TernaryLogic.

Changes:

Simplifies LSRA operand-use construction for AVX2/AVX512 FMA intrinsics, only using delay-free constraints when required by CopyUpperBits semantics.
Adds a dedicated LSRA path for NI_AVX512_TernaryLogic when the control byte is an immediate, avoiding delay-free uses in that case.
Extends genHWIntrinsic_R_R_R_RM_I to adjust TernaryLogic control immediates when the target register overlaps certain operands.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
src/coreclr/jit/lsraxarch.cpp	Adjusts LSRA use/target preferencing for FMA intrinsics and adds a special LSRA build path for `AVX512_TernaryLogic` with immediate control byte.
src/coreclr/jit/hwintrinsiccodegenxarch.cpp	Adds `TernaryLogic`-specific handling to rewrite the control byte when operand/target register overlap is detected.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

tannergooding · 2026-05-19T20:58:16Z

CC. @dotnet/jit-contrib, @EgorBo for review.

Diffs are here: https://dev.azure.com/dnceng-public/public/_build/results?buildId=1427007&view=ms.vss-build-web.run-extensions-tab

This removes a large number of unnecessary vmovaps prior/after to vpternlog and vfmadd instructions, such as:

-       vpternlogq xmm4, xmm0, xmm9, -106
-       vmovaps  xmm0, xmm4
+       vpternlogq xmm0, xmm4, xmm9, -106

The x86 diffs show the largest size improvement due to the limited register set available (XMM0-7 only).

The Linux x64 diffs then show a smaller improvement because they end up selecting XMM16-XMM31 in many cases which requires more bytes to encode. All XMM registers are CALLEE_TRASH and most of the SIMD methods don't involve calls.

The Windows x64 diffs then show a size regression because the register allocator ends up selecting callee saved registers more causing a bloat in the method prologue/epilogue due to them having to be saved/restored.

The change is overall an improvement and can be seen when observing the three variations here in unison. We probably want to look a bit into the Windowx 64 register ordering though since it really should be preferencing the EVEX registers over using the callee save registers.

EgorBo · 2026-05-19T22:25:32Z

The Windows x64 diffs then show a size regression

I'm trying to understand why quite a few lines of changes regressed more contexts than improved (size-wise) and PerfScore says that 6 collections regressed (overall) and only 3 improved and we should take it 😐

AndyAyersMS · 2026-05-19T23:12:53Z

The Windows x64 diffs then show a size regression

I'm trying to understand why quite a few lines of changes regressed more contexts than improved (size-wise) and PerfScore says that 6 collections regressed (overall) and only 3 improved and we should take it 😐

Most of those size-increasting diffs are in tests, so maybe there's some pattern there that is not that representative?

tannergooding · 2026-05-19T23:29:45Z

Most of those size-increasting diffs are in tests, so maybe there's some pattern there that is not that representative?

@AndyAyersMS is correct. There's primarily two things at play here...

First, Windows x64 has CALLEE_SAVE registers for FP/SIMD while Linux x64 and Windows x86 do not. This is the primary reason why the latter two are pure improvements. Any state that survives past a call must have already been CALLER_SAVE on these platforms and so the change can only remove unnecessary trashing of a register regardless of what LSRA selects. Windows x64 on the other hand can have its preferences changed such that a callee save register is preferred where it wasn't before, so while the body may reduce moves the prologue/epilogue will grow instead.

The secondary consideration and why this primarily shows up in the tests is because the tests have an unusual pattern where they do SIMD computation and then assert the value is correct. In many cases this assertion boils down to extracting each element of the vector and checking it is correct. For example, with Vector2 we have a MathHelper.Equals(Vector2 a, Vector2 b) method which does Equal(a.X, b.X) && Equal(a.Y, b.Y), where that then does Math.Abs(v1 - v2) < epsilon. This introduces singular value extractions and the SIMD value living across a call boundary, which forces LSRA to preference CALLEE_SAVE state.

If a user actually has a pattern like that for production SIMD code, then this will potentially regress them for FMA/VPTERNLOG scenarios, but the same would also be true for every other instruction that is already correctly handling target preferencing. It's also a highly unusual pattern for SIMD, since you want to avoid such code altogether and typically keep SIMD values as SIMD and in a single register.

There's also of course edges where LSRA just doesn't do a great job.

There's a couple causes where not being delay free causes LSRA to pick different registers and kill a register that was holding a well known constant (typically 0) and so we emit a new vxorps xmm4, xmm4, xmm4. This one is a known pessimization with SIMD/FP constants we've had forever. We don't have an easy fix for it.

We then have a case I'm looking into for System.Numerics.Tensors.TensorPrimitives:IndexOfMinMaxVectorized256Size4Plus where the method body change is:

-       vpmovm2q ymm16, k1
-       vpternlogq ymm6, ymm6, ymm16, 85
-       vblendvpd ymm0, ymm5, ymm0, ymm6
-       vpternlogq ymm5, ymm5, ymm16, 85
-       vpblendvb ymm3, ymm4, ymm3, ymm5
+       vpmovm2q ymm6, k1
+       vpternlogq ymm7, ymm7, ymm6, 85
+       vblendvpd ymm0, ymm5, ymm0, ymm7
+       vpternlogq ymm6, ymm6, ymm6, 85
+       vpblendvb ymm3, ymm4, ymm3, ymm6

So LSRA decided to use ymm7 (CALLEE_SAVE) where previously it opted to use ymm16 (CALLER_SAVE). This causes an extra prologue/epilogue spill even though there's no reason for that to occur.

Overall though, particularly for production code, this is a positive improvement to the handling and fixes a pessimization we knew we had around these instructions.

tannergooding added 2 commits May 18, 2026 17:06

Remove some unnecessary delay free markings from xarch FMA instructions

5a77621

Remove some unnecessary delay free markings from xarch TERNLOG instru…

8a0bfb2

…ctions

Copilot AI review requested due to automatic review settings May 19, 2026 00:37

Copilot started reviewing on behalf of tannergooding May 19, 2026 00:38 View session

github-actions Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 19, 2026

Copilot AI reviewed May 19, 2026

View reviewed changes

Comment thread src/coreclr/jit/hwintrinsiccodegenxarch.cpp Outdated

Comment thread src/coreclr/jit/lsraxarch.cpp Outdated

tannergooding added 2 commits May 18, 2026 18:05

Resolve feedback

c679229

Remove a bad assert

de5be2c

Copilot AI review requested due to automatic review settings May 19, 2026 02:02

Copilot started reviewing on behalf of tannergooding May 19, 2026 02:02 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

Remove an assert that's invalid for the jmp table fallback case

2e39f1c

build-analysis Bot mentioned this pull request May 19, 2026

[wasm] Tests failing with DirectoryNotFoundException trying to load test data #128293

Open

Fix formatting

4366dce

Copilot AI review requested due to automatic review settings May 19, 2026 12:57

Copilot started reviewing on behalf of tannergooding May 19, 2026 12:58 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

Comment thread src/coreclr/jit/lsraxarch.cpp

Comment thread src/coreclr/jit/lsraxarch.cpp Outdated

tannergooding added 2 commits May 19, 2026 07:16

Only swap if all operands are used

adf7887

Don't reorder if op3 is a used from a spill temp

0b0ce37

Copilot AI review requested due to automatic review settings May 19, 2026 17:21

Copilot started reviewing on behalf of tannergooding May 19, 2026 17:22 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

Comment thread src/coreclr/jit/hwintrinsiccodegenxarch.cpp

tannergooding requested a review from EgorBo May 19, 2026 20:46

This was referenced May 19, 2026

slow macOS - "##[error]The job running on agent Azure Pipelines 9 ran longer than the maximum time of 60 minutes." dotnet/dnceng#1883

Open

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

EgorBo approved these changes May 19, 2026

View reviewed changes

build-analysis Bot mentioned this pull request May 19, 2026

ExceptionHandlingExpressions.ExpressionsUnwrapeExternallyThrownRuntimeWrappedException failure #128164

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove unnecessary delayfree from xarch FMA and TERNLOG instructions#128350

Remove unnecessary delayfree from xarch FMA and TERNLOG instructions#128350
tannergooding wants to merge 8 commits into
dotnet:mainfrom
tannergooding:remove-unnecessary-delayfree

tannergooding commented May 19, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

tannergooding commented May 19, 2026

Uh oh!

EgorBo commented May 19, 2026 •

edited

Loading

Uh oh!

AndyAyersMS commented May 19, 2026

Uh oh!

tannergooding commented May 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

tannergooding commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

tannergooding commented May 19, 2026

Uh oh!

EgorBo commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AndyAyersMS commented May 19, 2026

Uh oh!

tannergooding commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tannergooding commented May 19, 2026 •

edited

Loading

EgorBo commented May 19, 2026 •

edited

Loading

tannergooding commented May 19, 2026 •

edited

Loading