Skip to content

Remove unnecessary delayfree from xarch FMA and TERNLOG instructions#128350

Open
tannergooding wants to merge 8 commits into
dotnet:mainfrom
tannergooding:remove-unnecessary-delayfree
Open

Remove unnecessary delayfree from xarch FMA and TERNLOG instructions#128350
tannergooding wants to merge 8 commits into
dotnet:mainfrom
tannergooding:remove-unnecessary-delayfree

Conversation

@tannergooding
Copy link
Copy Markdown
Member

@tannergooding tannergooding commented May 19, 2026

These instructions are fully reorderable and so much like various commutative nodes do not need to be marked delay free except in special scenarios.

This resolves #62215

Copilot AI review requested due to automatic review settings May 19, 2026 00:37
@github-actions github-actions Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 19, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates xarch JIT HWIntrinsic register allocation/codegen handling to avoid marking AVX FMA and AVX-512 TernaryLogic operands as “delay-free” in most cases, while adding codegen support for additional operand/target overlap scenarios for TernaryLogic.

Changes:

  • Simplifies LSRA operand-use construction for AVX2/AVX512 FMA intrinsics, only using delay-free constraints when required by CopyUpperBits semantics.
  • Adds a dedicated LSRA path for NI_AVX512_TernaryLogic when the control byte is an immediate, avoiding delay-free uses in that case.
  • Extends genHWIntrinsic_R_R_R_RM_I to adjust TernaryLogic control immediates when the target register overlaps certain operands.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/coreclr/jit/lsraxarch.cpp Adjusts LSRA use/target preferencing for FMA intrinsics and adds a special LSRA build path for AVX512_TernaryLogic with immediate control byte.
src/coreclr/jit/hwintrinsiccodegenxarch.cpp Adds TernaryLogic-specific handling to rewrite the control byte when operand/target register overlap is detected.

Comment thread src/coreclr/jit/hwintrinsiccodegenxarch.cpp Outdated
Comment thread src/coreclr/jit/lsraxarch.cpp Outdated
Copilot AI review requested due to automatic review settings May 19, 2026 02:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings May 19, 2026 12:57
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread src/coreclr/jit/lsraxarch.cpp
Comment thread src/coreclr/jit/lsraxarch.cpp Outdated
Copilot AI review requested due to automatic review settings May 19, 2026 17:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment thread src/coreclr/jit/hwintrinsiccodegenxarch.cpp
@tannergooding tannergooding requested a review from EgorBo May 19, 2026 20:46
@tannergooding
Copy link
Copy Markdown
Member Author

CC. @dotnet/jit-contrib, @EgorBo for review.

Diffs are here: https://dev.azure.com/dnceng-public/public/_build/results?buildId=1427007&view=ms.vss-build-web.run-extensions-tab

This removes a large number of unnecessary vmovaps prior/after to vpternlog and vfmadd instructions, such as:

-       vpternlogq xmm4, xmm0, xmm9, -106
-       vmovaps  xmm0, xmm4
+       vpternlogq xmm0, xmm4, xmm9, -106

The x86 diffs show the largest size improvement due to the limited register set available (XMM0-7 only).

The Linux x64 diffs then show a smaller improvement because they end up selecting XMM16-XMM31 in many cases which requires more bytes to encode. All XMM registers are CALLEE_TRASH and most of the SIMD methods don't involve calls.

The Windows x64 diffs then show a size regression because the register allocator ends up selecting callee saved registers more causing a bloat in the method prologue/epilogue due to them having to be saved/restored.

The change is overall an improvement and can be seen when observing the three variations here in unison. We probably want to look a bit into the Windowx 64 register ordering though since it really should be preferencing the EVEX registers over using the callee save registers.

@EgorBo
Copy link
Copy Markdown
Member

EgorBo commented May 19, 2026

The Windows x64 diffs then show a size regression

I'm trying to understand why quite a few lines of changes regressed more contexts than improved (size-wise) and PerfScore says that 6 collections regressed (overall) and only 3 improved and we should take it 😐

@AndyAyersMS
Copy link
Copy Markdown
Member

The Windows x64 diffs then show a size regression

I'm trying to understand why quite a few lines of changes regressed more contexts than improved (size-wise) and PerfScore says that 6 collections regressed (overall) and only 3 improved and we should take it 😐

Most of those size-increasting diffs are in tests, so maybe there's some pattern there that is not that representative?

@tannergooding
Copy link
Copy Markdown
Member Author

tannergooding commented May 19, 2026

Most of those size-increasting diffs are in tests, so maybe there's some pattern there that is not that representative?

@AndyAyersMS is correct. There's primarily two things at play here...

First, Windows x64 has CALLEE_SAVE registers for FP/SIMD while Linux x64 and Windows x86 do not. This is the primary reason why the latter two are pure improvements. Any state that survives past a call must have already been CALLER_SAVE on these platforms and so the change can only remove unnecessary trashing of a register regardless of what LSRA selects. Windows x64 on the other hand can have its preferences changed such that a callee save register is preferred where it wasn't before, so while the body may reduce moves the prologue/epilogue will grow instead.

The secondary consideration and why this primarily shows up in the tests is because the tests have an unusual pattern where they do SIMD computation and then assert the value is correct. In many cases this assertion boils down to extracting each element of the vector and checking it is correct. For example, with Vector2 we have a MathHelper.Equals(Vector2 a, Vector2 b) method which does Equal(a.X, b.X) && Equal(a.Y, b.Y), where that then does Math.Abs(v1 - v2) < epsilon. This introduces singular value extractions and the SIMD value living across a call boundary, which forces LSRA to preference CALLEE_SAVE state.

If a user actually has a pattern like that for production SIMD code, then this will potentially regress them for FMA/VPTERNLOG scenarios, but the same would also be true for every other instruction that is already correctly handling target preferencing. It's also a highly unusual pattern for SIMD, since you want to avoid such code altogether and typically keep SIMD values as SIMD and in a single register.


There's also of course edges where LSRA just doesn't do a great job.

There's a couple causes where not being delay free causes LSRA to pick different registers and kill a register that was holding a well known constant (typically 0) and so we emit a new vxorps xmm4, xmm4, xmm4. This one is a known pessimization with SIMD/FP constants we've had forever. We don't have an easy fix for it.

We then have a case I'm looking into for System.Numerics.Tensors.TensorPrimitives:IndexOfMinMaxVectorized256Size4Plus where the method body change is:

-       vpmovm2q ymm16, k1
-       vpternlogq ymm6, ymm6, ymm16, 85
-       vblendvpd ymm0, ymm5, ymm0, ymm6
-       vpternlogq ymm5, ymm5, ymm16, 85
-       vpblendvb ymm3, ymm4, ymm3, ymm5
+       vpmovm2q ymm6, k1
+       vpternlogq ymm7, ymm7, ymm6, 85
+       vblendvpd ymm0, ymm5, ymm0, ymm7
+       vpternlogq ymm6, ymm6, ymm6, 85
+       vpblendvb ymm3, ymm4, ymm3, ymm6

So LSRA decided to use ymm7 (CALLEE_SAVE) where previously it opted to use ymm16 (CALLER_SAVE). This causes an extra prologue/epilogue spill even though there's no reason for that to occur.

Overall though, particularly for production code, this is a positive improvement to the handling and fixes a pessimization we knew we had around these instructions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve FMA code generation related to operand last use

4 participants