JIT: improve throughput of the RLCSE greedy heuristic by AndyAyersMS · Pull Request #98906 · dotnet/runtime

AndyAyersMS · 2024-02-25T23:47:07Z

Profiling showed that GetFeatures was a major factor in throughput. For the most part the features of CSE candidates don't change as we perform CSEs, so build in some logic to avoid recomputing the feature set unless there is some evidence features have changed.

To avoid having to remove already performed candidates from the candidate vector we now tag them as m_performed so they get ignored during subsequent processing, and discarded if we ever recompute features.

This should cut the TP impact roughly in half, the remaining part seems to largely be from doing more CSEs (which we hope will show some perf benefit).

I also took advantage of #98434 so in the symbol table listing the cse temps now show which CSE candidate inspired them.

Contributes to #92915.

Profiling showed that `GetFeatures` was a major factor in throughput. For the most part the features of CSE candidates don't change as we perform CSEs, so build in some logic to avoid recomputing the feature set unless there is some evidence features have changed. To avoid having to remove already performed candidates from the candidate vector we now tag them them as `m_performed`l these get ignored during subsequent processing, and discarded if we ever recompute features. This should cut the TP impact roughly in half, the remaining part seems to largely be from doing more CSEs (which we hope will show some perf benefit). Contributes to dotnet#92915.

ghost · 2024-02-25T23:47:20Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Profiling showed that GetFeatures was a major factor in throughput. For the most part the features of CSE candidates don't change as we perform CSEs, so build in some logic to avoid recomputing the feature set unless there is some evidence features have changed.

To avoid having to remove already performed candidates from the candidate vector we now tag them them as m_performedl these get ignored during subsequent processing, and discarded if we ever recompute features.

This should cut the TP impact roughly in half, the remaining part seems to largely be from doing more CSEs (which we hope will show some perf benefit).

Contributes to #92915.

Author:	AndyAyersMS
Assignees:	AndyAyersMS
Labels:	`area-CodeGen-coreclr`
Milestone:	-

AndyAyersMS · 2024-02-25T23:48:23Z

@EgorBo PTAL
cc @dotnet/jit-contrib

Note the heuristic is disabled here so this will look like a no-diff change. I will cherry-pick this into #98776 where the new heuristic is enabled.

AndyAyersMS · 2024-02-25T23:51:00Z

One note on "unchanging" features. I realized while working on this that "is live across call" is a volatile feature (say we CSE a helper call) but we have no easy way to discover or recompute this.

Likewise with "is LSRA live across call" though that does get recomputed if we recompute for other reasons.

EgorBo · 2024-02-26T13:42:41Z

say we CSE a helper call) but we have no easy way to discover or recompute this.

Also, some helper calls have custom calling conventions (e.g. write barriers), so presumably CSE candidates don't suffer from living accross those.

AndyAyersMS · 2024-02-26T15:39:57Z

say we CSE a helper call) but we have no easy way to discover or recompute this.

Also, some helper calls have custom calling conventions (e.g. write barriers), so presumably CSE candidates don't suffer from living accross those.

I'll have to check, but I don't think CSE analysis takes potential write barrier calls (or any other late-introduced call) into account.

AndyAyersMS · 2024-02-26T15:43:01Z

The "lsra live across" is potentially expensive as it could walk most of the flow graph per candidate. I could make this more efficient by doing just one walk but that would require revising the flow of candidate costing. Will try and get a handle on the cost first.

Pareto frontier data suggests that there is a pretty hard tradeoff between perf score and code size (and hence I'm guessing TP) so not clear how much better things can get here.

ghost assigned AndyAyersMS Feb 25, 2024

ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 25, 2024

EgorBo approved these changes Feb 26, 2024

View reviewed changes

AndyAyersMS merged commit d0255d2 into dotnet:main Feb 26, 2024

github-actions bot locked and limited conversation to collaborators Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: improve throughput of the RLCSE greedy heuristic#98906

JIT: improve throughput of the RLCSE greedy heuristic#98906
AndyAyersMS merged 1 commit intodotnet:mainfrom
AndyAyersMS:ImproveRLCSE_TP

AndyAyersMS commented Feb 25, 2024 •

edited

Loading

Uh oh!

ghost commented Feb 25, 2024

Uh oh!

AndyAyersMS commented Feb 25, 2024

Uh oh!

AndyAyersMS commented Feb 25, 2024

Uh oh!

EgorBo commented Feb 26, 2024

Uh oh!

AndyAyersMS commented Feb 26, 2024

Uh oh!

AndyAyersMS commented Feb 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AndyAyersMS commented Feb 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Feb 25, 2024

Uh oh!

AndyAyersMS commented Feb 25, 2024

Uh oh!

AndyAyersMS commented Feb 25, 2024

Uh oh!

EgorBo commented Feb 26, 2024

Uh oh!

AndyAyersMS commented Feb 26, 2024

Uh oh!

AndyAyersMS commented Feb 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AndyAyersMS commented Feb 25, 2024 •

edited

Loading