JIT: improve throughput of the RLCSE greedy heuristic#98906
JIT: improve throughput of the RLCSE greedy heuristic#98906AndyAyersMS merged 1 commit intodotnet:mainfrom
Conversation
Profiling showed that `GetFeatures` was a major factor in throughput. For the most part the features of CSE candidates don't change as we perform CSEs, so build in some logic to avoid recomputing the feature set unless there is some evidence features have changed. To avoid having to remove already performed candidates from the candidate vector we now tag them them as `m_performed`l these get ignored during subsequent processing, and discarded if we ever recompute features. This should cut the TP impact roughly in half, the remaining part seems to largely be from doing more CSEs (which we hope will show some perf benefit). Contributes to dotnet#92915.
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsProfiling showed that To avoid having to remove already performed candidates from the candidate vector we now tag them them as This should cut the TP impact roughly in half, the remaining part seems to largely be from doing more CSEs (which we hope will show some perf benefit). Contributes to #92915.
|
|
One note on "unchanging" features. I realized while working on this that "is live across call" is a volatile feature (say we CSE a helper call) but we have no easy way to discover or recompute this. Likewise with "is LSRA live across call" though that does get recomputed if we recompute for other reasons. |
Also, some helper calls have custom calling conventions (e.g. write barriers), so presumably CSE candidates don't suffer from living accross those. |
I'll have to check, but I don't think CSE analysis takes potential write barrier calls (or any other late-introduced call) into account. |
|
The "lsra live across" is potentially expensive as it could walk most of the flow graph per candidate. I could make this more efficient by doing just one walk but that would require revising the flow of candidate costing. Will try and get a handle on the cost first. Pareto frontier data suggests that there is a pretty hard tradeoff between perf score and code size (and hence I'm guessing TP) so not clear how much better things can get here. |
Profiling showed that
GetFeatureswas a major factor in throughput. For the most part the features of CSE candidates don't change as we perform CSEs, so build in some logic to avoid recomputing the feature set unless there is some evidence features have changed.To avoid having to remove already performed candidates from the candidate vector we now tag them as
m_performedso they get ignored during subsequent processing, and discarded if we ever recompute features.This should cut the TP impact roughly in half, the remaining part seems to largely be from doing more CSEs (which we hope will show some perf benefit).
I also took advantage of #98434 so in the symbol table listing the cse temps now show which CSE candidate inspired them.
Contributes to #92915.