8357258: x86: Improve receiver type profiling reliability #25305

shipilev · 2025-05-19T14:59:36Z

See the bug for discussion what issues current machinery has.

This PR executes the plan outlined in the bug:

Common the receiver type profiling code in interpreter and C1
Rewrite receiver type profiling code to only do atomic receiver slot installations
Trim C1OptimizeVirtualCallProfiling to only claim slots when receiver is installed

This PR does not do atomic counter updates themselves, as it may have much wider performance implications, including regressions. This PR should be at least performance neutral.

Additional testing:

Linux x86_64 server fastdebug, compiler/
Linux x86_64 server fastdebug, all

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8357258: x86: Improve receiver type profiling reliability (Enhancement - P4)

Reviewers

Vladimir Kozlov (@vnkozlov - Reviewer)
Vladimir Ivanov (@iwanowww - Reviewer)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/25305/head:pull/25305
$ git checkout pull/25305

Update a local copy of the PR:
$ git checkout pull/25305
$ git pull https://git.openjdk.org/jdk.git pull/25305/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 25305

View PR using the GUI difftool:
$ git pr show -t 25305

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/25305.diff

Using Webrev

Link to Webrev Comment

bridgekeeper · 2025-05-19T15:00:22Z

👋 Welcome back shade! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2025-05-19T15:00:40Z

@shipilev This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8357258: x86: Improve receiver type profiling reliability

Reviewed-by: kvn, vlivanov

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been no new commits pushed to the master branch. If another commit should be pushed before you perform the /integrate command, your PR will be automatically rebased. If you prefer to avoid any potential automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

openjdk · 2025-05-19T15:01:19Z

@shipilev The following label will be automatically applied to this pull request:

hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

bridgekeeper · 2025-07-14T19:21:49Z

@shipilev This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

openjdk · 2025-08-06T18:45:46Z

@shipilev this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout JDK-8357258-x86-c1-optimize-virt-calls
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

shipilev · 2025-09-05T11:39:37Z

In addition to reliability improvements, doing a denser loop allows to significantly optimize tier3 code density. With larger TypeProfileWidth, type profile checks are the significant part of generated code. This density improvement allows us to do the CAS without increasing the code size. It also allows us to store (more) tier3 code in AOTCache going forward. If/when folks (looking at @theRealAph, really) start doing probabilistic profiling counters, this budget increase would also help to cram in more code.

$ for I in 1 2 3 4; do build/linux-x86_64-server-release/images/jdk/bin/java -XX:TieredStopAtLevel=${I} \
  -Xcomp -XX:+CITime -Xmx2g Hello.java 2>&1 | grep "Tier${I}" | cut -d' ' -f 3,23-; done

=== -XX:TypeProfileWidth=2 (default)

# Baseline
Tier1 nmethods_code_size:  7091616 bytes
Tier2 nmethods_code_size:  7579424 bytes
Tier3 nmethods_code_size: 17494984 bytes
Tier4 nmethods_code_size:  6058128 bytes

# Patched
Tier1 nmethods_code_size:  7091648 bytes
Tier2 nmethods_code_size:  7581808 bytes
Tier3 nmethods_code_size: 16806440 bytes (-4%)
Tier4 nmethods_code_size:  6057920 bytes

=== -XX:TypeProfileWidth=8 (default with +UseJVMCICompiler)

# Baseline
Tier1 nmethods_code_size:  7091672 bytes
Tier2 nmethods_code_size:  7580576 bytes
Tier3 nmethods_code_size: 28096448 bytes
Tier4 nmethods_code_size:  6061280 bytes

# Patched
Tier1 nmethods_code_size:  7090760 bytes
Tier2 nmethods_code_size:  7579432 bytes
Tier3 nmethods_code_size: 16837688 bytes (-40% !!!)
Tier4 nmethods_code_size:  6058104 bytes

mlbridge · 2025-09-05T11:45:20Z

Webrevs

shipilev · 2025-09-17T09:53:11Z

Looking for reviews! @dean-long, @vnkozlov, @veresov -- you would probably be interested in this.

src/hotspot/cpu/x86/macroAssembler_x86.cpp

src/hotspot/cpu/x86/interp_masm_x86.cpp

bridgekeeper · 2025-10-22T16:37:58Z

@shipilev This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

shipilev · 2025-10-23T07:28:29Z

/touch

Not now, bot, still looking for reviewers.

openjdk · 2025-10-23T07:29:07Z

@shipilev The pull request is being re-evaluated and the inactivity timeout has been reset.

bridgekeeper · 2025-11-20T14:38:42Z

@shipilev This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

shipilev · 2025-11-20T14:48:53Z

Shoo bots, still looking for reviewers.

src/hotspot/cpu/x86/macroAssembler_x86.cpp

…e, and the table is full

shipilev · 2025-12-01T12:58:41Z

Oh, all right! This made me realize we actually have a secondary "fast" case: receiver is not found, but profile is full. This is pretty frequent with TypeProfileWidth=2. In that case, we are doing way too much stuff, anticipating receiver slot installation that would never actually come. Specializing for that case costs significantly fewer loads, and gets the code much more pipelined; I suspect that because tight loops that do not have CAS-es in them are uop-cached more readily.

We now lose "only" 0.5ns in this test:

Benchmark                                                (randomized)  Mode  Cnt    Score   Error      Units

# Baseline
InterfaceCalls.test2ndInt5Types                                 false  avgt   12   16.945 ±  0.079      ns/op
InterfaceCalls.test2ndInt5Types:L1-dcache-load-misses           false  avgt    3    0.076 ±  2.187       #/op
InterfaceCalls.test2ndInt5Types:L1-dcache-loads                 false  avgt    3   88.738 ±  0.416       #/op
InterfaceCalls.test2ndInt5Types:branch-misses                   false  avgt    3    0.007 ±  0.003       #/op
InterfaceCalls.test2ndInt5Types:branches                        false  avgt    3   49.122 ±  0.353       #/op
InterfaceCalls.test2ndInt5Types:cycles                          false  avgt    3   57.147 ±  1.698       #/op
InterfaceCalls.test2ndInt5Types:instructions                    false  avgt    3  247.443 ±  1.531       #/op

# Old PR version
InterfaceCalls.test2ndInt5Types                                 false  avgt   12   22.513 ±  0.208      ns/op
InterfaceCalls.test2ndInt5Types:L1-dcache-load-misses           false  avgt    3    0.012 ±  0.072       #/op
InterfaceCalls.test2ndInt5Types:L1-dcache-loads                 false  avgt    3  108.446 ± 13.975       #/op  ; +20 loads
InterfaceCalls.test2ndInt5Types:branch-misses                   false  avgt    3    0.407 ±  0.010       #/op
InterfaceCalls.test2ndInt5Types:branches                        false  avgt    3   54.102 ±  0.403       #/op  ; +5 branches
InterfaceCalls.test2ndInt5Types:cycles                          false  avgt    3   75.938 ±  5.043       #/op  ; +19 cycles
InterfaceCalls.test2ndInt5Types:instructions                    false  avgt    3  280.194 ±  5.758       #/op  ; +32 instructions

# New PR version
InterfaceCalls.test2ndInt5Types                                 false  avgt   12   17.441 ±  0.287      ns/op
InterfaceCalls.test2ndInt5Types:L1-dcache-load-misses           false  avgt    3    0.009 ±  0.072       #/op
InterfaceCalls.test2ndInt5Types:L1-dcache-loads                 false  avgt    3   88.803 ±  1.401       #/op
InterfaceCalls.test2ndInt5Types:branch-misses                   false  avgt    3    0.009 ±  0.062       #/op
InterfaceCalls.test2ndInt5Types:branches                        false  avgt    3   52.945 ±  0.752       #/op  ; +4 branches
InterfaceCalls.test2ndInt5Types:cycles                          false  avgt    3   58.866 ± 15.379       #/op  ; +2 cycles
InterfaceCalls.test2ndInt5Types:instructions                    false  avgt    3  272.838 ±  1.665       #/op  ; +28 instructions

The code is in new commits, passes hotspot:tier1, running more tests now.

src/hotspot/cpu/x86/macroAssembler_x86.cpp

vnkozlov · 2025-12-03T19:50:19Z

This looks good. Thank you for cleaning up code and detailed comments.
I submitted our testing.

iwanowww · 2025-12-04T19:14:43Z

Overall, looks good to me. Nice work, Aleksey!

I'm curious how performance-sensitive that part of code is. Does it make sense to try to further optimize it?

For example:

2 slots is the most common case; any benefits from optimizing specifically for it (e.g., unroll the loops)?
fast path can be further optimized for no nulls case by offloading more work on found_null slow path [1]

[1]

    // Fastest: receiver is already installed
    int i = 0;
    for (; i < receiver_count(); i++) {
      if (receiver(i) == recv) goto found_recv(i);
      if (receiver(i) == null) goto found_null(i);
    }
  
    goto polymorphic
  
    // Slow: try to install receiver
  found_null(i):
    // Finish the search
    for (int j = i ; j < receiver_count(); j++) {
      if (receiver(j) == recv) goto found_recv(j);
    }
    CAS(&receiver(i), null, recv);
    goto restart
...

vnkozlov · 2025-12-04T21:44:13Z

2 slots is the most common case; any benefits from optimizing specifically for it (e.g., unroll the loops)?

Yes, since row_limit() is statically know and does not change we can have two versions of code based on its value:

<= 2 slots: fully unrolled (much less instructions)
> 2 slots: current proposed code

vnkozlov · 2025-12-04T21:53:39Z

fast path can be further optimized for no nulls case by offloading more work on found_null slow path [1]

I don't think we need to optimize > 2 slots case. Such setting is not current default. Also based on @shipilev comments 2 separate loops is more or less optimal.

vnkozlov · 2025-12-05T06:09:17Z

My testing of version 07 passed clean

shipilev · 2025-12-10T08:28:47Z

I'm curious how performance-sensitive that part of code is. Does it make sense to try to further optimize it?

This is about 5-th-ish version of this code, so I don't think there is more juice to squeeze out of it. The current version is more or less optimal. The stratification into three cases looks the best performing overall.

fast path can be further optimized for no nulls case by offloading more work on found_null slow path [1]

Yeah, but putting checks for both installed receiver and nullptr slot turns out hurting performance; this is bad even without extra control flow. Two separate loops are more efficient, even for small number of iterations. It also helpfully optimizes for the best case, when profile is smaller than TypeProfileWidth, which is what we want.

2 slots is the most common case; any benefits from optimizing specifically for it (e.g., unroll the loops)?

I don't think it is worth the extra complexity, honestly. The loop-y code in current version is still a significant code density win over the decision-tree (unrolled, effectively) approach we are doing currently. Keeping this thing simple means more reliability and less testing surface, plus much less headache to port to other architectures.

Note that the goal for this work is to improve profiling reliability without hopefully ceding too much ground in code density and performance. When I started out, it was not clear if it is doable, given the need for atomics; but it looks doable indeed. So I think we should call this thing done and move on to solving the actual performance problem in this code: the contention on counter updates.

iwanowww · 2025-12-10T19:54:09Z

Thanks for the clarifications, Aleksey. Just wanted to get a sense how much performance-wise we leave on the table and whether it is worth to spend more time on it later.

iwanowww

Looks good.

vnkozlov

Yes, we can look on this later if we need to optimize it more. Thankfully it is in one place now.

I don't need to retest it since you didn't change code after v07 and only merged from mainline.

mikabl-arm · 2025-12-18T13:21:29Z

Hi @shipilev , are you aware of anyone working on or planning to implement the same for AArch64 by any chance?

shipilev · 2025-12-18T15:23:42Z

Hi @shipilev , are you aware of anyone working on or planning to implement the same for AArch64 by any chance?

I'll task one of our folks to do it after NY break.

Speaking of, I will integrate this one after NY break as well, to avoid dealing with any possible fallout during the holidays.

shipilev · 2026-01-05T09:33:27Z

Remerged from master, re-ran tier1 and hotspot_compiler tests on Linux x86_64, all clean. There is an unrelated GHA infra failure (#29030), which IMO does not block the integration, as at least Windows x86_64 passed in GHA, and Linux x86_64 passes locally.

shipilev · 2026-01-05T09:33:37Z

Here goes.

/integrate

shipilev · 2026-01-05T09:34:57Z

I'll task one of our folks to do it after NY break.

That would be: https://bugs.openjdk.org/browse/JDK-8374513

openjdk · 2026-01-05T09:35:57Z

Going to push as commit e676c9d.
Since your change was applied there has been 1 commit pushed to the master branch:

1630382: 8373704: Improve "SocketException: Protocol family unavailable" message

Your commit was automatically rebased without conflicts.

openjdk · 2026-01-05T09:36:13Z

@shipilev Pushed as commit e676c9d.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label May 19, 2025

shipilev changed the title ~~8357258: x86: Improve C1 known virtual calls profiling reliability~~ 8357258: x86: Improve receiver type profiling reliability May 20, 2025

shipilev force-pushed the JDK-8357258-x86-c1-optimize-virt-calls branch from 3a49d15 to 798b4c3 Compare May 20, 2025 12:24

shipilev force-pushed the JDK-8357258-x86-c1-optimize-virt-calls branch from d2733a3 to d5b7991 Compare July 9, 2025 10:54

openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Aug 6, 2025

Initial version

3b601fc

shipilev force-pushed the JDK-8357258-x86-c1-optimize-virt-calls branch from d5b7991 to 3b601fc Compare September 5, 2025 06:28

openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Sep 5, 2025

Drop atomic counters

e078cfb

shipilev marked this pull request as ready for review September 5, 2025 11:40

openjdk bot added the rfr Pull request is ready for review label Sep 5, 2025

Merge branch 'master' into JDK-8357258-x86-c1-optimize-virt-calls

f934435

dean-long reviewed Sep 17, 2025

View reviewed changes

src/hotspot/cpu/x86/macroAssembler_x86.cpp Outdated Show resolved Hide resolved

dean-long reviewed Sep 17, 2025

View reviewed changes

src/hotspot/cpu/x86/interp_masm_x86.cpp Outdated Show resolved Hide resolved

dean-long reviewed Sep 17, 2025

View reviewed changes

src/hotspot/cpu/x86/interp_masm_x86.cpp Show resolved Hide resolved

Merge branch 'master' into JDK-8357258-x86-c1-optimize-virt-calls

446fa99

rose00 reviewed Nov 20, 2025

View reviewed changes

src/hotspot/cpu/x86/macroAssembler_x86.cpp Outdated Show resolved Hide resolved

shipilev added 2 commits December 1, 2025 13:16

Actually have a second "fast" case: receiver is not found in the tabl…

fe9f28c

…e, and the table is full

Simplify third case: no need to loop, just restart the search

f3e0fa4

iwanowww reviewed Dec 1, 2025

View reviewed changes

src/hotspot/cpu/x86/macroAssembler_x86.cpp Outdated Show resolved Hide resolved

shipilev added 3 commits December 2, 2025 10:36

Tighten up the comments

133c29f

More comments

39cc4df

Merge branch 'master' into JDK-8357258-x86-c1-optimize-virt-calls

3c5019d

Merge branch 'master' into JDK-8357258-x86-c1-optimize-virt-calls

c28810e

iwanowww approved these changes Dec 10, 2025

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label Dec 10, 2025

vnkozlov approved these changes Dec 10, 2025

View reviewed changes

Merge branch 'master' into JDK-8357258-x86-c1-optimize-virt-calls

e4a4719

openjdk bot added the integrated Pull request has been integrated label Jan 5, 2026

openjdk bot closed this Jan 5, 2026

openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jan 5, 2026

8357258: x86: Improve receiver type profiling reliability #25305

8357258: x86: Improve receiver type profiling reliability #25305

Uh oh!

Conversation

shipilev commented May 19, 2025 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewers

Reviewing

Uh oh!

bridgekeeper bot commented May 19, 2025

Uh oh!

openjdk bot commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openjdk bot commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bridgekeeper bot commented Jul 14, 2025

Uh oh!

openjdk bot commented Aug 6, 2025

Uh oh!

shipilev commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mlbridge bot commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

shipilev commented Sep 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bridgekeeper bot commented Oct 22, 2025

Uh oh!

shipilev commented Oct 23, 2025

Uh oh!

openjdk bot commented Oct 23, 2025

Uh oh!

bridgekeeper bot commented Nov 20, 2025

Uh oh!

shipilev commented Nov 20, 2025

Uh oh!

Uh oh!

shipilev commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

vnkozlov commented Dec 3, 2025

Uh oh!

iwanowww commented Dec 4, 2025

Uh oh!

vnkozlov commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vnkozlov commented Dec 4, 2025

Uh oh!

vnkozlov commented Dec 5, 2025

Uh oh!

shipilev commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iwanowww commented Dec 10, 2025

Uh oh!

iwanowww left a comment

Choose a reason for hiding this comment

Uh oh!

vnkozlov left a comment

Choose a reason for hiding this comment

Uh oh!

mikabl-arm commented Dec 18, 2025

Uh oh!

shipilev commented Dec 18, 2025

Uh oh!

shipilev commented Jan 5, 2026

Uh oh!

shipilev commented Jan 5, 2026

Uh oh!

shipilev commented May 19, 2025 •

edited by openjdk bot

Loading

openjdk bot commented May 19, 2025 •

edited

Loading

openjdk bot commented May 19, 2025 •

edited

Loading

shipilev commented Sep 5, 2025 •

edited

Loading

mlbridge bot commented Sep 5, 2025 •

edited

Loading

shipilev commented Dec 1, 2025 •

edited

Loading

vnkozlov commented Dec 4, 2025 •

edited

Loading

shipilev commented Dec 10, 2025 •

edited

Loading