Skip to content

Conversation

@shipilev
Copy link
Member

@shipilev shipilev commented May 19, 2025

See the bug for discussion what issues current machinery has.

This PR executes the plan outlined in the bug:

  1. Common the receiver type profiling code in interpreter and C1
  2. Rewrite receiver type profiling code to only do atomic receiver slot installations
  3. Trim C1OptimizeVirtualCallProfiling to only claim slots when receiver is installed

This PR does not do atomic counter updates themselves, as it may have much wider performance implications, including regressions. This PR should be at least performance neutral.

Additional testing:

  • Linux x86_64 server fastdebug, compiler/
  • Linux x86_64 server fastdebug, all

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8357258: x86: Improve receiver type profiling reliability (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/25305/head:pull/25305
$ git checkout pull/25305

Update a local copy of the PR:
$ git checkout pull/25305
$ git pull https://git.openjdk.org/jdk.git pull/25305/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 25305

View PR using the GUI difftool:
$ git pr show -t 25305

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/25305.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented May 19, 2025

👋 Welcome back shade! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented May 19, 2025

@shipilev This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8357258: x86: Improve receiver type profiling reliability

Reviewed-by: kvn, vlivanov

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been no new commits pushed to the master branch. If another commit should be pushed before you perform the /integrate command, your PR will be automatically rebased. If you prefer to avoid any potential automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk
Copy link

openjdk bot commented May 19, 2025

@shipilev The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label May 19, 2025
@shipilev shipilev changed the title 8357258: x86: Improve C1 known virtual calls profiling reliability 8357258: x86: Improve receiver type profiling reliability May 20, 2025
@shipilev shipilev force-pushed the JDK-8357258-x86-c1-optimize-virt-calls branch from 3a49d15 to 798b4c3 Compare May 20, 2025 12:24
@shipilev shipilev force-pushed the JDK-8357258-x86-c1-optimize-virt-calls branch from d2733a3 to d5b7991 Compare July 9, 2025 10:54
@bridgekeeper
Copy link

bridgekeeper bot commented Jul 14, 2025

@shipilev This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@openjdk
Copy link

openjdk bot commented Aug 6, 2025

@shipilev this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout JDK-8357258-x86-c1-optimize-virt-calls
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Aug 6, 2025
@shipilev shipilev force-pushed the JDK-8357258-x86-c1-optimize-virt-calls branch from d5b7991 to 3b601fc Compare September 5, 2025 06:28
@openjdk openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Sep 5, 2025
@shipilev
Copy link
Member Author

shipilev commented Sep 5, 2025

In addition to reliability improvements, doing a denser loop allows to significantly optimize tier3 code density. With larger TypeProfileWidth, type profile checks are the significant part of generated code. This density improvement allows us to do the CAS without increasing the code size. It also allows us to store (more) tier3 code in AOTCache going forward. If/when folks (looking at @theRealAph, really) start doing probabilistic profiling counters, this budget increase would also help to cram in more code.

$ for I in 1 2 3 4; do build/linux-x86_64-server-release/images/jdk/bin/java -XX:TieredStopAtLevel=${I} \
  -Xcomp -XX:+CITime -Xmx2g Hello.java 2>&1 | grep "Tier${I}" | cut -d' ' -f 3,23-; done

=== -XX:TypeProfileWidth=2 (default)

# Baseline
Tier1 nmethods_code_size:  7091616 bytes
Tier2 nmethods_code_size:  7579424 bytes
Tier3 nmethods_code_size: 17494984 bytes
Tier4 nmethods_code_size:  6058128 bytes

# Patched
Tier1 nmethods_code_size:  7091648 bytes
Tier2 nmethods_code_size:  7581808 bytes
Tier3 nmethods_code_size: 16806440 bytes (-4%)
Tier4 nmethods_code_size:  6057920 bytes

=== -XX:TypeProfileWidth=8 (default with +UseJVMCICompiler)

# Baseline
Tier1 nmethods_code_size:  7091672 bytes
Tier2 nmethods_code_size:  7580576 bytes
Tier3 nmethods_code_size: 28096448 bytes
Tier4 nmethods_code_size:  6061280 bytes

# Patched
Tier1 nmethods_code_size:  7090760 bytes
Tier2 nmethods_code_size:  7579432 bytes
Tier3 nmethods_code_size: 16837688 bytes (-40% !!!)
Tier4 nmethods_code_size:  6058104 bytes

@shipilev shipilev marked this pull request as ready for review September 5, 2025 11:40
@openjdk openjdk bot added the rfr Pull request is ready for review label Sep 5, 2025
@mlbridge
Copy link

mlbridge bot commented Sep 5, 2025

Webrevs

@shipilev
Copy link
Member Author

Looking for reviews! @dean-long, @vnkozlov, @veresov -- you would probably be interested in this.

@bridgekeeper
Copy link

bridgekeeper bot commented Oct 22, 2025

@shipilev This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@shipilev
Copy link
Member Author

/touch

Not now, bot, still looking for reviewers.

@openjdk
Copy link

openjdk bot commented Oct 23, 2025

@shipilev The pull request is being re-evaluated and the inactivity timeout has been reset.

@bridgekeeper
Copy link

bridgekeeper bot commented Nov 20, 2025

@shipilev This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@shipilev
Copy link
Member Author

Shoo bots, still looking for reviewers.

@shipilev
Copy link
Member Author

shipilev commented Dec 1, 2025

Oh, all right! This made me realize we actually have a secondary "fast" case: receiver is not found, but profile is full. This is pretty frequent with TypeProfileWidth=2. In that case, we are doing way too much stuff, anticipating receiver slot installation that would never actually come. Specializing for that case costs significantly fewer loads, and gets the code much more pipelined; I suspect that because tight loops that do not have CAS-es in them are uop-cached more readily.

We now lose "only" 0.5ns in this test:

Benchmark                                                (randomized)  Mode  Cnt    Score   Error      Units

# Baseline
InterfaceCalls.test2ndInt5Types                                 false  avgt   12   16.945 ±  0.079      ns/op
InterfaceCalls.test2ndInt5Types:L1-dcache-load-misses           false  avgt    3    0.076 ±  2.187       #/op
InterfaceCalls.test2ndInt5Types:L1-dcache-loads                 false  avgt    3   88.738 ±  0.416       #/op
InterfaceCalls.test2ndInt5Types:branch-misses                   false  avgt    3    0.007 ±  0.003       #/op
InterfaceCalls.test2ndInt5Types:branches                        false  avgt    3   49.122 ±  0.353       #/op
InterfaceCalls.test2ndInt5Types:cycles                          false  avgt    3   57.147 ±  1.698       #/op
InterfaceCalls.test2ndInt5Types:instructions                    false  avgt    3  247.443 ±  1.531       #/op

# Old PR version
InterfaceCalls.test2ndInt5Types                                 false  avgt   12   22.513 ±  0.208      ns/op
InterfaceCalls.test2ndInt5Types:L1-dcache-load-misses           false  avgt    3    0.012 ±  0.072       #/op
InterfaceCalls.test2ndInt5Types:L1-dcache-loads                 false  avgt    3  108.446 ± 13.975       #/op  ; +20 loads
InterfaceCalls.test2ndInt5Types:branch-misses                   false  avgt    3    0.407 ±  0.010       #/op
InterfaceCalls.test2ndInt5Types:branches                        false  avgt    3   54.102 ±  0.403       #/op  ; +5 branches
InterfaceCalls.test2ndInt5Types:cycles                          false  avgt    3   75.938 ±  5.043       #/op  ; +19 cycles
InterfaceCalls.test2ndInt5Types:instructions                    false  avgt    3  280.194 ±  5.758       #/op  ; +32 instructions

# New PR version
InterfaceCalls.test2ndInt5Types                                 false  avgt   12   17.441 ±  0.287      ns/op
InterfaceCalls.test2ndInt5Types:L1-dcache-load-misses           false  avgt    3    0.009 ±  0.072       #/op
InterfaceCalls.test2ndInt5Types:L1-dcache-loads                 false  avgt    3   88.803 ±  1.401       #/op
InterfaceCalls.test2ndInt5Types:branch-misses                   false  avgt    3    0.009 ±  0.062       #/op
InterfaceCalls.test2ndInt5Types:branches                        false  avgt    3   52.945 ±  0.752       #/op  ; +4 branches
InterfaceCalls.test2ndInt5Types:cycles                          false  avgt    3   58.866 ± 15.379       #/op  ; +2 cycles
InterfaceCalls.test2ndInt5Types:instructions                    false  avgt    3  272.838 ±  1.665       #/op  ; +28 instructions

The code is in new commits, passes hotspot:tier1, running more tests now.

@vnkozlov
Copy link
Contributor

vnkozlov commented Dec 3, 2025

This looks good. Thank you for cleaning up code and detailed comments.
I submitted our testing.

@iwanowww
Copy link
Contributor

iwanowww commented Dec 4, 2025

Overall, looks good to me. Nice work, Aleksey!

I'm curious how performance-sensitive that part of code is. Does it make sense to try to further optimize it?

For example:

  • 2 slots is the most common case; any benefits from optimizing specifically for it (e.g., unroll the loops)?
  • fast path can be further optimized for no nulls case by offloading more work on found_null slow path [1]

[1]

    // Fastest: receiver is already installed
    int i = 0;
    for (; i < receiver_count(); i++) {
      if (receiver(i) == recv) goto found_recv(i);
      if (receiver(i) == null) goto found_null(i);
    }
  
    goto polymorphic
  
    // Slow: try to install receiver
  found_null(i):
    // Finish the search
    for (int j = i ; j < receiver_count(); j++) {
      if (receiver(j) == recv) goto found_recv(j);
    }
    CAS(&receiver(i), null, recv);
    goto restart
...

@vnkozlov
Copy link
Contributor

vnkozlov commented Dec 4, 2025

2 slots is the most common case; any benefits from optimizing specifically for it (e.g., unroll the loops)?

Yes, since row_limit() is statically know and does not change we can have two versions of code based on its value:

  • <= 2 slots: fully unrolled (much less instructions)
  • > 2 slots: current proposed code

@vnkozlov
Copy link
Contributor

vnkozlov commented Dec 4, 2025

fast path can be further optimized for no nulls case by offloading more work on found_null slow path [1]

I don't think we need to optimize > 2 slots case. Such setting is not current default. Also based on @shipilev comments 2 separate loops is more or less optimal.

@vnkozlov
Copy link
Contributor

vnkozlov commented Dec 5, 2025

My testing of version 07 passed clean

@shipilev
Copy link
Member Author

shipilev commented Dec 10, 2025

I'm curious how performance-sensitive that part of code is. Does it make sense to try to further optimize it?

This is about 5-th-ish version of this code, so I don't think there is more juice to squeeze out of it. The current version is more or less optimal. The stratification into three cases looks the best performing overall.

fast path can be further optimized for no nulls case by offloading more work on found_null slow path [1]

Yeah, but putting checks for both installed receiver and nullptr slot turns out hurting performance; this is bad even without extra control flow. Two separate loops are more efficient, even for small number of iterations. It also helpfully optimizes for the best case, when profile is smaller than TypeProfileWidth, which is what we want.

2 slots is the most common case; any benefits from optimizing specifically for it (e.g., unroll the loops)?

I don't think it is worth the extra complexity, honestly. The loop-y code in current version is still a significant code density win over the decision-tree (unrolled, effectively) approach we are doing currently. Keeping this thing simple means more reliability and less testing surface, plus much less headache to port to other architectures.

Note that the goal for this work is to improve profiling reliability without hopefully ceding too much ground in code density and performance. When I started out, it was not clear if it is doable, given the need for atomics; but it looks doable indeed. So I think we should call this thing done and move on to solving the actual performance problem in this code: the contention on counter updates.

@iwanowww
Copy link
Contributor

Thanks for the clarifications, Aleksey. Just wanted to get a sense how much performance-wise we leave on the table and whether it is worth to spend more time on it later.

Copy link
Contributor

@iwanowww iwanowww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Dec 10, 2025
Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can look on this later if we need to optimize it more. Thankfully it is in one place now.

I don't need to retest it since you didn't change code after v07 and only merged from mainline.

@mikabl-arm
Copy link
Contributor

Hi @shipilev , are you aware of anyone working on or planning to implement the same for AArch64 by any chance?

@shipilev
Copy link
Member Author

Hi @shipilev , are you aware of anyone working on or planning to implement the same for AArch64 by any chance?

I'll task one of our folks to do it after NY break.

Speaking of, I will integrate this one after NY break as well, to avoid dealing with any possible fallout during the holidays.

@shipilev
Copy link
Member Author

shipilev commented Jan 5, 2026

Remerged from master, re-ran tier1 and hotspot_compiler tests on Linux x86_64, all clean. There is an unrelated GHA infra failure (#29030), which IMO does not block the integration, as at least Windows x86_64 passed in GHA, and Linux x86_64 passes locally.

@shipilev
Copy link
Member Author

shipilev commented Jan 5, 2026

Here goes.

/integrate

@shipilev
Copy link
Member Author

shipilev commented Jan 5, 2026

I'll task one of our folks to do it after NY break.

That would be: https://bugs.openjdk.org/browse/JDK-8374513

@openjdk
Copy link

openjdk bot commented Jan 5, 2026

Going to push as commit e676c9d.
Since your change was applied there has been 1 commit pushed to the master branch:

  • 1630382: 8373704: Improve "SocketException: Protocol family unavailable" message

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Jan 5, 2026
@openjdk openjdk bot closed this Jan 5, 2026
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jan 5, 2026
@openjdk
Copy link

openjdk bot commented Jan 5, 2026

@shipilev Pushed as commit e676c9d.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot hotspot-dev@openjdk.org hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

7 participants