Add OOM pause mechanism to pause query threads before killing on critical heap by yashmayya · Pull Request #18163 · apache/pinot

yashmayya · 2026-04-10T18:53:01Z

Summary

When heap transitions from below-critical to critical, pauses all query execution threads for a configurable timeout and calls System.gc() (only once, on first transition to critical), giving the JVM breathing room to reclaim memory before resorting to query kills.
If heap recovers below critical during the pause window, threads resume with no query killed.
If heap stays critical after timeout, the existing kill-most-expensive-query logic proceeds.
Off by default, dynamically toggleable via cluster config (no restart required).

Overhead

Happy path (no OOM pressure): waitIfPaused() is called at every sampling checkpoint and costs a single volatile read of _pauseActive. No lock, no contention.
Triggering path (critical heap): Query threads block on a ReentrantLock + Condition. Lock contention is not a concern because:
- Threads acquire the lock only to enter Condition.await(), releasing it immediately - hold time is near-zero.
- activatePause() (watcher thread) is lock-free - it writes two volatiles with correct ordering.
- This only activates when the JVM is already under critical heap pressure and queries are about to be killed anyway.

Configuration

┌────────────────────────────────────────────────┬─────────┬─────────┬─────────────────────────────────────────────────────────┐
│                   Config key                   │  Type   │ Default │                       Description                       │
├────────────────────────────────────────────────┼─────────┼─────────┼─────────────────────────────────────────────────────────┤
│ accounting.oom.critical.query.pause.enabled    │ boolean │ false   │ Enable pausing query threads before killing on critical │
│                                                │         │         │  heap                                                   │
├────────────────────────────────────────────────┼─────────┼─────────┼─────────────────────────────────────────────────────────┤
│ accounting.oom.critical.query.pause.timeout.ms │ long    │ 1000    │ How long (ms) to pause before proceeding with kill      │
└────────────────────────────────────────────────┴─────────┴─────────┴─────────────────────────────────────────────────────────┘

Both are dynamically updatable via Helix cluster config.

State transitions

Below critical → Critical (enabled): Activate pause, call System.gc()
Critical → Critical (within timeout): No-op, let GC work
Critical → Critical (timeout expired): Clear pause, kill most expensive query
Critical → Below critical (pause active): Clear pause, resume threads, no kill
Any → Panic: Kill all queries first, then clear pause
Below critical → Critical (disabled): Existing behavior (immediate kill of most expensive query)

codecov-commenter · 2026-04-10T19:42:06Z

Codecov Report

❌ Patch coverage is 69.04762% with 26 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.36%. Comparing base (3cc1902) to head (f902705).
⚠️ Report is 3 commits behind head on master.

Files with missing lines	Patch %	Lines
...pinot/core/accounting/QueryResourceAggregator.java	74.00%	9 Missing and 4 partials ⚠️
...org/apache/pinot/spi/query/QueryThreadContext.java	58.33%	5 Missing ⚠️
...ache/pinot/core/accounting/QueryMonitorConfig.java	76.47%	2 Missing and 2 partials ⚠️
...ore/accounting/ResourceUsageAccountantFactory.java	0.00%	4 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18163      +/-   ##
============================================
+ Coverage     63.32%   63.36%   +0.03%     
  Complexity     1627     1627              
============================================
  Files          3238     3238              
  Lines        197011   197089      +78     
  Branches      30466    30481      +15     
============================================
+ Hits         124762   124877     +115     
+ Misses        62244    62193      -51     
- Partials      10005    10019      +14

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (ø)`
integration	`100.00% <ø> (ø)`
integration1	`100.00% <ø> (ø)`
integration2	`0.00% <ø> (ø)`
java-11	`63.34% <69.04%> (+0.03%)`	⬆️
java-21	`63.33% <69.04%> (+8.07%)`	⬆️
temurin	`63.36% <69.04%> (+0.03%)`	⬆️
unittests	`63.35% <69.04%> (+0.03%)`	⬆️
unittests1	`55.33% <69.04%> (+0.04%)`	⬆️
unittests2	`34.94% <5.95%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

xiangfu0

High-signal issue inline.

Jackie-Jiang · 2026-04-13T00:18:00Z

        threadTracker.updateMemorySnapshot();
      }
+      // Block if an OOM pause is active (fast-path: single volatile read)
+      _queryResourceAggregator.waitIfPaused();


IMO this shouldn't be part of sample usage, but termination check: pause first, then check termination. Make sure termination is set before releasing pause

That will require some SPI changes, how about we swap order of sampling usage and termination checks instead (i.e., sample first - pause if needed - check termination)?

SPI change is fine. I'm thinking from an abstraction level, where there action belongs to the termination check, not usage sampling. E.g. the engine might only check termination without doing sampling

Jackie-Jiang · 2026-04-13T00:18:30Z

+        if (remaining <= 0) {
+          break;
+        }
+        _pauseCondition.await(remaining, TimeUnit.MILLISECONDS);


Do we need timeout here, or only rely on condition variable?

We could, but that relies solely on the watcher task thread to wake up query execution threads - this wait with timeout is like a double safeguard to ensure that query threads don't ever end up waiting indefinitely and appear stuck. Any particular issues with this wait with timeout approach?

It is a little bit risky because thread might wake up before watcher thread tries to terminate the query, causing higher memory usage

Jackie-Jiang · 2026-04-13T00:25:27Z

+    public static final String CONFIG_OF_OOM_PAUSE_ENABLED = "accounting.oom.critical.query.pause.enabled";
+    public static final boolean DEFAULT_OOM_PAUSE_ENABLED = false;
+
+    public static final String CONFIG_OF_OOM_PAUSE_TIMEOUT_MS = "accounting.oom.critical.query.pause.timeout.ms";


We can merge them into one config, where non-positive value (e.g. default -1) means no pause. Something like accounting.oom.query.kill.grace.period.ms

Should we also allow pausing when memory usage directly jumping into panic level? Or consider using a flag to control it.

Yeah I can merge the configs into one, but I'm not sure query.kill.grace.period is the right terminology? It's not exactly a "grace period" we're offering the threads - which sounds like we're waiting for them to complete execution gracefully - we're actually pausing execution instead.

As for the point on panic, IMO that would significantly muddy the panic semantics. I'd argue panic should stay as immediate kill-all with no pause since the JVM is almost out of memory and any delay could tip into OOM.

How about accounting.oom.pre.query.kill.pause.duration.ms? I feel pause.timeout is a little bit ambiguous.

I remember seeing memory usage directly jumping to panic sometimes. I agree it feels dangerous to not immediately kill on panic, but maybe it is good to at least have that option so we can try it out

accounting.oom.pre.query.kill.pause.duration.ms sounds good. I've also added a new boolean config accounting.oom.panic.pre.query.kill.pause.enabled (defaulting to false) which can control the pause on direct transition to panic (bypassing critical).

…ical heap

…o termination check

yashmayya · 2026-04-15T22:55:16Z

+        if (System.currentTimeMillis() >= _pauseDeadlineMs || !pauseOnPanic) {
+          // Grace period expired, or panic-pause disabled — kill all and clear
+          killAllQueries(threadTrackers);


Hm actually now that I think about it, we probably don't want to wait for grace period to expire on transition from critical to panic, even if the pause on panic config is enabled..

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Jackie-Jiang · 2026-04-15T23:24:37Z

+    public static final long DEFAULT_OOM_PRE_QUERY_KILL_PAUSE_DURATION_MS = -1;
+
+    public static final String CONFIG_OF_OOM_PANIC_PRE_QUERY_KILL_PAUSE_ENABLED =
+        "accounting.oom.panic.pre.query.kill.pause.enabled";


(optional) I feel it might be clearer:

Suggested change

"accounting.oom.panic.pre.query.kill.pause.enabled";

"accounting.oom.panic.allow.pre.query.kill.pause";

Jackie-Jiang · 2026-04-15T23:36:12Z

    // NOTE: In production code, threadContext should never be null. It might be null in tests when QueryThreadContext
    //       is not set up.
    if (threadContext != null) {
+      threadContext.waitIfPausedInternal();


Discussed offline. Let's still check termination first, then when entering the pause, check termination again before exiting the pause. Pause should be done after the sampling

Jackie-Jiang · 2026-04-15T23:39:31Z

+  /**
+   * Clears the OOM pause and wakes all blocked query threads.
+   */
+  void clearPause() {


(minor) Is this exposed for testing purpose? Either annotate it or make it private. Same for other places

Jackie-Jiang · 2026-04-15T23:48:43Z

+  private final ReentrantLock _pauseLock = new ReentrantLock();
+  private final Condition _pauseCondition = _pauseLock.newCondition();
+  private volatile boolean _pauseActive = false;
+  private volatile long _pauseDeadlineMs = 0;


(nit) Doesn't need to be volatile as it is maintained as a local state for the watcher.
Consider adding some comments describing the contract for them:

_pauseActive is shared by both watcher and query threads
_pauseDeadlineMs is local state for the watcher (similar to _triggeringLevel)

Jackie-Jiang · 2026-04-15T23:52:50Z

  }

  private void evalTriggers() {
    TriggeringLevel previousTriggeringLevel = _triggeringLevel;


(minor) This local variable is no longer needed

yashmayya requested a review from Jackie-Jiang April 10, 2026 18:53

yashmayya added feature New functionality oom-protection Related to out-of-memory protection mechanisms labels Apr 10, 2026

xiangfu0 reviewed Apr 11, 2026

View reviewed changes

Comment thread pinot-core/src/main/java/org/apache/pinot/core/accounting/QueryResourceAggregator.java Outdated

Jackie-Jiang reviewed Apr 13, 2026

View reviewed changes

yashmayya added 3 commits April 15, 2026 14:07

Add OOM pause mechanism to pause query threads before killing on crit…

f6ccfeb

…ical heap

Address review comments

5fe7c1a

Add pause on direct transition to panic; rename configs; move pause t…

677e123

…o termination check

yashmayya force-pushed the oom-pause-queries branch from ceb1cb6 to 677e123 Compare April 15, 2026 22:53

yashmayya commented Apr 15, 2026

View reviewed changes

Rename fields, getters, and log lines to match config constant names

f902705

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Jackie-Jiang approved these changes Apr 16, 2026

View reviewed changes

	"accounting.oom.panic.pre.query.kill.pause.enabled";
	"accounting.oom.panic.allow.pre.query.kill.pause";

Conversation

yashmayya commented Apr 10, 2026

Summary

Overhead

Configuration

State transitions

Uh oh!

codecov-commenter commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

xiangfu0 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Apr 10, 2026 •

edited

Loading