Skip to content

fix: shutdown hook deadlock under leader election and deprecate Operator#installShutdownHook(Duration)#3383

Open
Dennis-Mircea wants to merge 4 commits into
operator-framework:mainfrom
Dennis-Mircea:fix/installshutdownhook-leader-election-and-deprecation
Open

fix: shutdown hook deadlock under leader election and deprecate Operator#installShutdownHook(Duration)#3383
Dennis-Mircea wants to merge 4 commits into
operator-framework:mainfrom
Dennis-Mircea:fix/installshutdownhook-leader-election-and-deprecation

Conversation

@Dennis-Mircea
Copy link
Copy Markdown
Contributor

Summary

Resolves all three concerns raised in #3376:

  1. The JVM shutdown hook installed via Operator#installShutdownHook(...) was previously skipped whenever leader election was enabled (introduced in fix: leader election stop deadlock #1618 to work around the deadlock reported in LeaderElectionManager#stopLeading maybe deadlock with Operator#installShutdownHook #1614). As a result, a leader pod receiving SIGTERM did not release its lease, forcing standby replicas to wait for lease expiry before they could take over.
  2. The gracefulShutdownTimeout argument on Operator#installShutdownHook(Duration) has been ignored since PR feat: support for graceful shutdown based on configuration #2479. The reconciliation termination timeout is now read from ConfigurationService#reconciliationTerminationTimeout().
  3. The actual rationale for the leader-election skip and the dead Duration parameter was not documented anywhere; neither were the Operator#stop() shutdown sequence and the LeaderElectionManager lifecycle.

What changed

Fix the underlying deadlock (LeaderElectionManager)

The deadlock from #1614 was very specific: Operator#stop() called from inside a JVM shutdown hook would cancel the leader-election future, which fired the onStopLeading callback, which called System.exit(1). That System.exit then blocked indefinitely on the java.lang.Shutdown class lock that the shutdown hook thread itself was already holding.

This PR breaks the recursion. LeaderElectionManager now carries a stoppingGracefully AtomicBoolean, set in stop() before the future is cancelled and checked at the top of stopLeading(). When the flag is set, stopLeading() returns immediately instead of invoking System.exit. The "restart on lost lead" behavior (when the leader-election library detects a real lost lease without a prior stop()) is preserved because that path runs without the flag ever being set.

Re-enable the shutdown hook unconditionally (Operator)

With the deadlock fixed at its source, the conditional in Operator#installShutdownHook() is no longer required. The hook is now registered regardless of leader-election state, and the "Leader election is on, shutdown hook will not be installed." warn log line is removed. A leader pod receiving SIGTERM will run the hook, which calls Operator#stop(), which now cleanly cancels the leader-election future and releases the lease.

Deprecate installShutdownHook(Duration) (Operator)

The Duration argument has been dead since #2479. This PR adds a new no-arg installShutdownHook() overload (the recommended replacement) whose JavaDoc points users at ConfigurationServiceOverrider#withReconciliationTerminationTimeout(Duration) as the real configuration knob. The existing installShutdownHook(Duration) is marked @Deprecated(forRemoval = true) and delegates to the no-arg overload. The Duration parameter is kept only for source and binary compatibility.

Documentation additions

  • Class-level JavaDoc on LeaderElectionManager explaining its role, configuration entry points, the Lease-based coordination, the three-way behavior of stopLeading(), and the lifecycle ownership by Operator.
  • JavaDoc on Operator#stop() documenting the four-step shutdown sequence (controller manager, executor service manager with timeout, leader-election manager, optional client close), an explicit "safe to call from a JVM shutdown hook" guarantee, the not-started edge-case behavior, and the closeClientOnStop() opt-out (default true).
  • JavaDoc on Operator#installShutdownHook() explaining the timeout configuration knob and adding a terminationGracePeriodSeconds note that recommends sizing the pod's grace period to fit the configured reconciliationTerminationTimeout plus a small buffer.

Regression test

LeaderElectionManagerTest#stopLeadingDoesNotInvokeSystemExitWhenStopWasCalledFirst calls stop() and then stopLeading() directly. If the graceful-shutdown short-circuit is ever reintroduced as System.exit(1), this test method would terminate the JUnit JVM rather than failing cleanly, making the regression impossible to miss in CI. LeaderElectionManager#stopLeading was lowered from private to protected to enable the test (and to allow subclasses to extend the behavior).

Behavior summary by scenario

Scenario Before this PR After this PR
No leader election, SIGTERM Hook installed, stop() runs, graceful shutdown Same
Leader election, leader receives SIGTERM Hook not installed, JVM exits, lease leaks Hook installed, stop() runs, lease released cleanly
Leader election, non-leader receives SIGTERM Hook not installed Hook installed, stop() is effectively a no-op for leader-election state (no lease to release)
Leader loses lead (onStopLeading fires without prior stop()) System.exit(1) triggered, JVM restarts Same
installShutdownHook(Duration) called by user Duration silently ignored Duration silently ignored, plus deprecation warning at compile time

Test plan

  • mvn -pl operator-framework-core test passes, including the new stopLeadingDoesNotInvokeSystemExitWhenStopWasCalledFirst regression.
  • Spotless / Google Java Format clean.
  • Compiler emits a deprecation warning at every existing call site of installShutdownHook(Duration).

…tor#installShutdownHook(Duration)

Signed-off-by: Dennis-Mircea Ciupitu <dennis.mircea.ciupitu@gmail.com>
Copilot AI review requested due to automatic review settings May 27, 2026 14:49
@openshift-ci openshift-ci Bot requested review from csviri and metacosm May 27, 2026 14:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Improves operator shutdown behavior in leader-election scenarios by always installing a JVM shutdown hook and preventing System.exit(1) from being invoked during graceful shutdown, with a regression test covering the deadlock scenario.

Changes:

  • Always register a JVM shutdown hook in Operator, and deprecate the legacy overload that accepted a (now ignored) timeout.
  • Add a graceful-shutdown guard in LeaderElectionManager.stopLeading() to skip System.exit(1) when shutdown is already in progress.
  • Add a regression test ensuring stopLeading() doesn’t terminate the JVM when stop() was called first.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
operator-framework-core/src/test/java/io/javaoperatorsdk/operator/LeaderElectionManagerTest.java Adds a regression test for shutdown-hook/leader-election interaction to avoid JVM termination.
operator-framework-core/src/main/java/io/javaoperatorsdk/operator/Operator.java Changes shutdown hook installation behavior and improves shutdown documentation; deprecates old overload.
operator-framework-core/src/main/java/io/javaoperatorsdk/operator/LeaderElectionManager.java Adds graceful-shutdown coordination to avoid System.exit during JVM shutdown; minor permission-check change.

Comment on lines +112 to +122
@Test
void stopLeadingDoesNotInvokeSystemExitWhenStopWasCalledFirst() {
// When stop() is called before the onStopLeading callback fires (which is what happens when
// stop()'s future cancellation triggers the callback), stopLeading() must skip
// System.exit(1). Otherwise calling stop() from inside a JVM shutdown hook deadlocks against
// the java.lang.Shutdown class lock. If this regression is ever reintroduced, this test
// method would terminate the JUnit JVM via System.exit(1) instead of failing cleanly.
final var leaderElectionManager = leaderElectionManager(null);
leaderElectionManager.stop();
leaderElectionManager.stopLeading();
}
Comment on lines +54 to +58
* <p>Internally this class wraps a Fabric8 {@link LeaderElector} that coordinates via a Kubernetes
* {@code Lease} resource (group {@value #COORDINATION_GROUP}, resource {@value #LEASES_RESOURCE}).
* When this pod acquires the lease, {@link #startLeading()} starts event processing on the
* controller manager. When the lease is lost or the leader-election future is cancelled, {@link
* #stopLeading()} is invoked.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need that level of information, especially because this touches the internal implementation, which might change over time, and addresses concepts that have not been defined in this scope.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As part of the last commit I removed the JavaDoc for this class and also the ones made as part of the installShutdownHook methods. Please let me know if you want to stick to the current installShutdownHook method JavaDocs.

Signed-off-by: Dennis-Mircea Ciupitu <dennis.mircea.ciupitu@gmail.com>
@Dennis-Mircea Dennis-Mircea requested a review from metacosm May 27, 2026 15:38
Signed-off-by: Dennis-Mircea Ciupitu <dennis.mircea.ciupitu@gmail.com>
Signed-off-by: Dennis-Mircea Ciupitu <dennis.mircea.ciupitu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants