fix: shutdown hook deadlock under leader election and deprecate Operator#installShutdownHook(Duration)#3383
Conversation
…tor#installShutdownHook(Duration) Signed-off-by: Dennis-Mircea Ciupitu <dennis.mircea.ciupitu@gmail.com>
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Improves operator shutdown behavior in leader-election scenarios by always installing a JVM shutdown hook and preventing System.exit(1) from being invoked during graceful shutdown, with a regression test covering the deadlock scenario.
Changes:
- Always register a JVM shutdown hook in
Operator, and deprecate the legacy overload that accepted a (now ignored) timeout. - Add a graceful-shutdown guard in
LeaderElectionManager.stopLeading()to skipSystem.exit(1)when shutdown is already in progress. - Add a regression test ensuring
stopLeading()doesn’t terminate the JVM whenstop()was called first.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| operator-framework-core/src/test/java/io/javaoperatorsdk/operator/LeaderElectionManagerTest.java | Adds a regression test for shutdown-hook/leader-election interaction to avoid JVM termination. |
| operator-framework-core/src/main/java/io/javaoperatorsdk/operator/Operator.java | Changes shutdown hook installation behavior and improves shutdown documentation; deprecates old overload. |
| operator-framework-core/src/main/java/io/javaoperatorsdk/operator/LeaderElectionManager.java | Adds graceful-shutdown coordination to avoid System.exit during JVM shutdown; minor permission-check change. |
| @Test | ||
| void stopLeadingDoesNotInvokeSystemExitWhenStopWasCalledFirst() { | ||
| // When stop() is called before the onStopLeading callback fires (which is what happens when | ||
| // stop()'s future cancellation triggers the callback), stopLeading() must skip | ||
| // System.exit(1). Otherwise calling stop() from inside a JVM shutdown hook deadlocks against | ||
| // the java.lang.Shutdown class lock. If this regression is ever reintroduced, this test | ||
| // method would terminate the JUnit JVM via System.exit(1) instead of failing cleanly. | ||
| final var leaderElectionManager = leaderElectionManager(null); | ||
| leaderElectionManager.stop(); | ||
| leaderElectionManager.stopLeading(); | ||
| } |
| * <p>Internally this class wraps a Fabric8 {@link LeaderElector} that coordinates via a Kubernetes | ||
| * {@code Lease} resource (group {@value #COORDINATION_GROUP}, resource {@value #LEASES_RESOURCE}). | ||
| * When this pod acquires the lease, {@link #startLeading()} starts event processing on the | ||
| * controller manager. When the lease is lost or the leader-election future is cancelled, {@link | ||
| * #stopLeading()} is invoked. |
There was a problem hiding this comment.
I don't think we need that level of information, especially because this touches the internal implementation, which might change over time, and addresses concepts that have not been defined in this scope.
There was a problem hiding this comment.
As part of the last commit I removed the JavaDoc for this class and also the ones made as part of the installShutdownHook methods. Please let me know if you want to stick to the current installShutdownHook method JavaDocs.
Signed-off-by: Dennis-Mircea Ciupitu <dennis.mircea.ciupitu@gmail.com>
Signed-off-by: Dennis-Mircea Ciupitu <dennis.mircea.ciupitu@gmail.com>
Signed-off-by: Dennis-Mircea Ciupitu <dennis.mircea.ciupitu@gmail.com>
Summary
Resolves all three concerns raised in #3376:
Operator#installShutdownHook(...)was previously skipped whenever leader election was enabled (introduced in fix: leader election stop deadlock #1618 to work around the deadlock reported in LeaderElectionManager#stopLeading maybe deadlock with Operator#installShutdownHook #1614). As a result, a leader pod receivingSIGTERMdid not release its lease, forcing standby replicas to wait for lease expiry before they could take over.gracefulShutdownTimeoutargument onOperator#installShutdownHook(Duration)has been ignored since PR feat: support for graceful shutdown based on configuration #2479. The reconciliation termination timeout is now read fromConfigurationService#reconciliationTerminationTimeout().Durationparameter was not documented anywhere; neither were theOperator#stop()shutdown sequence and theLeaderElectionManagerlifecycle.What changed
Fix the underlying deadlock (
LeaderElectionManager)The deadlock from #1614 was very specific:
Operator#stop()called from inside a JVM shutdown hook would cancel the leader-election future, which fired theonStopLeadingcallback, which calledSystem.exit(1). ThatSystem.exitthen blocked indefinitely on thejava.lang.Shutdownclass lock that the shutdown hook thread itself was already holding.This PR breaks the recursion.
LeaderElectionManagernow carries astoppingGracefullyAtomicBoolean, set instop()before the future is cancelled and checked at the top ofstopLeading(). When the flag is set,stopLeading()returns immediately instead of invokingSystem.exit. The "restart on lost lead" behavior (when the leader-election library detects a real lost lease without a priorstop()) is preserved because that path runs without the flag ever being set.Re-enable the shutdown hook unconditionally (
Operator)With the deadlock fixed at its source, the conditional in
Operator#installShutdownHook()is no longer required. The hook is now registered regardless of leader-election state, and the"Leader election is on, shutdown hook will not be installed."warn log line is removed. A leader pod receivingSIGTERMwill run the hook, which callsOperator#stop(), which now cleanly cancels the leader-election future and releases the lease.Deprecate
installShutdownHook(Duration)(Operator)The
Durationargument has been dead since #2479. This PR adds a new no-arginstallShutdownHook()overload (the recommended replacement) whose JavaDoc points users atConfigurationServiceOverrider#withReconciliationTerminationTimeout(Duration)as the real configuration knob. The existinginstallShutdownHook(Duration)is marked@Deprecated(forRemoval = true)and delegates to the no-arg overload. TheDurationparameter is kept only for source and binary compatibility.Documentation additions
LeaderElectionManagerexplaining its role, configuration entry points, theLease-based coordination, the three-way behavior ofstopLeading(), and the lifecycle ownership byOperator.Operator#stop()documenting the four-step shutdown sequence (controller manager, executor service manager with timeout, leader-election manager, optional client close), an explicit "safe to call from a JVM shutdown hook" guarantee, the not-started edge-case behavior, and thecloseClientOnStop()opt-out (defaulttrue).Operator#installShutdownHook()explaining the timeout configuration knob and adding aterminationGracePeriodSecondsnote that recommends sizing the pod's grace period to fit the configuredreconciliationTerminationTimeoutplus a small buffer.Regression test
LeaderElectionManagerTest#stopLeadingDoesNotInvokeSystemExitWhenStopWasCalledFirstcallsstop()and thenstopLeading()directly. If the graceful-shutdown short-circuit is ever reintroduced asSystem.exit(1), this test method would terminate the JUnit JVM rather than failing cleanly, making the regression impossible to miss in CI.LeaderElectionManager#stopLeadingwas lowered fromprivatetoprotectedto enable the test (and to allow subclasses to extend the behavior).Behavior summary by scenario
SIGTERMstop()runs, graceful shutdownSIGTERMstop()runs, lease released cleanlySIGTERMstop()is effectively a no-op for leader-election state (no lease to release)onStopLeadingfires without priorstop())System.exit(1)triggered, JVM restartsinstallShutdownHook(Duration)called by userDurationsilently ignoredDurationsilently ignored, plus deprecation warning at compile timeTest plan
mvn -pl operator-framework-core testpasses, including the newstopLeadingDoesNotInvokeSystemExitWhenStopWasCalledFirstregression.installShutdownHook(Duration).