Add shared offline-budget accessor for auto Maintenance Mode#181
Merged
Conversation
Adds BaseControllerDataProvider#getInstancesUnableToAcceptOnlineReplicas(nowMs):
the single source of truth for "instances that count toward the cluster-wide
offline budget" used by auto Maintenance Mode entry / exit thresholds
(MAX_OFFLINE_INSTANCES_ALLOWED, NUM_OFFLINE_INSTANCES_FOR_AUTO_EXIT).
An instance counts when all three conditions hold:
- InstanceOperation is routable (not SWAP_IN / UNKNOWN)
- It is not currently enabled-and-live
- It does not carry a valid instance-operation maintenance marker
Today BestPossibleStateCalcStage (entry) filters by !UNROUTABLE while
MaintenanceRecoveryStage (exit) filters by ASSIGNABLE. The two baselines
diverge on EVACUATE, which can drive a cluster to enter MM and then
auto-exit on the very next tick. This accessor is the first step toward
fixing that; a follow-up PR will switch both stages to call it.
This change is a pure addition: no existing callers are switched, no
controller behavior changes.
Test coverage in TestInstancesUnableToAcceptOnlineReplicas (new, 18 cases):
- Per-operation buckets: ENABLE+live, ENABLE+offline, DISABLE,
EVACUATE (live and offline), SWAP_IN, UNKNOWN
- Marker interactions: valid / expired / boundary (nowMs == until)
- Aggregate: empty cluster, mixed-cluster
- Returned-set contract: modifiable, independent of future calls
All 18 pass.
Adjacent integration tests confirmed still green:
TestClusterInMaintenanceModeWhenReachingOfflineInstancesLimit 2/2
TestInstanceOperationMaintenanceBudget 4/4
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issues
Prep PR for the auto-MM entry/exit consistency fix discussed as a follow-up to #175.
Description
Adds
BaseControllerDataProvider#getInstancesUnableToAcceptOnlineReplicas(long nowMs)— the single source of truth for "instances that count toward the cluster-wide offline budget" used by both auto Maintenance Mode entry (MAX_OFFLINE_INSTANCES_ALLOWED) and auto exit (NUM_OFFLINE_INSTANCES_FOR_AUTO_EXIT).An instance counts when all three conditions hold:
InstanceOperationis routable (not inUNROUTABLE_INSTANCE_OPERATIONS={SWAP_IN, UNKNOWN}).Why.
BestPossibleStateCalcStage(MM entry) andMaintenanceRecoveryStage(MM exit) currently use different baselines: entry filters by!UNROUTABLE, exit filters byASSIGNABLE(={ENABLE, DISABLE}). The two diverge onEVACUATE. In a cluster with oneEVACUATE+offlineinstance plus oneENABLE+offlineinstance andMAX_OFFLINE_INSTANCES_ALLOWED=1/NUM_OFFLINE_INSTANCES_FOR_AUTO_EXIT=1, entry sees 2 → enters MM, and exit sees 1 → auto-exits on the very next tick. This accessor is the first step toward fixing that.Scope of this PR. Pure addition. No existing callers are switched, so there is no controller behavior change. The follow-up PR will:
TestInstanceOperationMaintenanceBudgetand lock in the new auto-exit semantics.Tests
TestInstancesUnableToAcceptOnlineReplicas(new, 18 cases):ENABLE+live,ENABLE+offline,DISABLE+live,DISABLE+offline,EVACUATE+live,EVACUATE+offline,SWAP_IN,UNKNOWN.nowMs == until, marker onSWAP_INis irrelevant.ENABLE+live+ every other bucket once).Targeted run on the new + adjacent suites (helix-core only):
```
mvn -pl helix-core test
-Dtest='TestInstancesUnableToAcceptOnlineReplicas,TestInstanceOperationMaintenanceBudget,TestClusterInMaintenanceModeWhenReachingOfflineInstancesLimit'
TestInstancesUnableToAcceptOnlineReplicas: 18/18 pass
TestInstanceOperationMaintenanceBudget: 4/4 pass
TestClusterInMaintenanceModeWhenReachingOfflineInstancesLimit: 2/2 pass
```
Changes that Break Backward Compatibility (Optional)
None. The accessor is a pure addition with no callers in this PR.
Documentation (Optional)
N/A — internal API, exhaustively documented in the Javadoc.
Commits
Code Quality
BaseControllerDataProviderstyle. No new lint warnings introduced.