Skip to content

Add shared offline-budget accessor for auto Maintenance Mode#181

Merged
thestreak101 merged 1 commit into
devfrom
sacchoud/shared-offline-budget-accessor
May 20, 2026
Merged

Add shared offline-budget accessor for auto Maintenance Mode#181
thestreak101 merged 1 commit into
devfrom
sacchoud/shared-offline-budget-accessor

Conversation

@thestreak101
Copy link
Copy Markdown
Collaborator

Issues

Prep PR for the auto-MM entry/exit consistency fix discussed as a follow-up to #175.

Description

Adds BaseControllerDataProvider#getInstancesUnableToAcceptOnlineReplicas(long nowMs) — the single source of truth for "instances that count toward the cluster-wide offline budget" used by both auto Maintenance Mode entry (MAX_OFFLINE_INSTANCES_ALLOWED) and auto exit (NUM_OFFLINE_INSTANCES_FOR_AUTO_EXIT).

An instance counts when all three conditions hold:

  • Its InstanceOperation is routable (not in UNROUTABLE_INSTANCE_OPERATIONS = {SWAP_IN, UNKNOWN}).
  • It is not currently enabled-and-live.
  • It does not carry a valid (unexpired) instance-operation maintenance marker.

Why. BestPossibleStateCalcStage (MM entry) and MaintenanceRecoveryStage (MM exit) currently use different baselines: entry filters by !UNROUTABLE, exit filters by ASSIGNABLE (= {ENABLE, DISABLE}). The two diverge on EVACUATE. In a cluster with one EVACUATE+offline instance plus one ENABLE+offline instance and MAX_OFFLINE_INSTANCES_ALLOWED=1 / NUM_OFFLINE_INSTANCES_FOR_AUTO_EXIT=1, entry sees 2 → enters MM, and exit sees 1 → auto-exits on the very next tick. This accessor is the first step toward fixing that.

Scope of this PR. Pure addition. No existing callers are switched, so there is no controller behavior change. The follow-up PR will:

  1. Switch both stages to call the new accessor.
  2. Update log messages to match the shared terminology.
  3. Add EVACUATE-coverage integration tests to TestInstanceOperationMaintenanceBudget and lock in the new auto-exit semantics.

Tests

TestInstancesUnableToAcceptOnlineReplicas (new, 18 cases):

  • Per-operation buckets: ENABLE+live, ENABLE+offline, DISABLE+live, DISABLE+offline, EVACUATE+live, EVACUATE+offline, SWAP_IN, UNKNOWN.
  • Marker interactions: valid marker exempts, expired marker doesn't, boundary at nowMs == until, marker on SWAP_IN is irrelevant.
  • Aggregate: empty cluster, realistic mixed-cluster (8 ENABLE+live + every other bucket once).
  • Returned-set contract: modifiable, independent of future calls (the accessor's docstring promises both).

Targeted run on the new + adjacent suites (helix-core only):

​```
mvn -pl helix-core test
-Dtest='TestInstancesUnableToAcceptOnlineReplicas,TestInstanceOperationMaintenanceBudget,TestClusterInMaintenanceModeWhenReachingOfflineInstancesLimit'

TestInstancesUnableToAcceptOnlineReplicas: 18/18 pass
TestInstanceOperationMaintenanceBudget: 4/4 pass
TestClusterInMaintenanceModeWhenReachingOfflineInstancesLimit: 2/2 pass
​```

Changes that Break Backward Compatibility (Optional)

None. The accessor is a pure addition with no callers in this PR.

Documentation (Optional)

N/A — internal API, exhaustively documented in the Javadoc.

Commits

  • Subject in imperative mood, ≤ 50 chars, body wraps at 72, explains what + why.

Code Quality

  • Diff formatted to match surrounding BaseControllerDataProvider style. No new lint warnings introduced.

Adds BaseControllerDataProvider#getInstancesUnableToAcceptOnlineReplicas(nowMs):
the single source of truth for "instances that count toward the cluster-wide
offline budget" used by auto Maintenance Mode entry / exit thresholds
(MAX_OFFLINE_INSTANCES_ALLOWED, NUM_OFFLINE_INSTANCES_FOR_AUTO_EXIT).

An instance counts when all three conditions hold:
  - InstanceOperation is routable (not SWAP_IN / UNKNOWN)
  - It is not currently enabled-and-live
  - It does not carry a valid instance-operation maintenance marker

Today BestPossibleStateCalcStage (entry) filters by !UNROUTABLE while
MaintenanceRecoveryStage (exit) filters by ASSIGNABLE. The two baselines
diverge on EVACUATE, which can drive a cluster to enter MM and then
auto-exit on the very next tick. This accessor is the first step toward
fixing that; a follow-up PR will switch both stages to call it.

This change is a pure addition: no existing callers are switched, no
controller behavior changes.

Test coverage in TestInstancesUnableToAcceptOnlineReplicas (new, 18 cases):
  - Per-operation buckets: ENABLE+live, ENABLE+offline, DISABLE,
    EVACUATE (live and offline), SWAP_IN, UNKNOWN
  - Marker interactions: valid / expired / boundary (nowMs == until)
  - Aggregate: empty cluster, mixed-cluster
  - Returned-set contract: modifiable, independent of future calls
All 18 pass.

Adjacent integration tests confirmed still green:
  TestClusterInMaintenanceModeWhenReachingOfflineInstancesLimit 2/2
  TestInstanceOperationMaintenanceBudget 4/4
Copy link
Copy Markdown
Collaborator

@LZD-PratyushBhatt LZD-PratyushBhatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thestreak101 thestreak101 merged commit b149d19 into dev May 20, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants