Add shared offline-budget accessor for auto Maintenance Mode by thestreak101 · Pull Request #181 · linkedin/helix

thestreak101 · 2026-05-20T04:09:08Z

Issues

Prep PR for the auto-MM entry/exit consistency fix discussed as a follow-up to #175.

Description

Adds BaseControllerDataProvider#getInstancesUnableToAcceptOnlineReplicas(long nowMs) — the single source of truth for "instances that count toward the cluster-wide offline budget" used by both auto Maintenance Mode entry (MAX_OFFLINE_INSTANCES_ALLOWED) and auto exit (NUM_OFFLINE_INSTANCES_FOR_AUTO_EXIT).

An instance counts when all three conditions hold:

Its InstanceOperation is routable (not in UNROUTABLE_INSTANCE_OPERATIONS = {SWAP_IN, UNKNOWN}).
It is not currently enabled-and-live.
It does not carry a valid (unexpired) instance-operation maintenance marker.

Why. BestPossibleStateCalcStage (MM entry) and MaintenanceRecoveryStage (MM exit) currently use different baselines: entry filters by !UNROUTABLE, exit filters by ASSIGNABLE (= {ENABLE, DISABLE}). The two diverge on EVACUATE. In a cluster with one EVACUATE+offline instance plus one ENABLE+offline instance and MAX_OFFLINE_INSTANCES_ALLOWED=1 / NUM_OFFLINE_INSTANCES_FOR_AUTO_EXIT=1, entry sees 2 → enters MM, and exit sees 1 → auto-exits on the very next tick. This accessor is the first step toward fixing that.

Scope of this PR. Pure addition. No existing callers are switched, so there is no controller behavior change. The follow-up PR will:

Switch both stages to call the new accessor.
Update log messages to match the shared terminology.
Add EVACUATE-coverage integration tests to TestInstanceOperationMaintenanceBudget and lock in the new auto-exit semantics.

Tests

TestInstancesUnableToAcceptOnlineReplicas (new, 18 cases):

Per-operation buckets: ENABLE+live, ENABLE+offline, DISABLE+live, DISABLE+offline, EVACUATE+live, EVACUATE+offline, SWAP_IN, UNKNOWN.
Marker interactions: valid marker exempts, expired marker doesn't, boundary at nowMs == until, marker on SWAP_IN is irrelevant.
Aggregate: empty cluster, realistic mixed-cluster (8 ENABLE+live + every other bucket once).
Returned-set contract: modifiable, independent of future calls (the accessor's docstring promises both).

Targeted run on the new + adjacent suites (helix-core only):

```
mvn -pl helix-core test
-Dtest='TestInstancesUnableToAcceptOnlineReplicas,TestInstanceOperationMaintenanceBudget,TestClusterInMaintenanceModeWhenReachingOfflineInstancesLimit'

TestInstancesUnableToAcceptOnlineReplicas: 18/18 pass
TestInstanceOperationMaintenanceBudget: 4/4 pass
TestClusterInMaintenanceModeWhenReachingOfflineInstancesLimit: 2/2 pass
```

Changes that Break Backward Compatibility (Optional)

None. The accessor is a pure addition with no callers in this PR.

Documentation (Optional)

N/A — internal API, exhaustively documented in the Javadoc.

Commits

Subject in imperative mood, ≤ 50 chars, body wraps at 72, explains what + why.

Code Quality

Diff formatted to match surrounding BaseControllerDataProvider style. No new lint warnings introduced.

Adds BaseControllerDataProvider#getInstancesUnableToAcceptOnlineReplicas(nowMs): the single source of truth for "instances that count toward the cluster-wide offline budget" used by auto Maintenance Mode entry / exit thresholds (MAX_OFFLINE_INSTANCES_ALLOWED, NUM_OFFLINE_INSTANCES_FOR_AUTO_EXIT). An instance counts when all three conditions hold: - InstanceOperation is routable (not SWAP_IN / UNKNOWN) - It is not currently enabled-and-live - It does not carry a valid instance-operation maintenance marker Today BestPossibleStateCalcStage (entry) filters by !UNROUTABLE while MaintenanceRecoveryStage (exit) filters by ASSIGNABLE. The two baselines diverge on EVACUATE, which can drive a cluster to enter MM and then auto-exit on the very next tick. This accessor is the first step toward fixing that; a follow-up PR will switch both stages to call it. This change is a pure addition: no existing callers are switched, no controller behavior changes. Test coverage in TestInstancesUnableToAcceptOnlineReplicas (new, 18 cases): - Per-operation buckets: ENABLE+live, ENABLE+offline, DISABLE, EVACUATE (live and offline), SWAP_IN, UNKNOWN - Marker interactions: valid / expired / boundary (nowMs == until) - Aggregate: empty cluster, mixed-cluster - Returned-set contract: modifiable, independent of future calls All 18 pass. Adjacent integration tests confirmed still green: TestClusterInMaintenanceModeWhenReachingOfflineInstancesLimit 2/2 TestInstanceOperationMaintenanceBudget 4/4

LZD-PratyushBhatt

LGTM

thestreak101 requested review from LZD-PratyushBhatt, PranaviAncha, arkmish, kabragaurav, laxman-ch, ngngwr and proud-parselmouth as code owners May 20, 2026 04:09

thestreak101 mentioned this pull request May 20, 2026

Make auto-MM entry and exit measure the same offline budget #182

Merged

2 tasks

LZD-PratyushBhatt approved these changes May 20, 2026

View reviewed changes

thestreak101 merged commit b149d19 into dev May 20, 2026
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add shared offline-budget accessor for auto Maintenance Mode#181

Add shared offline-budget accessor for auto Maintenance Mode#181
thestreak101 merged 1 commit into
devfrom
sacchoud/shared-offline-budget-accessor

thestreak101 commented May 20, 2026

Uh oh!

LZD-PratyushBhatt left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

thestreak101 commented May 20, 2026

Issues

Description

Tests

Changes that Break Backward Compatibility (Optional)

Documentation (Optional)

Commits

Code Quality

Uh oh!

LZD-PratyushBhatt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants