Planned-maintenance budget exemption for auto Maintenance Mode by LZD-PratyushBhatt · Pull Request #175 · linkedin/helix

LZD-PratyushBhatt · 2026-05-12T13:37:34Z

Issues

My PR addresses the following Helix issues and references them in the PR description:

Description

Here are some details about my PR, including screenshots of any UI changes:

What. Introduces a per-instance PLANNED_MAINTENANCE_UNTIL_MS marker on InstanceConfig that excludes valid-marker instances from the cluster-wide MAX_OFFLINE_INSTANCES_ALLOWED count that drives auto Maintenance Mode (MM). The filter applies symmetrically on MM entry (BestPossibleStateCalcStage) and MM exit (MaintenanceRecoveryStage).

Why. Helix today is cause-blind: every offline host counts against the budget. Rolling deploys, evacuations, and planned swaps produce false-positive MM triggers, and a static MAX_OFFLINE_INSTANCES_ALLOWED caps the deployment speedup that operators expect when raising replication factor (e.g., Venice's RF=3 to RF=4 change). The marker lets Helix recognize planned offlines while keeping the MM trigger as a safety net for unplanned loss. Helix stays free of orchestration semantics; operation-type reasoning remains in the orchestrator.

How.

InstanceConfig:
- New PLANNED_MAINTENANCE_UNTIL_MS (SimpleField, long) and PLANNED_MAINTENANCE_METADATA (MapField, audit-only) properties.
- PLANNED_MAINTENANCE_NOT_SET = -1L sentinel. Setter clears both fields when given a non-positive value.
- isUnderPlannedMaintenance(long nowMs) helper.
- Both new fields added to NON_OVERWRITABLE_PROPERTIES so the marker does not transfer onto a swap-in target via overwriteInstanceConfig.
ClusterConfig adds three opt-in fields (no defaults, absent => feature off for that cluster):
- MAX_PLANNED_MAINTENANCE_INSTANCES (absolute cap on simultaneously marked instances).
- MAX_PLANNED_MAINTENANCE_PERCENTAGE (percentage cap on simultaneously marked instances; if both caps are set, the stricter one wins).
- DEFAULT_PLANNED_MAINTENANCE_DURATION_MS (fallback TTL applied when the REST caller omits expiresAtMillis; absent => such writes are rejected with 400).
BestPossibleStateCalcStage.validateInstancesUnableToAcceptOnlineReplicasLimit: builds the offline-budget set in one pass (routable configs minus enabled-live) and drops instances under valid planned-maintenance markers via removeIf(isUnderPlannedMaintenance) before comparing against MAX_OFFLINE_INSTANCES_ALLOWED. Planned-and-alive instances are not double-counted because they were never in the offline set to begin with.
MaintenanceRecoveryStage applies the same subtraction on the auto-exit path so a deploy window does not block recovery.
helix-rest endpoints (mirror the stoppable-check routing style):
- Single: POST /clusters/{c}/instances/{i}/plannedMaintenance
- Batch: POST /clusters/{c}/instances?command=plannedMaintenance (adds plannedMaintenance to the Command enum and dispatches via the existing InstancesAccessor.instancesOperations switch, parallel to how ?command=stoppable works).
Request body (single):
```
{ "expiresAtMillis": 1776385800000, "reason": "venice-deploy:opId=abc", "source": "AUTOMATION" }
```
Batch body adds "instances": ["h1", "h2", ...]. A negative expiresAtMillis clears the marker.

TTL resolution: caller value if positive; else now + DEFAULT_PLANNED_MAINTENANCE_DURATION_MS if the cluster has it; else 400.

Partial-accept contract (mirrors batch stoppable). The batch endpoint processes instances in input order against a single snapshot of the current cap quota. Per-instance failures (missing instance, cap exhausted) do not abort the call; they land in the rejected map. HTTP 400 is reserved for caller-side bugs that invalidate the entire request (empty instances, past expiresAtMillis, missing expiresAtMillis with no cluster default).

Single endpoint response (200 on success):
```
{ "instance": "h1", "expiresAtMillis": 1776385800000 }
```
On a per-instance failure (cap full or instance missing), single returns 400 with the reason since there's nothing to be partial about.

Batch endpoint response (always 200 for any per-instance outcome):
```
{
  "applied":  ["h1", "h2"],
  "rejected": { "h3": "would exceed MAX_PLANNED_MAINTENANCE_INSTANCES=2" },
  "expiresAtMillis": 1776385800000
}
```
The shared PlannedMaintenanceWriteHandler returns a PlannedMaintenanceResult (applied, rejected, resolvedExpiresAtMillis). Existence checks use a single getInstancesInCluster snapshot rather than per-instance config reads (HelixAdmin throws on missing instances), and the same snapshot feeds the cap quota calculation.

Evaluation model. The filter is evaluated lazily on events the controller already handles (instance-config writes, live-instance changes, periodic rebalance). Orchestrator clears re-evaluate immediately via the config write. Real outages re-evaluate via the corresponding live-instance change. A stale marker left by an orchestrator crash is benign on a stable cluster and cascades into MM correctly the moment a genuine outage arrives.

Tests

The following tests are written for this issue:

helix-core/src/test/java/org/apache/helix/model/TestInstanceConfig.java (10 new cases):
- testPlannedMaintenanceUnsetByDefault
- testPlannedMaintenanceSetAndGet
- testIsUnderPlannedMaintenanceWindow (boundary semantics)
- testPlannedMaintenanceClearViaNegativeValue
- testPlannedMaintenanceClearViaZero
- testPlannedMaintenanceMetadataSetAndGet
- testPlannedMaintenanceMetadataIsReadOnly
- testPlannedMaintenanceMetadataNullClears
- testPlannedMaintenanceMetadataIsDefensiveCopied
- testPlannedMaintenanceNotTransferredByOverwrite
helix-core/src/test/java/org/apache/helix/model/TestClusterConfig.java (3 new cases):
- testPlannedMaintenanceFieldsUnsetByDefault
- testPlannedMaintenanceFieldsRoundTrip
- testPlannedMaintenanceFieldsClearWithSentinel
helix-rest/src/test/java/org/apache/helix/rest/clusterMaintenanceService/TestPlannedMaintenanceWriteHandler.java (21 cases, Mockito-based): TTL resolution (caller wins / cluster default / reject when both unset / past-timestamp rejected), set-with-metadata, set-without-audit-fields, dedup, empty/null instance rejection, missing-instance partial-accept (others still applied), cap enforcement under partial-accept semantics (no-op / absolute / boundary / percentage / stricter-of-two / counts existing markers / ignores expired markers / zero-cap rejects everything), clear with negative sentinel, clear bypasses cap, clear with missing instance partial-accept.
helix-core/src/test/java/org/apache/helix/integration/rebalancer/TestPlannedMaintenanceBudgetExemption.java (2 integration cases with real ZK):
- testValidMarkersPreventMMTrigger: two marked offline instances do not trip MM; adding two unmarked offline instances on top does.
- testExpiredMarkersDoNotExempt: markers with a past untilMs behave as if absent and MM trips.

The following is the result of the "mvn test" command on the appropriate module:

Targeted run across helix-core + helix-rest covering all new tests and the
adjacent maintenance-mode regression tests:

  mvn -pl helix-core,helix-rest test \
    -Dtest='TestInstanceConfig,TestClusterConfig,TestPlannedMaintenanceWriteHandler,
    TestPlannedMaintenanceBudgetExemption,TestBestPossibleStateCalcStage,
    TestClusterInMaintenanceModeWhenReachingOfflineInstancesLimit,
    TestClusterMaintenanceMode'

  helix-core: Tests run: 87, Failures: 0, Errors: 0, Skipped: 0
  helix-rest: Tests run: 21, Failures: 0, Errors: 0, Skipped: 0

End-to-end smoke test against a live cluster on localhost:2191 confirmed all paths: TTL resolution (caller / cluster-default / reject-with-no-default / past-timestamp), partial-accept on cap overflow, partial-accept on missing instance, single-endpoint 400 on cap-exhausted, single-endpoint 400 on missing instance, mixed missing + cap-overflow batch.

Changes that Break Backward Compatibility (Optional)

None. All new fields are opt-in:

New InstanceConfig fields default to "unset"; older readers ignore unknown ZNRecord keys, and the existing getMaxOfflineInstancesAllowed path is unchanged when no instance carries a marker.
New ClusterConfig fields default to "unset"; absent means feature is off for that cluster.
The two new REST endpoints are additive; existing endpoints are untouched.
NON_OVERWRITABLE_PROPERTIES gains two entries; this only affects callers that explicitly relied on overwriteInstanceConfig transferring planned-maintenance state during swap, which would have been an unintended behavior anyway.

Documentation (Optional)

Design doc: Helix Planned Maintenance Window (per-instance budget exemption).

Commits

My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Code Quality

My diff has been formatted using helix-style.xml
(helix-style-intellij.xml if IntelliJ IDE is used)

Introduces a per-instance PLANNED_MAINTENANCE_UNTIL_MS marker on InstanceConfig that excludes valid-marker instances from MAX_OFFLINE_INSTANCES_ALLOWED while the marker has not expired. The filter applies symmetrically to MM entry (BestPossibleStateCalcStage) and MM exit (MaintenanceRecoveryStage). Unplanned offline loss still trips MM, preserving the safety net. Cluster-level controls (all opt-in, no defaults): - MAX_PLANNED_MAINTENANCE_INSTANCES / MAX_PLANNED_MAINTENANCE_PERCENTAGE caps how many instances may carry a marker simultaneously - DEFAULT_PLANNED_MAINTENANCE_DURATION_MS supplies a fallback TTL when the REST caller omits expiresAtMillis; absent => write rejected with 400 REST endpoints: - POST /clusters/{c}/instances/{i}/plannedMaintenance (single) - POST /clusters/{c}/instances/plannedMaintenance:batch (batch) A negative expiresAtMillis sentinel clears the marker. Markers are listed in NON_OVERWRITABLE_PROPERTIES so they do not transfer via swap-in. Tests: - TestInstanceConfig: 10 cases covering get/set, isUnderPlannedMaintenance boundary, clear-via-negative, clear-via-zero, metadata defensive copy, overwriteInstanceConfig non-transfer - TestClusterConfig: 3 cases covering unset sentinels and round-trip - TestPlannedMaintenanceWriteHandler: 19 cases covering TTL resolution, dedup, pre-write existence validation, cap enforcement (absolute, percentage, stricter-of-two, existing markers, expired markers), clear-bypasses-cap - TestPlannedMaintenanceBudgetExemption: integration test with real ZK confirming valid markers prevent MM trigger end-to-end and expired markers behave as absent

Earlier draft put the batch endpoint at @path("plannedMaintenance:batch") on InstancesAccessor. Jersey resolves URLs under /clusters/{c}/instances/ against PerInstanceAccessor's @path("{instanceName}") root resource (more specific class @path wins), so the batch URL was being captured by updateInstance and returning "Invalid command: null". Match the stoppable-check routing pattern instead: - Single endpoint stays at /clusters/{c}/instances/{i}/plannedMaintenance (PerInstanceAccessor) - Batch is now POST /clusters/{c}/instances?command=plannedMaintenance, routed through InstancesAccessor.instancesOperations dispatch with a new Command enum value and a private batchSetPlannedMaintenance helper that mirrors batchGetStoppableInstances Verified end-to-end against localhost:2191 + REST on :8100 with set, batch, fallback-TTL, past-expiry-rejected, no-default-rejected, and cap-exceeded-rejected paths.

bellatrix007 · 2026-05-13T16:03:12Z

+      // Subtract instances carrying a valid planned-maintenance marker that are currently
+      // counted as offline (routable config, not in the enabled-live set). Planned-and-alive
+      // instances are already absent from the budget and must not be double-subtracted, so we
+      // intersect the marker filter with the !enabledLive predicate. The filter applies
+      // symmetrically: same expression gates both MM entry and (via the same code path on the
+      // next pipeline tick) MM exit.
+      long nowMs = System.currentTimeMillis();
+      Set<String> enabledLive = cache.getEnabledLiveInstances();
+      int plannedAndOfflineCount = (int) cache.getInstanceConfigMap().entrySet().stream()
+          .filter(instanceEntry -> !InstanceConstants.UNROUTABLE_INSTANCE_OPERATIONS.contains(
+              instanceEntry.getValue().getInstanceOperation().getOperation()))
+          .filter(instanceEntry -> instanceEntry.getValue().isUnderPlannedMaintenance(nowMs))
+          .filter(instanceEntry -> !enabledLive.contains(instanceEntry.getKey()))
+          .count();


This will pretty soon become confusing, lets use simple logic -

Get list of instancesUnableToAcceptOnlineReplicas.

From step 1, the size can be determined for previous offline instances calculation.

From the list of instancesUnableToAcceptOnlineReplicas, remove any intersecting instance under maintenance (no need to go through the list twice).

Done. Build the offline-budget set first (routable minus enabled-live), then drop planned maintenance instances from it via removeIf. Single pass now.

bellatrix007 · 2026-05-13T16:06:27Z

+   * unset on the cluster. The auto-MM filter and the REST write path both treat the unset
+   * value as "feature disabled for this cluster".
+   */
+  public static final int PLANNED_MAINTENANCE_LIMIT_UNSET = -1;


nit:

Suggested change

public static final int PLANNED_MAINTENANCE_LIMIT_UNSET = -1;

public static final int PLANNED_MAINTENANCE__INSTANCES_LIMIT_UNSET = -1;

UNSET seems a bit weird, is this the convention used everywhere?

Dropped the named constant entirely and using inline -1 to match the MAX_OFFLINE_INSTANCES_ALLOWED convention here.

checked MAX_OFFLINE_INSTANCES_ALLOWED, it just uses inline -1 with no named constant, switched to that. Same for the duration field.

bellatrix007 · 2026-05-13T16:08:44Z

+    // omits an explicit expiresAtMillis. Not a default in the "always applied" sense: when
+    // unset, writes that omit expiresAtMillis are rejected.


Not a default in the "always applied" sense: when unset, writes that omit expiresAtMillis are rejected.
This complete sentence feels odd, lets rephrase it.

bellatrix007 · 2026-05-13T16:10:47Z

    HELIX_INSTANCE_OPERATIONS,
-    INSTANCE_OPERATION_STATE
+    INSTANCE_OPERATION_STATE,
+    PLANNED_MAINTENANCE_UNTIL_MS,


Can this value be part of the metadata itself?

Kept it separate intentionally. The timestamp drives the filter on every controller tick so it's the hot path. SimpleField + getLongField is a typed direct read. If we move it into the map we have to parse the map + string to long every check. Also keeps the load bearing field (what affects MM) separate from the pure audit fields (reason/source/setAt). Lmk if you feel strongly though.

bellatrix007 · 2026-05-13T16:15:33Z

+    if (clear) {
+      for (String instanceName : deduped) {
+        InstanceConfig cfg = loadInstanceConfig(clusterId, instanceName);
+        cfg.setPlannedMaintenanceUntilMs(InstanceConfig.PLANNED_MAINTENANCE_NOT_SET);


instead of implicitly removing this, can this be a more explicit method?

Extracted into assertInstancesExist(clusterId, deduped)

bellatrix007 · 2026-05-13T16:20:18Z

+      }
+      return callerExpiresAtMillis;
+    }
+    long defaultDuration = clusterConfig.getDefaultPlannedMaintenanceDurationMs();


Maybe I am reading this incorrectly, but based on the above conditions, we can reach here if callerExpiresAtMillis <= 0 correct? which is essentially the unset condition. Wont defaultDuration be -1 in such cases?

Yeah you're right, defaultDuration is -1 when unset. The > 0L check worked but was implicit. Changed to compare against the -1L directly with a comment so the "feature off" branch is explicit.

bellatrix007 · 2026-05-13T16:25:27Z

+      throw new BadRequestException(String.format(
+          "Write would push planned-maintenance count to %d, exceeds cluster cap %d "
+              + "(absoluteCap=%d, percentageCap=%d, clusterSize=%d)",
+          projectedMarked, effectiveCap, absoluteCap, percentageCap, allInstances.size()));


I am worried if this cap can introduce failures which will block the complete deployment in case these values are not set properly.
Either we should gracefully handle this and add the marker for instance within limit or while setting the cap values, be more restrictive / opinionated (which is hard to enforce).

partial accept kind of defeats the purpose. If we mark half the batch and the orchestrator proceeds with the deploy anyway, the unmarked half still trips MM. Hard reject gives the orchestrator a clear signal to slow down or fix the cap. The error response already includes current count + projected + cap so it's actionable.

ACtually, thought about it again, made sense to partial accept just like we do for batch stoppable checks also. batch processes instances in input order, applies what fits the cap, returns {applied, rejected, expiresAtMillis} with per-instance reasons. 400 is now only for request body issues. Single endpoint stays binary 200/400 since there's nothing to be partial about. Updated 21 handler tests and verified end to end against a live cluster.

bellatrix007 · 2026-05-13T16:30:47Z

    return OK();
  }

+  /**


I was thinking of 1 more endpoint, lmk if that makes sense.
Can we have and endpoint which essentially give me a number of actual offline instances which are counted by Helix? At present, I dont think there is any such thing present. Clients have to do this calculation on their own, which can change based on internal logic but client wont know.

Good suggestion, agree clients shouldn't have to recompute this. Will do as a follow up PR to keep this one scoped.

… existence check - BestPossibleStateCalcStage: collapse the two-stream offline filter into one: build the offline-budget set (routable minus enabled-live) then drop instances under valid planned-maintenance markers via removeIf(isUnderPlannedMaintenance). - ClusterConfig: drop named UNSET constants (PLANNED_MAINTENANCE_LIMIT_UNSET / _DURATION_UNSET); use inline -1 to match the MAX_OFFLINE_INSTANCES_ALLOWED convention. - ClusterConfig: rephrase the DEFAULT_PLANNED_MAINTENANCE_DURATION_MS comment to drop the awkward "always applied" phrasing. - PlannedMaintenanceWriteHandler: extract the pre-write existence check into assertInstancesExist for readability; the implicit "load and throw" loop is gone. - PlannedMaintenanceWriteHandler: compare the cluster default to the literal -1L sentinel rather than > 0L, so the "feature off" branch is explicit. Adds a comment. - TestClusterConfig: inline -1 / -1L in lieu of the removed constants. All targeted tests still pass (87 in helix-core, 19 in helix-rest).

Following PR feedback: the batch endpoint now mirrors the partial-accept contract used by the batch stoppable check. Instances are processed in input order; those that fit the cap quota (or that exist on a clear) are returned under "applied", the rest under "rejected" keyed by instance name with a free-form reason. Handler changes - applyPlannedMaintenance returns a PlannedMaintenanceResult value (applied list, rejected map, resolved expiry) instead of a bare long. - Missing-instance check switched from HelixAdmin.getInstanceConfig (which throws for missing instances) to a single HelixAdmin.getInstancesInCluster snapshot. Same snapshot feeds the cap quota calculation, so we no longer pay two passes of per-instance reads for the existence check. - Cap enforcement walks candidates in input order against a precomputed remaining-quota, rejecting overflow rather than throwing. - BadRequestException is now reserved for caller-side bugs (empty input, malformed expiry, missing cluster default). Per-instance failures surface through the rejected map. REST changes - POST /clusters/{c}/instances/{i}/plannedMaintenance (single) stays binary: 200 with {instance, expiresAtMillis} on apply, 400 with the per-instance reason on reject. - POST /clusters/{c}/instances?command=plannedMaintenance (batch) now always returns 200 with {applied, rejected, expiresAtMillis} for any per-instance outcome; 400 is reserved for invalid request bodies. Tests - TestPlannedMaintenanceWriteHandler rewritten to assert the new result shape: applied/rejected contents, per-instance reject reasons, and zero-cap edge cases. 21 tests, all green. - All other adjacent tests (TestInstanceConfig, TestClusterConfig, TestBestPossibleStateCalcStage, TestPlannedMaintenanceBudgetExemption, TestClusterInMaintenanceModeWhenReachingOfflineInstancesLimit) still pass unchanged. Verified end-to-end against localhost:2191 + REST on :8100: partial accept with mixed missing-instance + cap-overflow inputs returns 200 with the right applied/rejected partitioning; single endpoint returns 400 with the per-instance reason when the cap is exhausted.

ClusterConfig and InstanceConfig non-topology fields are already trimmed out of ResourceChangeSnapshot by default (the trimmers' whitelist of "topology-related" fields excludes them, so simple-field values for PLANNED_MAINTENANCE_UNTIL_MS and the new cluster-cap fields never reach the snapshot comparison). InstanceConfigTrimmer.getNonTrimmableKeys, however, preserves the keys of every map and list field by default (only the values are emptied). That means writing PLANNED_MAINTENANCE_METADATA would still flip the snapshot from {no key} to {key: {}}, which the change detector reads as a real change and triggers a rebalance pipeline. Same pattern that HELIX_INSTANCE_OPERATIONS handles via its existing remove() override. This commit adds PLANNED_MAINTENANCE_METADATA to the same remove() path so marker writes don't trigger spurious rebalance work. While doing that, switch the override from mutating super's live keySet() views (which would propagate to the source InstanceConfig record) to copying into fresh HashSet instances first. Behavior is unchanged for callers that already had clean separation between source and trimmed snapshot, but the override is now safe regardless of whether the caller reuses the InstanceConfig instance. Adds TestPlannedMaintenanceTrimming to lock in the contract: marker writes on both InstanceConfig (untilMs simple field, metadata map field) and ClusterConfig (cap fields, default duration) are trimmed away before change-detection sees them. All 100 adjacent tests still pass.

thestreak101 · 2026-05-19T09:26:40Z

+      // The filter is symmetric: the same expression gates both MM entry here and MM exit
+      // via MaintenanceRecoveryStage on the next pipeline tick.


Not true right? BestPossibleStateCalcStage is filtering on not UNROUTABLE and MaintenanceRecoveryStage is filtering on getAssignableInstances(doesn't have evacuate and swap_out).

thestreak101 · 2026-05-19T09:29:21Z

+   * PLANNED_MAINTENANCE_METADATA is purely audit data (reason/source/setAtMs) and writing it
+   * must not trigger a rebalance pipeline.


What happens if same operation was set without this and then another one sets the same operation but with metadata this time?

Removed metadata

thestreak101 · 2026-05-19T09:37:39Z

+      // excluding planned-maintenance instances that are not currently live. This matches
+      // the filter applied at MM entry in BestPossibleStateCalcStage and keeps a deploy
+      // window from blocking auto-recovery.
+      Set<String> assignable = cache.getAssignableInstances();


Base is different from baseline, doesn't have evacuation here.

pre existing from long back, will revisit in a different PR

thestreak101 · 2026-05-19T09:39:36Z

+    MAX_PLANNED_MAINTENANCE_INSTANCES,
+    MAX_PLANNED_MAINTENANCE_PERCENTAGE,


Don't allow setting both, we should have validation that only one of these can be set in the config.

thestreak101 · 2026-05-19T09:40:22Z

+    // Fallback TTL (in millis) applied by the planned-maintenance write path when the
+    // caller omits expiresAtMillis. When unset (-1), callers must always supply an
+    // explicit expiresAtMillis; otherwise the write is rejected.
+    DEFAULT_PLANNED_MAINTENANCE_DURATION_MS,


Name needs to change to config confuse with cluster maintenance mode.

thestreak101 · 2026-05-19T09:41:23Z

-    INSTANCE_OPERATION_STATE
+    INSTANCE_OPERATION_STATE,
+    PLANNED_MAINTENANCE_UNTIL_MS,
+    PLANNED_MAINTENANCE_METADATA


thestreak101 · 2026-05-19T09:43:57Z

+   * Keys used inside the PLANNED_MAINTENANCE_METADATA map field. The map is purely audit
+   * data; it does not influence the budget-exemption decision.
+   */
+  public static final class PlannedMaintenanceMetadataKey {


Why not just instance operation metadata, are these field in anyway related to maintenance?

thestreak101 · 2026-05-19T09:45:09Z

+    public static final String REASON = "reason";
+    public static final String SOURCE = "source";
+    public static final String SET_AT_MS = "setAtMs";


Aren't these already present in instance config?

thestreak101 · 2026-05-19T09:46:01Z

+   * Sentinel returned by {@link #getPlannedMaintenanceUntilMs()} when no marker is set.
+   * Callers clear the marker by passing this value to {@link #setPlannedMaintenanceUntilMs(long)}.
+   */
+  public static final long PLANNED_MAINTENANCE_NOT_SET = -1L;


also update this name depending on decision on other field.

thestreak101 · 2026-05-19T09:47:08Z

Change name???

@thestreak101

…date mutually exclusive budget forms Renames the per-instance budget exemption feature throughout the codebase to anchor on the existing HELIX_INSTANCE_OPERATIONS concept rather than overload "Maintenance" with Helix's cluster-wide Maintenance Mode. The INSTANCE_OPERATION_ prefix disambiguates from MM while keeping the "maintenance window" semantic the orchestrator actually wants. Rename map: Field/Method/Class Before -> After -------------------------------------------------------------------- InstanceConfig field PLANNED_MAINTENANCE_UNTIL_MS -> INSTANCE_OPERATION_MAINTENANCE_UNTIL_MS InstanceConfig sentinel PLANNED_MAINTENANCE_NOT_SET -> INSTANCE_OPERATION_MAINTENANCE_NOT_SET InstanceConfig methods isUnderPlannedMaintenance(now) -> isUnderInstanceOperationMaintenance(now) (and the setter/getter pair to match) ClusterConfig absolute cap MAX_PLANNED_MAINTENANCE_INSTANCES -> INSTANCE_OPERATION_MAINTENANCE_BUDGET ClusterConfig percent cap MAX_PLANNED_MAINTENANCE_PERCENTAGE -> INSTANCE_OPERATION_MAINTENANCE_BUDGET_PERCENTAGE ClusterConfig default TTL DEFAULT_PLANNED_MAINTENANCE_DURATION_MS -> DEFAULT_INSTANCE_OPERATION_MAINTENANCE_DURATION_MS Handler class PlannedMaintenanceWriteHandler -> InstanceOperationMaintenanceWriteHandler Command enum plannedMaintenance -> instanceOperationMaintenance REST path /plannedMaintenance -> /instanceOperationMaintenance Also addresses two PR comments by @thestreak101: * Dropped PLANNED_MAINTENANCE_METADATA entirely. The reason/source/setAtMs fields duplicated InstanceOperation's existing REASON/SOURCE/TIMESTAMP; external logging at the REST layer covers the same audit ground. This also eliminates the InstanceConfigTrimmer special-case for the map field. * Made the two budget forms (absolute and percentage) mutually exclusive at setter time. Writing one when the other is already configured now throws HelixException (same pattern setNumOfflineInstancesForAutoExit uses against MaxOfflineInstancesAllowed). The handler's quota math simplifies: only one form is ever consulted at runtime. * Rephrased the filter "symmetry" comments in BestPossibleStateCalcStage and MaintenanceRecoveryStage. The pre-subtraction baselines differ between the two stages (!UNROUTABLE vs getAssignableInstances) which is pre-existing controller behavior; what's consistent across both is the marker-based subtraction. Tests: TestInstanceConfig: 26 cases (was 31; dropped 5 metadata-specific ones). TestClusterConfig: 6 new cases including mutual-exclusion paths. TestInstanceOperationMaintenanceWriteHandler: 19 cases (was 21; dropped metadata tests, kept all partial-accept + cap + clear coverage). TestInstanceOperationMaintenanceTrimming: 2 cases (was 3; dropped the metadata-MapField trimmer test that no longer applies). TestInstanceOperationMaintenanceBudget: 2 integration cases, real ZK. All 87 helix-core tests + 19 helix-rest tests in the targeted suite still pass.

Prior comment claimed entry's !UNROUTABLE_INSTANCE_OPERATIONS filter includes "EVACUATE/SWAP_OUT" relative to exit's getAssignableInstances. SWAP_OUT does not exist in InstanceConstants.InstanceOperation; the enum is {ENABLE, DISABLE, EVACUATE, SWAP_IN, UNKNOWN}. SWAP_IN is excluded from both baselines (it's in UNROUTABLE and not in ASSIGNABLE), so the only operation that actually differs between the two baselines is EVACUATE. Rephrase the comments in both stages to call out EVACUATE specifically and drop the SWAP_OUT reference.

The baseline-asymmetry caveat will be addressed in a follow-up PR that introduces a single OFFLINE_BUDGET_INSTANCE_OPERATIONS set used at both MM entry and exit. Until then, the comments don't need to call it out; they only need to describe what each stage does and that the marker subtraction is applied identically at both.

thestreak101 · 2026-05-19T14:59:12Z

+    if (absoluteCap >= 0) {
+      effectiveCap = absoluteCap;
+    } else {


what if both are set? Do we have a validation to prevent against that?

I see that we are validating that case while updating the config, can ignore it here for now.

thestreak101 · 2026-05-19T15:02:45Z

+          + "INSTANCE_OPERATION_MAINTENANCE_BUDGET are mutually exclusive; clear the "
+          + "absolute form before setting the percentage form.");
+    }
+    _record.setIntField(


Add validation to ensure % is between 0 and 100 here before writing.

thestreak101 · 2026-05-19T15:06:46Z

This only seems to be testing offline node case, can we add tests for disable, evacuate, unknown etc to validate more cases?

thestreak101 · 2026-05-19T15:13:20Z

+        return JSONRepresentation(body);
+      }
+      // Single-instance call cannot be partial; surface the per-instance reason as 400.
+      String rejectReason = result.getRejected().getOrDefault(instanceName, "rejected");


We can end up masking other errors with 400 code here. If instances is not in rejected list, it should return 400 and should fail with 500 errors instead.

…ISABLE/EVACUATE tests Three changes responding to streak's latest review pass on PR #175: * ClusterConfig.setInstanceOperationMaintenanceBudgetPercentage now rejects values outside [0, 100] (or -1 to clear) at write time. A value above 100 or below -1 was previously stored silently and would have produced a cap larger than the cluster or undefined sentinel behavior; the setter now throws HelixException with an explicit range in the message. * PerInstanceAccessor.setInstanceOperationMaintenance no longer maps every "not applied" outcome to HTTP 400. The handler is supposed to return every input instance in exactly one of applied or rejected; if a caller hits the third case (instance is in neither set), that's a server-side bug, not a client error, so the accessor now returns 500 in that case to surface the bug rather than mask it. * TestInstanceOperationMaintenanceBudget gains two integration cases exercising the marker against non-default InstanceOperation values: DISABLE (instance is in the MM-entry baseline but absent from enabledLive, so it counts toward the budget without a marker) and EVACUATE (same shape, plus EVACUATE is the operation most relevant to the orchestrator's deploy/drain scenarios). UNKNOWN/SWAP_IN are not covered because they're members of UNROUTABLE_INSTANCE_OPERATIONS and never count toward the budget regardless of marker state, so a test would only assert "no-op behaves like no-op". All adjacent tests still pass: 72 helix-core tests (was 100; difference is fewer tests in the targeted run, not regressions) and 19 helix-rest handler tests.

thestreak101

Make sure the tests failures are not related to the change before merging.

LZD-PratyushBhatt · 2026-05-20T01:10:47Z

Make sure the tests failures are not related to the change before merging.

Unrelated
Error: Test failed: testGreedyRebalanceWithGlobalPerInstancePartitionLimit(org.apache.helix.integration.rebalancer.TestGreedyRebalanceWithGlobalPerInstancePartitionLimit) Time elapsed: 6.464 s <<< FAILURE!

LZD-PratyushBhatt requested review from PranaviAncha, arkmish, kabragaurav, laxman-ch, ngngwr, proud-parselmouth and thestreak101 as code owners May 12, 2026 13:37

bellatrix007 reviewed May 13, 2026

View reviewed changes

LZD-PratyushBhatt added 3 commits May 14, 2026 00:39

LZD-PratyushBhatt requested a review from bellatrix007 May 14, 2026 05:09

bellatrix007 previously approved these changes May 19, 2026

View reviewed changes

thestreak101 reviewed May 19, 2026

View reviewed changes

LZD-PratyushBhatt dismissed bellatrix007’s stale review via 942cbd6 May 19, 2026 11:53

LZD-PratyushBhatt added 2 commits May 19, 2026 17:27

thestreak101 reviewed May 19, 2026

View reviewed changes

thestreak101 approved these changes May 19, 2026

View reviewed changes

LZD-PratyushBhatt merged commit 8cb8ae6 into dev May 20, 2026
1 of 2 checks passed

This was referenced May 20, 2026

Add shared offline-budget accessor for auto Maintenance Mode #181

Merged

Make auto-MM entry and exit measure the same offline budget #182

Merged

	public static final int PLANNED_MAINTENANCE_LIMIT_UNSET = -1;
	public static final int PLANNED_MAINTENANCE__INSTANCES_LIMIT_UNSET = -1;

		// omits an explicit expiresAtMillis. Not a default in the "always applied" sense: when
		// unset, writes that omit expiresAtMillis are rejected.

		// The filter is symmetric: the same expression gates both MM entry here and MM exit
		// via MaintenanceRecoveryStage on the next pipeline tick.

		* PLANNED_MAINTENANCE_METADATA is purely audit data (reason/source/setAtMs) and writing it
		* must not trigger a rebalance pipeline.

		MAX_PLANNED_MAINTENANCE_INSTANCES,
		MAX_PLANNED_MAINTENANCE_PERCENTAGE,

Conversation

LZD-PratyushBhatt commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issues

Description

Tests

Changes that Break Backward Compatibility (Optional)

Documentation (Optional)

Commits

Code Quality

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

LZD-PratyushBhatt commented May 12, 2026 •

edited

Loading