Skip to content

Planned-maintenance budget exemption for auto Maintenance Mode#175

Merged
LZD-PratyushBhatt merged 9 commits into
devfrom
lzd/planned-maintenance-window
May 20, 2026
Merged

Planned-maintenance budget exemption for auto Maintenance Mode#175
LZD-PratyushBhatt merged 9 commits into
devfrom
lzd/planned-maintenance-window

Conversation

@LZD-PratyushBhatt
Copy link
Copy Markdown
Collaborator

@LZD-PratyushBhatt LZD-PratyushBhatt commented May 12, 2026

Issues

  • My PR addresses the following Helix issues and references them in the PR description:

Description

  • Here are some details about my PR, including screenshots of any UI changes:

What. Introduces a per-instance PLANNED_MAINTENANCE_UNTIL_MS marker on InstanceConfig that excludes valid-marker instances from the cluster-wide MAX_OFFLINE_INSTANCES_ALLOWED count that drives auto Maintenance Mode (MM). The filter applies symmetrically on MM entry (BestPossibleStateCalcStage) and MM exit (MaintenanceRecoveryStage).

Why. Helix today is cause-blind: every offline host counts against the budget. Rolling deploys, evacuations, and planned swaps produce false-positive MM triggers, and a static MAX_OFFLINE_INSTANCES_ALLOWED caps the deployment speedup that operators expect when raising replication factor (e.g., Venice's RF=3 to RF=4 change). The marker lets Helix recognize planned offlines while keeping the MM trigger as a safety net for unplanned loss. Helix stays free of orchestration semantics; operation-type reasoning remains in the orchestrator.

How.

  1. InstanceConfig:

    • New PLANNED_MAINTENANCE_UNTIL_MS (SimpleField, long) and PLANNED_MAINTENANCE_METADATA (MapField, audit-only) properties.
    • PLANNED_MAINTENANCE_NOT_SET = -1L sentinel. Setter clears both fields when given a non-positive value.
    • isUnderPlannedMaintenance(long nowMs) helper.
    • Both new fields added to NON_OVERWRITABLE_PROPERTIES so the marker does not transfer onto a swap-in target via overwriteInstanceConfig.
  2. ClusterConfig adds three opt-in fields (no defaults, absent => feature off for that cluster):

    • MAX_PLANNED_MAINTENANCE_INSTANCES (absolute cap on simultaneously marked instances).
    • MAX_PLANNED_MAINTENANCE_PERCENTAGE (percentage cap on simultaneously marked instances; if both caps are set, the stricter one wins).
    • DEFAULT_PLANNED_MAINTENANCE_DURATION_MS (fallback TTL applied when the REST caller omits expiresAtMillis; absent => such writes are rejected with 400).
  3. BestPossibleStateCalcStage.validateInstancesUnableToAcceptOnlineReplicasLimit: builds the offline-budget set in one pass (routable configs minus enabled-live) and drops instances under valid planned-maintenance markers via removeIf(isUnderPlannedMaintenance) before comparing against MAX_OFFLINE_INSTANCES_ALLOWED. Planned-and-alive instances are not double-counted because they were never in the offline set to begin with.

  4. MaintenanceRecoveryStage applies the same subtraction on the auto-exit path so a deploy window does not block recovery.

  5. helix-rest endpoints (mirror the stoppable-check routing style):

    • Single: POST /clusters/{c}/instances/{i}/plannedMaintenance
    • Batch: POST /clusters/{c}/instances?command=plannedMaintenance (adds plannedMaintenance to the Command enum and dispatches via the existing InstancesAccessor.instancesOperations switch, parallel to how ?command=stoppable works).

    Request body (single):

    { "expiresAtMillis": 1776385800000, "reason": "venice-deploy:opId=abc", "source": "AUTOMATION" }

    Batch body adds "instances": ["h1", "h2", ...]. A negative expiresAtMillis clears the marker.

    TTL resolution: caller value if positive; else now + DEFAULT_PLANNED_MAINTENANCE_DURATION_MS if the cluster has it; else 400.

    Partial-accept contract (mirrors batch stoppable). The batch endpoint processes instances in input order against a single snapshot of the current cap quota. Per-instance failures (missing instance, cap exhausted) do not abort the call; they land in the rejected map. HTTP 400 is reserved for caller-side bugs that invalidate the entire request (empty instances, past expiresAtMillis, missing expiresAtMillis with no cluster default).

    Single endpoint response (200 on success):

    { "instance": "h1", "expiresAtMillis": 1776385800000 }

    On a per-instance failure (cap full or instance missing), single returns 400 with the reason since there's nothing to be partial about.

    Batch endpoint response (always 200 for any per-instance outcome):

    {
      "applied":  ["h1", "h2"],
      "rejected": { "h3": "would exceed MAX_PLANNED_MAINTENANCE_INSTANCES=2" },
      "expiresAtMillis": 1776385800000
    }

    The shared PlannedMaintenanceWriteHandler returns a PlannedMaintenanceResult (applied, rejected, resolvedExpiresAtMillis). Existence checks use a single getInstancesInCluster snapshot rather than per-instance config reads (HelixAdmin throws on missing instances), and the same snapshot feeds the cap quota calculation.

Evaluation model. The filter is evaluated lazily on events the controller already handles (instance-config writes, live-instance changes, periodic rebalance). Orchestrator clears re-evaluate immediately via the config write. Real outages re-evaluate via the corresponding live-instance change. A stale marker left by an orchestrator crash is benign on a stable cluster and cascades into MM correctly the moment a genuine outage arrives.

Tests

  • The following tests are written for this issue:
  • helix-core/src/test/java/org/apache/helix/model/TestInstanceConfig.java (10 new cases):

    • testPlannedMaintenanceUnsetByDefault
    • testPlannedMaintenanceSetAndGet
    • testIsUnderPlannedMaintenanceWindow (boundary semantics)
    • testPlannedMaintenanceClearViaNegativeValue
    • testPlannedMaintenanceClearViaZero
    • testPlannedMaintenanceMetadataSetAndGet
    • testPlannedMaintenanceMetadataIsReadOnly
    • testPlannedMaintenanceMetadataNullClears
    • testPlannedMaintenanceMetadataIsDefensiveCopied
    • testPlannedMaintenanceNotTransferredByOverwrite
  • helix-core/src/test/java/org/apache/helix/model/TestClusterConfig.java (3 new cases):

    • testPlannedMaintenanceFieldsUnsetByDefault
    • testPlannedMaintenanceFieldsRoundTrip
    • testPlannedMaintenanceFieldsClearWithSentinel
  • helix-rest/src/test/java/org/apache/helix/rest/clusterMaintenanceService/TestPlannedMaintenanceWriteHandler.java (21 cases, Mockito-based): TTL resolution (caller wins / cluster default / reject when both unset / past-timestamp rejected), set-with-metadata, set-without-audit-fields, dedup, empty/null instance rejection, missing-instance partial-accept (others still applied), cap enforcement under partial-accept semantics (no-op / absolute / boundary / percentage / stricter-of-two / counts existing markers / ignores expired markers / zero-cap rejects everything), clear with negative sentinel, clear bypasses cap, clear with missing instance partial-accept.

  • helix-core/src/test/java/org/apache/helix/integration/rebalancer/TestPlannedMaintenanceBudgetExemption.java (2 integration cases with real ZK):

    • testValidMarkersPreventMMTrigger: two marked offline instances do not trip MM; adding two unmarked offline instances on top does.
    • testExpiredMarkersDoNotExempt: markers with a past untilMs behave as if absent and MM trips.
  • The following is the result of the "mvn test" command on the appropriate module:
Targeted run across helix-core + helix-rest covering all new tests and the
adjacent maintenance-mode regression tests:

  mvn -pl helix-core,helix-rest test \
    -Dtest='TestInstanceConfig,TestClusterConfig,TestPlannedMaintenanceWriteHandler,
    TestPlannedMaintenanceBudgetExemption,TestBestPossibleStateCalcStage,
    TestClusterInMaintenanceModeWhenReachingOfflineInstancesLimit,
    TestClusterMaintenanceMode'

  helix-core: Tests run: 87, Failures: 0, Errors: 0, Skipped: 0
  helix-rest: Tests run: 21, Failures: 0, Errors: 0, Skipped: 0

End-to-end smoke test against a live cluster on localhost:2191 confirmed all paths: TTL resolution (caller / cluster-default / reject-with-no-default / past-timestamp), partial-accept on cap overflow, partial-accept on missing instance, single-endpoint 400 on cap-exhausted, single-endpoint 400 on missing instance, mixed missing + cap-overflow batch.

Changes that Break Backward Compatibility (Optional)

None. All new fields are opt-in:

  • New InstanceConfig fields default to "unset"; older readers ignore unknown ZNRecord keys, and the existing getMaxOfflineInstancesAllowed path is unchanged when no instance carries a marker.
  • New ClusterConfig fields default to "unset"; absent means feature is off for that cluster.
  • The two new REST endpoints are additive; existing endpoints are untouched.
  • NON_OVERWRITABLE_PROPERTIES gains two entries; this only affects callers that explicitly relied on overwriteInstanceConfig transferring planned-maintenance state during swap, which would have been an unintended behavior anyway.

Documentation (Optional)

Design doc: Helix Planned Maintenance Window (per-instance budget exemption).

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

Introduces a per-instance PLANNED_MAINTENANCE_UNTIL_MS marker on InstanceConfig
that excludes valid-marker instances from MAX_OFFLINE_INSTANCES_ALLOWED while the
marker has not expired. The filter applies symmetrically to MM entry
(BestPossibleStateCalcStage) and MM exit (MaintenanceRecoveryStage). Unplanned
offline loss still trips MM, preserving the safety net.

Cluster-level controls (all opt-in, no defaults):
  - MAX_PLANNED_MAINTENANCE_INSTANCES / MAX_PLANNED_MAINTENANCE_PERCENTAGE caps
    how many instances may carry a marker simultaneously
  - DEFAULT_PLANNED_MAINTENANCE_DURATION_MS supplies a fallback TTL when the
    REST caller omits expiresAtMillis; absent => write rejected with 400

REST endpoints:
  - POST /clusters/{c}/instances/{i}/plannedMaintenance      (single)
  - POST /clusters/{c}/instances/plannedMaintenance:batch    (batch)
A negative expiresAtMillis sentinel clears the marker. Markers are listed in
NON_OVERWRITABLE_PROPERTIES so they do not transfer via swap-in.

Tests:
  - TestInstanceConfig: 10 cases covering get/set, isUnderPlannedMaintenance
    boundary, clear-via-negative, clear-via-zero, metadata defensive copy,
    overwriteInstanceConfig non-transfer
  - TestClusterConfig: 3 cases covering unset sentinels and round-trip
  - TestPlannedMaintenanceWriteHandler: 19 cases covering TTL resolution,
    dedup, pre-write existence validation, cap enforcement (absolute,
    percentage, stricter-of-two, existing markers, expired markers),
    clear-bypasses-cap
  - TestPlannedMaintenanceBudgetExemption: integration test with real ZK
    confirming valid markers prevent MM trigger end-to-end and expired
    markers behave as absent
Earlier draft put the batch endpoint at @path("plannedMaintenance:batch")
on InstancesAccessor. Jersey resolves URLs under /clusters/{c}/instances/
against PerInstanceAccessor's @path("{instanceName}") root resource (more
specific class @path wins), so the batch URL was being captured by
updateInstance and returning "Invalid command: null".

Match the stoppable-check routing pattern instead:
  - Single endpoint stays at /clusters/{c}/instances/{i}/plannedMaintenance
    (PerInstanceAccessor)
  - Batch is now POST /clusters/{c}/instances?command=plannedMaintenance,
    routed through InstancesAccessor.instancesOperations dispatch with a
    new Command enum value and a private batchSetPlannedMaintenance helper
    that mirrors batchGetStoppableInstances

Verified end-to-end against localhost:2191 + REST on :8100 with set,
batch, fallback-TTL, past-expiry-rejected, no-default-rejected, and
cap-exceeded-rejected paths.
Comment on lines +385 to +398
// Subtract instances carrying a valid planned-maintenance marker that are currently
// counted as offline (routable config, not in the enabled-live set). Planned-and-alive
// instances are already absent from the budget and must not be double-subtracted, so we
// intersect the marker filter with the !enabledLive predicate. The filter applies
// symmetrically: same expression gates both MM entry and (via the same code path on the
// next pipeline tick) MM exit.
long nowMs = System.currentTimeMillis();
Set<String> enabledLive = cache.getEnabledLiveInstances();
int plannedAndOfflineCount = (int) cache.getInstanceConfigMap().entrySet().stream()
.filter(instanceEntry -> !InstanceConstants.UNROUTABLE_INSTANCE_OPERATIONS.contains(
instanceEntry.getValue().getInstanceOperation().getOperation()))
.filter(instanceEntry -> instanceEntry.getValue().isUnderPlannedMaintenance(nowMs))
.filter(instanceEntry -> !enabledLive.contains(instanceEntry.getKey()))
.count();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will pretty soon become confusing, lets use simple logic -

  1. Get list of instancesUnableToAcceptOnlineReplicas.
  2. From step 1, the size can be determined for previous offline instances calculation.
  3. From the list of instancesUnableToAcceptOnlineReplicas, remove any intersecting instance under maintenance (no need to go through the list twice).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Build the offline-budget set first (routable minus enabled-live), then drop planned maintenance instances from it via removeIf. Single pass now.

* unset on the cluster. The auto-MM filter and the REST write path both treat the unset
* value as "feature disabled for this cluster".
*/
public static final int PLANNED_MAINTENANCE_LIMIT_UNSET = -1;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
public static final int PLANNED_MAINTENANCE_LIMIT_UNSET = -1;
public static final int PLANNED_MAINTENANCE__INSTANCES_LIMIT_UNSET = -1;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UNSET seems a bit weird, is this the convention used everywhere?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped the named constant entirely and using inline -1 to match the MAX_OFFLINE_INSTANCES_ALLOWED convention here.

checked MAX_OFFLINE_INSTANCES_ALLOWED, it just uses inline -1 with no named constant, switched to that. Same for the duration field.

Comment on lines +93 to +94
// omits an explicit expiresAtMillis. Not a default in the "always applied" sense: when
// unset, writes that omit expiresAtMillis are rejected.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a default in the "always applied" sense: when unset, writes that omit expiresAtMillis are rejected.
This complete sentence feels odd, lets rephrase it.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rephrased

HELIX_INSTANCE_OPERATIONS,
INSTANCE_OPERATION_STATE
INSTANCE_OPERATION_STATE,
PLANNED_MAINTENANCE_UNTIL_MS,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this value be part of the metadata itself?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kept it separate intentionally. The timestamp drives the filter on every controller tick so it's the hot path. SimpleField + getLongField is a typed direct read. If we move it into the map we have to parse the map + string to long every check. Also keeps the load bearing field (what affects MM) separate from the pure audit fields (reason/source/setAt). Lmk if you feel strongly though.

if (clear) {
for (String instanceName : deduped) {
InstanceConfig cfg = loadInstanceConfig(clusterId, instanceName);
cfg.setPlannedMaintenanceUntilMs(InstanceConfig.PLANNED_MAINTENANCE_NOT_SET);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of implicitly removing this, can this be a more explicit method?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted into assertInstancesExist(clusterId, deduped)

}
return callerExpiresAtMillis;
}
long defaultDuration = clusterConfig.getDefaultPlannedMaintenanceDurationMs();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am reading this incorrectly, but based on the above conditions, we can reach here if callerExpiresAtMillis <= 0 correct? which is essentially the unset condition. Wont defaultDuration be -1 in such cases?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you're right, defaultDuration is -1 when unset. The > 0L check worked but was implicit. Changed to compare against the -1L directly with a comment so the "feature off" branch is explicit.

Comment on lines +217 to +220
throw new BadRequestException(String.format(
"Write would push planned-maintenance count to %d, exceeds cluster cap %d "
+ "(absoluteCap=%d, percentageCap=%d, clusterSize=%d)",
projectedMarked, effectiveCap, absoluteCap, percentageCap, allInstances.size()));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am worried if this cap can introduce failures which will block the complete deployment in case these values are not set properly.
Either we should gracefully handle this and add the marker for instance within limit or while setting the cap values, be more restrictive / opinionated (which is hard to enforce).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial accept kind of defeats the purpose. If we mark half the batch and the orchestrator proceeds with the deploy anyway, the unmarked half still trips MM. Hard reject gives the orchestrator a clear signal to slow down or fix the cap. The error response already includes current count + projected + cap so it's actionable.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACtually, thought about it again, made sense to partial accept just like we do for batch stoppable checks also. batch processes instances in input order, applies what fits the cap, returns {applied, rejected, expiresAtMillis} with per-instance reasons. 400 is now only for request body issues. Single endpoint stays binary 200/400 since there's nothing to be partial about. Updated 21 handler tests and verified end to end against a live cluster.

return OK();
}

/**
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of 1 more endpoint, lmk if that makes sense.
Can we have and endpoint which essentially give me a number of actual offline instances which are counted by Helix? At present, I dont think there is any such thing present. Clients have to do this calculation on their own, which can change based on internal logic but client wont know.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, agree clients shouldn't have to recompute this. Will do as a follow up PR to keep this one scoped.

… existence check

- BestPossibleStateCalcStage: collapse the two-stream offline filter into
  one: build the offline-budget set (routable minus enabled-live) then
  drop instances under valid planned-maintenance markers via
  removeIf(isUnderPlannedMaintenance).
- ClusterConfig: drop named UNSET constants
  (PLANNED_MAINTENANCE_LIMIT_UNSET / _DURATION_UNSET); use inline -1 to
  match the MAX_OFFLINE_INSTANCES_ALLOWED convention.
- ClusterConfig: rephrase the DEFAULT_PLANNED_MAINTENANCE_DURATION_MS
  comment to drop the awkward "always applied" phrasing.
- PlannedMaintenanceWriteHandler: extract the pre-write existence check
  into assertInstancesExist for readability; the implicit "load and
  throw" loop is gone.
- PlannedMaintenanceWriteHandler: compare the cluster default to the
  literal -1L sentinel rather than > 0L, so the "feature off" branch is
  explicit. Adds a comment.
- TestClusterConfig: inline -1 / -1L in lieu of the removed constants.

All targeted tests still pass (87 in helix-core, 19 in helix-rest).
Following PR feedback: the batch endpoint now mirrors the partial-accept
contract used by the batch stoppable check. Instances are processed in
input order; those that fit the cap quota (or that exist on a clear) are
returned under "applied", the rest under "rejected" keyed by instance
name with a free-form reason.

Handler changes
- applyPlannedMaintenance returns a PlannedMaintenanceResult value
  (applied list, rejected map, resolved expiry) instead of a bare long.
- Missing-instance check switched from HelixAdmin.getInstanceConfig
  (which throws for missing instances) to a single
  HelixAdmin.getInstancesInCluster snapshot. Same snapshot feeds the cap
  quota calculation, so we no longer pay two passes of per-instance
  reads for the existence check.
- Cap enforcement walks candidates in input order against a precomputed
  remaining-quota, rejecting overflow rather than throwing.
- BadRequestException is now reserved for caller-side bugs (empty input,
  malformed expiry, missing cluster default). Per-instance failures
  surface through the rejected map.

REST changes
- POST /clusters/{c}/instances/{i}/plannedMaintenance (single) stays
  binary: 200 with {instance, expiresAtMillis} on apply, 400 with the
  per-instance reason on reject.
- POST /clusters/{c}/instances?command=plannedMaintenance (batch) now
  always returns 200 with {applied, rejected, expiresAtMillis} for any
  per-instance outcome; 400 is reserved for invalid request bodies.

Tests
- TestPlannedMaintenanceWriteHandler rewritten to assert the new result
  shape: applied/rejected contents, per-instance reject reasons, and
  zero-cap edge cases. 21 tests, all green.
- All other adjacent tests (TestInstanceConfig, TestClusterConfig,
  TestBestPossibleStateCalcStage, TestPlannedMaintenanceBudgetExemption,
  TestClusterInMaintenanceModeWhenReachingOfflineInstancesLimit) still
  pass unchanged.

Verified end-to-end against localhost:2191 + REST on :8100: partial
accept with mixed missing-instance + cap-overflow inputs returns 200
with the right applied/rejected partitioning; single endpoint returns
400 with the per-instance reason when the cap is exhausted.
ClusterConfig and InstanceConfig non-topology fields are already trimmed
out of ResourceChangeSnapshot by default (the trimmers' whitelist of
"topology-related" fields excludes them, so simple-field values for
PLANNED_MAINTENANCE_UNTIL_MS and the new cluster-cap fields never reach
the snapshot comparison).

InstanceConfigTrimmer.getNonTrimmableKeys, however, preserves the keys
of every map and list field by default (only the values are emptied).
That means writing PLANNED_MAINTENANCE_METADATA would still flip the
snapshot from {no key} to {key: {}}, which the change detector reads as
a real change and triggers a rebalance pipeline. Same pattern that
HELIX_INSTANCE_OPERATIONS handles via its existing remove() override.

This commit adds PLANNED_MAINTENANCE_METADATA to the same remove() path
so marker writes don't trigger spurious rebalance work. While doing
that, switch the override from mutating super's live keySet() views
(which would propagate to the source InstanceConfig record) to copying
into fresh HashSet instances first. Behavior is unchanged for callers
that already had clean separation between source and trimmed snapshot,
but the override is now safe regardless of whether the caller reuses
the InstanceConfig instance.

Adds TestPlannedMaintenanceTrimming to lock in the contract: marker
writes on both InstanceConfig (untilMs simple field, metadata map field)
and ClusterConfig (cap fields, default duration) are trimmed away
before change-detection sees them. All 100 adjacent tests still pass.
bellatrix007
bellatrix007 previously approved these changes May 19, 2026
Comment on lines +381 to +382
// The filter is symmetric: the same expression gates both MM entry here and MM exit
// via MaintenanceRecoveryStage on the next pipeline tick.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not true right? BestPossibleStateCalcStage is filtering on not UNROUTABLE and MaintenanceRecoveryStage is filtering on getAssignableInstances(doesn't have evacuate and swap_out).

Comment on lines +71 to +72
* PLANNED_MAINTENANCE_METADATA is purely audit data (reason/source/setAtMs) and writing it
* must not trigger a rebalance pipeline.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if same operation was set without this and then another one sets the same operation but with metadata this time?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed metadata

// excluding planned-maintenance instances that are not currently live. This matches
// the filter applied at MM entry in BestPossibleStateCalcStage and keeps a deploy
// window from blocking auto-recovery.
Set<String> assignable = cache.getAssignableInstances();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Base is different from baseline, doesn't have evacuation here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pre existing from long back, will revisit in a different PR

Comment on lines +90 to +91
MAX_PLANNED_MAINTENANCE_INSTANCES,
MAX_PLANNED_MAINTENANCE_PERCENTAGE,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't allow setting both, we should have validation that only one of these can be set in the config.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// Fallback TTL (in millis) applied by the planned-maintenance write path when the
// caller omits expiresAtMillis. When unset (-1), callers must always supply an
// explicit expiresAtMillis; otherwise the write is rejected.
DEFAULT_PLANNED_MAINTENANCE_DURATION_MS,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name needs to change to config confuse with cluster maintenance mode.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

INSTANCE_OPERATION_STATE
INSTANCE_OPERATION_STATE,
PLANNED_MAINTENANCE_UNTIL_MS,
PLANNED_MAINTENANCE_METADATA
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* Keys used inside the PLANNED_MAINTENANCE_METADATA map field. The map is purely audit
* data; it does not influence the budget-exemption decision.
*/
public static final class PlannedMaintenanceMetadataKey {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just instance operation metadata, are these field in anyway related to maintenance?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

Comment on lines +87 to +89
public static final String REASON = "reason";
public static final String SOURCE = "source";
public static final String SET_AT_MS = "setAtMs";
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't these already present in instance config?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

* Sentinel returned by {@link #getPlannedMaintenanceUntilMs()} when no marker is set.
* Callers clear the marker by passing this value to {@link #setPlannedMaintenanceUntilMs(long)}.
*/
public static final long PLANNED_MAINTENANCE_NOT_SET = -1L;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also update this name depending on decision on other field.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change name???

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

…date mutually exclusive budget forms

Renames the per-instance budget exemption feature throughout the codebase
to anchor on the existing HELIX_INSTANCE_OPERATIONS concept rather than
overload "Maintenance" with Helix's cluster-wide Maintenance Mode. The
INSTANCE_OPERATION_ prefix disambiguates from MM while keeping the
"maintenance window" semantic the orchestrator actually wants.

Rename map:
  Field/Method/Class                           Before -> After
  --------------------------------------------------------------------
  InstanceConfig field         PLANNED_MAINTENANCE_UNTIL_MS
                              -> INSTANCE_OPERATION_MAINTENANCE_UNTIL_MS
  InstanceConfig sentinel      PLANNED_MAINTENANCE_NOT_SET
                              -> INSTANCE_OPERATION_MAINTENANCE_NOT_SET
  InstanceConfig methods       isUnderPlannedMaintenance(now)
                              -> isUnderInstanceOperationMaintenance(now)
                              (and the setter/getter pair to match)
  ClusterConfig absolute cap   MAX_PLANNED_MAINTENANCE_INSTANCES
                              -> INSTANCE_OPERATION_MAINTENANCE_BUDGET
  ClusterConfig percent cap    MAX_PLANNED_MAINTENANCE_PERCENTAGE
                              -> INSTANCE_OPERATION_MAINTENANCE_BUDGET_PERCENTAGE
  ClusterConfig default TTL    DEFAULT_PLANNED_MAINTENANCE_DURATION_MS
                              -> DEFAULT_INSTANCE_OPERATION_MAINTENANCE_DURATION_MS
  Handler class                PlannedMaintenanceWriteHandler
                              -> InstanceOperationMaintenanceWriteHandler
  Command enum                 plannedMaintenance
                              -> instanceOperationMaintenance
  REST path                    /plannedMaintenance
                              -> /instanceOperationMaintenance

Also addresses two PR comments by @thestreak101:

* Dropped PLANNED_MAINTENANCE_METADATA entirely. The reason/source/setAtMs
  fields duplicated InstanceOperation's existing REASON/SOURCE/TIMESTAMP;
  external logging at the REST layer covers the same audit ground. This
  also eliminates the InstanceConfigTrimmer special-case for the map field.

* Made the two budget forms (absolute and percentage) mutually exclusive
  at setter time. Writing one when the other is already configured now
  throws HelixException (same pattern setNumOfflineInstancesForAutoExit
  uses against MaxOfflineInstancesAllowed). The handler's quota math
  simplifies: only one form is ever consulted at runtime.

* Rephrased the filter "symmetry" comments in BestPossibleStateCalcStage
  and MaintenanceRecoveryStage. The pre-subtraction baselines differ
  between the two stages (!UNROUTABLE vs getAssignableInstances) which is
  pre-existing controller behavior; what's consistent across both is the
  marker-based subtraction.

Tests:
  TestInstanceConfig: 26 cases (was 31; dropped 5 metadata-specific ones).
  TestClusterConfig: 6 new cases including mutual-exclusion paths.
  TestInstanceOperationMaintenanceWriteHandler: 19 cases (was 21; dropped
    metadata tests, kept all partial-accept + cap + clear coverage).
  TestInstanceOperationMaintenanceTrimming: 2 cases (was 3; dropped the
    metadata-MapField trimmer test that no longer applies).
  TestInstanceOperationMaintenanceBudget: 2 integration cases, real ZK.

All 87 helix-core tests + 19 helix-rest tests in the targeted suite still
pass.
Prior comment claimed entry's !UNROUTABLE_INSTANCE_OPERATIONS filter
includes "EVACUATE/SWAP_OUT" relative to exit's getAssignableInstances.
SWAP_OUT does not exist in InstanceConstants.InstanceOperation; the enum
is {ENABLE, DISABLE, EVACUATE, SWAP_IN, UNKNOWN}. SWAP_IN is excluded
from both baselines (it's in UNROUTABLE and not in ASSIGNABLE), so the
only operation that actually differs between the two baselines is
EVACUATE.

Rephrase the comments in both stages to call out EVACUATE specifically
and drop the SWAP_OUT reference.
The baseline-asymmetry caveat will be addressed in a follow-up PR that
introduces a single OFFLINE_BUDGET_INSTANCE_OPERATIONS set used at both
MM entry and exit. Until then, the comments don't need to call it out;
they only need to describe what each stage does and that the marker
subtraction is applied identically at both.
Comment on lines +262 to +264
if (absoluteCap >= 0) {
effectiveCap = absoluteCap;
} else {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if both are set? Do we have a validation to prevent against that?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that we are validating that case while updating the config, can ignore it here for now.

+ "INSTANCE_OPERATION_MAINTENANCE_BUDGET are mutually exclusive; clear the "
+ "absolute form before setting the percentage form.");
}
_record.setIntField(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add validation to ensure % is between 0 and 100 here before writing.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only seems to be testing offline node case, can we add tests for disable, evacuate, unknown etc to validate more cases?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

return JSONRepresentation(body);
}
// Single-instance call cannot be partial; surface the per-instance reason as 400.
String rejectReason = result.getRejected().getOrDefault(instanceName, "rejected");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can end up masking other errors with 400 code here. If instances is not in rejected list, it should return 400 and should fail with 500 errors instead.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

…ISABLE/EVACUATE tests

Three changes responding to streak's latest review pass on PR #175:

* ClusterConfig.setInstanceOperationMaintenanceBudgetPercentage now
  rejects values outside [0, 100] (or -1 to clear) at write time. A
  value above 100 or below -1 was previously stored silently and would
  have produced a cap larger than the cluster or undefined sentinel
  behavior; the setter now throws HelixException with an explicit
  range in the message.

* PerInstanceAccessor.setInstanceOperationMaintenance no longer maps
  every "not applied" outcome to HTTP 400. The handler is supposed to
  return every input instance in exactly one of applied or rejected;
  if a caller hits the third case (instance is in neither set), that's
  a server-side bug, not a client error, so the accessor now returns
  500 in that case to surface the bug rather than mask it.

* TestInstanceOperationMaintenanceBudget gains two integration cases
  exercising the marker against non-default InstanceOperation values:
  DISABLE (instance is in the MM-entry baseline but absent from
  enabledLive, so it counts toward the budget without a marker) and
  EVACUATE (same shape, plus EVACUATE is the operation most relevant
  to the orchestrator's deploy/drain scenarios). UNKNOWN/SWAP_IN are
  not covered because they're members of UNROUTABLE_INSTANCE_OPERATIONS
  and never count toward the budget regardless of marker state, so a
  test would only assert "no-op behaves like no-op".

All adjacent tests still pass: 72 helix-core tests (was 100; difference
is fewer tests in the targeted run, not regressions) and 19 helix-rest
handler tests.
Copy link
Copy Markdown
Collaborator

@thestreak101 thestreak101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure the tests failures are not related to the change before merging.

@LZD-PratyushBhatt
Copy link
Copy Markdown
Collaborator Author

Make sure the tests failures are not related to the change before merging.

Unrelated
Error: Test failed: testGreedyRebalanceWithGlobalPerInstancePartitionLimit(org.apache.helix.integration.rebalancer.TestGreedyRebalanceWithGlobalPerInstancePartitionLimit) Time elapsed: 6.464 s <<< FAILURE!

@LZD-PratyushBhatt LZD-PratyushBhatt merged commit 8cb8ae6 into dev May 20, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants