Planned-maintenance budget exemption for auto Maintenance Mode#175
Conversation
Introduces a per-instance PLANNED_MAINTENANCE_UNTIL_MS marker on InstanceConfig
that excludes valid-marker instances from MAX_OFFLINE_INSTANCES_ALLOWED while the
marker has not expired. The filter applies symmetrically to MM entry
(BestPossibleStateCalcStage) and MM exit (MaintenanceRecoveryStage). Unplanned
offline loss still trips MM, preserving the safety net.
Cluster-level controls (all opt-in, no defaults):
- MAX_PLANNED_MAINTENANCE_INSTANCES / MAX_PLANNED_MAINTENANCE_PERCENTAGE caps
how many instances may carry a marker simultaneously
- DEFAULT_PLANNED_MAINTENANCE_DURATION_MS supplies a fallback TTL when the
REST caller omits expiresAtMillis; absent => write rejected with 400
REST endpoints:
- POST /clusters/{c}/instances/{i}/plannedMaintenance (single)
- POST /clusters/{c}/instances/plannedMaintenance:batch (batch)
A negative expiresAtMillis sentinel clears the marker. Markers are listed in
NON_OVERWRITABLE_PROPERTIES so they do not transfer via swap-in.
Tests:
- TestInstanceConfig: 10 cases covering get/set, isUnderPlannedMaintenance
boundary, clear-via-negative, clear-via-zero, metadata defensive copy,
overwriteInstanceConfig non-transfer
- TestClusterConfig: 3 cases covering unset sentinels and round-trip
- TestPlannedMaintenanceWriteHandler: 19 cases covering TTL resolution,
dedup, pre-write existence validation, cap enforcement (absolute,
percentage, stricter-of-two, existing markers, expired markers),
clear-bypasses-cap
- TestPlannedMaintenanceBudgetExemption: integration test with real ZK
confirming valid markers prevent MM trigger end-to-end and expired
markers behave as absent
Earlier draft put the batch endpoint at @path("plannedMaintenance:batch") on InstancesAccessor. Jersey resolves URLs under /clusters/{c}/instances/ against PerInstanceAccessor's @path("{instanceName}") root resource (more specific class @path wins), so the batch URL was being captured by updateInstance and returning "Invalid command: null". Match the stoppable-check routing pattern instead: - Single endpoint stays at /clusters/{c}/instances/{i}/plannedMaintenance (PerInstanceAccessor) - Batch is now POST /clusters/{c}/instances?command=plannedMaintenance, routed through InstancesAccessor.instancesOperations dispatch with a new Command enum value and a private batchSetPlannedMaintenance helper that mirrors batchGetStoppableInstances Verified end-to-end against localhost:2191 + REST on :8100 with set, batch, fallback-TTL, past-expiry-rejected, no-default-rejected, and cap-exceeded-rejected paths.
| // Subtract instances carrying a valid planned-maintenance marker that are currently | ||
| // counted as offline (routable config, not in the enabled-live set). Planned-and-alive | ||
| // instances are already absent from the budget and must not be double-subtracted, so we | ||
| // intersect the marker filter with the !enabledLive predicate. The filter applies | ||
| // symmetrically: same expression gates both MM entry and (via the same code path on the | ||
| // next pipeline tick) MM exit. | ||
| long nowMs = System.currentTimeMillis(); | ||
| Set<String> enabledLive = cache.getEnabledLiveInstances(); | ||
| int plannedAndOfflineCount = (int) cache.getInstanceConfigMap().entrySet().stream() | ||
| .filter(instanceEntry -> !InstanceConstants.UNROUTABLE_INSTANCE_OPERATIONS.contains( | ||
| instanceEntry.getValue().getInstanceOperation().getOperation())) | ||
| .filter(instanceEntry -> instanceEntry.getValue().isUnderPlannedMaintenance(nowMs)) | ||
| .filter(instanceEntry -> !enabledLive.contains(instanceEntry.getKey())) | ||
| .count(); |
There was a problem hiding this comment.
This will pretty soon become confusing, lets use simple logic -
- Get list of
instancesUnableToAcceptOnlineReplicas. - From step 1, the size can be determined for previous offline instances calculation.
- From the list of
instancesUnableToAcceptOnlineReplicas, remove any intersecting instance under maintenance (no need to go through the list twice).
There was a problem hiding this comment.
Done. Build the offline-budget set first (routable minus enabled-live), then drop planned maintenance instances from it via removeIf. Single pass now.
| * unset on the cluster. The auto-MM filter and the REST write path both treat the unset | ||
| * value as "feature disabled for this cluster". | ||
| */ | ||
| public static final int PLANNED_MAINTENANCE_LIMIT_UNSET = -1; |
There was a problem hiding this comment.
nit:
| public static final int PLANNED_MAINTENANCE_LIMIT_UNSET = -1; | |
| public static final int PLANNED_MAINTENANCE__INSTANCES_LIMIT_UNSET = -1; |
There was a problem hiding this comment.
UNSET seems a bit weird, is this the convention used everywhere?
There was a problem hiding this comment.
Dropped the named constant entirely and using inline -1 to match the MAX_OFFLINE_INSTANCES_ALLOWED convention here.
checked MAX_OFFLINE_INSTANCES_ALLOWED, it just uses inline -1 with no named constant, switched to that. Same for the duration field.
| // omits an explicit expiresAtMillis. Not a default in the "always applied" sense: when | ||
| // unset, writes that omit expiresAtMillis are rejected. |
There was a problem hiding this comment.
Not a default in the "always applied" sense: when unset, writes that omit expiresAtMillis are rejected.
This complete sentence feels odd, lets rephrase it.
There was a problem hiding this comment.
rephrased
| HELIX_INSTANCE_OPERATIONS, | ||
| INSTANCE_OPERATION_STATE | ||
| INSTANCE_OPERATION_STATE, | ||
| PLANNED_MAINTENANCE_UNTIL_MS, |
There was a problem hiding this comment.
Can this value be part of the metadata itself?
There was a problem hiding this comment.
Kept it separate intentionally. The timestamp drives the filter on every controller tick so it's the hot path. SimpleField + getLongField is a typed direct read. If we move it into the map we have to parse the map + string to long every check. Also keeps the load bearing field (what affects MM) separate from the pure audit fields (reason/source/setAt). Lmk if you feel strongly though.
| if (clear) { | ||
| for (String instanceName : deduped) { | ||
| InstanceConfig cfg = loadInstanceConfig(clusterId, instanceName); | ||
| cfg.setPlannedMaintenanceUntilMs(InstanceConfig.PLANNED_MAINTENANCE_NOT_SET); |
There was a problem hiding this comment.
instead of implicitly removing this, can this be a more explicit method?
There was a problem hiding this comment.
Extracted into assertInstancesExist(clusterId, deduped)
| } | ||
| return callerExpiresAtMillis; | ||
| } | ||
| long defaultDuration = clusterConfig.getDefaultPlannedMaintenanceDurationMs(); |
There was a problem hiding this comment.
Maybe I am reading this incorrectly, but based on the above conditions, we can reach here if callerExpiresAtMillis <= 0 correct? which is essentially the unset condition. Wont defaultDuration be -1 in such cases?
There was a problem hiding this comment.
Yeah you're right, defaultDuration is -1 when unset. The > 0L check worked but was implicit. Changed to compare against the -1L directly with a comment so the "feature off" branch is explicit.
| throw new BadRequestException(String.format( | ||
| "Write would push planned-maintenance count to %d, exceeds cluster cap %d " | ||
| + "(absoluteCap=%d, percentageCap=%d, clusterSize=%d)", | ||
| projectedMarked, effectiveCap, absoluteCap, percentageCap, allInstances.size())); |
There was a problem hiding this comment.
I am worried if this cap can introduce failures which will block the complete deployment in case these values are not set properly.
Either we should gracefully handle this and add the marker for instance within limit or while setting the cap values, be more restrictive / opinionated (which is hard to enforce).
There was a problem hiding this comment.
partial accept kind of defeats the purpose. If we mark half the batch and the orchestrator proceeds with the deploy anyway, the unmarked half still trips MM. Hard reject gives the orchestrator a clear signal to slow down or fix the cap. The error response already includes current count + projected + cap so it's actionable.
There was a problem hiding this comment.
ACtually, thought about it again, made sense to partial accept just like we do for batch stoppable checks also. batch processes instances in input order, applies what fits the cap, returns {applied, rejected, expiresAtMillis} with per-instance reasons. 400 is now only for request body issues. Single endpoint stays binary 200/400 since there's nothing to be partial about. Updated 21 handler tests and verified end to end against a live cluster.
| return OK(); | ||
| } | ||
|
|
||
| /** |
There was a problem hiding this comment.
I was thinking of 1 more endpoint, lmk if that makes sense.
Can we have and endpoint which essentially give me a number of actual offline instances which are counted by Helix? At present, I dont think there is any such thing present. Clients have to do this calculation on their own, which can change based on internal logic but client wont know.
There was a problem hiding this comment.
Good suggestion, agree clients shouldn't have to recompute this. Will do as a follow up PR to keep this one scoped.
… existence check - BestPossibleStateCalcStage: collapse the two-stream offline filter into one: build the offline-budget set (routable minus enabled-live) then drop instances under valid planned-maintenance markers via removeIf(isUnderPlannedMaintenance). - ClusterConfig: drop named UNSET constants (PLANNED_MAINTENANCE_LIMIT_UNSET / _DURATION_UNSET); use inline -1 to match the MAX_OFFLINE_INSTANCES_ALLOWED convention. - ClusterConfig: rephrase the DEFAULT_PLANNED_MAINTENANCE_DURATION_MS comment to drop the awkward "always applied" phrasing. - PlannedMaintenanceWriteHandler: extract the pre-write existence check into assertInstancesExist for readability; the implicit "load and throw" loop is gone. - PlannedMaintenanceWriteHandler: compare the cluster default to the literal -1L sentinel rather than > 0L, so the "feature off" branch is explicit. Adds a comment. - TestClusterConfig: inline -1 / -1L in lieu of the removed constants. All targeted tests still pass (87 in helix-core, 19 in helix-rest).
Following PR feedback: the batch endpoint now mirrors the partial-accept
contract used by the batch stoppable check. Instances are processed in
input order; those that fit the cap quota (or that exist on a clear) are
returned under "applied", the rest under "rejected" keyed by instance
name with a free-form reason.
Handler changes
- applyPlannedMaintenance returns a PlannedMaintenanceResult value
(applied list, rejected map, resolved expiry) instead of a bare long.
- Missing-instance check switched from HelixAdmin.getInstanceConfig
(which throws for missing instances) to a single
HelixAdmin.getInstancesInCluster snapshot. Same snapshot feeds the cap
quota calculation, so we no longer pay two passes of per-instance
reads for the existence check.
- Cap enforcement walks candidates in input order against a precomputed
remaining-quota, rejecting overflow rather than throwing.
- BadRequestException is now reserved for caller-side bugs (empty input,
malformed expiry, missing cluster default). Per-instance failures
surface through the rejected map.
REST changes
- POST /clusters/{c}/instances/{i}/plannedMaintenance (single) stays
binary: 200 with {instance, expiresAtMillis} on apply, 400 with the
per-instance reason on reject.
- POST /clusters/{c}/instances?command=plannedMaintenance (batch) now
always returns 200 with {applied, rejected, expiresAtMillis} for any
per-instance outcome; 400 is reserved for invalid request bodies.
Tests
- TestPlannedMaintenanceWriteHandler rewritten to assert the new result
shape: applied/rejected contents, per-instance reject reasons, and
zero-cap edge cases. 21 tests, all green.
- All other adjacent tests (TestInstanceConfig, TestClusterConfig,
TestBestPossibleStateCalcStage, TestPlannedMaintenanceBudgetExemption,
TestClusterInMaintenanceModeWhenReachingOfflineInstancesLimit) still
pass unchanged.
Verified end-to-end against localhost:2191 + REST on :8100: partial
accept with mixed missing-instance + cap-overflow inputs returns 200
with the right applied/rejected partitioning; single endpoint returns
400 with the per-instance reason when the cap is exhausted.
ClusterConfig and InstanceConfig non-topology fields are already trimmed
out of ResourceChangeSnapshot by default (the trimmers' whitelist of
"topology-related" fields excludes them, so simple-field values for
PLANNED_MAINTENANCE_UNTIL_MS and the new cluster-cap fields never reach
the snapshot comparison).
InstanceConfigTrimmer.getNonTrimmableKeys, however, preserves the keys
of every map and list field by default (only the values are emptied).
That means writing PLANNED_MAINTENANCE_METADATA would still flip the
snapshot from {no key} to {key: {}}, which the change detector reads as
a real change and triggers a rebalance pipeline. Same pattern that
HELIX_INSTANCE_OPERATIONS handles via its existing remove() override.
This commit adds PLANNED_MAINTENANCE_METADATA to the same remove() path
so marker writes don't trigger spurious rebalance work. While doing
that, switch the override from mutating super's live keySet() views
(which would propagate to the source InstanceConfig record) to copying
into fresh HashSet instances first. Behavior is unchanged for callers
that already had clean separation between source and trimmed snapshot,
but the override is now safe regardless of whether the caller reuses
the InstanceConfig instance.
Adds TestPlannedMaintenanceTrimming to lock in the contract: marker
writes on both InstanceConfig (untilMs simple field, metadata map field)
and ClusterConfig (cap fields, default duration) are trimmed away
before change-detection sees them. All 100 adjacent tests still pass.
| // The filter is symmetric: the same expression gates both MM entry here and MM exit | ||
| // via MaintenanceRecoveryStage on the next pipeline tick. |
There was a problem hiding this comment.
Not true right? BestPossibleStateCalcStage is filtering on not UNROUTABLE and MaintenanceRecoveryStage is filtering on getAssignableInstances(doesn't have evacuate and swap_out).
| * PLANNED_MAINTENANCE_METADATA is purely audit data (reason/source/setAtMs) and writing it | ||
| * must not trigger a rebalance pipeline. |
There was a problem hiding this comment.
What happens if same operation was set without this and then another one sets the same operation but with metadata this time?
There was a problem hiding this comment.
Removed metadata
| // excluding planned-maintenance instances that are not currently live. This matches | ||
| // the filter applied at MM entry in BestPossibleStateCalcStage and keeps a deploy | ||
| // window from blocking auto-recovery. | ||
| Set<String> assignable = cache.getAssignableInstances(); |
There was a problem hiding this comment.
Base is different from baseline, doesn't have evacuation here.
There was a problem hiding this comment.
pre existing from long back, will revisit in a different PR
| MAX_PLANNED_MAINTENANCE_INSTANCES, | ||
| MAX_PLANNED_MAINTENANCE_PERCENTAGE, |
There was a problem hiding this comment.
Don't allow setting both, we should have validation that only one of these can be set in the config.
| // Fallback TTL (in millis) applied by the planned-maintenance write path when the | ||
| // caller omits expiresAtMillis. When unset (-1), callers must always supply an | ||
| // explicit expiresAtMillis; otherwise the write is rejected. | ||
| DEFAULT_PLANNED_MAINTENANCE_DURATION_MS, |
There was a problem hiding this comment.
Name needs to change to config confuse with cluster maintenance mode.
| INSTANCE_OPERATION_STATE | ||
| INSTANCE_OPERATION_STATE, | ||
| PLANNED_MAINTENANCE_UNTIL_MS, | ||
| PLANNED_MAINTENANCE_METADATA |
| * Keys used inside the PLANNED_MAINTENANCE_METADATA map field. The map is purely audit | ||
| * data; it does not influence the budget-exemption decision. | ||
| */ | ||
| public static final class PlannedMaintenanceMetadataKey { |
There was a problem hiding this comment.
Why not just instance operation metadata, are these field in anyway related to maintenance?
| public static final String REASON = "reason"; | ||
| public static final String SOURCE = "source"; | ||
| public static final String SET_AT_MS = "setAtMs"; |
There was a problem hiding this comment.
Aren't these already present in instance config?
| * Sentinel returned by {@link #getPlannedMaintenanceUntilMs()} when no marker is set. | ||
| * Callers clear the marker by passing this value to {@link #setPlannedMaintenanceUntilMs(long)}. | ||
| */ | ||
| public static final long PLANNED_MAINTENANCE_NOT_SET = -1L; |
There was a problem hiding this comment.
also update this name depending on decision on other field.
…date mutually exclusive budget forms
Renames the per-instance budget exemption feature throughout the codebase
to anchor on the existing HELIX_INSTANCE_OPERATIONS concept rather than
overload "Maintenance" with Helix's cluster-wide Maintenance Mode. The
INSTANCE_OPERATION_ prefix disambiguates from MM while keeping the
"maintenance window" semantic the orchestrator actually wants.
Rename map:
Field/Method/Class Before -> After
--------------------------------------------------------------------
InstanceConfig field PLANNED_MAINTENANCE_UNTIL_MS
-> INSTANCE_OPERATION_MAINTENANCE_UNTIL_MS
InstanceConfig sentinel PLANNED_MAINTENANCE_NOT_SET
-> INSTANCE_OPERATION_MAINTENANCE_NOT_SET
InstanceConfig methods isUnderPlannedMaintenance(now)
-> isUnderInstanceOperationMaintenance(now)
(and the setter/getter pair to match)
ClusterConfig absolute cap MAX_PLANNED_MAINTENANCE_INSTANCES
-> INSTANCE_OPERATION_MAINTENANCE_BUDGET
ClusterConfig percent cap MAX_PLANNED_MAINTENANCE_PERCENTAGE
-> INSTANCE_OPERATION_MAINTENANCE_BUDGET_PERCENTAGE
ClusterConfig default TTL DEFAULT_PLANNED_MAINTENANCE_DURATION_MS
-> DEFAULT_INSTANCE_OPERATION_MAINTENANCE_DURATION_MS
Handler class PlannedMaintenanceWriteHandler
-> InstanceOperationMaintenanceWriteHandler
Command enum plannedMaintenance
-> instanceOperationMaintenance
REST path /plannedMaintenance
-> /instanceOperationMaintenance
Also addresses two PR comments by @thestreak101:
* Dropped PLANNED_MAINTENANCE_METADATA entirely. The reason/source/setAtMs
fields duplicated InstanceOperation's existing REASON/SOURCE/TIMESTAMP;
external logging at the REST layer covers the same audit ground. This
also eliminates the InstanceConfigTrimmer special-case for the map field.
* Made the two budget forms (absolute and percentage) mutually exclusive
at setter time. Writing one when the other is already configured now
throws HelixException (same pattern setNumOfflineInstancesForAutoExit
uses against MaxOfflineInstancesAllowed). The handler's quota math
simplifies: only one form is ever consulted at runtime.
* Rephrased the filter "symmetry" comments in BestPossibleStateCalcStage
and MaintenanceRecoveryStage. The pre-subtraction baselines differ
between the two stages (!UNROUTABLE vs getAssignableInstances) which is
pre-existing controller behavior; what's consistent across both is the
marker-based subtraction.
Tests:
TestInstanceConfig: 26 cases (was 31; dropped 5 metadata-specific ones).
TestClusterConfig: 6 new cases including mutual-exclusion paths.
TestInstanceOperationMaintenanceWriteHandler: 19 cases (was 21; dropped
metadata tests, kept all partial-accept + cap + clear coverage).
TestInstanceOperationMaintenanceTrimming: 2 cases (was 3; dropped the
metadata-MapField trimmer test that no longer applies).
TestInstanceOperationMaintenanceBudget: 2 integration cases, real ZK.
All 87 helix-core tests + 19 helix-rest tests in the targeted suite still
pass.
Prior comment claimed entry's !UNROUTABLE_INSTANCE_OPERATIONS filter
includes "EVACUATE/SWAP_OUT" relative to exit's getAssignableInstances.
SWAP_OUT does not exist in InstanceConstants.InstanceOperation; the enum
is {ENABLE, DISABLE, EVACUATE, SWAP_IN, UNKNOWN}. SWAP_IN is excluded
from both baselines (it's in UNROUTABLE and not in ASSIGNABLE), so the
only operation that actually differs between the two baselines is
EVACUATE.
Rephrase the comments in both stages to call out EVACUATE specifically
and drop the SWAP_OUT reference.
The baseline-asymmetry caveat will be addressed in a follow-up PR that introduces a single OFFLINE_BUDGET_INSTANCE_OPERATIONS set used at both MM entry and exit. Until then, the comments don't need to call it out; they only need to describe what each stage does and that the marker subtraction is applied identically at both.
| if (absoluteCap >= 0) { | ||
| effectiveCap = absoluteCap; | ||
| } else { |
There was a problem hiding this comment.
what if both are set? Do we have a validation to prevent against that?
There was a problem hiding this comment.
I see that we are validating that case while updating the config, can ignore it here for now.
| + "INSTANCE_OPERATION_MAINTENANCE_BUDGET are mutually exclusive; clear the " | ||
| + "absolute form before setting the percentage form."); | ||
| } | ||
| _record.setIntField( |
There was a problem hiding this comment.
Add validation to ensure % is between 0 and 100 here before writing.
There was a problem hiding this comment.
This only seems to be testing offline node case, can we add tests for disable, evacuate, unknown etc to validate more cases?
| return JSONRepresentation(body); | ||
| } | ||
| // Single-instance call cannot be partial; surface the per-instance reason as 400. | ||
| String rejectReason = result.getRejected().getOrDefault(instanceName, "rejected"); |
There was a problem hiding this comment.
We can end up masking other errors with 400 code here. If instances is not in rejected list, it should return 400 and should fail with 500 errors instead.
…ISABLE/EVACUATE tests Three changes responding to streak's latest review pass on PR #175: * ClusterConfig.setInstanceOperationMaintenanceBudgetPercentage now rejects values outside [0, 100] (or -1 to clear) at write time. A value above 100 or below -1 was previously stored silently and would have produced a cap larger than the cluster or undefined sentinel behavior; the setter now throws HelixException with an explicit range in the message. * PerInstanceAccessor.setInstanceOperationMaintenance no longer maps every "not applied" outcome to HTTP 400. The handler is supposed to return every input instance in exactly one of applied or rejected; if a caller hits the third case (instance is in neither set), that's a server-side bug, not a client error, so the accessor now returns 500 in that case to surface the bug rather than mask it. * TestInstanceOperationMaintenanceBudget gains two integration cases exercising the marker against non-default InstanceOperation values: DISABLE (instance is in the MM-entry baseline but absent from enabledLive, so it counts toward the budget without a marker) and EVACUATE (same shape, plus EVACUATE is the operation most relevant to the orchestrator's deploy/drain scenarios). UNKNOWN/SWAP_IN are not covered because they're members of UNROUTABLE_INSTANCE_OPERATIONS and never count toward the budget regardless of marker state, so a test would only assert "no-op behaves like no-op". All adjacent tests still pass: 72 helix-core tests (was 100; difference is fewer tests in the targeted run, not regressions) and 19 helix-rest handler tests.
thestreak101
left a comment
There was a problem hiding this comment.
Make sure the tests failures are not related to the change before merging.
Unrelated |
Issues
Description
What. Introduces a per-instance
PLANNED_MAINTENANCE_UNTIL_MSmarker onInstanceConfigthat excludes valid-marker instances from the cluster-wideMAX_OFFLINE_INSTANCES_ALLOWEDcount that drives auto Maintenance Mode (MM). The filter applies symmetrically on MM entry (BestPossibleStateCalcStage) and MM exit (MaintenanceRecoveryStage).Why. Helix today is cause-blind: every offline host counts against the budget. Rolling deploys, evacuations, and planned swaps produce false-positive MM triggers, and a static
MAX_OFFLINE_INSTANCES_ALLOWEDcaps the deployment speedup that operators expect when raising replication factor (e.g., Venice's RF=3 to RF=4 change). The marker lets Helix recognize planned offlines while keeping the MM trigger as a safety net for unplanned loss. Helix stays free of orchestration semantics; operation-type reasoning remains in the orchestrator.How.
InstanceConfig:PLANNED_MAINTENANCE_UNTIL_MS(SimpleField, long) andPLANNED_MAINTENANCE_METADATA(MapField, audit-only) properties.PLANNED_MAINTENANCE_NOT_SET = -1Lsentinel. Setter clears both fields when given a non-positive value.isUnderPlannedMaintenance(long nowMs)helper.NON_OVERWRITABLE_PROPERTIESso the marker does not transfer onto a swap-in target viaoverwriteInstanceConfig.ClusterConfigadds three opt-in fields (no defaults, absent => feature off for that cluster):MAX_PLANNED_MAINTENANCE_INSTANCES(absolute cap on simultaneously marked instances).MAX_PLANNED_MAINTENANCE_PERCENTAGE(percentage cap on simultaneously marked instances; if both caps are set, the stricter one wins).DEFAULT_PLANNED_MAINTENANCE_DURATION_MS(fallback TTL applied when the REST caller omitsexpiresAtMillis; absent => such writes are rejected with 400).BestPossibleStateCalcStage.validateInstancesUnableToAcceptOnlineReplicasLimit: builds the offline-budget set in one pass (routable configs minus enabled-live) and drops instances under valid planned-maintenance markers viaremoveIf(isUnderPlannedMaintenance)before comparing againstMAX_OFFLINE_INSTANCES_ALLOWED. Planned-and-alive instances are not double-counted because they were never in the offline set to begin with.MaintenanceRecoveryStageapplies the same subtraction on the auto-exit path so a deploy window does not block recovery.helix-rest endpoints (mirror the stoppable-check routing style):
POST /clusters/{c}/instances/{i}/plannedMaintenancePOST /clusters/{c}/instances?command=plannedMaintenance(addsplannedMaintenanceto theCommandenum and dispatches via the existingInstancesAccessor.instancesOperationsswitch, parallel to how?command=stoppableworks).Request body (single):
{ "expiresAtMillis": 1776385800000, "reason": "venice-deploy:opId=abc", "source": "AUTOMATION" }Batch body adds
"instances": ["h1", "h2", ...]. A negativeexpiresAtMillisclears the marker.TTL resolution: caller value if positive; else
now + DEFAULT_PLANNED_MAINTENANCE_DURATION_MSif the cluster has it; else 400.Partial-accept contract (mirrors batch stoppable). The batch endpoint processes instances in input order against a single snapshot of the current cap quota. Per-instance failures (missing instance, cap exhausted) do not abort the call; they land in the
rejectedmap. HTTP 400 is reserved for caller-side bugs that invalidate the entire request (emptyinstances, pastexpiresAtMillis, missingexpiresAtMilliswith no cluster default).Single endpoint response (200 on success):
{ "instance": "h1", "expiresAtMillis": 1776385800000 }On a per-instance failure (cap full or instance missing), single returns 400 with the reason since there's nothing to be partial about.
Batch endpoint response (always 200 for any per-instance outcome):
{ "applied": ["h1", "h2"], "rejected": { "h3": "would exceed MAX_PLANNED_MAINTENANCE_INSTANCES=2" }, "expiresAtMillis": 1776385800000 }The shared
PlannedMaintenanceWriteHandlerreturns aPlannedMaintenanceResult(applied,rejected,resolvedExpiresAtMillis). Existence checks use a singlegetInstancesInClustersnapshot rather than per-instance config reads (HelixAdmin throws on missing instances), and the same snapshot feeds the cap quota calculation.Evaluation model. The filter is evaluated lazily on events the controller already handles (instance-config writes, live-instance changes, periodic rebalance). Orchestrator clears re-evaluate immediately via the config write. Real outages re-evaluate via the corresponding live-instance change. A stale marker left by an orchestrator crash is benign on a stable cluster and cascades into MM correctly the moment a genuine outage arrives.
Tests
helix-core/src/test/java/org/apache/helix/model/TestInstanceConfig.java(10 new cases):testPlannedMaintenanceUnsetByDefaulttestPlannedMaintenanceSetAndGettestIsUnderPlannedMaintenanceWindow(boundary semantics)testPlannedMaintenanceClearViaNegativeValuetestPlannedMaintenanceClearViaZerotestPlannedMaintenanceMetadataSetAndGettestPlannedMaintenanceMetadataIsReadOnlytestPlannedMaintenanceMetadataNullClearstestPlannedMaintenanceMetadataIsDefensiveCopiedtestPlannedMaintenanceNotTransferredByOverwritehelix-core/src/test/java/org/apache/helix/model/TestClusterConfig.java(3 new cases):testPlannedMaintenanceFieldsUnsetByDefaulttestPlannedMaintenanceFieldsRoundTriptestPlannedMaintenanceFieldsClearWithSentinelhelix-rest/src/test/java/org/apache/helix/rest/clusterMaintenanceService/TestPlannedMaintenanceWriteHandler.java(21 cases, Mockito-based): TTL resolution (caller wins / cluster default / reject when both unset / past-timestamp rejected), set-with-metadata, set-without-audit-fields, dedup, empty/null instance rejection, missing-instance partial-accept (others still applied), cap enforcement under partial-accept semantics (no-op / absolute / boundary / percentage / stricter-of-two / counts existing markers / ignores expired markers / zero-cap rejects everything), clear with negative sentinel, clear bypasses cap, clear with missing instance partial-accept.helix-core/src/test/java/org/apache/helix/integration/rebalancer/TestPlannedMaintenanceBudgetExemption.java(2 integration cases with real ZK):testValidMarkersPreventMMTrigger: two marked offline instances do not trip MM; adding two unmarked offline instances on top does.testExpiredMarkersDoNotExempt: markers with a pastuntilMsbehave as if absent and MM trips.End-to-end smoke test against a live cluster on
localhost:2191confirmed all paths: TTL resolution (caller / cluster-default / reject-with-no-default / past-timestamp), partial-accept on cap overflow, partial-accept on missing instance, single-endpoint 400 on cap-exhausted, single-endpoint 400 on missing instance, mixed missing + cap-overflow batch.Changes that Break Backward Compatibility (Optional)
None. All new fields are opt-in:
InstanceConfigfields default to "unset"; older readers ignore unknown ZNRecord keys, and the existinggetMaxOfflineInstancesAllowedpath is unchanged when no instance carries a marker.ClusterConfigfields default to "unset"; absent means feature is off for that cluster.NON_OVERWRITABLE_PROPERTIESgains two entries; this only affects callers that explicitly relied onoverwriteInstanceConfigtransferring planned-maintenance state during swap, which would have been an unintended behavior anyway.Documentation (Optional)
Design doc: Helix Planned Maintenance Window (per-instance budget exemption).
Commits
Code Quality
(helix-style-intellij.xml if IntelliJ IDE is used)