Skip to content

design: add doc for cluster autoscaling and background reconfiguration#36691

Draft
aljoscha wants to merge 15 commits into
MaterializeInc:mainfrom
aljoscha:design-cluster-autoscaling
Draft

design: add doc for cluster autoscaling and background reconfiguration#36691
aljoscha wants to merge 15 commits into
MaterializeInc:mainfrom
aljoscha:design-cluster-autoscaling

Conversation

@aljoscha
Copy link
Copy Markdown
Contributor

Rendered

Resolves SQL-315

aljoscha and others added 15 commits May 22, 2026 12:46
…guration

Proposes a cluster controller that runs alongside the Coordinator as a
reconciler over durable cluster config. Reshapes graceful reconfiguration to
run in the background by making the user's target the durable cluster config
and removing session-bound intent. Introduces HYDRATION_SIZE for burst replicas
during hydration, with the existing ON REFRESH scheduling lifted into the same
strategy framework.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cess hydration

Removes the carve-out that excluded HYDRATION_SIZE on storage-only clusters,
adds a callout for the storage-side hydration signal, and reframes the
hydration consumption pattern around in-process compute-controller state
rather than the builtin view, with guidance to avoid the existing
graceful-reconfig 1-second polling cadence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In practice they're the same mechanism for our purposes; the doc shouldn't
draw a distinction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The doc as a whole describes the v1, and the individual steps don't make sense
as standalone shippable units.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ing changes

Drop the peek-routing and hydration-signal items (handled in the design body),
and reframe the remaining three items as user-observable behavior changes that
are corollaries of the design.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the stiff "Corollary" framing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sh combo

No appetite to invest further in SCHEDULE syntax; the combination should stay
rejected indefinitely rather than be left open as a possible follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ss of RF

Burst is transient and N burst replicas to mirror an N-replica steady set
multiplies cost for diminishing benefit (tear-down only requires one steady
replica to have caught up).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e examples

Speculative future strategies (queue-depth, scale-to-zero, time-of-day) are not
on the roadmap and shouldn't shape the v1 interface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Keep the foreground (session-bound) path intact during dyncfg-gated rollout,
remove it once the background model is fully enabled.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…achinery

Preserving the foreground (session-bound) experience during rollout doesn't
require keeping the pending flag or the three-stage state machine. The
foreground UX is implemented as a thin session-side wait over the background
mechanism; deprecating it later is removing the shim, not unwinding code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bullet leads and numbered-list labels stay bold (they act as mini-headers);
inline emphasis on terms or phrases within prose becomes italics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This reverts commit 4ff6832.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two user-facing capabilities motivate this work:

1. **Background graceful cluster reconfiguration.** Today, `ALTER CLUSTER ... SET (SIZE = ...)` with the graceful (zero-downtime) strategy requires the SQL session to remain open for the duration of the reconfiguration — the session holds the wait-for-hydration stage. Long-running reconfigurations are fragile: any process or session interruption — a network blip, a client timeout, an SQL tool closing, an `environmentd` restart — aborts the reconfiguration. The user experience we want is: the statement returns immediately, and the reconfiguration continues in the background, surviving restarts and disconnects.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yaay!


## Out of Scope

- `HYDRATION_SIZE` combined with `SCHEDULE = ('on-refresh', ...)`. Initial version rejects; see [Open Questions](#open-questions).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

## Out of Scope

- `HYDRATION_SIZE` combined with `SCHEDULE = ('on-refresh', ...)`. Initial version rejects; see [Open Questions](#open-questions).
- More than one concurrent burst replica per cluster. Initial version supports exactly one; revisit if needed.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also agreed

- A new autoscaling strategy can be added without restructuring the framework.
- Operators can disable the burst behavior across an environment via a break-glass flag without disabling other autoscaling.

## Out of Scope
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the calls here


1. **Stuck-reconfiguration recovery policy.** When a reconfiguration's pending replicas have not hydrated within the system timeout, do we (a) park the reconfiguration indefinitely with a clear signal in the introspection view for an operator to act, (b) auto-cancel and revert to the prior steady state, or (c) make the policy a dyncfg with one of (a)/(b) as the default? Same question for stuck burst replicas.

2. **`HYDRATION_SIZE` + `SCHEDULE = ('on-refresh', ...)` combination.** v1 rejects this combination. Semantically the combination is interesting (every refresh window, burst comes up first to accelerate hydration, steady catches up, then the schedule turns the cluster off), but our strong recommendation is to keep this rejected indefinitely: there is currently no appetite to invest further in the `SCHEDULE` syntax, and supporting the combination would expand its surface area.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed


2. **`HYDRATION_SIZE` + `SCHEDULE = ('on-refresh', ...)` combination.** v1 rejects this combination. Semantically the combination is interesting (every refresh window, burst comes up first to accelerate hydration, steady catches up, then the schedule turns the cluster off), but our strong recommendation is to keep this rejected indefinitely: there is currently no appetite to invest further in the `SCHEDULE` syntax, and supporting the combination would expand its surface area.

3. **Multiple burst replicas (one per steady replica vs. one total).** v1 supports exactly one burst replica per cluster regardless of replication factor. Our strong recommendation is to keep it that way: the burst replica is by design transient, and provisioning N burst replicas to mirror an N-replica steady set would multiply cost for diminishing benefit — burst tear-down only requires one steady replica to have caught up. Revisit only if real-world usage proves the single-burst model insufficient.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed


7. **Foreground/synchronous graceful reconfiguration retention.** Our strong recommendation is to deprecate the current foreground (session-bound) mechanism in favor of the background model. During rollout, the foreground experience is preserved as a thin session-side wait shim over the background mechanism (see [SQL surface](#sql-surface)) — *not* by retaining the existing parallel state machine. This means the `pending: bool` flag on replicas and the associated three-stage machinery can be removed up front; deprecating the foreground experience later is simply deleting the wait shim. The one behavioral difference vs. today is that session disconnect during the wait no longer aborts the reconfiguration (arguably a feature; the durable target stays set and the controller continues).

8. **Hydration burst during graceful reconfiguration.** Should burst kick in while a graceful reconfig is in flight (target size differs from current replicas)? Leaning toward no: the new-size replicas are themselves transient hydration capacity, and stacking burst on top risks confusing billing and behavior. Burst resumes once the reconfig settles.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. If a user type ALTER CLUSTER SET SIZE 200cc, that shouldn't trigger a burst. It should trigger a 200cc replica. Once the 200cc replica is hydrated, retire the original replica.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the nope is a confirmation of my "leaning towards no", yes? 😅

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct :)


- **Burst and reconfiguration-transient replicas appear in billing and metering identically to ordinary replicas.** A user with `HYDRATION_SIZE` set sees additional billing during hydration windows; a user issuing a background `ALTER CLUSTER` sees additional billing during the overlap between the old and new replica sets.
- **Background `ALTER CLUSTER` returns immediately** after writing the new target to the catalog. The actual replica transition happens asynchronously and is observable via the new introspection view. This matches the existing pattern for other async DDL (e.g., `CREATE INDEX` returns once the catalog entry exists; hydration happens afterwards).
- **`SHOW CLUSTERS` reports the new (target) size immediately on ALTER**, not the old size. Mid-reconfiguration the durable cluster configuration already reflects the user's intent, so `SHOW CLUSTERS` does too. This is a change from today's behavior, where the old size is reported until the graceful reconfiguration finalizes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. What I would really like is for SHOW CLUSTERS to tell me whether a reconfiguration is in flight or not, tell me what the current size is, and what the target size is

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can see why you would like that. I think we can do something good there!


The following behaviors fall out of the design rather than being its headline outcomes. They are user-observable and worth flagging in release notes and user-facing documentation.

- **Burst and reconfiguration-transient replicas appear in billing and metering identically to ordinary replicas.** A user with `HYDRATION_SIZE` set sees additional billing during hydration windows; a user issuing a background `ALTER CLUSTER` sees additional billing during the overlap between the old and new replica sets.
Copy link
Copy Markdown
Contributor

@maheshwarip maheshwarip May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this makes sense. But is this new behavior? I assumed that this was how it always worked!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I was wrong!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, it is how it worked, but put it in there because the bursting is new. For graceful reconfig it was always like this

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha gotcha. Ok, no concerns!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants