workload-replay: Handle larger captured workloads by def- · Pull Request #36686 · MaterializeInc/materialize

def- · 2026-05-22T09:38:48Z

Captured workloads from production environments often have features the local replay did not yet support; replaying them previously failed at various points before useful test work could happen.

config: add extra cluster replica sizes
objects: accept quoted database names with non-identifier characters (e.g. hyphens) in CREATE CONNECTION; parse double-quoted REFERENCE / EXTERNAL REFERENCE parts so dotted references like "db-name".schema.table are split correctly.
objects: upstream CREATE TABLE statements now use IF NOT EXISTS — the same physical Postgres/MySQL/SQL Server table can back multiple MZ sources and we were failing the second time.
objects: strip options from captured SQL that don't survive local replay: empty TEXT COLUMNS = () (planner panic), EXCLUDE COLUMNS (names upstream columns we don't recreate), Kafka START OFFSET = (…) (local topics have 1 partition), VERSION = N on CREATE SINK (internal-only option).
column: extend type coverage so <scalar>[] arrays, interval, oid, real, and Materialize-specific list/record/aclitem variants produce sensible non-NULL values; character (no size) now emits a single char so it fits CHAR(1).
column: function-call defaults like pg_catalog.now() are skipped in COPY and Kafka paths, where they would be inserted as literal strings and rejected.
column: avro_type falls back to string for non-primitive types so fastavro can parse the generated schema.
util / mzcompose: workload files can now be stored compressed as .yml.zst and are loaded transparently via the new load_workload helper. Glob *.yml also picks up the compressed variants.

def- · 2026-05-23T02:20:27Z

Currently blocked by https://linear.app/materializeinc/issue/CLU-95/thread-coordinator-panicked-at-srccompute-clientsrcas-of & https://github.com/MaterializeInc/database-issues/issues/9656#issuecomment-4523776124

Edit: Rebased on top of potential fix, retrying: https://buildkite.com/materialize/release-qualification/builds/1250

`Interval` is lex-ordered by (months, days, micros), but adding an interval to a timestamp or date adds *calendar* months with day-clamping and then days as 24-hour periods. That arithmetic does not respect the lex order: t = 2024-01-31 i1 = {0 months, 31 days} → t + i1 = 2024-03-02 i2 = {1 month, 0 days} → t + i2 = 2024-02-29 In lex order `i1 < i2`, yet `t + i1 > t + i2`. For timestamps, the day-clamping also collapses near-boundary inputs into the same date while preserving sub-day time, so the first argument is non-monotone too: t1 = 2024-01-30 23:59:59, i = {1 month} → 2024-02-29 23:59:59 t2 = 2024-01-31 00:00:00, i = {1 month} → 2024-02-29 00:00:00 For dates the first argument *is* monotone (no sub-day precision means clamping only collapses, never reverses), but the interval argument has the same problem. These annotations are consumed by the abstract interpreter that drives persist filter pushdown. Marking these functions monotone meant the interpreter computed the output range by evaluating the function only at the endpoints of the input interval range — and a stats range like `[{0m,31d}, {1m,0d}]` would yield the narrow output range `[2024-02-29, 2024-03-02]` even though interior intervals (e.g. `{0m, 60d}`) actually produce timestamps far outside that window. Filter pushdown could then incorrectly conclude a part had no matching rows, tripping the `persist filter pushdown correctness violation!` audit in `persist_source.rs`. Companion fix to b607993, which corrected the analogous annotations for `add_time_interval`, `sub_time_interval`, `mul_interval`, and `div_interval`. Fixes database-issues#9656.

The override was added to suppress the audit panic from database-issues#9656; with the monotonicity annotations on timestamp/date + interval corrected, filter pushdown should be sound again. Removing this lets CI re-exercise the audit and surface any remaining latent causes.

- Drop useless .into() in the new interpret.rs regression test (clippy). - Update filter-pushdown.slt: sub_timestamp_interval is no longer pushdownable now that it's marked non-monotone, so the expected pushdown= lines for 'timestamp - INTERVAL day' queries are gone. This is the known tradeoff for soundness — even day-only intervals share the non-monotonic annotation since the abstract interpreter can't distinguish them statically.

date_bin(stride, source) = origin + floor((source - origin) / stride) * stride. For a fixed source like 2024-01-01 12:00:00, a 1-day stride bins to 2024-01-01 00:00:00 but a 2-day stride bins to 2023-12-31 00:00:00 — i.e. the lex-larger stride produces an earlier output. Same class of bug as the timestamp/date + interval monotonicity issues. Demotes date_bin_timestamp and date_bin_timestamp_tz from (true, true) to (false, true). Still monotone in source.

Postgres-style `age(a, b)` is annotated `is_monotone = "(true, true)"`, but it is non-monotone in either argument. First argument: the carry logic in the calendar-aware difference borrows a whole month's worth of days when the day field goes negative, which breaks the lex order of `Interval` (months, days, micros) at month boundaries: age(2024-03-31, 2024-02-15) = {1 month, 16 days} age(2024-04-01, 2024-02-15) = {1 month, 15 days} ← lex-smaller age(2024-05-01, 2024-02-15) = {2 months, 15 days} Second argument: after the algorithm's final sign-revert, the same day-borrow makes the result dip non-monotonically as `b` crosses a month boundary: age(2024-02-29, 2024-03-30) = {-1 month, -1 day} age(2024-02-29, 2024-03-31) = {-1 month, -2 days} ← lex-smaller age(2024-02-29, 2024-04-01) = {-1 month, -1 day} Demotes both `age_timestamp` and `age_timestamp_tz` to `(false, false)`. Adds regression tests in `interpret.rs` exercising the abstract interpreter via a `>=` predicate, mirroring the existing tests for the analogous `add_timestamp_interval` and `date_bin_timestamp` bugs. Same bug class as 5257b0d and 51e86ed.

The runtime MFP evaluator (SafeMfpPlan::evaluate_inner) runs every expression once all preceding predicates pass, so an expression that errors on the actual data turns the whole row into an Err. The abstract interpreter's mfp_filter / mfp_plan_filter, however, only ANDs together the predicates and temporal bounds — so the AND result misses the fallibility of any expression whose result column isn't referenced by a predicate or bound. Persist filter pushdown then discards parts that actually produce error rows, tripping the audit panic in persist_source.rs. Concretely from the audit log on database-issues#9656: expressions: [cast_string_to_uuid(merchant_group_id)] predicates: [NOT IsNull(checksum)] upper_bounds: [cast_timestamp_tz_to_mz_timestamp(coalesce(deleted_at, ...))] Stats say checksum is non-null and deleted_at is in the past. The interpreter computed AND({True}, {False}) = {False} with fallible=false and discarded the part. The actual evaluator: predicate passes, cast_string_to_uuid is evaluated next, errors on the row's merchant_group_id value, and the whole row is emitted as Err. Audit catches the discrepancy. Override mfp_filter and mfp_plan_filter on ColumnSpecs so that the returned summary's fallible bit is set if any of the MFP expressions' specs are fallible. This is conservative (we'll keep parts where a predicate could-but-doesn't-have-to fail, even though predicate short-circuiting would have prevented the expression from running), but it's sound and matches the runtime semantics. Adds a regression test that builds an MFP with one always-erroring expression and one always-passing predicate, asserts that the interpreter's summary may_fail.

The previous commits in this stack marked add/sub_timestamp_interval (and the tz variants) as non-monotone in both arguments to preserve correctness in the presence of calendar-month / day-clamping arithmetic. But that conservative annotation costs pushdown for common temporal-filter predicates like 't_col - INTERVAL ''1'' day < literal', which were a motivating use case for filter pushdown in the first place. Recovers that pushdown by extending SpecialBinary with a DynamicMonotone handler: when the interval argument is a literal with months == 0, the operation reduces to adding a fixed number of microseconds, which is monotone in both arguments. In that case the interpreter computes a tight output range; in all other cases it falls back to anything(), preserving the soundness fix. The SpecialBinary's pushdownable hint is set to (true, false) so the Trace pass routes 't_col +/- INTERVAL_lit' predicates through pushdown regardless of months — at runtime the dynamic check decides whether to narrow.

Captured workloads from production environments often have features the local replay did not yet support; replaying them previously failed at various points before useful test work could happen. * config: add source_ingestion_*, balanced_*, and highcpu_* cluster replica sizes so CREATE CLUSTER REPLICA against these production size names is accepted locally. * objects: accept quoted database names with non-identifier characters (e.g. hyphens) in CREATE CONNECTION; parse double-quoted REFERENCE / EXTERNAL REFERENCE parts so dotted references like `"db-name".schema.table` are split correctly. * objects: upstream CREATE TABLE statements now use IF NOT EXISTS — the same physical Postgres/MySQL/SQL Server table can back multiple MZ sources and we were failing the second time. * objects: strip options from captured SQL that don't survive local replay: empty `TEXT COLUMNS = ()` (planner panic), `EXCLUDE COLUMNS` (names upstream columns we don't recreate), Kafka `START OFFSET = (…)` (local topics have 1 partition), `VERSION = N` on CREATE SINK (internal-only option). * column: extend type coverage so `<scalar>[]` arrays, `interval`, `oid`, `real`, and Materialize-specific list/record/aclitem variants produce sensible non-NULL values; `character` (no size) now emits a single char so it fits CHAR(1). * column: function-call defaults like `pg_catalog.now()` are skipped in COPY and Kafka paths, where they would be inserted as literal strings and rejected. * column: avro_type falls back to `string` for non-primitive types so fastavro can parse the generated schema. * util / mzcompose: workload files can now be stored compressed as `.yml.zst` and are loaded transparently via the new `load_workload` helper. Glob `*.yml` also picks up the compressed variants. Large captured workloads compress ~30× with zstd, which keeps them under GitHub's 100 MiB per-file limit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

def- force-pushed the pr-workload-capture-fixes branch 6 times, most recently from 1b5bcea to b20f0f1 Compare May 23, 2026 02:14

claude added 4 commits May 24, 2026 02:02

def- force-pushed the pr-workload-capture-fixes branch from b20f0f1 to 8408777 Compare May 24, 2026 03:57

def- mentioned this pull request May 24, 2026

expr: fix non-monotone annotations on timestamp/date/interval functions #36702

Open

def- force-pushed the pr-workload-capture-fixes branch from 8408777 to 7701761 Compare May 24, 2026 13:00

def- force-pushed the pr-workload-capture-fixes branch from 7701761 to d094b5c Compare May 25, 2026 02:09

def- force-pushed the pr-workload-capture-fixes branch 3 times, most recently from cdc64ce to 6c2255b Compare May 25, 2026 04:12

def- force-pushed the pr-workload-capture-fixes branch from 6c2255b to 9c02d48 Compare May 25, 2026 06:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workload-replay: Handle larger captured workloads#36686

workload-replay: Handle larger captured workloads#36686
def- wants to merge 8 commits into
MaterializeInc:mainfrom
def-:pr-workload-capture-fixes

def- commented May 22, 2026

Uh oh!

def- commented May 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

def- commented May 22, 2026

Uh oh!

def- commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

def- commented May 23, 2026 •

edited

Loading