Skip to content

workload-replay: Handle larger captured workloads#36686

Draft
def- wants to merge 8 commits into
MaterializeInc:mainfrom
def-:pr-workload-capture-fixes
Draft

workload-replay: Handle larger captured workloads#36686
def- wants to merge 8 commits into
MaterializeInc:mainfrom
def-:pr-workload-capture-fixes

Conversation

@def-
Copy link
Copy Markdown
Contributor

@def- def- commented May 22, 2026

Captured workloads from production environments often have features the local replay did not yet support; replaying them previously failed at various points before useful test work could happen.

  • config: add extra cluster replica sizes
  • objects: accept quoted database names with non-identifier characters (e.g. hyphens) in CREATE CONNECTION; parse double-quoted REFERENCE / EXTERNAL REFERENCE parts so dotted references like "db-name".schema.table are split correctly.
  • objects: upstream CREATE TABLE statements now use IF NOT EXISTS — the same physical Postgres/MySQL/SQL Server table can back multiple MZ sources and we were failing the second time.
  • objects: strip options from captured SQL that don't survive local replay: empty TEXT COLUMNS = () (planner panic), EXCLUDE COLUMNS (names upstream columns we don't recreate), Kafka START OFFSET = (…) (local topics have 1 partition), VERSION = N on CREATE SINK (internal-only option).
  • column: extend type coverage so <scalar>[] arrays, interval, oid, real, and Materialize-specific list/record/aclitem variants produce sensible non-NULL values; character (no size) now emits a single char so it fits CHAR(1).
  • column: function-call defaults like pg_catalog.now() are skipped in COPY and Kafka paths, where they would be inserted as literal strings and rejected.
  • column: avro_type falls back to string for non-primitive types so fastavro can parse the generated schema.
  • util / mzcompose: workload files can now be stored compressed as .yml.zst and are loaded transparently via the new load_workload helper. Glob *.yml also picks up the compressed variants.

@def- def- force-pushed the pr-workload-capture-fixes branch 6 times, most recently from 1b5bcea to b20f0f1 Compare May 23, 2026 02:14
@def-
Copy link
Copy Markdown
Contributor Author

def- commented May 23, 2026

claude added 4 commits May 24, 2026 02:02
`Interval` is lex-ordered by (months, days, micros), but adding an
interval to a timestamp or date adds *calendar* months with day-clamping
and then days as 24-hour periods. That arithmetic does not respect the
lex order:

  t = 2024-01-31
  i1 = {0 months, 31 days}   →  t + i1 = 2024-03-02
  i2 = {1 month,   0 days}   →  t + i2 = 2024-02-29

In lex order `i1 < i2`, yet `t + i1 > t + i2`. For timestamps, the
day-clamping also collapses near-boundary inputs into the same date
while preserving sub-day time, so the first argument is non-monotone
too:

  t1 = 2024-01-30 23:59:59,  i = {1 month}  →  2024-02-29 23:59:59
  t2 = 2024-01-31 00:00:00,  i = {1 month}  →  2024-02-29 00:00:00

For dates the first argument *is* monotone (no sub-day precision means
clamping only collapses, never reverses), but the interval argument has
the same problem.

These annotations are consumed by the abstract interpreter that drives
persist filter pushdown. Marking these functions monotone meant the
interpreter computed the output range by evaluating the function only
at the endpoints of the input interval range — and a stats range like
`[{0m,31d}, {1m,0d}]` would yield the narrow output range
`[2024-02-29, 2024-03-02]` even though interior intervals
(e.g. `{0m, 60d}`) actually produce timestamps far outside that window.
Filter pushdown could then incorrectly conclude a part had no matching
rows, tripping the `persist filter pushdown correctness violation!`
audit in `persist_source.rs`.

Companion fix to b607993, which corrected the analogous annotations
for `add_time_interval`, `sub_time_interval`, `mul_interval`, and
`div_interval`.

Fixes database-issues#9656.
The override was added to suppress the audit panic from
database-issues#9656; with the monotonicity annotations on
timestamp/date + interval corrected, filter pushdown should be sound
again. Removing this lets CI re-exercise the audit and surface any
remaining latent causes.
- Drop useless .into() in the new interpret.rs regression test (clippy).
- Update filter-pushdown.slt: sub_timestamp_interval is no longer
  pushdownable now that it's marked non-monotone, so the expected
  pushdown= lines for 'timestamp - INTERVAL day' queries are gone.
  This is the known tradeoff for soundness — even day-only intervals
  share the non-monotonic annotation since the abstract interpreter
  can't distinguish them statically.
date_bin(stride, source) = origin + floor((source - origin) / stride) * stride.
For a fixed source like 2024-01-01 12:00:00, a 1-day stride bins to
2024-01-01 00:00:00 but a 2-day stride bins to 2023-12-31 00:00:00 —
i.e. the lex-larger stride produces an earlier output. Same class of
bug as the timestamp/date + interval monotonicity issues.

Demotes date_bin_timestamp and date_bin_timestamp_tz from (true, true)
to (false, true). Still monotone in source.
Postgres-style `age(a, b)` is annotated `is_monotone = "(true, true)"`,
but it is non-monotone in either argument.

First argument: the carry logic in the calendar-aware difference borrows
a whole month's worth of days when the day field goes negative, which
breaks the lex order of `Interval` (months, days, micros) at month
boundaries:

  age(2024-03-31, 2024-02-15) = {1 month, 16 days}
  age(2024-04-01, 2024-02-15) = {1 month, 15 days}   ← lex-smaller
  age(2024-05-01, 2024-02-15) = {2 months, 15 days}

Second argument: after the algorithm's final sign-revert, the same
day-borrow makes the result dip non-monotonically as `b` crosses a
month boundary:

  age(2024-02-29, 2024-03-30) = {-1 month, -1 day}
  age(2024-02-29, 2024-03-31) = {-1 month, -2 days}  ← lex-smaller
  age(2024-02-29, 2024-04-01) = {-1 month, -1 day}

Demotes both `age_timestamp` and `age_timestamp_tz` to
`(false, false)`. Adds regression tests in `interpret.rs` exercising
the abstract interpreter via a `>=` predicate, mirroring the existing
tests for the analogous `add_timestamp_interval` and
`date_bin_timestamp` bugs.

Same bug class as 5257b0d and 51e86ed.
@def- def- force-pushed the pr-workload-capture-fixes branch from 8408777 to 7701761 Compare May 24, 2026 13:00
The runtime MFP evaluator (SafeMfpPlan::evaluate_inner) runs every
expression once all preceding predicates pass, so an expression that
errors on the actual data turns the whole row into an Err. The
abstract interpreter's mfp_filter / mfp_plan_filter, however, only
ANDs together the predicates and temporal bounds — so the AND result
misses the fallibility of any expression whose result column isn't
referenced by a predicate or bound. Persist filter pushdown then
discards parts that actually produce error rows, tripping the audit
panic in persist_source.rs.

Concretely from the audit log on database-issues#9656:

  expressions: [cast_string_to_uuid(merchant_group_id)]
  predicates:  [NOT IsNull(checksum)]
  upper_bounds: [cast_timestamp_tz_to_mz_timestamp(coalesce(deleted_at, ...))]

Stats say checksum is non-null and deleted_at is in the past. The
interpreter computed AND({True}, {False}) = {False} with fallible=false
and discarded the part. The actual evaluator: predicate passes,
cast_string_to_uuid is evaluated next, errors on the row's
merchant_group_id value, and the whole row is emitted as Err. Audit
catches the discrepancy.

Override mfp_filter and mfp_plan_filter on ColumnSpecs so that the
returned summary's fallible bit is set if any of the MFP expressions'
specs are fallible. This is conservative (we'll keep parts where a
predicate could-but-doesn't-have-to fail, even though predicate
short-circuiting would have prevented the expression from running), but
it's sound and matches the runtime semantics.

Adds a regression test that builds an MFP with one always-erroring
expression and one always-passing predicate, asserts that the
interpreter's summary may_fail.
@def- def- force-pushed the pr-workload-capture-fixes branch from 7701761 to d094b5c Compare May 25, 2026 02:09
The previous commits in this stack marked add/sub_timestamp_interval
(and the tz variants) as non-monotone in both arguments to preserve
correctness in the presence of calendar-month / day-clamping
arithmetic. But that conservative annotation costs pushdown for
common temporal-filter predicates like 't_col - INTERVAL ''1'' day <
literal', which were a motivating use case for filter pushdown in
the first place.

Recovers that pushdown by extending SpecialBinary with a
DynamicMonotone handler: when the interval argument is a literal with
months == 0, the operation reduces to adding a fixed number of
microseconds, which is monotone in both arguments. In that case the
interpreter computes a tight output range; in all other cases it
falls back to anything(), preserving the soundness fix.

The SpecialBinary's pushdownable hint is set to (true, false) so the
Trace pass routes 't_col +/- INTERVAL_lit' predicates through
pushdown regardless of months — at runtime the dynamic check decides
whether to narrow.
@def- def- force-pushed the pr-workload-capture-fixes branch 3 times, most recently from cdc64ce to 6c2255b Compare May 25, 2026 04:12
Captured workloads from production environments often have features the
local replay did not yet support; replaying them previously failed at
various points before useful test work could happen.

* config: add source_ingestion_*, balanced_*, and highcpu_* cluster
  replica sizes so CREATE CLUSTER REPLICA against these production size
  names is accepted locally.
* objects: accept quoted database names with non-identifier characters
  (e.g. hyphens) in CREATE CONNECTION; parse double-quoted REFERENCE /
  EXTERNAL REFERENCE parts so dotted references like
  `"db-name".schema.table` are split correctly.
* objects: upstream CREATE TABLE statements now use IF NOT EXISTS — the
  same physical Postgres/MySQL/SQL Server table can back multiple MZ
  sources and we were failing the second time.
* objects: strip options from captured SQL that don't survive local
  replay: empty `TEXT COLUMNS = ()` (planner panic), `EXCLUDE COLUMNS`
  (names upstream columns we don't recreate), Kafka `START OFFSET = (…)`
  (local topics have 1 partition), `VERSION = N` on CREATE SINK
  (internal-only option).
* column: extend type coverage so `<scalar>[]` arrays, `interval`,
  `oid`, `real`, and Materialize-specific list/record/aclitem variants
  produce sensible non-NULL values; `character` (no size) now emits a
  single char so it fits CHAR(1).
* column: function-call defaults like `pg_catalog.now()` are skipped in
  COPY and Kafka paths, where they would be inserted as literal strings
  and rejected.
* column: avro_type falls back to `string` for non-primitive types so
  fastavro can parse the generated schema.
* util / mzcompose: workload files can now be stored compressed as
  `.yml.zst` and are loaded transparently via the new `load_workload`
  helper. Glob `*.yml` also picks up the compressed variants. Large
  captured workloads compress ~30× with zstd, which keeps them under
  GitHub's 100 MiB per-file limit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@def- def- force-pushed the pr-workload-capture-fixes branch from 6c2255b to 9c02d48 Compare May 25, 2026 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants