Skip to content

expr: propagate MFP expression fallibility through filter pushdown#36721

Draft
antiguru wants to merge 1 commit into
mainfrom
claude/github-issue-9656-mfp-expr-fallibility
Draft

expr: propagate MFP expression fallibility through filter pushdown#36721
antiguru wants to merge 1 commit into
mainfrom
claude/github-issue-9656-mfp-expr-fallibility

Conversation

@antiguru
Copy link
Copy Markdown
Member

@antiguru antiguru commented May 25, 2026

Motivation

Even with the monotonicity fixes in #36702 (merged) and #36706, workload-replay was still hitting persist filter pushdown correctness violation!. The audit log on that failure (database-issues#9656, u103) showed:

mfp = MapFilterProject {
  expressions: [CastStringToUuid(Column(3, "merchant_group_id"))],
  predicates:  [(2, Not(IsNull(Column(1, "checksum"))))],
  projection: [0, 1, 2, 4, 5, 6],
  input_arity: 6
}
upper_bounds: [cast_timestamp_tz_to_mz_timestamp(Coalesce(deleted_at, '9999-12-31'))]

stats: checksum lower="bar" upper="bar"; deleted_at = 2025-06-16 (single value)
result: Err((DataflowErrorSer(143), 1779629555000, +1))

The interpreter computed:

  • predicate NOT IsNull(checksum){True} (stats say non-null).
  • upper-bound check 2025-06-16 >= mz_now{False} (frontier was past 2025-06-16).
  • AND({True}, {False}) = {False}, fallible = false.

may_keep=false, may_error=false, may_skip=trueDiscard the part.

But the actual MFP runtime ran the predicate (True), then evaluated the cast_string_to_uuid("merchant_group_id_a") expression (NOT a valid UUID), which errored, and the whole row was emitted as Err. The discarded part actually had an error row to emit.

Description

This is a different bug class from the monotonicity fixes in #36702 / #36706 / #36708; it's independent and can land on its own. The runtime evaluator (SafeMfpPlan::evaluate_inner) runs every MFP expression once all preceding predicates pass; any expression error propagates as the row's error result. The abstract interpreter's mfp_filter / mfp_plan_filter, however, only ANDs together the predicates and temporal bounds — so the AND result misses the fallibility of any expression whose result column isn't referenced by a predicate or bound.

This PR overrides mfp_filter and mfp_plan_filter on ColumnSpecs to set the returned summary's fallible bit if any of the MFP expressions' specs are fallible. Conservative (we'll keep parts where a predicate could-but-doesn't-have-to fail, even though predicate short-circuiting would have prevented the expression from running) but sound and matches the runtime semantics.

The default trait-level implementation (used by Trace) is unchanged, since Trace is about pushdownability rather than soundness.

Verification

  • New regression test interpret::tests::test_mfp_unreferenced_fallible_expression: builds an MFP with one always-erroring expression (cast_string_to_uuid("not-a-uuid")) and one always-passing predicate, asserts that the interpreter's summary may_fail(). Fails on the pre-fix code; passes on the fix.
  • cargo test -p mz-expr --lib — all tests pass.
  • Workload-replay should stop hitting the audit panic on this class of MFP (cast/parse expression columns that aren't gated by a predicate).

Cost

Any MFP with a fallible expression in its expressions list will now keep parts that would previously have been discarded by temporal-bound checks. The tighter version would only mark fallibility when the predicates' spec admits True (so the expression would actually run at runtime); that's a follow-up if the conservative version costs too much in practice.

@antiguru antiguru force-pushed the claude/github-issue-9656-followup-pushdown branch from 6b3a33e to ecd05d3 Compare May 26, 2026 15:55
@antiguru antiguru force-pushed the claude/github-issue-9656-mfp-expr-fallibility branch from 2d99c58 to 718e095 Compare May 26, 2026 15:56
@DAlperin
Copy link
Copy Markdown
Member

Could we add a proptest alongside test_timestamp_plus_interval_dynamic_monotone to verify the monotonicity claim directly against the function impl?

@antiguru antiguru force-pushed the claude/github-issue-9656-followup-pushdown branch from ecd05d3 to 2aab955 Compare May 26, 2026 17:06
@antiguru antiguru force-pushed the claude/github-issue-9656-mfp-expr-fallibility branch from 718e095 to 94aec8d Compare May 26, 2026 17:07
The runtime MFP evaluator (SafeMfpPlan::evaluate_inner) runs every
expression once all preceding predicates pass, so an expression that
errors on the actual data turns the whole row into an Err. The
abstract interpreter's mfp_filter / mfp_plan_filter, however, only
ANDs together the predicates and temporal bounds — so the AND result
misses the fallibility of any expression whose result column isn't
referenced by a predicate or bound. Persist filter pushdown then
discards parts that actually produce error rows, tripping the audit
panic in persist_source.rs.

Concretely from the audit log on database-issues#9656:

  expressions: [cast_string_to_uuid(merchant_group_id)]
  predicates:  [NOT IsNull(checksum)]
  upper_bounds: [cast_timestamp_tz_to_mz_timestamp(coalesce(deleted_at, ...))]

Stats say checksum is non-null and deleted_at is in the past. The
interpreter computed AND({True}, {False}) = {False} with fallible=false
and discarded the part. The actual evaluator: predicate passes,
cast_string_to_uuid is evaluated next, errors on the row's
merchant_group_id value, and the whole row is emitted as Err. Audit
catches the discrepancy.

Override mfp_filter and mfp_plan_filter on ColumnSpecs so that the
returned summary's fallible bit is set if any of the MFP expressions'
specs are fallible. This is conservative (we'll keep parts where a
predicate could-but-doesn't-have-to fail, even though predicate
short-circuiting would have prevented the expression from running), but
it's sound and matches the runtime semantics.

Adds a regression test that builds an MFP with one always-erroring
expression and one always-passing predicate, asserts that the
interpreter's summary may_fail.
@antiguru antiguru force-pushed the claude/github-issue-9656-mfp-expr-fallibility branch from 94aec8d to 5c7a128 Compare May 26, 2026 17:08
@antiguru antiguru changed the base branch from claude/github-issue-9656-followup-pushdown to main May 26, 2026 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants