[SPARK-57205][SQL] Add SupportsScanMerging to merge equivalent V2 file scans#56264
[SPARK-57205][SQL] Add SupportsScanMerging to merge equivalent V2 file scans#56264LuciferYang wants to merge 2 commits into
Conversation
fbaa5f0 to
7c1e41a
Compare
…e scans ### What changes were proposed in this pull request? This adds a `SupportsScanMerging` mix-in for DSv2 `Scan` and teaches `MergeSubplans` (through `PlanMerger`) to use it. When two scans of the same table differ only in their projected columns and/or pushed filters, the optimizer asks the source to fuse them into one scan that covers both, and then deduplicates the subplans built on top. The interface is implemented once on the `FileScan` base, so every built-in file format (Parquet, ORC, CSV, JSON, Text, Avro) participates. Two scans merge when they read the same data (same file index, schema, options and partition filters); the merged scan reads the union of the two read schemas. When the pushed data filters differ, the merged scan widens them to `OR(f1, f2)` and reads a superset -- the exact per-side predicate is still enforced by the post-scan `Filter`, which `MergeSubplans` turns into per-aggregate `FILTER (WHERE ...)` clauses. That widening is off by default, gated by `spark.sql.files.scanMerge.ignorePushedDataFilters`. Merging is declined when an aggregate is pushed into the scan, since there is then no post-scan `Filter` to separate the two sides. ### Why are the changes needed? V1 file sources already get this. Their filter and column pushdown happens during physical planning, so when `MergeSubplans` runs the logical plan still has plain `Filter`/`Project` nodes over identical relation leaves, which the existing rules merge. DSv2 bakes pushdown into `DataSourceV2ScanRelation` during logical optimization, so two subqueries that differ only in `WHERE` or `SELECT` become structurally different leaves and cannot be merged. For example TPC-DS q9 (fifteen scalar subqueries over `store_sales`) collapses to a single scan under V1 but reads the table fifteen times under V2. This closes that gap. ### Does this PR introduce _any_ user-facing change? No. The connector interface and the config are internal, and the config defaults to off, so the default plan is unchanged. ### How was this patch tested? New tests in `PlanMergeSuite` run each query through both the V1 and V2 file-source paths, with AQE on and off, and assert identical results and identical merge structure. The merge-semantics cases cover differing columns, differing data filters, differing partition filters, filter propagation through a join, and a multi-subquery composition, plus negative cases; the format-coverage cases run a representative query over Parquet, ORC and JSON. `MergeSubplansSuite`, `DataSourceV2Suite` and the Parquet/ORC V2 aggregate-pushdown suites still pass. I also checked that V1 and V2 produce identical results and merge structure on the affected TPC-DS queries (q2, q9, q28, q59, q77a, q88, q90) at scale factor 1. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.8
7c1e41a to
511e5d1
Compare
|
cc @peter-toth Do you have any better ideas or suggestions regarding this optimization capability? Thanks ~ |
|
Thanks for picking this up @LuciferYang. I haven't done a thorough review of the PR yet, but I read it through carefully alongside my old #37711 and wanted to share a thought on the approach before getting into the details. The shape here is essentially the same as #37711 (which never landed), generalized in two important ways: the merge logic now lives once on One thought on the design direction, and an alternative worth at least considering before we commit to this shape: The current shape (call it Option A) The merge body lives on the connector side, in
An alternative (Option B) SPARK-56385 (April 2026) added
The strict/relaxed split and the Trade-off
I don't have a position yet on whether to swap this PR for B or land A first and follow up — wanted to put the alternative on the table since it changes who pays the cost (Spark once vs. every connector). If you'd like to experiment with Option B yourself, please go ahead; if you'd prefer, I can sketch it as a draft so we can compare side by side. Let me know which you'd like. cc @cloud-fan |
|
@peter-toth Sorry for missing PR #37711 earlier, and many thanks for your review and suggestions. You're right that with the body on One thing worth pinning down in the Option B sketch: the holder's A merge driven off the relation's The strict column-union case (same pushed filters, differ only in projected columns) is universally safe and is the only on-by-default path. That's where B helps most, since it generalizes to every V2 source with no per-connector code. The relaxed OR-widen only makes sense for sources that keep a post-scan residual There's also a layering issue that affects where the generic body can live. I'd propose we land A first (complete, tested, ships the on-by-default strict column-union case), then generalize in a follow-up that (1) adds a generic Spark-side strict merge in sql/core behind the marker interface, (2) adds the Either way, yes please -- I'd like to take you up on the offer to sketch B. Seeing the two side by side would make the strict-vs-relaxed boundary concrete and show whether our follow-ups converge. If you can wire B so the post-scan residual is visible to the leaf merge, that dissolves my main concern -- I may well be missing a simpler path. cc @cloud-fan |
|
Thanks for the thorough writeup, @LuciferYang -- you're right on all the specifics, and the Agreed that the relation's Where I'm more optimistic is the conclusion. The predicate isn't lost, it's just not on the relation: for best-effort sources it's the post-scan You're right about layering too: the legacy I don't want to block your PR on a hunch, though. Let me spend a bit of time next week sketching the generic version so we can put the two side by side, as you suggested -- then we can see whether the strict-vs-relaxed boundary really dissolves, and decide together whether the DSv2 file sources are worth their own code on top of the generic core, or whether landing this first and generalizing later is cleaner after all. I'll report back here. |
…-56385 pushedFilters dependency, adapted to 4.0.x - backport SPARK-56385 pushedFilters field on DataSourceV2ScanRelation (surgical: field + doCanonicalize + V2ScanRelationPushDown population; drop join-pushdown entanglement) - update ~12 positional DataSourceV2ScanRelation patterns to 6-arg - cherry-pick apache#56264: FileScan union (SupportsRuntimeCatalystFilters + SupportsScanMerging), PlanMerger coexists with SPARK-40193 filter-prop - strip bindingPolicy from new conf; drop ParquetScan variant comparison (no variant pushdown in 4.0.x)
What changes were proposed in this pull request?
This adds a
SupportsScanMergingmix-in for DSv2Scanand wires it intoMergeSubplansthroughPlanMerger. When two scans of the same table differ only in their projected columns and/or pushed filters, the optimizer asks the source to fuse them into a single scan that covers both, then deduplicates the subplans built on top. The interface is implemented once on theFileScanbase, so every built-in file format (Parquet, ORC, CSV, JSON, Text, Avro) participates. Two scans merge when they read the same data (same file index, schema, options and partition filters), and the merged scan reads the union of their read schemas. When the pushed data filters differ, the merged scan widens them toOR(f1, f2)and reads a superset; the exact per-side predicate is still enforced by the post-scanFilter, whichMergeSubplansrewrites into per-aggregateFILTER (WHERE ...)clauses. That widening is off by default and gated byspark.sql.files.scanMerge.ignorePushedDataFilters. Merging is declined when an aggregate is pushed into the scan, because there is then no post-scanFilterleft to separate the two sides.Why are the changes needed?
V1 file sources already merge in this case. Their filter and column pushdown happens during physical planning, so when
MergeSubplansruns the logical plan still has plainFilter/Projectnodes over identical relation leaves, which the existing rules combine. DSv2 instead bakes pushdown intoDataSourceV2ScanRelationduring logical optimization, so two subqueries that differ only inWHEREorSELECTbecome structurally different leaves and cannot be merged. TPC-DS q9 is a good example: its fifteen scalar subqueries overstore_salescollapse to a single scan under V1 but read the table fifteen times under V2. This change brings V2 to parity.Does this PR introduce any user-facing change?
No. The connector interface and the config are both internal, and the config defaults to off, so the default plan is unchanged.
How was this patch tested?
New tests in
PlanMergeSuiterun each query through both the V1 and V2 file-source paths, with AQE on and off, and assert that the results and the merge structure are identical. They cover differing columns, differing data filters, differing partition filters, filter propagation through a join, and a multi-subquery composition, plus the negative cases, and they run a representative query across Parquet, ORC and JSON for format coverage.MergeSubplansSuite,DataSourceV2Suite, and the Parquet/ORC V2 aggregate-pushdown suites still pass. I also verified that V1 and V2 produce identical results and merge structure on the affected TPC-DS queries (q2, q9, q28, q59, q77a, q88, q90) at scale factor 1.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Opus 4.8