Skip to content

feat: Plumb Parquet virtual columns (row_number) through TableSchema and ParquetOpener#22026

Open
mbutrovich wants to merge 17 commits into
apache:mainfrom
mbutrovich:virtual-columns-table-schema
Open

feat: Plumb Parquet virtual columns (row_number) through TableSchema and ParquetOpener#22026
mbutrovich wants to merge 17 commits into
apache:mainfrom
mbutrovich:virtual-columns-table-schema

Conversation

@mbutrovich
Copy link
Copy Markdown
Contributor

@mbutrovich mbutrovich commented May 5, 2026

Which issue does this PR close?

Rationale for this change

arrow-rs 57.1.0+ supports Parquet virtual columns (row_number, row_group_index) via ArrowReaderOptions::with_virtual_columns, and DataFusion pins a new-enough arrow-rs for the API to be available. DataFusion does not yet plumb the option through ParquetOpener, so consumers (notably Comet) cannot project Spark's _tmp_metadata_row_index through the native_datafusion scan path.

This PR adds the minimal opener-boundary plumbing so TableSchema can carry virtual columns and the Parquet reader produces them. UX / SQL-layer surface for virtual columns stays deferred to the epic in #20135 — this follows the same framing alamb blessed for #20071 (the input_file_name() UDF).

What changes are included in this PR?

  • TableSchema::with_virtual_columns(...) builder + virtual_columns() getter. Layout: [file, partition, virtual]. Composable with with_table_partition_cols in either order.
  • TableSchema::schema_without_virtual_columns() — file + partition schema used by pushdown-planning paths that can't evaluate virtual-col refs.
  • ParquetOpener forwards the fields to ArrowReaderOptions::with_virtual_columns; augments the schemas passed to the expr-adapter / simplifier with virtual fields so virtual-col refs identity-rewrite; strips them from the projection fed to ProjectionMask::roots (which only understands file columns) and appends them to stream_schema so reassign_expr_columns resolves them by name.
  • New ParquetVirtualColumn enum with TryFrom<&FieldRef> (in datasource-parquet::virtual_column) gates which arrow-rs virtual extension types are accepted. Currently only RowNumber; adding a variant (e.g. RowGroupIndex) is a compile-time obligation. Replaces the earlier runtime string-allowlist so the contract lives in the type system.
  • ParquetSource::try_pushdown_filters classifies filters against the file+partition schema (not the full table schema) so predicates referencing virtual columns are reported as PushedDown::No and the FilterExec stays above the scan — arrow-rs's RowFilter addresses parquet leaves only and can't evaluate virtual-column refs, so silently pushing them would produce wrong results.
  • Defensive check in the opener: build_virtual_columns_state (run once per scan partition at morselizer-build time) errors when pushdown_filters=true and the predicate references a virtual column, with a clear remediation message pointing at try_pushdown_filters. This catches callers that bypass the optimizer and set the predicate on ParquetSource directly.
  • arrow-schema added as a direct dep (previously transitive via arrow) so the enum references RowNumber::NAME from arrow-rs instead of hardcoding the string.
  • Explicitly not in scope (follow-ups): ListingTable / SQL-layer surface, a three-arg constructor on TableSchema, ParquetSource::with_virtual_columns, and RowGroupIndex support.

Are these changes tested?

Yes. New unit tests in opener.rs:

  • test_row_index_basic — single row group, select data + row_number.
  • test_row_index_projection_only — select only row_number.
  • test_row_index_multi_row_group — 3 × 100 rows, verify absolute 0..300 across boundaries.
  • test_row_index_with_row_group_skip — predicate stats-prunes the middle row group; verify row numbers stay absolute (0..100 ++ 200..300). Critical correctness gate for Spark (and for Fix RowNumberReader when not all row groups are selected arrow-rs#8863).
  • test_row_index_with_partition_cols — partition + virtual + data columns compose correctly.
  • test_row_index_nullable_int64 — nullability flag flows through unchanged (matches Spark's _tmp_metadata_row_index declaration).
  • test_unsupported_virtual_extension_type_rejected — using RowGroupIndex (a real arrow-rs type deliberately not in the enum yet) errors with NotImplemented instead of silently forwarding.
  • test_row_index_predicate_pushdown_mixed_or_errors / _virtual_only_errors / _allowed_when_pushdown_disabled — exercise the opener's defensive check for virtual-col predicate refs with pushdown_filters=true, and confirm the pushdown_filters=false path is unaffected.

In source.rs: test_try_pushdown_filters_rejects_virtual_column_refs pins the planner-boundary contract — file-col filters are PushedDown::Yes, virtual-only and mixed filters are PushedDown::No.

In virtual_column.rs: unit tests covering TryFrom<&FieldRef> for valid, missing-extension-type, and unsupported-extension-type inputs.

Plus a TableSchema unit test verifying the [file, partition, virtual] layout is stable regardless of builder-call order.

Are there any user-facing changes?

Public API additions: TableSchema::with_virtual_columns(...), TableSchema::virtual_columns(), TableSchema::schema_without_virtual_columns(), and ParquetVirtualColumn (re-exported from datafusion-datasource-parquet). No existing API changed; no breaking changes.

mbutrovich added 2 commits May 5, 2026 13:21
…and ParquetOpener, gated behind a tested-only extension-type allowlist, to unblock Comet's native-DataFusion support for Spark's _tmp_metadata_row_index.
@github-actions github-actions Bot added the datasource Changes to the datasource crate label May 5, 2026
Comment thread datafusion/datasource-parquet/src/opener.rs
Comment thread datafusion/datasource/src/table_schema.rs
Comment thread datafusion/datasource/src/table_schema.rs
@adriangb
Copy link
Copy Markdown
Contributor

adriangb commented May 5, 2026

My main concern is #22026 (comment).

The various schemas in opener.rs are already quite complex, this risks making it worse.

@mbutrovich
Copy link
Copy Markdown
Contributor Author

My main concern is #22026 (comment).

The various schemas in opener.rs are already quite complex, this risks making it worse.

Thanks for the review @adriangb! Agreed it could make things more complicated, but if DataFusion is ever going to support these virtual columns it might be unavoidable. I think it's good to hash this stuff out in the smallest possible PR at the opener level. I'll push an update later today.

@mbutrovich
Copy link
Copy Markdown
Contributor Author

Thanks again for the review @adriangb! Hopefully I addressed all of the feedback, but happy to keep chatting about it.

Mixed virtual/file predicates with pushdown_filters=true

Confirmed the silent-drop bug with failing tests. Root cause: ParquetSource::try_pushdown_filters called can_expr_be_pushed_down_with_schemas with the full table schema (now including virtual columns), so filters referencing row_number were marked PushedDown::YesFilterExec removed → the scan's build_row_filter couldn't resolve the virtual-col ref against physical_file_schema and silently dropped the conjunct.

Arrow-rs can't accept virtual-column refs in a RowFilter at all: ArrowPredicate::projection() returns a ProjectionMask over parquet leaves only, and virtual columns are synthesized after filter evaluation. So virtual columns are projectable but never pushable.

Fix: added TableSchema::schema_without_virtual_columns() (file + partition, excluding virtual) and try_pushdown_filters uses that. Virtual-col filters are now reported PushedDown::No and the FilterExec stays above the scan.

Defense-in-depth in the opener for callers who bypass the optimizer (e.g. manual plan builders): prepare_open_file rejects pushdown_filters=true + virtual-col predicate with a clear error pointing at with_pushdown_filters(false) or keeping the filter above the scan.

Tests: source.rs::test_try_pushdown_filters_rejects_virtual_column_refs (planner boundary), plus three opener-level tests covering mixed OR, virtual-only, and the allowed pushdown_filters=false case.

Ordering doc on virtual_columns

Struct field doc now spells out the [file, partition, virtual] layout, matching the builder methods.

Enum + TryFrom

Added ParquetVirtualColumn with TryFrom<&FieldRef> in a new virtual_column.rs. The runtime allowlist in the opener is replaced with ParquetVirtualColumn::try_from(field)?. Adding a new variant (e.g. RowGroupIndex) is now a compile-time obligation, and consumers can pattern-match instead of string-comparing extension-type names. Exposed as pub use ParquetVirtualColumn at the crate root.

@mbutrovich mbutrovich requested a review from adriangb May 5, 2026 20:52
@adriangb
Copy link
Copy Markdown
Contributor

adriangb commented May 5, 2026

I think this would then have a negative interaction with the goal of turning filter pushdown on by default. Maybe we'll always have to apply some filters as a FilterExec and that's fine...

@mbutrovich
Copy link
Copy Markdown
Contributor Author

mbutrovich commented May 5, 2026

I think this would then have a negative interaction with the goal of turning filter pushdown on by default. Maybe we'll always have to apply some filters as a FilterExec and that's fine...

Comet conservatively never removes FilterExec nodes above scans with pushed down filters, though that maybe shouldn't be the case.

Wouldn't this only prevent filter pushdown for filters that reference virtual columns?

@adriangb
Copy link
Copy Markdown
Contributor

adriangb commented May 5, 2026

Wouldn't this only prevent filter pushdown for filters that reference virtual columns?

Yeah but it means we'll have to keep the split forever. Which might have been the case anyway and maybe a non issue.

And that any filter that does reference virtual columns cannot be pushed down even if a part of it would benefit from doing so, e..g row_id = 1 and pk = 1, but I'm not sure that's a realistic scenario. In the past we prevented pushdown of projection columns and that was a real issue, we'd see queries in prod from users along the lines of day = '...' OR pk = 1 that could not get pushed down.

@adriangb
Copy link
Copy Markdown
Contributor

adriangb commented May 5, 2026

I plan to give this another review tomorrow.

@comphead
Copy link
Copy Markdown
Contributor

comphead commented May 5, 2026

run benchmark tpch tpcds

@comphead
Copy link
Copy Markdown
Contributor

comphead commented May 5, 2026

@mbutrovich from high level perspective how row_number virtual column would work when reading multiple parquet files?

@adriangbot
Copy link
Copy Markdown

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4383929017-2034-5dnfv 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing virtual-columns-table-schema (bd513ec) to 2c7af17 (merge-base) diff using: tpcds
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4383929017-2033-f8cjt 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing virtual-columns-table-schema (bd513ec) to 2c7af17 (merge-base) diff using: tpch
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

Comparing HEAD and virtual-columns-table-schema
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query     ┃                           HEAD ┃   virtual-columns-table-schema ┃    Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 1  │ 40.03 / 41.45 ±1.55 / 43.79 ms │ 39.49 / 40.62 ±1.19 / 42.39 ms │ no change │
│ QQuery 2  │ 21.07 / 21.56 ±0.69 / 22.87 ms │ 20.75 / 20.84 ±0.10 / 21.04 ms │ no change │
│ QQuery 3  │ 35.68 / 38.24 ±1.33 / 39.30 ms │ 35.44 / 37.61 ±1.64 / 39.23 ms │ no change │
│ QQuery 4  │ 18.04 / 18.37 ±0.17 / 18.52 ms │ 18.06 / 18.12 ±0.05 / 18.19 ms │ no change │
│ QQuery 5  │ 43.56 / 45.18 ±2.01 / 48.95 ms │ 43.18 / 44.20 ±0.87 / 45.76 ms │ no change │
│ QQuery 6  │ 17.06 / 17.19 ±0.14 / 17.45 ms │ 17.06 / 17.16 ±0.08 / 17.28 ms │ no change │
│ QQuery 7  │ 49.86 / 50.64 ±0.54 / 51.27 ms │ 50.04 / 52.14 ±2.32 / 56.63 ms │ no change │
│ QQuery 8  │ 46.49 / 46.74 ±0.14 / 46.88 ms │ 46.50 / 46.95 ±0.65 / 48.22 ms │ no change │
│ QQuery 9  │ 51.78 / 52.17 ±0.28 / 52.54 ms │ 51.72 / 52.33 ±0.52 / 53.01 ms │ no change │
│ QQuery 10 │ 65.29 / 65.42 ±0.11 / 65.57 ms │ 65.11 / 65.91 ±1.20 / 68.29 ms │ no change │
│ QQuery 11 │ 13.62 / 14.10 ±0.63 / 15.35 ms │ 13.68 / 14.39 ±1.31 / 17.00 ms │ no change │
│ QQuery 12 │ 26.16 / 26.42 ±0.24 / 26.78 ms │ 26.36 / 26.73 ±0.28 / 27.10 ms │ no change │
│ QQuery 13 │ 35.63 / 36.37 ±0.51 / 36.97 ms │ 35.10 / 36.02 ±0.71 / 36.92 ms │ no change │
│ QQuery 14 │ 26.54 / 27.04 ±0.62 / 28.24 ms │ 26.64 / 26.83 ±0.15 / 27.07 ms │ no change │
│ QQuery 15 │ 32.68 / 32.81 ±0.10 / 32.95 ms │ 32.57 / 33.23 ±0.62 / 34.39 ms │ no change │
│ QQuery 16 │ 15.17 / 15.27 ±0.06 / 15.36 ms │ 15.10 / 15.24 ±0.11 / 15.42 ms │ no change │
│ QQuery 17 │ 75.04 / 76.49 ±0.95 / 77.33 ms │ 75.97 / 77.19 ±1.14 / 79.00 ms │ no change │
│ QQuery 18 │ 67.84 / 68.82 ±0.96 / 70.42 ms │ 67.31 / 68.81 ±0.94 / 69.99 ms │ no change │
│ QQuery 19 │ 37.52 / 37.65 ±0.13 / 37.90 ms │ 37.42 / 37.70 ±0.22 / 38.08 ms │ no change │
│ QQuery 20 │ 38.52 / 38.72 ±0.15 / 38.88 ms │ 38.62 / 39.10 ±0.33 / 39.53 ms │ no change │
│ QQuery 21 │ 58.33 / 59.44 ±0.83 / 60.37 ms │ 59.62 / 60.74 ±0.71 / 61.68 ms │ no change │
│ QQuery 22 │ 23.78 / 23.97 ±0.18 / 24.28 ms │ 23.64 / 24.06 ±0.42 / 24.80 ms │ no change │
└───────────┴────────────────────────────────┴────────────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Benchmark Summary                           ┃          ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ Total Time (HEAD)                           │ 854.08ms │
│ Total Time (virtual-columns-table-schema)   │ 855.91ms │
│ Average Time (HEAD)                         │  38.82ms │
│ Average Time (virtual-columns-table-schema) │  38.90ms │
│ Queries Faster                              │        0 │
│ Queries Slower                              │        0 │
│ Queries with No Change                      │       22 │
│ Queries with Failure                        │        0 │
└─────────────────────────────────────────────┴──────────┘

Resource Usage

tpch — base (merge-base)

Metric Value
Wall time 5.0s
Peak memory 5.5 GiB
Avg memory 5.0 GiB
CPU user 32.0s
CPU sys 2.2s
Peak spill 0 B

tpch — branch

Metric Value
Wall time 5.0s
Peak memory 5.5 GiB
Avg memory 5.0 GiB
CPU user 31.9s
CPU sys 2.3s
Peak spill 0 B

File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

Comparing HEAD and virtual-columns-table-schema
--------------------
Benchmark tpcds_sf1.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃                                  HEAD ┃          virtual-columns-table-schema ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │           6.47 / 7.04 ±0.91 / 8.85 ms │           6.37 / 6.84 ±0.83 / 8.50 ms │     no change │
│ QQuery 2  │        82.70 / 83.63 ±0.58 / 84.33 ms │        83.93 / 84.87 ±0.51 / 85.28 ms │     no change │
│ QQuery 3  │        31.32 / 31.64 ±0.18 / 31.82 ms │        31.09 / 31.27 ±0.15 / 31.53 ms │     no change │
│ QQuery 4  │    563.85 / 580.17 ±13.51 / 595.57 ms │     573.63 / 590.61 ±9.15 / 599.40 ms │     no change │
│ QQuery 5  │        55.02 / 56.17 ±1.00 / 57.36 ms │        55.18 / 55.96 ±0.51 / 56.58 ms │     no change │
│ QQuery 6  │        38.16 / 38.73 ±0.55 / 39.46 ms │        38.26 / 39.97 ±2.02 / 43.82 ms │     no change │
│ QQuery 7  │     116.31 / 118.29 ±1.77 / 121.63 ms │     115.76 / 117.67 ±1.98 / 121.39 ms │     no change │
│ QQuery 8  │        41.49 / 41.70 ±0.24 / 42.13 ms │        40.89 / 40.96 ±0.07 / 41.06 ms │     no change │
│ QQuery 9  │        55.77 / 59.88 ±2.62 / 63.46 ms │        54.56 / 57.98 ±2.05 / 60.75 ms │     no change │
│ QQuery 10 │        86.67 / 87.58 ±0.77 / 88.54 ms │        85.72 / 86.23 ±0.52 / 87.06 ms │     no change │
│ QQuery 11 │    351.22 / 364.88 ±11.43 / 377.34 ms │     363.72 / 370.19 ±3.52 / 373.64 ms │     no change │
│ QQuery 12 │        30.78 / 31.07 ±0.18 / 31.24 ms │        30.51 / 30.80 ±0.26 / 31.28 ms │     no change │
│ QQuery 13 │     138.03 / 138.75 ±0.79 / 140.19 ms │     134.83 / 136.57 ±1.54 / 138.86 ms │     no change │
│ QQuery 14 │     527.83 / 534.55 ±3.64 / 538.71 ms │     532.98 / 536.60 ±1.87 / 538.43 ms │     no change │
│ QQuery 15 │        63.93 / 65.02 ±1.28 / 67.43 ms │        66.70 / 69.00 ±1.77 / 71.51 ms │  1.06x slower │
│ QQuery 16 │           7.18 / 7.40 ±0.23 / 7.83 ms │           7.29 / 7.43 ±0.11 / 7.57 ms │     no change │
│ QQuery 17 │        86.73 / 88.11 ±1.57 / 90.98 ms │        85.31 / 87.53 ±2.44 / 92.00 ms │     no change │
│ QQuery 18 │     163.53 / 164.50 ±0.72 / 165.59 ms │     159.80 / 163.91 ±2.25 / 166.47 ms │     no change │
│ QQuery 19 │        44.35 / 44.67 ±0.26 / 45.13 ms │        44.77 / 45.03 ±0.39 / 45.80 ms │     no change │
│ QQuery 20 │        37.61 / 38.23 ±0.45 / 38.98 ms │        37.94 / 38.58 ±0.43 / 39.23 ms │     no change │
│ QQuery 21 │        19.01 / 19.36 ±0.20 / 19.59 ms │        19.36 / 19.60 ±0.18 / 19.88 ms │     no change │
│ QQuery 22 │        64.17 / 65.22 ±0.94 / 66.71 ms │        68.63 / 69.51 ±0.58 / 70.44 ms │  1.07x slower │
│ QQuery 23 │    504.82 / 524.83 ±20.74 / 562.39 ms │     510.61 / 522.29 ±9.80 / 533.95 ms │     no change │
│ QQuery 24 │     249.60 / 252.27 ±2.50 / 255.57 ms │     250.70 / 259.29 ±7.06 / 271.39 ms │     no change │
│ QQuery 25 │     120.57 / 122.44 ±1.79 / 125.60 ms │     121.80 / 123.77 ±1.59 / 126.18 ms │     no change │
│ QQuery 26 │        76.56 / 77.43 ±0.76 / 78.68 ms │        76.28 / 77.69 ±0.90 / 78.91 ms │     no change │
│ QQuery 27 │           7.11 / 7.26 ±0.14 / 7.53 ms │           7.38 / 7.65 ±0.15 / 7.77 ms │  1.05x slower │
│ QQuery 28 │        65.63 / 67.13 ±0.81 / 68.00 ms │        65.96 / 67.41 ±0.74 / 67.93 ms │     no change │
│ QQuery 29 │     105.15 / 107.19 ±1.27 / 109.13 ms │     106.33 / 107.90 ±2.12 / 111.97 ms │     no change │
│ QQuery 30 │                                  FAIL │                                  FAIL │  incomparable │
│ QQuery 31 │     117.44 / 118.81 ±1.06 / 120.34 ms │     117.93 / 120.44 ±1.44 / 121.93 ms │     no change │
│ QQuery 32 │        22.60 / 22.92 ±0.18 / 23.14 ms │        22.60 / 22.97 ±0.23 / 23.24 ms │     no change │
│ QQuery 33 │        42.06 / 42.87 ±0.61 / 43.74 ms │        41.40 / 42.57 ±1.83 / 46.21 ms │     no change │
│ QQuery 34 │        10.86 / 11.45 ±0.43 / 12.08 ms │        10.67 / 11.08 ±0.33 / 11.53 ms │     no change │
│ QQuery 35 │        85.94 / 87.24 ±1.76 / 90.64 ms │        85.23 / 85.60 ±0.34 / 86.16 ms │     no change │
│ QQuery 36 │           6.91 / 7.06 ±0.10 / 7.21 ms │           6.55 / 6.71 ±0.13 / 6.92 ms │     no change │
│ QQuery 37 │           7.69 / 7.82 ±0.09 / 7.93 ms │           7.52 / 7.76 ±0.16 / 7.98 ms │     no change │
│ QQuery 38 │        73.83 / 74.10 ±0.29 / 74.63 ms │        76.04 / 76.73 ±0.57 / 77.56 ms │     no change │
│ QQuery 39 │     105.55 / 107.98 ±2.12 / 110.73 ms │     109.68 / 111.93 ±1.47 / 114.04 ms │     no change │
│ QQuery 40 │        24.25 / 24.44 ±0.10 / 24.51 ms │        24.98 / 25.22 ±0.19 / 25.57 ms │     no change │
│ QQuery 41 │        14.39 / 14.59 ±0.13 / 14.77 ms │        15.22 / 15.33 ±0.07 / 15.42 ms │  1.05x slower │
│ QQuery 42 │        25.59 / 26.12 ±0.36 / 26.66 ms │        26.28 / 26.66 ±0.33 / 27.10 ms │     no change │
│ QQuery 43 │           5.65 / 5.76 ±0.10 / 5.89 ms │           5.84 / 6.56 ±0.91 / 8.35 ms │  1.14x slower │
│ QQuery 44 │        11.66 / 11.80 ±0.08 / 11.91 ms │        11.75 / 12.08 ±0.25 / 12.51 ms │     no change │
│ QQuery 45 │        45.21 / 47.41 ±1.80 / 49.02 ms │        47.82 / 48.61 ±1.29 / 51.19 ms │     no change │
│ QQuery 46 │        14.16 / 14.51 ±0.27 / 14.87 ms │        14.85 / 15.15 ±0.23 / 15.47 ms │     no change │
│ QQuery 47 │     252.56 / 265.14 ±7.41 / 275.21 ms │     250.20 / 253.65 ±2.99 / 258.15 ms │     no change │
│ QQuery 48 │     109.27 / 110.30 ±1.01 / 112.03 ms │     109.47 / 110.67 ±1.38 / 113.27 ms │     no change │
│ QQuery 49 │        85.89 / 86.30 ±0.24 / 86.62 ms │        86.03 / 87.00 ±0.62 / 87.98 ms │     no change │
│ QQuery 50 │        63.08 / 64.30 ±1.68 / 67.59 ms │        63.11 / 65.72 ±2.31 / 69.81 ms │     no change │
│ QQuery 51 │       93.81 / 97.35 ±2.10 / 100.26 ms │       96.59 / 98.01 ±1.29 / 100.06 ms │     no change │
│ QQuery 52 │        26.20 / 27.15 ±1.01 / 29.08 ms │        25.82 / 26.11 ±0.25 / 26.41 ms │     no change │
│ QQuery 53 │        32.39 / 32.49 ±0.08 / 32.62 ms │        32.17 / 33.24 ±1.49 / 36.18 ms │     no change │
│ QQuery 54 │        57.61 / 58.25 ±0.51 / 59.05 ms │        56.43 / 58.52 ±2.13 / 62.45 ms │     no change │
│ QQuery 55 │        25.19 / 25.68 ±0.51 / 26.65 ms │        25.73 / 26.27 ±0.31 / 26.66 ms │     no change │
│ QQuery 56 │        41.62 / 42.07 ±0.57 / 43.19 ms │        42.96 / 43.28 ±0.24 / 43.64 ms │     no change │
│ QQuery 57 │     187.59 / 191.07 ±2.00 / 193.20 ms │     191.85 / 193.30 ±1.38 / 195.41 ms │     no change │
│ QQuery 58 │     123.84 / 124.66 ±0.44 / 125.07 ms │     120.61 / 123.17 ±1.51 / 124.88 ms │     no change │
│ QQuery 59 │     121.67 / 122.20 ±0.56 / 122.97 ms │     120.57 / 121.92 ±0.88 / 123.10 ms │     no change │
│ QQuery 60 │        41.96 / 42.50 ±0.39 / 43.13 ms │        42.25 / 42.78 ±0.38 / 43.32 ms │     no change │
│ QQuery 61 │        14.24 / 14.30 ±0.07 / 14.43 ms │        14.44 / 14.53 ±0.07 / 14.64 ms │     no change │
│ QQuery 62 │        49.34 / 49.86 ±0.29 / 50.24 ms │        48.86 / 49.80 ±1.52 / 52.82 ms │     no change │
│ QQuery 63 │        32.72 / 33.05 ±0.19 / 33.27 ms │        32.19 / 32.43 ±0.28 / 32.97 ms │     no change │
│ QQuery 64 │     495.24 / 501.59 ±6.70 / 513.86 ms │     492.56 / 497.63 ±3.75 / 502.42 ms │     no change │
│ QQuery 65 │     149.29 / 152.59 ±2.31 / 155.63 ms │     153.14 / 156.85 ±2.60 / 161.08 ms │     no change │
│ QQuery 66 │        86.71 / 88.91 ±1.30 / 90.44 ms │        86.27 / 90.26 ±4.05 / 98.06 ms │     no change │
│ QQuery 67 │     262.50 / 269.09 ±4.74 / 274.49 ms │     266.01 / 272.73 ±4.14 / 278.81 ms │     no change │
│ QQuery 68 │        14.25 / 14.64 ±0.23 / 14.85 ms │        14.85 / 15.03 ±0.21 / 15.38 ms │     no change │
│ QQuery 69 │        81.94 / 84.11 ±2.12 / 88.00 ms │        82.21 / 85.06 ±5.13 / 95.32 ms │     no change │
│ QQuery 70 │     110.46 / 112.49 ±2.02 / 116.35 ms │     109.60 / 115.95 ±6.54 / 124.14 ms │     no change │
│ QQuery 71 │        38.30 / 39.55 ±1.99 / 43.46 ms │        37.36 / 37.54 ±0.15 / 37.77 ms │ +1.05x faster │
│ QQuery 72 │ 2175.03 / 2325.52 ±88.51 / 2444.48 ms │ 2314.95 / 2373.68 ±38.31 / 2425.89 ms │     no change │
│ QQuery 73 │        10.79 / 11.10 ±0.29 / 11.51 ms │        10.45 / 10.62 ±0.12 / 10.76 ms │     no change │
│ QQuery 74 │     206.09 / 208.67 ±1.49 / 210.17 ms │     195.10 / 200.32 ±6.41 / 211.59 ms │     no change │
│ QQuery 75 │     155.97 / 158.32 ±1.80 / 160.67 ms │     156.02 / 158.77 ±1.86 / 161.77 ms │     no change │
│ QQuery 76 │        37.66 / 38.75 ±1.68 / 42.04 ms │        37.89 / 38.50 ±0.47 / 39.26 ms │     no change │
│ QQuery 77 │        64.99 / 66.20 ±0.67 / 66.91 ms │        64.74 / 65.94 ±0.70 / 66.89 ms │     no change │
│ QQuery 78 │     202.83 / 206.45 ±3.27 / 210.58 ms │     201.87 / 206.98 ±4.10 / 210.88 ms │     no change │
│ QQuery 79 │        69.64 / 71.02 ±1.23 / 72.96 ms │        71.07 / 71.50 ±0.39 / 72.21 ms │     no change │
│ QQuery 80 │     106.92 / 109.00 ±2.04 / 112.87 ms │     106.83 / 108.13 ±1.09 / 109.68 ms │     no change │
│ QQuery 81 │        26.49 / 27.59 ±1.67 / 30.86 ms │        26.37 / 26.78 ±0.23 / 27.03 ms │     no change │
│ QQuery 82 │        18.25 / 18.61 ±0.21 / 18.87 ms │        18.61 / 18.73 ±0.10 / 18.91 ms │     no change │
│ QQuery 83 │        39.97 / 40.40 ±0.29 / 40.88 ms │        40.16 / 41.17 ±1.42 / 43.96 ms │     no change │
│ QQuery 84 │        45.46 / 46.40 ±1.58 / 49.54 ms │        45.58 / 45.84 ±0.33 / 46.49 ms │     no change │
│ QQuery 85 │     145.11 / 146.33 ±1.23 / 148.46 ms │     144.07 / 144.94 ±0.48 / 145.39 ms │     no change │
│ QQuery 86 │        27.17 / 27.58 ±0.27 / 27.96 ms │        26.17 / 26.48 ±0.26 / 26.84 ms │     no change │
│ QQuery 87 │        72.71 / 74.93 ±1.56 / 76.67 ms │        71.79 / 72.42 ±0.39 / 72.87 ms │     no change │
│ QQuery 88 │        66.63 / 67.67 ±1.03 / 69.60 ms │        67.28 / 68.12 ±0.93 / 69.86 ms │     no change │
│ QQuery 89 │        38.56 / 38.95 ±0.33 / 39.55 ms │        38.47 / 39.04 ±0.69 / 40.37 ms │     no change │
│ QQuery 90 │        19.05 / 19.39 ±0.20 / 19.68 ms │        18.98 / 19.13 ±0.09 / 19.22 ms │     no change │
│ QQuery 91 │        55.48 / 56.13 ±0.40 / 56.69 ms │        55.26 / 55.55 ±0.32 / 56.15 ms │     no change │
│ QQuery 92 │        32.86 / 33.09 ±0.13 / 33.24 ms │        31.83 / 33.11 ±1.92 / 36.93 ms │     no change │
│ QQuery 93 │        54.32 / 56.40 ±1.53 / 58.12 ms │        54.62 / 56.82 ±2.18 / 60.32 ms │     no change │
│ QQuery 94 │        42.09 / 42.63 ±0.45 / 43.37 ms │        42.15 / 43.01 ±0.74 / 44.10 ms │     no change │
│ QQuery 95 │        91.09 / 91.95 ±0.72 / 93.02 ms │        92.95 / 93.70 ±0.51 / 94.19 ms │     no change │
│ QQuery 96 │        25.62 / 25.81 ±0.13 / 25.94 ms │        25.27 / 25.68 ±0.31 / 26.18 ms │     no change │
│ QQuery 97 │        48.19 / 49.05 ±0.78 / 50.37 ms │        48.75 / 49.22 ±0.31 / 49.68 ms │     no change │
│ QQuery 98 │        44.16 / 44.72 ±0.39 / 45.16 ms │        44.17 / 45.28 ±0.74 / 46.37 ms │     no change │
│ QQuery 99 │        72.41 / 73.55 ±1.25 / 75.89 ms │        71.23 / 71.83 ±0.39 / 72.45 ms │     no change │
└───────────┴───────────────────────────────────────┴───────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                           ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                           │ 11275.90ms │
│ Total Time (virtual-columns-table-schema)   │ 11351.25ms │
│ Average Time (HEAD)                         │   115.06ms │
│ Average Time (virtual-columns-table-schema) │   115.83ms │
│ Queries Faster                              │          1 │
│ Queries Slower                              │          5 │
│ Queries with No Change                      │         92 │
│ Queries with Failure                        │          1 │
└─────────────────────────────────────────────┴────────────┘

Resource Usage

tpcds — base (merge-base)

Metric Value
Wall time 60.0s
Peak memory 6.9 GiB
Avg memory 6.2 GiB
CPU user 258.3s
CPU sys 6.7s
Peak spill 0 B

tpcds — branch

Metric Value
Wall time 60.0s
Peak memory 6.7 GiB
Avg memory 6.0 GiB
CPU user 261.9s
CPU sys 7.2s
Peak spill 0 B

File an issue against this benchmark runner

@mbutrovich mbutrovich requested a review from adriangb May 12, 2026 18:45
Comment thread datafusion/datasource-parquet/src/opener.rs Outdated
@mbutrovich mbutrovich moved this from Todo to In progress in Comet Development May 13, 2026
@mbutrovich mbutrovich requested a review from adriangb May 14, 2026 18:09
Copy link
Copy Markdown
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good. @mbutrovich do you intend to get this into 54? I think there's something to be said for waiting until 54 goes out at this point so we can do the rest of the work wiring up so we can derisk the design as a whole.

@mbutrovich
Copy link
Copy Markdown
Contributor Author

mbutrovich commented May 14, 2026

I think this looks good. @mbutrovich do you intend to get this into 54? I think there's something to be said for waiting until 54 goes out at this point so we can do the rest of the work wiring up so we can derisk the design as a whole.

I'd like it in 54 if we think the API at this layer is stable, but I see your argument that if the API needs a tweak when we go to hook everything up that we hit API stability challenges. I am okay to defer, but also was not planning to do the work to hook it up to the front-end any time soon, so it becomes an indefinite merge/maybe not completely wired in 55 either.

@adriangb
Copy link
Copy Markdown
Contributor

Gotcha. If you're okay deferring until 54 (which should just be a week or two) I think that'd make me feel more comfortable taking the risk. We don't have feature freezes officially but I think it's a good general approach to take. I asked in #20135 (comment) if anyone can drive the rest of this but I'd say once 54 is out we can merge this regardless. Thanks for working on this it's been quite the effort!

@mbutrovich
Copy link
Copy Markdown
Contributor Author

mbutrovich commented May 14, 2026

No worries. This isn't urgently needed in Comet, it's just on the list of Spark gaps we want to close. Thanks for your help thus far!

if self.virtual_columns.is_empty() {
self.virtual_columns = Arc::new(virtual_columns);
} else {
let existing = Arc::get_mut(&mut self.virtual_columns).expect(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will panic if:

  • You make a TableSchema
  • Call with_virtual_columns to add some virtual columns
  • clone the TableSchema
  • Call with_virtual_columns again on one of the owned clones

Now with_table_partition_cols has the same bug so maybe it's okay, but I do think it's an unsafe contract. It also seems like the solution is relatively simple: use Arc::make_mut. I might make a PR for
with_table_partition_cols

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #22372 to fix with_table_partition_cols with Arc::make_mut (off main, so it can land independently of this PR).

@adriangb
Copy link
Copy Markdown
Contributor

@mbutrovich I think we are ready to merge this. Sorry for the conflicts, we've been doing some cleanup / refactoring in opener.rs

@mbutrovich
Copy link
Copy Markdown
Contributor Author

@mbutrovich I think we are ready to merge this. Sorry for the conflicts, we've been doing some cleanup / refactoring in opener.rs

I'll get it cleaned up by end of week, thanks for the reminder!

OussamaSaoudi added a commit to OussamaSaoudi/delta-kernel-rs that referenced this pull request May 22, 2026
…F + RowNumber

Reworks the Load sink path in three coupled directions:

1. **Streaming `LoadExec` -- unordered concurrent merger.** Replace the
   serial "open one file, drain it, open next" walker with a stream
   combinator chain: per-row `extract_row_inputs` -> per-row open future
   -> `futures::stream::buffer_unordered(N)` -> `try_flatten`. Concurrency
   cap N = `target_partitions` clamped [1, 64] (default 8). Output
   ordering across files is unspecified by design; intra-file batch
   order is preserved by sequential drain. Limit applied at the
   flattened output by slicing + early termination. Cancellation
   propagates via the standard stream-drop chain.

2. **Compile-time eager dispatch in `register_load_relation`.** When the
   compiled upstream `LogicalPlan` is bare `LogicalPlan::Values` and the
   sink has no DV, register a new `EagerLoadTableProvider` (custom
   `TableProvider`) instead of `LoadTableProvider`. The provider holds a
   pre-built `Vec<PartitionedFile>` with per-file `partition_values`
   populated from the row's passthrough literals -- same broadcast
   mechanism the streaming path uses. `scan()` returns a single
   `DataSourceExec` over a `FileGroup`, so DataFusion's native
   multi-partition fan-out, projection / limit pushdown, and
   `repartitioned()` apply for free. Anything else (non-Values upstream
   or DV present) falls through to the streaming path.

3. **DV applied via a `not_in_dv` ScalarUDF over the parquet `_row_number`
   virtual column.** Each per-file open future:
   - Resolves the kernel-side DV via
     `tokio::task::spawn_blocking(|| descriptor.read(...))` -- returns a
     `RoaringTreemap` (deleted row IDs) cheaply Arc-cloned into the
     UDF's closure.
   - Builds a per-file `DataSourceExec -> FilterExec(not_in_dv(_row_number))
     -> ProjectionExec(drop _row_number)` stack via
     `build_per_file_plan`. The opener's `with_virtual_columns` (via
     `TableSchema::with_virtual_columns` from
     apache/datafusion#22026) injects `_row_number` into emitted batches
     so the FilterExec predicate can reference it; the trailing
     ProjectionExec drops it before the output stream sees it.
   - Reject `LoadSink` with `FileType::Json` AND `dv_ref.is_some()` at
     `LoadExec::new`: JSON has no row-number virtual column, and DVs
     only apply to Delta parquet data files in practice.

Critical fix in `FieldIdPhysicalExprAdapter::rewrite_column`: virtual
columns are now reindexed to their position in the **physical** file
schema (`physical_file_schema.fields().len()` for a single virtual)
rather than passing the original index unchanged. The original index
came from the LoadExec's projected `TableSchema` (= file ++ partition
++ virtual), which doesn't match either `logical_for_rewrite` (file ++
virtual, no partition) or `physical_for_rewrite` (physical_file ++
virtual). Schema evolution can also make logical and physical schemas
diverge in length; reindexing against the physical schema's actual
length lands the virtual at the correct position regardless.

Other touches:

- **Phase 3 (sync I/O drop)**: `file_size_for_row` no longer falls back
  to `std::fs::metadata` when the size column is unset/null. Sizes get
  resolved per-file via async object-store HEAD inside
  `build_per_file_plan` (`resolve_size_if_unknown`), avoiding a sync
  call inside an async future.
- **Phase 4 (projection-aware passthrough)**: precompute
  `projected_passthrough: Arc<Vec<usize>>` once at `LoadExec::new`;
  iterate that in `extract_row_inputs` instead of all of
  `sink.passthrough_columns`.
- **Phase 6 (factor helpers)**: shared primitives -- `RowInputs`,
  `extract_row_inputs`, `build_file_source`, `into_partitioned_file`,
  `adapter_factory_for`, `strip_field_metadata_recursive`,
  `make_not_in_dv_udf`, `resolve_dv_async`, `resolve_size_if_unknown`,
  `build_per_file_plan` -- live in `load_helpers.rs` and feed both the
  streaming `LoadExec` and the eager `EagerLoadTableProvider`.

Workspace patches: point `datafusion-*` at our local datafusion-fork
on the `pr-22026` branch (open + approved by adriangb,
2026-05-14), and `datafusion-functions-json` at a local fork carrying
the post-50.0 API-drift fix (drop redundant `as_any` impls, rename
`Cast.data_type` -> `Cast.field`). Adds `roaring = "0.11.2"` and
`tokio` direct deps.

Tests: scan_correctness gains a `scan_with_row_index_and_utf8_column`
repro that locks in the virtual-column + Utf8 + supplied_schema
interaction (was a latent regression triggered only by acceptance
fixtures). Acceptance: 3457 / 3457 pass (was 71 failures before this
work). Three previously-expected-fail entries
(`cdc_schema_evolution_read_all`,
`cdf_with_schema_evolution_read_all`,
`cm_id_matching_swapped_select_a_reads_e`) now pass via the eager /
field-id-aware path and move into `FIXED_IN_DATAFUSION`.
zhuqi-lucas pushed a commit to zhuqi-lucas/arrow-datafusion that referenced this pull request May 24, 2026
…Arc (apache#22372)

## Which issue does this PR close?

- No separate issue. This addresses a review observation from apache#22026:
apache#22026 (comment)

## Rationale for this change

`TableSchema::with_table_partition_cols` appended to an existing
partition-column list via `Arc::get_mut(...).expect(...)`. The `expect`
message assumed that owning `self` implies sole ownership of the inner
`Arc<Vec<FieldRef>>` — but that is not true. `TableSchema` derives
`Clone`, and cloning only bumps the `Arc` refcount without copying the
`Vec`.

So this sequence panicked:

```rust
let ts = TableSchema::new(file_schema, vec![some_partition_col]);
let cloned = ts.clone();                       // Arc refcount is now 2
let _ = cloned.with_table_partition_cols(more); // Arc::get_mut -> None -> expect() panics
```

`with_table_partition_cols` taking `mut self` gives unique ownership of
the *struct*, not of the inner `Arc`.

## What changes are included in this PR?

- Make `with_table_partition_cols` **replace** the partition columns
instead of appending to them, by assigning a fresh
`Arc::new(partition_cols)`. This removes the in-place mutation branch
entirely:
- It never mutates the inner `Vec`, so it is safe even when the `Arc` is
shared with a clone (copy-on-write isolation is automatic) — fixing the
panic without needing `Arc::make_mut`.
- It matches builder-API expectations (a `with_x` setter replaces) and
removes the risk of accidentally duplicating partition columns, as
raised in review.
- No production code relied on the append behavior (every `TableSchema`
is built via `new`/`from_file_schema`); only unit tests exercised it,
and they are updated to assert replacement.

## Are these changes tested?

Yes:
- `test_with_table_partition_cols_replaces_existing` verifies that
calling the method on a `TableSchema` that already has partition columns
replaces them rather than appending.
- `test_with_table_partition_cols_after_clone_does_not_panic` clones a
`TableSchema` and sets partition columns on the clone, verifying it does
not panic and that the other clone is left unmodified (copy-on-write
isolation).

Existing `TableSchema` tests continue to pass.

## Are there any user-facing changes?

`TableSchema::with_table_partition_cols` now replaces existing partition
columns instead of appending to them. The previous append path panicked
on any shared/cloned `TableSchema`, so no working usage relied on it.
There are no API signature changes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Daniël Heres <danielheres@gmail.com>
@alamb
Copy link
Copy Markdown
Contributor

alamb commented May 26, 2026

I was excited to try this feature out -- but I couldn't figure out how to query the virtual columns from SQL -- is that possible?

Or does this sentence mean you plan to do it as a follow on PR

UX / SQL-layer surface for virtual columns stays deferred to the epic in #20135 — this follows the same framing alamb blessed for #20071 (the input_file_name() UDF).

I personally prefer having APIs in the code that we are sure can work / are documented with examples so we know they are adequate to actually support the usecase

I wonder if we should try and hook up the virtual columns in SQL as a draft PR to be sure this API is good enough 🤔

@alamb
Copy link
Copy Markdown
Contributor

alamb commented May 26, 2026

Actually, it seems like @AdamGS plans to make such a PR here: #20135 (comment)

@adriangb
Copy link
Copy Markdown
Contributor

Yes the plan was to wire it up in a followup, ref discussion following #20135 (comment). @AdamGS is working on it. TLDR: it may not be trivial to hook up and get all cases working, but this low level implementation seems like it would likely work for any eventual front end implementation and it unblocks @mbutrovich

wlhjason pushed a commit to wlhjason/datafusion that referenced this pull request May 26, 2026
…pache#22496)

## Which issue does this PR close?

- No separate issue. Follows up on apache#22372 (panic fix in
`TableSchema::with_table_partition_cols`) and the API discussion it
spawned, and is informed by apache#22026 (which adds a third column group,
virtual columns, to `TableSchema`).

## Rationale for this change

`TableSchema` has one required input (the file schema) and a growing set
of *optional* column groups: partition columns today, virtual columns in
apache#22026. The current API expresses this awkwardly:

- `new(file_schema, partition_cols)` privileges partition columns with a
positional slot while virtual columns only get a builder method — an
asymmetry that grows with every new column kind.
- `TableSchema` eagerly recomputes and caches the concatenated table
schema on *every* incremental setter call, so
`from_file_schema(s).with_table_partition_cols(p)` rebuilds it twice
(three times once virtual columns are added). This is exactly why
`new()`'s docs told callers to avoid the builder-style chain.
- The setter mutated an inner `Arc<Vec<FieldRef>>` in place, which is
what caused the shared-`Arc` panic fixed in apache#22372.

A dedicated builder addresses all three, and mirrors the existing
`FileScanConfigBuilder` (the type that *owns* a `TableSchema`).

## What changes are included in this PR?

- **`TableSchemaBuilder`**: `new(file_schema)` →
`.with_table_partition_cols(impl Into<Fields>)` → `.build()`. The
concatenated table schema is computed exactly **once**, in `build()`.
The setter takes `impl Into<Fields>`, so an existing schema's `Fields`
is accepted zero-copy.
- **Partition columns are now stored as `arrow::datatypes::Fields`** (an
immutable `Arc<[FieldRef]>`) instead of `Arc<Vec<FieldRef>>`: one fewer
indirection, shareable zero-copy, and — being immutable — the
shared-`Arc` mutation panic is structurally impossible.
- **`TableSchema::table_partition_cols()` and the delegating
`FileScanConfig::table_partition_cols()` now return `&Fields`.**
`Fields` derefs to `&[FieldRef]`, so iteration/indexing/`len`/`is_empty`
are unchanged; only the arrow `FileFormat` path needed `.to_vec()`.
- **`TableSchema::with_table_partition_cols` is deprecated** in favor of
the builder. It now **replaces** rather than appends. (Note: `main`
currently *appends* here — the replace change in apache#22372 was not captured
by that PR's squash merge — so this also restores the intended replace
semantics.)
- `new` / `from_file_schema` are kept as conveniences that route through
the builder.
- Documented in the 54.0.0 upgrade guide.

This intentionally leaves virtual columns out; apache#22026 should extend the
builder with `with_virtual_columns` once it lands.

## Are these changes tested?

Yes. New unit tests cover building with partition columns,
replace-on-repeat, zero-copy `Fields` input, and the deprecated setter's
behavior; existing `TableSchema` / `FileScanConfig` tests and doctests
pass. `cargo clippy --all-targets -- -D warnings` is clean across the
datasource/proto/arrow/parquet/catalog-listing crates.

## Are there any user-facing changes?

Yes — please apply the `api change` label:

- `TableSchema::table_partition_cols()` /
`FileScanConfig::table_partition_cols()` return `&Fields` instead of
`&Vec<FieldRef>` (source-compatible for most uses via `Deref`).
- `TableSchema::with_table_partition_cols` is deprecated (use the
builder) and now replaces rather than appends.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

8 participants