Add range partitioning sqllogictest fixture#22607
Conversation
| } | ||
| } | ||
|
|
||
| fn register_range_partitioned_table(ctx: &SessionContext) { |
There was a problem hiding this comment.
🤔 could it be feasable to have a table function instead? just in case this single table is not capable of satisfying all corner cases in the future.
Just food for thought, if you think a hardcoded table is good then let's stick with it (it's actually simpler)
There was a problem hiding this comment.
yes I think its feasbile and would be useful for future PRs. Ditto to my comment about this being as simple as possible
366e4dc to
d0ede58
Compare
d0ede58 to
ada1564
Compare
alamb
left a comment
There was a problem hiding this comment.
Thanks @gene-bordegaray -- I think this is a good step and could be merged as is. However, I have some suggestions for follow on work/how to make this simpler and more general. Let me know what you think
Thank you @gabotechs and @asolimando for your reviews
| } | ||
|
|
||
| impl RangePartitionedTable { | ||
| fn output_partitioning( |
There was a problem hiding this comment.
I didn't realize it would take so much ceremony to get partitioned data (aka create an entire DataSource/TableProvider) and I agree with @gabotechs that if we go down this approach it will be hard to generate all the possible cases we would like to test.
SO I would suggest we start by integrate pre-defined partitioning in higher level APIs, starating with FileScanConfig, which I think will be needed eventually to use this feature
I dug around a bit and it seems like it will be straightforward to make FileScanConfig::with_partitioned_by_file_group more general
datafusion/datafusion/datasource/src/file_scan_config/mod.rs
Lines 494 to 505 in 04c01bb
What I suggest we do (could be a follow on PR)
- Update FileScanConfig so it specifies a predefined output partitioning (
output_partitioning: Option<Partitioning>) rather than a bool - Deprecate
FileScanConfig::with_partitoned_by_file_groups - Add a new
FileScanConfig::with_output_partitioningsimilar toFileScanConfig::with_output_ordering - Use the DataSourceExec::from(...) for that config
That will save a lot of boiler plate in this setup, and I think you'll need the more general form to take advantage of range partitioning externally (aka it won't be wasted code)
Eventually, I think we should be targeting ListingTable s that declared it was RangePartitioned via ListingOptions (and thus eventually via SQL) .
However, it seems like ListingTable/ListingOptions don't have the output hooks yet and we need to buid there incrementally
https://github.com/apache/datafusion/blob/4c909bafc5c50749884fdd80a06235d7bd72dbde/datafusion/catalog-listing/src/options.rs#L32-L31
There was a problem hiding this comment.
I think this would be a good clean up. I would be ok with doing this in this PR but think it would be better on its own as I think it will have some ripppling effects
There was a problem hiding this comment.
I poke around and will see if it is feasible to add in here
There was a problem hiding this comment.
Ok, I think that the FileScanConfig::with_output_partitioning(...) with the ListingOptions::with_output_partitioning(...) is the right move. It will help us declaring range partitioning on actual file/listing tables so that is great
I think we can leave the FileScanConfig / ListingTable work as a follow-up. There is some things we need to ahndle there I don't think belong in here.
There is a smaller cleanup where MemorySourceConfig could have an output_partitioning builder, which would clean up a lot of the boilerplate, but after looking at it I don’t think it is good because adding it only to MemorySourceConfig feels like a one-off public API just for this fixture.
I would be ok with moving this to its own module now and clean up with the right fix after.
| ); | ||
| } | ||
|
|
||
| // ============================================================================== |
There was a problem hiding this comment.
if we really need a few hundred lines of setip, maybe we can put it into a module like datafusion/sqllogictest/src/test_context/range_partitioning.rs
However, I have suggestions below that I think could make this substantially smaller
There was a problem hiding this comment.
I ended up doing this with the justification in comment on refactoring. If we are ok with this path I can open an issue 😄
alamb
left a comment
There was a problem hiding this comment.
Looks good to me -- thank you @gene-bordegaray - I think this is a good start and we can iterate on this / clean it up in follow ons (clearly I trust that you will do so 😆 )
Since it is just test infrastructure I'll put it in the merge queue now
|
I have made a follow up issue on enhancing the file/listing scan APIs for |
…_partitioning` for pre-defined file partitioning (apache#22657) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Closes apache#22645. ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> This follows up on apache#22607 by replacing range-partitioning sqllogictest boilerplate with a general file/listing scan API for declared output partitioning. Related: apache#21992, apache#22607, apache#22607 (comment) ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> - Add declared `output_partitioning` to file scan and listing table configuration. - Preserve declared partition counts during listing-table file grouping. - Serialize scan `output_partitioning` through physical plan proto. - Refactor `range_partitioning.slt` to use a CSV `ListingTable` instead of a custom test-only `TableProvider` / `DataSource`. Contract: - Declared partitioning expressions are written against the full table schema before scan projection. For example, `Range([range_key@0], [(10), (20)], 3)` remains valid if the scan projects `range_key` and falls back to `UnknownPartitioning(3)` if `range_key` is not projected. - Listing tables create one file group per declared output partition (which can exceed `target_partitions`). It is up to the user to plan their partitioning. For example, a 4-partition range declaration creates four scan file groups, adding empty trailing groups when fewer files are present. - File group index is part of the contract: file group `i` must contain rows for declared output partition `i`. DataFusion does not validate row placement, matching other user-declared properties such as sortedness. ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Yes. ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> Yes. This adds public API for declaring file/listing scan output partitioning. No breaking API changes. <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Co-authored-by: Gabriel <45515538+gabotechs@users.noreply.github.com>
Which issue does this PR close?
Rationale for this change
This adds a focused sqllogictest fixture for source-provided
Rangepartitioning before changing optimizer behavior. It follows the direction discussed in #21992 and gives later planning PRs stable baselines for current behavior.What changes are included in this PR?
range_partitionedtest table forrange_partitioning.slt.Rangepartitioning whenrange_keyis projected, andUnknownPartitioningwhen it is not.UNION ALLover range-partitioned inputs.Are these changes tested?
Yes.
cargo fmt --allcargo test -p datafusion-sqllogictest --test sqllogictests range_partitioningcargo clippy --all-targets --all-features -- -D warningsAre there any user-facing changes?
No. This is test-only infrastructure and sqllogictest coverage.