GH-38849: [C++][Parquet] Add support for list view and large list view#50160
Conversation
| // Multiple leaf ranges can be produced when child values are | ||
| // skipped, such as null fixed-size-list slots, or when | ||
| // list-view ranges are non-contiguous. Concatenate the slices | ||
| // in logical write order. |
There was a problem hiding this comment.
Sorry that my previous reply was inaccurate. If this PR is merged, test cases in #35697 will not report NotImplemented: Lists with non-zero length null components are not supported but still report other issues. Null values in a FixedSizeList also need to occupy space, but they do not appear in the Parquet data array. So round trip tests will fail. We can fix FixedSizeListReader in future PR.
| arrays.push_back(result.leaf_array->Slice(range.start, range.Size())); | ||
| } | ||
| ARROW_ASSIGN_OR_RAISE(values_array, | ||
| ::arrow::Concatenate(arrays, ctx->memory_pool)); |
There was a problem hiding this comment.
Instead of concatenating, would it be possible (and potentially more efficient) to call WriteArrow for each individual chunk?
There was a problem hiding this comment.
It's difficult to write data without constructing a new array because we don't know the definition level and repetition level ranges for each segment of the array. Maybe we can add two offsets in ElementRange to implement this.
There was a problem hiding this comment.
Maybe we can add two offsets in
ElementRangeto implement this.
We can, but perhaps that can be left as a later optimization.
(also, not sure it's a good idea if there are many ElementRanges in the first place)
293c25d to
7d0f8a0
Compare
|
@pitrou Sorry for the delay in updating. I have updated the code and comments. Please take a look again. |
pitrou
left a comment
There was a problem hiding this comment.
LGTM, just a question/suggestion
|
Oh, I forgot two things, sorry:
|
Do you mean we should modify
Done. |
Oops, I had already forgotten about that. Sorry! |
|
After merging your PR, Conbench analyzed the 2 benchmarking runs that have been run so far on merge-commit cffe22d. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
…st view (apache#50160) ### Rationale for this change Allow us to write an arrow list view array into a parquet file and read list view type from parquet. ### What changes are included in this PR? We will sort the list views based on the offsets before inserting the array in Parquet writer, and generate a size array in Parquet reader. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: apache#38849 Authored-by: Zehua Zou <zehuazou2000@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
Rationale for this change
Allow us to write an arrow list view array into a parquet file and read list view type from parquet.
What changes are included in this PR?
We will sort the list views based on the offsets before inserting the array in Parquet writer, and generate a size array in Parquet reader.
Are these changes tested?
Yes.
Are there any user-facing changes?
No.