Skip to content

GH-38849: [C++][Parquet] Add support for list view and large list view#50160

Merged
pitrou merged 8 commits into
apache:mainfrom
HuaHuaY:list_view_type
Jun 25, 2026
Merged

GH-38849: [C++][Parquet] Add support for list view and large list view#50160
pitrou merged 8 commits into
apache:mainfrom
HuaHuaY:list_view_type

Conversation

@HuaHuaY

@HuaHuaY HuaHuaY commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Rationale for this change

Allow us to write an arrow list view array into a parquet file and read list view type from parquet.

What changes are included in this PR?

We will sort the list views based on the offsets before inserting the array in Parquet writer, and generate a size array in Parquet reader.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

@wgtmac wgtmac left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good to me. I've left some comments. cc @mapleFU @pitrou

Comment thread cpp/src/parquet/arrow/reader.cc
Comment thread cpp/src/parquet/arrow/reader.cc Outdated
Comment thread cpp/src/parquet/arrow/writer.cc
Comment thread cpp/src/parquet/arrow/arrow_reader_writer_test.cc Outdated
Comment thread cpp/src/parquet/arrow/arrow_reader_writer_test.cc Outdated
@github-actions github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 15, 2026
Comment thread cpp/src/parquet/properties.h Outdated
Comment thread cpp/src/parquet/arrow/path_internal.cc Outdated
Comment thread cpp/src/parquet/arrow/arrow_reader_writer_test.cc
// Multiple leaf ranges can be produced when child values are
// skipped, such as null fixed-size-list slots, or when
// list-view ranges are non-contiguous. Concatenate the slices
// in logical write order.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this somehow fixing #35697?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@HuaHuaY HuaHuaY Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that my previous reply was inaccurate. If this PR is merged, test cases in #35697 will not report NotImplemented: Lists with non-zero length null components are not supported but still report other issues. Null values ​​in a FixedSizeList also need to occupy space, but they do not appear in the Parquet data array. So round trip tests will fail. We can fix FixedSizeListReader in future PR.

arrays.push_back(result.leaf_array->Slice(range.start, range.Size()));
}
ARROW_ASSIGN_OR_RAISE(values_array,
::arrow::Concatenate(arrays, ctx->memory_pool));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of concatenating, would it be possible (and potentially more efficient) to call WriteArrow for each individual chunk?

@HuaHuaY HuaHuaY Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's difficult to write data without constructing a new array because we don't know the definition level and repetition level ranges for each segment of the array. Maybe we can add two offsets in ElementRange to implement this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes, that's a good point.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add two offsets in ElementRange to implement this.

We can, but perhaps that can be left as a later optimization.

(also, not sure it's a good idea if there are many ElementRanges in the first place)

@HuaHuaY HuaHuaY force-pushed the list_view_type branch 2 times, most recently from 293c25d to 7d0f8a0 Compare June 23, 2026 10:58
@HuaHuaY

HuaHuaY commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

@pitrou Sorry for the delay in updating. I have updated the code and comments. Please take a look again.

@pitrou pitrou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a question/suggestion

Comment thread cpp/src/parquet/arrow/path_internal.cc Outdated
@pitrou

pitrou commented Jun 25, 2026

Copy link
Copy Markdown
Member

Oh, I forgot two things, sorry:

  1. we should update ApplyOriginalStorageMetadata in parquet/arrow/schema.cc so that it also recurses into list-view types
  2. we should update the docs in https://arrow.apache.org/docs/cpp/parquet.html#types too

@HuaHuaY

HuaHuaY commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

we should update ApplyOriginalStorageMetadata in parquet/arrow/schema.cc so that it also recurses into list-view types

Do you mean we should modify GetNestedFactory? I've already modified it before.

we should update the docs in https://arrow.apache.org/docs/cpp/parquet.html#types too

Done.

@pitrou

pitrou commented Jun 25, 2026

Copy link
Copy Markdown
Member

Do you mean we should modify GetNestedFactory? I've already modified it before.

Oops, I had already forgotten about that. Sorry!

@pitrou pitrou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @HuaHuaY ! Let's just wait for CI now.

@pitrou pitrou merged commit cffe22d into apache:main Jun 25, 2026
51 of 53 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Jun 25, 2026
@HuaHuaY HuaHuaY deleted the list_view_type branch June 25, 2026 12:31
@conbench-apache-arrow

Copy link
Copy Markdown

After merging your PR, Conbench analyzed the 2 benchmarking runs that have been run so far on merge-commit cffe22d.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

tadeja pushed a commit to tadeja/arrow that referenced this pull request Jun 27, 2026
…st view (apache#50160)

### Rationale for this change

Allow us to write an arrow list view array into a parquet file and read list view type from parquet.

### What changes are included in this PR?

We will sort the list views based on the offsets before inserting the array in Parquet writer, and generate a size array in Parquet reader.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#38849

Authored-by: Zehua Zou <zehuazou2000@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants