ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file by niyue · Pull Request #12812 · apache/arrow

niyue · 2022-04-06T12:59:45Z

This PR aims to address https://issues.apache.org/jira/projects/ARROW/issues/ARROW-16131

Currently, when writing an IPC file with multiple record batches using the arrow::ipc::RecordBatchWriter, the custom_metadata for each record batch is not saved and will be discarded silently.

The current writer does provide the AppendCustomMetadata API internally, but it is never called. I use this API to add the custom metadata during writing and retrieve the custom metadata when reading the record batch. I am not completely sure if this is intentionally ignored or accidentally missing, but from a user's perspective, the metadata can be set via public API (see the example case in ARROW-16131) and I think it is not very friendly that this piece of data is silently discarded.

Several test cases are added in the read_write_test.cc for verifying this change.

github-actions · 2022-04-06T13:02:08Z

https://issues.apache.org/jira/browse/ARROW-16131

github-actions · 2022-04-06T13:02:10Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

niyue · 2022-04-06T23:57:58Z

I checked out all the build failures reported , including arrow-gcsfs-test (Timeout) and arrow-s3-test (Failed), and they don't seem to be related with my change.

I can build and run the S3 test locally, and it ran into segfault locally as the CI server did for the test case GetFileInfoRoot, but this seems not related with the PR change:

TEST_F(TestS3FS, GetFileInfoRoot) { AssertFileInfo(fs_.get(), "", FileType::Directory); }

For the gcs test, I failed to build the google cloud cpp lib and I am not able to run it locally so far, but it seems happened in other PR as well (for example, #12816)

pitrou · 2022-04-12T09:44:10Z

Hmm, I don't think this is the right way to do this. The IPC message's custom_metadata is independent from record batch (i.e. schema) metadata, one should not influence the other.

(also, you could have a custom_metadata for other message types such as Tensor)

cc @wesm for opinions.

niyue · 2022-04-12T23:20:19Z

The IPC message's custom_metadata is independent from record batch (i.e. schema) metadata

This is actually my initial understanding, according to this documentation (https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata), it says This includes Field, Schema, and Message. and it seems to indicate schema metadata is different from message metadata.

But for the documentation here (https://arrow.apache.org/docs/python/data.html#custom-schema-and-field-metadata), it says Arrow supports both schema-level and field-level custom key-value metadata allowing for systems to insert their own application defined metadata to customize behavior. and record batch (message level) metadata is not mentioned.

And I only see API/tests allowing users to modify schema's metadata (probably I am missing something), so I assumed my initial understanding is incorrect and came up with such a PR change.

If message metadata is different from schema metadata, the question here may be slightly more complex since in this case, each record batch may provide 1) this record batch's schema specific metadata, this seems not serialized/deserialized 2) this record batch's message specific metadata, we don't seem to provide API for reading/writing this (internally we have AppendCustomMetadata but it is not called and there seems no higher level API allowing users to pass this information)

Please advice what the correct behavior should be so that I could give it a try. Thanks.

pitrou · 2022-04-13T09:50:14Z

So, basically, per-message metadata is an Arrow IPC feature that's independent from record batches or schemas. That's why you don't see it mentioned in the schema metadata documentation.

If message metadata is different from schema metadata, the question here may be slightly more complex since in this case, each record batch may provide 1) this record batch's schema specific metadata, this seems not serialized/deserialized 2) this record batch's message specific metadata, we don't seem to provide API for reading/writing this

Indeed the message metadata wouldn't be attached to the record batch or schema, but passed separately.

One possible API on the write side:

class ARROW_EXPORT RecordBatchWriter {
 public:
  // (default implementation could call `WriteRecordBatch(batch)` while ignoring metadata?)
  virtual Status WriteRecordBatch(const RecordBatch& batch, const KeyValueMetadata& custom_metadata);

and on the read side:

class ARROW_EXPORT RecordBatchReader {
 public:
  // (default implementation could call `ReadNext(batch)` and reset `*custom_metadata` to null)
  virtual Status ReadNext(std::shared_ptr<RecordBatch>* batch, std::shared_ptr<KeyValueMetadata>* custom_metadata);

niyue · 2022-04-14T12:07:14Z

One possible API on the write side

Thanks for the advice. Let me give it a try.

niyue · 2022-04-17T12:15:48Z

@pitrou @lidavidm
I submitted a new commit following the suggestion above. I tried to keep the changes compatible with existing public APIs so several new APIs are added. Could you please help to review? Let me know if there is any issue. Thanks.

For the CI:

I found one test case 80 - arrow-compute-hash-join-node-test failed in C++ / AMD64 Conda C++ (pull_request), but I guess this is probably not caused by my PR
another test case 51 - arrow-flight-test (Failed) failed in continuous-integration/appveyor/pr, I am not able to find the detailed test report in the CI job, is this an issue?

pitrou · 2022-04-18T09:30:43Z

@niyue Indeed, these CI failures are unrelated. I'll let @lidavidm take a look at the Flight one; the hash-join-node-test crash is at https://issues.apache.org/jira/browse/ARROW-15361

lidavidm · 2022-04-18T12:02:39Z

I filed ARROW-16215

niyue · 2022-04-24T03:16:14Z

I rebased the PR onto the latest master branch and resolved the conflict in read_write_test.cc.

@pitrou @lidavidm could you please help to review if there is anything else that needs to be revised? Thanks.

… for IPC file.

pitrou

+1, let's wait for CI

pitrou · 2022-04-25T15:56:40Z

Thanks a lot for your contribution @niyue !

niyue · 2022-04-25T22:15:15Z

@pitrou @lidavidm thanks so much for the detailed review

ursabot · 2022-05-01T00:41:10Z

Benchmark runs are scheduled for baseline = 52698f3 and contender = 942f77e. 942f77e is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Failed ⬇️0.0% ⬆️0.36%] ursa-i9-9960x
[Finished ⬇️0.88% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 942f77e5 ec2-t3-xlarge-us-east-2
[Failed] 942f77e5 test-mac-arm
[Failed] 942f77e5 ursa-i9-9960x
[Finished] 942f77e5 ursa-thinkcentre-m75q
[Finished] 52698f35 ec2-t3-xlarge-us-east-2
[Failed] 52698f35 test-mac-arm
[Finished] 52698f35 ursa-i9-9960x
[Finished] 52698f35 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

github-actions Bot added the Component: C++ label Apr 6, 2022

niyue commented Apr 6, 2022

View reviewed changes

Comment thread cpp/src/arrow/ipc/test_common.cc Outdated

niyue commented Apr 7, 2022

View reviewed changes

Comment thread cpp/src/arrow/ipc/writer.cc Outdated

niyue commented Apr 7, 2022

View reviewed changes

Comment thread cpp/src/arrow/ipc/reader.cc Outdated

lidavidm reviewed Apr 7, 2022

View reviewed changes

Comment thread cpp/src/arrow/ipc/writer.cc Outdated

niyue force-pushed the bugfix/ipc-batch-meta branch from e337fa6 to 6320bd1 Compare April 8, 2022 05:53

niyue force-pushed the bugfix/ipc-batch-meta branch from 6320bd1 to f28dfb3 Compare April 17, 2022 08:18

niyue commented Apr 17, 2022

View reviewed changes

Comment thread cpp/src/arrow/ipc/reader.h Outdated

niyue commented Apr 17, 2022

View reviewed changes

Comment thread cpp/src/arrow/record_batch.h Outdated

lidavidm reviewed Apr 18, 2022

View reviewed changes

Comment thread cpp/src/arrow/ipc/reader.h Outdated

Comment thread cpp/src/arrow/record_batch.h Outdated

Comment thread cpp/src/arrow/ipc/reader.cc Outdated

niyue force-pushed the bugfix/ipc-batch-meta branch from f28dfb3 to e65856a Compare April 20, 2022 02:23

lidavidm approved these changes Apr 20, 2022

View reviewed changes

Comment thread cpp/src/arrow/ipc/reader.cc Outdated

niyue force-pushed the bugfix/ipc-batch-meta branch 3 times, most recently from 96df81a to 77c2988 Compare April 24, 2022 03:12

lidavidm approved these changes Apr 25, 2022

View reviewed changes

ARROW-16131, support saving and retrieving custom metadata in batches…

77608fe

… for IPC file.

pitrou force-pushed the bugfix/ipc-batch-meta branch from 77c2988 to 3c5b3c0 Compare April 25, 2022 14:18

pitrou approved these changes Apr 25, 2022

View reviewed changes

pitrou force-pushed the bugfix/ipc-batch-meta branch from 3c5b3c0 to 18d9995 Compare April 25, 2022 14:23

Trivial simplifications

b01cdb5

pitrou force-pushed the bugfix/ipc-batch-meta branch from 18d9995 to b01cdb5 Compare April 25, 2022 14:26

pitrou closed this in 942f77e Apr 25, 2022

niyue mentioned this pull request May 1, 2022

ARROW-16430: [Python] Add support for reading record batch custom metadata API #13041

Merged

asfimport mentioned this pull request May 1, 2022

[C++] Record batch specific metadata is not saved in IPC file #31539

Closed

Conversation

niyue commented Apr 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 6, 2022

Uh oh!

github-actions Bot commented Apr 6, 2022

Uh oh!

niyue commented Apr 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pitrou commented Apr 12, 2022

Uh oh!

niyue commented Apr 12, 2022

Uh oh!

pitrou commented Apr 13, 2022

Uh oh!

niyue commented Apr 14, 2022

Uh oh!

Uh oh!

Uh oh!

niyue commented Apr 17, 2022

Uh oh!

pitrou commented Apr 18, 2022

Uh oh!

lidavidm commented Apr 18, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

niyue commented Apr 24, 2022

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

pitrou commented Apr 25, 2022

Uh oh!

niyue commented Apr 25, 2022

Uh oh!

ursabot commented May 1, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

niyue commented Apr 6, 2022 •

edited

Loading

niyue commented Apr 6, 2022 •

edited

Loading