ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file#12812
ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file#12812niyue wants to merge 2 commits into
Conversation
|
|
|
I checked out all the build failures reported , including I can build and run the S3 test locally, and it ran into segfault locally as the CI server did for the test case For the gcs test, I failed to build the google cloud cpp lib and I am not able to run it locally so far, but it seems happened in other PR as well (for example, #12816) |
e337fa6 to
6320bd1
Compare
|
Hmm, I don't think this is the right way to do this. The IPC message's (also, you could have a cc @wesm for opinions. |
This is actually my initial understanding, according to this documentation (https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata), it says But for the documentation here (https://arrow.apache.org/docs/python/data.html#custom-schema-and-field-metadata), it says And I only see API/tests allowing users to modify schema's metadata (probably I am missing something), so I assumed my initial understanding is incorrect and came up with such a PR change. If message metadata is different from schema metadata, the question here may be slightly more complex since in this case, each record batch may provide 1) this record batch's schema specific metadata, this seems not serialized/deserialized 2) this record batch's message specific metadata, we don't seem to provide API for reading/writing this (internally we have Please advice what the correct behavior should be so that I could give it a try. Thanks. |
|
So, basically, per-message metadata is an Arrow IPC feature that's independent from record batches or schemas. That's why you don't see it mentioned in the schema metadata documentation.
Indeed the message metadata wouldn't be attached to the record batch or schema, but passed separately. One possible API on the write side: class ARROW_EXPORT RecordBatchWriter {
public:
// (default implementation could call `WriteRecordBatch(batch)` while ignoring metadata?)
virtual Status WriteRecordBatch(const RecordBatch& batch, const KeyValueMetadata& custom_metadata);and on the read side: class ARROW_EXPORT RecordBatchReader {
public:
// (default implementation could call `ReadNext(batch)` and reset `*custom_metadata` to null)
virtual Status ReadNext(std::shared_ptr<RecordBatch>* batch, std::shared_ptr<KeyValueMetadata>* custom_metadata); |
Thanks for the advice. Let me give it a try. |
6320bd1 to
f28dfb3
Compare
|
@pitrou @lidavidm For the CI:
|
|
@niyue Indeed, these CI failures are unrelated. I'll let @lidavidm take a look at the Flight one; the hash-join-node-test crash is at https://issues.apache.org/jira/browse/ARROW-15361 |
|
I filed ARROW-16215 |
f28dfb3 to
e65856a
Compare
96df81a to
77c2988
Compare
77c2988 to
3c5b3c0
Compare
3c5b3c0 to
18d9995
Compare
18d9995 to
b01cdb5
Compare
|
Thanks a lot for your contribution @niyue ! |
|
Benchmark runs are scheduled for baseline = 52698f3 and contender = 942f77e. 942f77e is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
This PR aims to address https://issues.apache.org/jira/projects/ARROW/issues/ARROW-16131
Currently, when writing an IPC file with multiple record batches using the
arrow::ipc::RecordBatchWriter, thecustom_metadatafor each record batch is not saved and will be discarded silently.The current writer does provide the
AppendCustomMetadataAPI internally, but it is never called. I use this API to add the custom metadata during writing and retrieve the custom metadata when reading the record batch. I am not completely sure if this is intentionally ignored or accidentally missing, but from a user's perspective, the metadata can be set via public API (see the example case in ARROW-16131) and I think it is not very friendly that this piece of data is silently discarded.Several test cases are added in the
read_write_test.ccfor verifying this change.