Skip to content

ARROW-16430: [Python] Add support for reading record batch custom metadata API#13041

Merged
pitrou merged 4 commits into
apache:masterfrom
niyue:feature/py-batch-custom-meta
Dec 12, 2022
Merged

ARROW-16430: [Python] Add support for reading record batch custom metadata API#13041
pitrou merged 4 commits into
apache:masterfrom
niyue:feature/py-batch-custom-meta

Conversation

@niyue

@niyue niyue commented May 1, 2022

Copy link
Copy Markdown
Contributor

In ARROW-16131, C++ APIs were added so that users can read/write record batch custom metadata for IPC file. In this PR, pyarrow APIs are added so that python users can take advantage of these APIs to address ARROW-16430.

@github-actions

github-actions Bot commented May 1, 2022

Copy link
Copy Markdown

@github-actions

github-actions Bot commented May 1, 2022

Copy link
Copy Markdown

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

Comment thread cpp/src/arrow/ipc/writer.h Outdated
Comment thread python/pyarrow/includes/libarrow.pxd Outdated
@niyue niyue force-pushed the feature/py-batch-custom-meta branch from 1a7e908 to 620ff15 Compare May 1, 2022 07:06
@niyue niyue force-pushed the feature/py-batch-custom-meta branch 3 times, most recently from 34b556a to 38e45dd Compare May 1, 2022 08:59
@kou kou changed the title ARROW-16430, support reading record batch custom metadata API in pyarrow ARROW-16430: [Python] Add support for reading record batch custom metadata API May 6, 2022
@niyue niyue force-pushed the feature/py-batch-custom-meta branch from 38e45dd to c1fb6ad Compare May 7, 2022 01:35
Comment thread python/pyarrow/_flight.pyx Outdated
Comment thread python/pyarrow/ipc.pxi Outdated
Comment thread python/pyarrow/ipc.pxi Outdated
Comment on lines 939 to 940

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche Is this the right Numpydoc syntax when a namedtuple is returned?

Comment thread python/pyarrow/tests/test_flight.py Outdated
@pitrou

pitrou commented May 10, 2022

Copy link
Copy Markdown
Member

By the way, this PR addresses RecordBatchFileReader but not RecordBatchReader. Do you plan to do that as well?

@niyue niyue force-pushed the feature/py-batch-custom-meta branch from c1fb6ad to 78f492f Compare May 12, 2022 05:41
@niyue

niyue commented May 12, 2022

Copy link
Copy Markdown
Contributor Author

By the way, this PR addresses RecordBatchFileReader but not RecordBatchReader. Do you plan to do that as well?

I tried adding the pyarrow API for RecordBatchReader locally previously, but I found if I want to test the new pyarrow API for RecordBatchReader, I probably have to add the corresponding API implementation in C++ RecordBatchStreamReader so that I can unit test it in a meaningful way since it seems only RecordBatchStreamReader and RecordBatchFileReader provides the capability to read record batch custom metadata (is this correct?). And since this PR is more about adding new pyarrow API for existing C++ capability, I would like to make non necessary C++ API change less in this PR.

  1. if there is other way to unit test pyarrow RecordBatchReader.read_next_batch_with_metadata API besides implementing it in RecordBatchStreamReader?
  2. do you recommend to put all these changes together in single PR or keep them as separated PRs? I initially submitted this issue as a pyarrow enhancement PR, but I am okay to include more API changes if it is preferred.

@niyue niyue force-pushed the feature/py-batch-custom-meta branch from 78f492f to 3f040a0 Compare May 12, 2022 07:24
@niyue

niyue commented May 17, 2022

Copy link
Copy Markdown
Contributor Author

@pitrou I made some additional changes to the PR, could you please help to review? Thanks.

@pitrou

pitrou commented May 18, 2022

Copy link
Copy Markdown
Member

@niyue Sorry for the late replay, but the C++ RecordBatchReader already has a method Result<RecordBatchWithMetadata> ReadNext(), so you wouldn't need to add anything more to the C++ side AFAICT?

@pitrou

pitrou commented Aug 9, 2022

Copy link
Copy Markdown
Member

@niyue Are you planning to work on this?
Otherwise @milesgranger you might be interested in taking this up?

@milesgranger

Copy link
Copy Markdown
Contributor

I can certainly pick it up if needed!
Presently, I'm working locally on ARROW-13763.

@niyue

niyue commented Aug 9, 2022

Copy link
Copy Markdown
Contributor Author

@niyue Are you planning to work on this?

Otherwise @milesgranger you might be interested in taking this up?

Hi @pitrou, sorry I got too busy since last commit and forgot this issue previously. I am still interested working on this, and can start working on it from next week.

@niyue niyue force-pushed the feature/py-batch-custom-meta branch from 3f040a0 to a676680 Compare August 17, 2022 06:24
@pitrou

pitrou commented Aug 17, 2022

Copy link
Copy Markdown
Member

@niyue If/when this is ready for review, please say so :-)

@niyue

niyue commented Aug 17, 2022

Copy link
Copy Markdown
Contributor Author

@pitrou I pushed a new commit a minute ago. I followed your suggestion to add corresponding API for RecordBatchReader API in pyarrow, and a basic unit test to cover this API. I ran all pyarrow tests (using the minimum build only) and all tests passed locally. Could you please help to review? Thanks.

@niyue

niyue commented Sep 8, 2022

Copy link
Copy Markdown
Contributor Author

@pitrou I am not sure if you see my previous comment. I made some change previously, and this PR is ready for review, could you please help? Thanks.

@pitrou

pitrou commented Sep 12, 2022

Copy link
Copy Markdown
Member

@niyue Sorry, I had overlooked it. I'll take a look when I can. @jorisvandenbossche would you like to review this too?

@pitrou pitrou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay! Here are some more comments.

Comment thread python/pyarrow/ipc.pxi Outdated
Comment thread python/pyarrow/ipc.pxi Outdated
Comment thread python/pyarrow/ipc.pxi Outdated
Comment thread python/pyarrow/ipc.pxi Outdated
Comment thread python/pyarrow/ipc.pxi Outdated
Comment thread python/pyarrow/ipc.pxi Outdated
Comment thread python/pyarrow/tests/test_ipc.py Outdated
Comment thread python/pyarrow/tests/test_ipc.py Outdated
Comment thread python/pyarrow/tests/test_ipc.py Outdated
Comment thread python/pyarrow/_flight.pyx Outdated
@niyue

niyue commented Sep 22, 2022

Copy link
Copy Markdown
Contributor Author

No problem. I will check them out and see if I can get them addressed. Thanks for the review.

@niyue niyue force-pushed the feature/py-batch-custom-meta branch 2 times, most recently from 6498958 to ef95d59 Compare September 27, 2022 10:18
Comment thread python/pyarrow/ipc.pxi Outdated
Comment thread python/pyarrow/tests/test_ipc.py Outdated
@pitrou

pitrou commented Oct 5, 2022

Copy link
Copy Markdown
Member

There were some timeouts in some CI jobs. I've restarted them just in case. If that still reproduces, should first rebase from master to catch up with any upstream fixes.

@jorisvandenbossche jorisvandenbossche left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just two minor doc formatting comments

Comment thread python/pyarrow/ipc.pxi Outdated
Comment thread python/pyarrow/ipc.pxi Outdated
@jorisvandenbossche

Copy link
Copy Markdown
Member

This looks like a nice addition. Something I am wondering more in general (potentially for future JIRA/PRs), when working with custom metadata, would it be useful to also allow to inspect the custom metadata of a certain batch, without also loading the batch? (so you could for example check the metadata before deciding whether to read the batch or not).
I suppose that would only make sense for the IPC file format? (although not sure if that makes sense at the format / C++ level) Like the RecordBatchFileReader now has a get_batch(i)/ get_batch_with_custom_metadata(i), one could also have a get_custom_metadata(i)?

@pitrou

pitrou commented Oct 5, 2022

Copy link
Copy Markdown
Member

Something I am wondering more in general (potentially for future JIRA/PRs), when working with custom metadata, would it be useful to also allow to inspect the custom metadata of a certain batch, without also loading the batch?

Perhaps, but that would have to be done on the C++ side first. Also, I would wait for users to actually request it.

@niyue niyue force-pushed the feature/py-batch-custom-meta branch from ef95d59 to af3cd34 Compare October 10, 2022 06:13
@niyue

niyue commented Oct 10, 2022

Copy link
Copy Markdown
Contributor Author

@pitrou sorry for the late response. There are two failed CI builds, and they don't seem relevant in this issue. I rebased onto the latest master branch anyway.

…w so that pyarrow can read record batch along with its custom metadata.
@niyue niyue force-pushed the feature/py-batch-custom-meta branch from af3cd34 to 8f5d119 Compare October 10, 2022 06:20
@niyue

niyue commented Nov 2, 2022

Copy link
Copy Markdown
Contributor Author

@pitrou could you please help to review the PR to see if there is still anything we desire to change? Thanks.

@pitrou pitrou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really sorry for the delay @niyue ! I've pushed a couple minor changes and will merge if CI is green.

Thank you for the work!

@pitrou pitrou merged commit 45d83fe into apache:master Dec 12, 2022
@niyue

niyue commented Dec 13, 2022

Copy link
Copy Markdown
Contributor Author

No problem @pitrou . Thanks for the review and the fixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants