Feat: Adds interop with Arrow library using new method Dataset.to_arrow()#281
Conversation
WalkthroughWalkthroughThe updates introduce native Arrow support by adding Changes
Sequence Diagram(s)Assessment against linked issues
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
There was a problem hiding this comment.
A few changes requested inline.
Also, I would like request at least one or two tests added. Feel free to replicate another test that validates to_pandas() capabilities. At minimum, we just want to make sure that the code can execute. It is probably also worth adding a version of the test that forces a small chunk size, ensuring that the ">1 chunk" codepath is also working as expected.
Overall - this looks great. I think we will be ready to merge once this feedback is resolved.
|
CodeRabbit (@coderabbitai) review |
Actions performedReview triggered.
|
|
/fix-pr
|
|
Aviraj Gour (@avirajsingh7) - Auto-fix applied some lint fixing. And the other fixes you applied looked great. Lmk if this is ready for final review or if you are still applying additional changes. |
|
Aaron ("AJ") Steers (@aaronsteers) it is ready for review |
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Actionable comments posted: 2
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files ignored due to path filters (1)
poetry.lockis excluded by!**/*.lock
Files selected for processing (5)
- airbyte/caches/base.py (2 hunks)
- airbyte/datasets/_base.py (2 hunks)
- airbyte/datasets/_sql.py (4 hunks)
- pyproject.toml (1 hunks)
- tests/integration_tests/test_all_cache_types.py (1 hunks)
Files skipped from review due to trivial changes (1)
- pyproject.toml
Additional comments not posted (3)
airbyte/datasets/_sql.py (2)
108-109: Ensure chunk size is appropriate and method is tested.The
to_arrowmethod allows specifying a chunk size for processing the dataset. Ensure that the chunk size is appropriate and that the method is tested.
175-186: Ensure chunk size is appropriate and method is tested.The
to_arrowmethod allows specifying a chunk size for processing the dataset. Ensure that the chunk size is appropriate and that the method is tested.Verification successful
Ensure chunk size is appropriate and method is tested.
The
to_arrowmethod allows specifying a chunk size for processing the dataset. Based on the search results, theCachedDatasetclass is referenced in several integration tests, indicating that the method is likely tested indirectly through these tests.
tests/integration_tests/test_source_test_fixture.pytests/integration_tests/cloud/test_cloud_sql_reads.pyScripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that the to_arrow method in CachedDataset is tested. # Test: Search for tests that call the to_arrow method in CachedDataset. rg --type python $'CachedDataset.to_arrow'Length of output: 72
Script:
#!/bin/bash # Description: Verify that the to_arrow method in CachedDataset is tested. # Test: Search for tests that call the to_arrow method in CachedDataset. rg 'CachedDataset.to_arrow'Length of output: 27
Script:
#!/bin/bash # Description: Verify that the CachedDataset class and its methods are tested. # Search for any references to the CachedDataset class in test files. rg 'CachedDataset' --glob '*test*.py' # Search for any test files that might be testing the functionality of CachedDataset. fd test --type f --extension pyLength of output: 3235
tests/integration_tests/test_all_cache_types.py (1)
166-168: Verify the correctness of the conversion and assertions.The conversion to Arrow dataset and the assertions for row count and batch iteration look correct. Ensure that the chunk size of 10 is optimal for the dataset size.
Verification successful
Verify the correctness of the conversion and assertions.
The conversion to an Arrow dataset and the assertions for row count and batch iteration look correct. Ensure that the chunk size of 10 is optimal for the dataset size.
- The code snippet is located in
tests/integration_tests/test_all_cache_types.pyaround lines 166-168.- The assertions check both the row count (
FAKER_SCALE_A) and the number of batches (20).Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the correctness of the conversion and assertions. # Test: Run the test suite to ensure the new functionality is covered by tests. pytest tests/integration_tests/test_all_cache_types.pyLength of output: 130
Script:
#!/bin/bash # Search for the function containing the provided code snippet to inspect its correctness. ast-grep --lang python --pattern $'def $_($$$):\n $$$\n arrow_dataset = result["users"].to_arrow(chunksize=10)\n assert arrow_dataset.count_rows() == FAKER_SCALE_A\n assert sum(1 for _ in arrow_dataset.to_batches()) == 20\n $$$'Length of output: 239
Script:
#!/bin/bash # Search for the provided code snippet in the specified file to locate its exact position and inspect its context. rg 'arrow_dataset = result\["users"\].to_arrow\(chunksize=10\)' tests/integration_tests/test_all_cache_types.py -A 5 -B 5Length of output: 644
Aaron ("AJ") Steers (aaronsteers)
left a comment
There was a problem hiding this comment.
The changes look great. Thanks for the added changes and for the contribution. 🚀
Will merge once all tests are passed.
|
/poetry-lock
|
Resolving dependencies... changes
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files ignored due to path filters (1)
poetry.lockis excluded by!**/*.lock
Files selected for processing (1)
- pyproject.toml (1 hunks)
Files skipped from review as they are similar to previous changes (1)
- pyproject.toml
|
/test-pr
|
|
BigQuery is failing at the To merge, we may want to declare BigQuery unsupported and simply raise
|
There was a problem hiding this comment.
Actionable comments posted: 0
Outside diff range and nitpick comments (1)
airbyte/caches/bigquery.py (1)
33-33: Remove whitespace from blank line.The blank line contains unnecessary whitespace.
- +Tools
Ruff
33-33: Blank line contains whitespace
Remove whitespace from blank line
(W293)
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- airbyte/caches/base.py (3 hunks)
- airbyte/caches/bigquery.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
- airbyte/caches/base.py
Additional context used
Ruff
airbyte/caches/bigquery.py
33-33: Blank line contains whitespace
Remove whitespace from blank line
(W293)
Additional comments not posted (2)
airbyte/caches/bigquery.py (2)
26-26: LGTM! Import statement is appropriate.The import statement for
DEFAULT_ARROW_MAX_CHUNK_SIZEis necessary for the new method.
34-45: LGTM! Method implementation is appropriate.The method correctly raises a
NotImplementedErrorwith a clear and informative message.
Aaron ("AJ") Steers (@aaronsteers) I think this fix should work, |
There was a problem hiding this comment.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- tests/integration_tests/test_all_cache_types.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
- tests/integration_tests/test_all_cache_types.py
|
/fix-pr |
|
Aviraj Gour (@avirajsingh7) - Thanks for taking the revision for the BigQuery issue. I did some digging and found this issue from a while back, describing a similar problem for |
|
/fix-pr
|
There was a problem hiding this comment.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- airbyte/caches/bigquery.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
- airbyte/caches/bigquery.py
There was a problem hiding this comment.
Actionable comments posted: 2
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- airbyte/caches/bigquery.py (1 hunks)
- tests/integration_tests/test_all_cache_types.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
- tests/integration_tests/test_all_cache_types.py
Additional context used
Ruff
airbyte/caches/bigquery.py
41-41: Blank line contains whitespace
Remove whitespace from blank line
(W293)
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- airbyte/caches/bigquery.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
- airbyte/caches/bigquery.py
|
/test-pr
(Actually successful. Failure is false-positive.) |
|
Aviraj Gour (@avirajsingh7) - Thanks for all your hard work on this. Tests are passing. Merged!! 🚀 🙌 |
|
Thanks Aaron ("AJ") Steers (@aaronsteers) for all your efforts and help |
Resolves #204
As discuss in issue added a supoort pyarrow dataset instead of pyarrow table to handle a large datasets.
Supporting docs:
pyarrow dataset
pyarrow.Table.from_pandas
Still need input on optimal chunksize...
Summary by CodeRabbit
New Features
to_arrowmethod for converting datasets to Arrow format with a specified chunk size.Bug Fixes
Tests