Skip to content

ARROW-9718: [Python] ParquetWriter to work with new FileSystem API#7991

Closed
jorisvandenbossche wants to merge 8 commits into
apache:masterfrom
jorisvandenbossche:ARROW-9718-ParquetWriter-filesystems
Closed

ARROW-9718: [Python] ParquetWriter to work with new FileSystem API#7991
jorisvandenbossche wants to merge 8 commits into
apache:masterfrom
jorisvandenbossche:ARROW-9718-ParquetWriter-filesystems

Conversation

@jorisvandenbossche

Copy link
Copy Markdown
Member

No description provided.

@github-actions

Copy link
Copy Markdown

@jorisvandenbossche jorisvandenbossche force-pushed the ARROW-9718-ParquetWriter-filesystems branch from 089d84e to fb9a773 Compare August 27, 2020 09:44
@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review August 27, 2020 09:47
@jorisvandenbossche jorisvandenbossche force-pushed the ARROW-9718-ParquetWriter-filesystems branch from 37ac79f to 12017a5 Compare September 1, 2020 09:39

@pitrou pitrou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jorisvandenbossche . A couple of comments below.

Comment thread python/pyarrow/fs.py Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"where"?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's the name of an argument, then put backquotes around it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was copy pasted from the implementation in pyarrow.filesystem, but agree it can be improved. Will update.

The keyword name itself may vary depending on where this helper function is called, so will keep it on a general "the specified path" or so.

Comment thread python/pyarrow/parquet.py Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are legacy filesystem imports? Do we still need them?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps import pyarrow.filesystem as legacyfs would make the code easier to read below.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we still need them because the full ParquetDataset/ParquetManifest (python) implementation here is based on the legacy filesystems.
But switched to use legacyfs. for the old ones, and plain imports for the new ones

Comment thread python/pyarrow/tests/test_parquet.py Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also from pyarrow import filesystem as legacyfs?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I am going to leave it as is for now, because the old ones are still used a lot (would make the diff much larger, will keep that for a next PR, eg when actually deprecating)

Comment thread python/pyarrow/tests/test_parquet.py Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must be lifted out of the with block.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it may be simpler to use pytest.raises(ValueError, match="...")

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed, was copied from another test, but updated to use match

Comment thread python/pyarrow/tests/test_parquet.py Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also test ParquetWriter(path=uri)?

@jorisvandenbossche jorisvandenbossche Sep 3, 2020

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added one, but it is segfaulting locally .. (maybe similar as ARROW-9814)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you pass -s to pytest, you should be able to see the C++ crash message (if any).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK to merge it with the commented out test for now? (opened issue for it at https://issues.apache.org/jira/browse/ARROW-9906)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

@jorisvandenbossche jorisvandenbossche left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! Pushed some updates

Comment thread python/pyarrow/fs.py Outdated

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was copy pasted from the implementation in pyarrow.filesystem, but agree it can be improved. Will update.

The keyword name itself may vary depending on where this helper function is called, so will keep it on a general "the specified path" or so.

Comment thread python/pyarrow/parquet.py Outdated

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we still need them because the full ParquetDataset/ParquetManifest (python) implementation here is based on the legacy filesystems.
But switched to use legacyfs. for the old ones, and plain imports for the new ones

Comment thread python/pyarrow/tests/test_parquet.py Outdated

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I am going to leave it as is for now, because the old ones are still used a lot (would make the diff much larger, will keep that for a next PR, eg when actually deprecating)

Comment thread python/pyarrow/tests/test_parquet.py Outdated

@jorisvandenbossche jorisvandenbossche Sep 3, 2020

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added one, but it is segfaulting locally .. (maybe similar as ARROW-9814)

Comment thread python/pyarrow/tests/test_parquet.py Outdated

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed, was copied from another test, but updated to use match

@jorisvandenbossche jorisvandenbossche force-pushed the ARROW-9718-ParquetWriter-filesystems branch from b772535 to 5bafd1d Compare September 3, 2020 07:51
@pitrou

pitrou commented Sep 7, 2020

Copy link
Copy Markdown
Member

@jorisvandenbossche Do you want to merge this?

@jorisvandenbossche

Copy link
Copy Markdown
Member Author

Yep

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants