Skip to content

ARROW-5494: [Python] Create FileSystem bindings#5258

Closed
kszucs wants to merge 36 commits into
apache:masterfrom
kszucs:ARROW-5494
Closed

ARROW-5494: [Python] Create FileSystem bindings#5258
kszucs wants to merge 36 commits into
apache:masterfrom
kszucs:ARROW-5494

Conversation

@kszucs

@kszucs kszucs commented Sep 3, 2019

Copy link
Copy Markdown
Member
  • FileSystem binding without S3 (it will be implemented in a followup PR)
  • Moved arrow/python/util/datetime.h one level upper, enabling to be used by the bindings

Comment thread python/pyarrow/_fs.pyx Outdated
Comment thread python/pyarrow/_fs.pyx Outdated
Comment thread python/pyarrow/_fs.pyx Outdated
Comment thread python/pyarrow/_fs.pyx Outdated
Comment thread python/pyarrow/fs.py Outdated
Comment thread python/pyarrow/tests/test_fs.py Outdated
Comment thread cpp/src/arrow/python/datetime.cc Outdated
Comment thread cpp/src/arrow/python/datetime.h Outdated

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou apply ARROW_PYTHON_EXPORT for all?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the functions which were inlined should be kept inline IMO.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I need to move all of the implementation to the header, including the static functions (practically the whole implementation).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's how it was already, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep everything as internal than that could work, although I prefer readable header files.
Actually we should benchmark whether it has a performance impact or not.

Could you recommend me specific benchmarks if we have any?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't seem to have any datetime-heavy benchmarks in the ASV benchmarks directory.
However, you can probably devise a timeit-based microbenchmark easily. No need to be sophisticated :-)

Comment thread cpp/src/arrow/python/datetime.h Outdated
@kszucs kszucs marked this pull request as ready for review September 4, 2019 14:15
Comment thread cpp/src/arrow/filesystem/filesystem.h Outdated
@kszucs

kszucs commented Sep 5, 2019

Copy link
Copy Markdown
Member Author

@wesm

wesm commented Sep 5, 2019

Copy link
Copy Markdown
Member

@kszucs it's a typedef... visibility only has to do with exported compiled symbols

@kszucs

kszucs commented Sep 6, 2019

Copy link
Copy Markdown
Member Author

Ahh, indeed! Thanks!

@kszucs kszucs requested review from pitrou and wesm September 6, 2019 07:55
@kszucs

kszucs commented Sep 6, 2019

Copy link
Copy Markdown
Member Author

@pitrou I have a "not a subpath error" which is probably windows related? https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/27224120/job/dx9qmxdj5tnenu94#L2057

@pitrou

pitrou commented Sep 9, 2019

Copy link
Copy Markdown
Member

@pitrou I have a "not a subpath error" which is probably windows related?

Right now the filesystem classes expect forward slashes, and it seems you're passing backslashes. You should try pathlib.Path.as_posix.

@pitrou pitrou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this. It's gonna be very useful. I posted some comments below.

Comment thread cpp/cmake_modules/FindClangTools.cmake Outdated
Comment thread cpp/src/arrow/python/datetime.cc Outdated
Comment thread cpp/src/arrow/python/datetime.h Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't seem to have any datetime-heavy benchmarks in the ASV benchmarks directory.
However, you can probably devise a timeit-based microbenchmark easily. No need to be sophisticated :-)

Comment thread python/pyarrow/io.pxi Outdated
Comment thread python/pyarrow/tests/test_fs.py Outdated
Comment thread python/pyarrow/_fs.pyx Outdated
Comment thread python/pyarrow/_fs.pyx Outdated
Comment thread python/pyarrow/_fs.pyx Outdated
Comment thread python/pyarrow/_fs.pyx Outdated
Comment thread python/pyarrow/_fs.pyx Outdated
@pitrou

pitrou commented Sep 11, 2019

Copy link
Copy Markdown
Member

Looks like the AppVeyor failure needs fixing, can you take a look @kszucs ?

Comment thread python/pyarrow/_fs.pyx Outdated

@jorisvandenbossche jorisvandenbossche left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I quickly tried this out, and added a few comments. But looks good!

Comment thread python/pyarrow/_fs.pyx Outdated
Comment thread python/pyarrow/_fs.pyx Outdated
Comment thread python/pyarrow/_fs.pyx Outdated
Comment thread python/pyarrow/_fs.pyx Outdated
@jorisvandenbossche

Copy link
Copy Markdown
Member

For the pathlib comment, the other alternative would also be to add it as a dependency (for python 2 only, just like also enum34 is a py2 only dependency for enum backport).
I would say that if we think it is nice to return a pathlib, we shouldn't preclude ourselves doing that just for the few months of py2 support we still have.

@kszucs

kszucs commented Sep 11, 2019

Copy link
Copy Markdown
Member Author

After transitioning to py3 only it would be reasonable to use pathlib.Path objects as return types indeed (at least that's my preference). I suggest to create a JIRA where we can discuss further (possible involves other parts of pyarrow), meanwhile I've set the return types to string.

@pitrou

pitrou commented Sep 11, 2019

Copy link
Copy Markdown
Member

After transitioning to py3 only it would be reasonable to use pathlib.Path objects as return types indeed (at least that's my preference).

You mean from get_target_stats?

@pitrou

pitrou commented Sep 11, 2019

Copy link
Copy Markdown
Member

it would be reasonable to use pathlib.Path objects as return types indeed (at least that's my preference)

Actually, no. pathlib paths are only for the local filesystem, not for arbitrary resources (e.g. S3). Instead you should return a plain string.

@jorisvandenbossche

Copy link
Copy Markdown
Member

Actually, no. pathlib paths are only for the local filesystem, not for arbitrary resources (e.g. S3). Instead you should return a plain string.

Ah, that solves the python 2 / should we depend on pathlib question then.

@kszucs

kszucs commented Sep 11, 2019

Copy link
Copy Markdown
Member Author

We could have a pathlib compatible Path implementation as well. Whatever, the API we're providing is not stable yet.

Comment thread python/pyarrow/tests/test_fs.py Outdated
Comment thread python/pyarrow/_fs.pyx
@pitrou

pitrou commented Sep 16, 2019

Copy link
Copy Markdown
Member

For now I'm not doing a full review again, feel free to ping me when you think you're ready (and CI is fixed :-)).

@kszucs

kszucs commented Sep 17, 2019

Copy link
Copy Markdown
Member Author

@pitrou this should be green now. I'd like to move forward with bindings for S3 once it is merged.

@pitrou pitrou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two more comments, sorry ;-)

Comment thread python/pyarrow/_fs.pyx Outdated
Comment thread python/pyarrow/_fs.pyx Outdated
since the C++ side then decodes from utf-8. On Unix, os.fsencode may be
better.
"""
return tobytes(_stringify_path(path))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this shouldn't accept Path objects, only str and bytes. Path objects are only valid for local paths, but here we are dealing with paths in arbitrary systems.

@kszucs kszucs Sep 17, 2019

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou done, actually we don't accept bytes neither, only strings (we're discussing it previously)

@kszucs

kszucs commented Sep 17, 2019

Copy link
Copy Markdown
Member Author

@pitrou done, builds are green, the travis error is about codecov upload failure.

@pitrou pitrou closed this in ae20ce8 Sep 17, 2019
@pitrou

pitrou commented Sep 17, 2019

Copy link
Copy Markdown
Member

This is a very nice start, thank you :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants