ARROW-7965: [Python] Refine higher level dataset API by kszucs · Pull Request #6505 · apache/arrow

kszucs · 2020-02-28T16:45:02Z

Provides a more intuitive way to construct nested dataset:

# instead of using confusing factory function
dataset([
     factory("s3://old-taxi-data", format="parquet"),
     factory("local/path/to/new/data", format="csv")
])

# let the user to construct a new dataset directly from dataset objects
dataset([ 
    dataset("s3://old-taxi-data", format="parquet"),
    dataset("local/path/to/new/data", format="csv")
])

In the future we might want to introduce a new Dataset class which wraps functionality of both the dataset actory and the materialized dataset enabling optimizations over rediscovery of already materialized datasets.

github-actions · 2020-02-28T17:02:27Z

https://issues.apache.org/jira/browse/ARROW-7965

jorisvandenbossche · 2020-02-28T20:10:53Z

That makes the usage indeed a bit easier. But I am wondering: how expensive is "finishing" the factory? As now you are finishing all sub-datasets, just to reuse the non-finished factory later one.

kszucs · 2020-02-28T20:30:08Z

Yeah, that's a downside until we wrap both the DatasetFactory and Dataset with a single class which does the discovery lazily.

nealrichardson · 2020-02-28T22:23:57Z

I agree with the goal here, but I wonder if the solution should perhaps be in C++? That way we wouldn't have to reimplement this in R too.

kszucs · 2020-03-02T09:26:29Z

I agree with the goal here, but I wonder if the solution should perhaps be in C++? That way we wouldn't have to reimplement this in R too.

cc @bkietz

jorisvandenbossche · 2020-03-02T09:29:00Z

If pushing into C++, that will have the same problem of the double cost of the dataset discovery

kszucs · 2020-03-02T11:16:42Z

@jorisvandenbossche in order to reuse the inspected schemas AFAICS we need "cache" them in the C++ instances and return them from InspectSchemas() unless we request an explicit re-discovery.

jorisvandenbossche · 2020-03-03T10:33:37Z

Trying to put in words a bit more structured what I said yesterday in the meeting:

Right now, we need to keep a reference to the dataset factories to be able to reuse those factories, since UnionDatasetFactory is written to expect a vector of factories.

But instead of fixing this on the dataset side by holding this reference, can we change the implementation of UnionDatasetFactory to actually accepts a vector of datasets instead of a vector of factories?

Looking at the UnionDatasetFactory implementation, it doesn't look like that still needs the schema's of all files/fragments of its sub-datasets. Rather, it only needs the finished schema of its sub-datasets (and then it checks if those are compatible / can be unified). But assuming that those sub-datasets would also created with a factory, the schema returned from child_factory->Inspect() will be the same as the schema from the finished dataset from that child_factory.

jorisvandenbossche · 2020-03-03T10:41:51Z

OK, I wanted to put this in some pseudo python code to explain it better, which lead me to see that my suggestion is right now indeed not possible ;)

The current logic:

def UnionDatasetFactory(factories):
    # inspect
    schemas = []
    for fact in factories:
        schemas.append(fact.inspect())
    schema = UnifySchemas(schemas)

    # finish
    datasets = []
    for fact in factories:
        datasets.append(fact.finish(schema))
    return UnionDataset(datasets, schema)

The issue that makes it not possible right now to pass a vector of datasets, is that the sub-datasets need to be finished with a potentially different (unified) schema.
So my question is then: would it be possible to kind of create a new "view" on an existing dataset with a different (but compatible!) schema? In the end, you could create a new Dataset passing all building blocks of the original dataset (the format, filesystem, forest, file_partitions) but with only specifying a different schema?

jorisvandenbossche · 2020-03-17T09:10:58Z

@bkietz @fsaintjacques with your deeper knowledge of the C++ dataset code, could you confirm or not whether my idea above is possible: having a UnionDataset constructor taking a vector of Dataset of objects instead of vector of DatasetFactories ?

bkietz · 2020-03-19T15:44:44Z

@jorisvandenbossche this is already supported in C++, but currently the schemas must be identical rather than merely compatible. Creating a view of a dataset with differing schema would probably be straightforward, created https://issues.apache.org/jira/browse/ARROW-8164 to track this feature

jorisvandenbossche · 2020-03-19T18:22:43Z

Thanks for opening the issue!

wesm · 2020-04-03T23:32:34Z

What's the status of this patch? It's still in the 0.17.0 backlog

jorisvandenbossche · 2020-04-04T06:43:22Z

This is blocked by #6721 (which is blocked by a strange R failure, I think).

I added it to the 0.17 milestone, as it it would be nice to get this in, as I noted in the JIRA:

This depends on ARROW-8164, but if that gets merged quickly, it would be nice tackle this issue for 0.17 since that would enable us to remove factory() from the high-level user API (which wasn't there yet in 0.16, so this would avoid it ever being in a released version)

But it is also certainly not a blocker, since the datasets API is still experimental anyway.

nealrichardson · 2020-04-09T15:26:32Z

#6721 has been merged, so this is unblocked

jorisvandenbossche

Thanks a lot for this PR! The diff is a bit hard to interpret, but generally looks good to me.

One thing I am not really sure about is the string URI value to specify the filesystem. What's the use for that? If you have a URI, just pass it to the source? This seems a needless complication to me, and not something we support elsewhere? (or do we?)

kszucs · 2020-04-10T19:58:23Z

@github-actions crossbow submit conda-win-vs2015-py36

github-actions · 2020-04-10T19:59:41Z

Revision: dacded6

Submitted crossbow builds: ursa-labs/crossbow @ actions-90

Task	Status
conda-win-vs2015-py36

kszucs · 2020-04-14T02:22:18Z

@github-actions crossbow submit conda-win-vs2015-py36

github-actions · 2020-04-14T02:23:43Z

Revision: 5692edc

Submitted crossbow builds: ursa-labs/crossbow @ actions-111

Task	Status
conda-win-vs2015-py36

jorisvandenbossche · 2020-04-14T09:52:44Z

Failures are due to "ignore_prefix" -> "selector_ignore_prefix" rename on the C++ side.

Since all Ursabot python builds are passing, it also seems they are not running any of the parquet or dataset related tests ..

kszucs · 2020-04-14T10:11:41Z

@github-actions crossbow submit conda-win-vs2015-py36

github-actions · 2020-04-14T10:13:06Z

Revision: acbe2cb

Submitted crossbow builds: ursa-labs/crossbow @ actions-113

Task	Status
conda-win-vs2015-py36

kszucs · 2020-04-14T10:13:50Z

Failures are due to "ignore_prefix" -> "selector_ignore_prefix" rename on the C++ side.

Updated.

Since all Ursabot python builds are passing, it also seems they are not running any of the parquet or dataset related tests ..

Yes, but we're not actively maintaining the ursabot builds now.

pitrou

I only skimmed through this, a few comments.

jorisvandenbossche · 2020-04-14T14:15:19Z


-def factory(path_or_paths, filesystem=None, partitioning=None,
-            format=None):
+def _ensure_filesystem(fs_or_uri):


Can be for a follow-up JIRA, but we might want to move this helper to pyarrow.fs, and use some this also in other places where we accept filesystems? (although I actually don't know if there are already many such places ..)

jorisvandenbossche · 2020-04-14T14:39:37Z

@kszucs can you also add this patch to this branch:

--- a/python/pyarrow/tests/test_parquet.py
+++ b/python/pyarrow/tests/test_parquet.py
@@ -2490,13 +2490,7 @@ def _assert_dataset_paths(dataset, paths, use_legacy_dataset):
         assert set(map(str, paths)) == {x.path for x in dataset.pieces}
     else:
         paths = [str(path.as_posix()) for path in paths]
-        if hasattr(dataset._dataset, 'files'):
-            assert set(paths) == set(dataset._dataset.files)
-        else:
-            # UnionDataset
-            # TODO(temp hack) remove this branch once ARROW-7965 is in (which
-            # will change this to a FileSystemDataset)
-            assert dataset.read().num_rows == 50
+        assert set(paths) == set(dataset._dataset.files)

(that's a clean-up for a hack I added yesterday becuase of a list of paths giving a UnionDataset, which this PR is fixing)

pitrou

Thank you. A few more comments.

kszucs · 2020-04-14T22:52:04Z

Thanks Joris, Ben, Antoine! Merging.

kszucs changed the title ~~[Python] Hold a reference to the dataset factory for later reuse~~ ARROW-7965: [Python] Hold a reference to the dataset factory for later reuse Feb 28, 2020

apache deleted a comment from github-actions Bot Feb 28, 2020

kszucs requested a review from jorisvandenbossche February 28, 2020 16:57

jorisvandenbossche reviewed Feb 28, 2020

View reviewed changes

Comment thread python/pyarrow/_dataset.pyx Outdated

kszucs commented Feb 28, 2020

View reviewed changes

Comment thread python/pyarrow/dataset.py Outdated

kszucs force-pushed the dataset-factory-reference branch from 95a2508 to f761dbd Compare April 9, 2020 18:32

kszucs changed the title ~~ARROW-7965: [Python] Hold a reference to the dataset factory for later reuse~~ ARROW-7965: [Python] Refine higher level dataset API Apr 10, 2020

kszucs requested review from bkietz and jorisvandenbossche April 10, 2020 01:49

kszucs commented Apr 10, 2020

View reviewed changes

Comment thread python/pyarrow/dataset.py Outdated

kszucs commented Apr 10, 2020

View reviewed changes

Comment thread python/pyarrow/tests/test_dataset.py Outdated

jorisvandenbossche reviewed Apr 10, 2020

View reviewed changes

bkietz requested changes Apr 10, 2020

View reviewed changes

kszucs added 2 commits April 14, 2020 04:15

more tests; make subtreefs path concatenation more robust

ff69b9d

more examples

5692edc

kszucs force-pushed the dataset-factory-reference branch from 3f21c9a to 5692edc Compare April 14, 2020 02:16

update selector_ignore_prefixes

acbe2cb

kszucs added 2 commits April 14, 2020 12:47

fix filesystem uri

a002b49

fix remaining occurences of wrong file uri

101babe

kszucs requested a review from jorisvandenbossche April 14, 2020 11:17

jorisvandenbossche reviewed Apr 14, 2020

View reviewed changes

Comment thread python/pyarrow/dataset.py Outdated

pitrou reviewed Apr 14, 2020

View reviewed changes

Comment thread python/pyarrow/tests/test_dataset.py Outdated

pitrou reviewed Apr 14, 2020

View reviewed changes

kszucs added 3 commits April 14, 2020 14:55

provide different uris on windows

a94bc10

address review comments

9eb9704

windows paths...

024a5cc

jorisvandenbossche approved these changes Apr 14, 2020

View reviewed changes

address review comments

e8c063b

pitrou requested changes Apr 14, 2020

View reviewed changes

address review comments

6fed057

pitrou approved these changes Apr 14, 2020

View reviewed changes

don't use threads

b9d95ed

kszucs closed this in 68475dd Apr 14, 2020

asfimport mentioned this pull request Apr 14, 2020

[Python] Refine higher level dataset API #24184

Closed

asfimport mentioned this pull request Apr 10, 2020

[Python][Dataset] Infer the filesystem from the first path if multiple paths are passed to dataset() #24583

Closed

Conversation

kszucs commented Feb 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Feb 28, 2020

Uh oh!

Uh oh!

jorisvandenbossche commented Feb 28, 2020

Uh oh!

kszucs commented Feb 28, 2020

Uh oh!

Uh oh!

nealrichardson commented Feb 28, 2020

Uh oh!

kszucs commented Mar 2, 2020

Uh oh!

jorisvandenbossche commented Mar 2, 2020

Uh oh!

kszucs commented Mar 2, 2020

Uh oh!

jorisvandenbossche commented Mar 3, 2020

Uh oh!

jorisvandenbossche commented Mar 3, 2020

Uh oh!

jorisvandenbossche commented Mar 17, 2020

Uh oh!

bkietz commented Mar 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche commented Mar 19, 2020

Uh oh!

wesm commented Apr 3, 2020

Uh oh!

jorisvandenbossche commented Apr 4, 2020

Uh oh!

nealrichardson commented Apr 9, 2020

Uh oh!

Uh oh!

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kszucs commented Apr 10, 2020

Uh oh!

github-actions Bot commented Apr 10, 2020

Uh oh!

kszucs commented Apr 14, 2020

Uh oh!

github-actions Bot commented Apr 14, 2020

Uh oh!

jorisvandenbossche commented Apr 14, 2020

Uh oh!

kszucs commented Apr 14, 2020

Uh oh!

github-actions Bot commented Apr 14, 2020

Uh oh!

kszucs commented Apr 14, 2020

Uh oh!

Uh oh!

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kszucs commented Feb 28, 2020 •

edited

Loading

bkietz commented Mar 19, 2020 •

edited

Loading