Skip to content

ARROW-7965: [Python] Refine higher level dataset API#6505

Closed
kszucs wants to merge 27 commits into
apache:masterfrom
kszucs:dataset-factory-reference
Closed

ARROW-7965: [Python] Refine higher level dataset API#6505
kszucs wants to merge 27 commits into
apache:masterfrom
kszucs:dataset-factory-reference

Conversation

@kszucs
Copy link
Copy Markdown
Member

@kszucs kszucs commented Feb 28, 2020

Provides a more intuitive way to construct nested dataset:

# instead of using confusing factory function
dataset([
     factory("s3://old-taxi-data", format="parquet"),
     factory("local/path/to/new/data", format="csv")
])

# let the user to construct a new dataset directly from dataset objects
dataset([ 
    dataset("s3://old-taxi-data", format="parquet"),
    dataset("local/path/to/new/data", format="csv")
])

In the future we might want to introduce a new Dataset class which wraps functionality of both the dataset actory and the materialized dataset enabling optimizations over rediscovery of already materialized datasets.

@kszucs kszucs changed the title [Python] Hold a reference to the dataset factory for later reuse ARROW-7965: [Python] Hold a reference to the dataset factory for later reuse Feb 28, 2020
@apache apache deleted a comment from github-actions Bot Feb 28, 2020
@github-actions
Copy link
Copy Markdown

Comment thread python/pyarrow/_dataset.pyx Outdated
@jorisvandenbossche
Copy link
Copy Markdown
Member

That makes the usage indeed a bit easier. But I am wondering: how expensive is "finishing" the factory? As now you are finishing all sub-datasets, just to reuse the non-finished factory later one.

@kszucs
Copy link
Copy Markdown
Member Author

kszucs commented Feb 28, 2020

Yeah, that's a downside until we wrap both the DatasetFactory and Dataset with a single class which does the discovery lazily.

Comment thread python/pyarrow/dataset.py Outdated
@nealrichardson
Copy link
Copy Markdown
Member

I agree with the goal here, but I wonder if the solution should perhaps be in C++? That way we wouldn't have to reimplement this in R too.

@kszucs
Copy link
Copy Markdown
Member Author

kszucs commented Mar 2, 2020

I agree with the goal here, but I wonder if the solution should perhaps be in C++? That way we wouldn't have to reimplement this in R too.

cc @bkietz

@jorisvandenbossche
Copy link
Copy Markdown
Member

If pushing into C++, that will have the same problem of the double cost of the dataset discovery

@kszucs
Copy link
Copy Markdown
Member Author

kszucs commented Mar 2, 2020

@jorisvandenbossche in order to reuse the inspected schemas AFAICS we need "cache" them in the C++ instances and return them from InspectSchemas() unless we request an explicit re-discovery.

@jorisvandenbossche
Copy link
Copy Markdown
Member

Trying to put in words a bit more structured what I said yesterday in the meeting:

Right now, we need to keep a reference to the dataset factories to be able to reuse those factories, since UnionDatasetFactory is written to expect a vector of factories.

But instead of fixing this on the dataset side by holding this reference, can we change the implementation of UnionDatasetFactory to actually accepts a vector of datasets instead of a vector of factories?

Looking at the UnionDatasetFactory implementation, it doesn't look like that still needs the schema's of all files/fragments of its sub-datasets. Rather, it only needs the finished schema of its sub-datasets (and then it checks if those are compatible / can be unified). But assuming that those sub-datasets would also created with a factory, the schema returned from child_factory->Inspect() will be the same as the schema from the finished dataset from that child_factory.

@jorisvandenbossche
Copy link
Copy Markdown
Member

OK, I wanted to put this in some pseudo python code to explain it better, which lead me to see that my suggestion is right now indeed not possible ;)

The current logic:

def UnionDatasetFactory(factories):
    # inspect
    schemas = []
    for fact in factories:
        schemas.append(fact.inspect())
    schema = UnifySchemas(schemas)

    # finish
    datasets = []
    for fact in factories:
        datasets.append(fact.finish(schema))
    return UnionDataset(datasets, schema)

The issue that makes it not possible right now to pass a vector of datasets, is that the sub-datasets need to be finished with a potentially different (unified) schema.
So my question is then: would it be possible to kind of create a new "view" on an existing dataset with a different (but compatible!) schema? In the end, you could create a new Dataset passing all building blocks of the original dataset (the format, filesystem, forest, file_partitions) but with only specifying a different schema?

@jorisvandenbossche
Copy link
Copy Markdown
Member

@bkietz @fsaintjacques with your deeper knowledge of the C++ dataset code, could you confirm or not whether my idea above is possible: having a UnionDataset constructor taking a vector of Dataset of objects instead of vector of DatasetFactories ?

@bkietz
Copy link
Copy Markdown
Member

bkietz commented Mar 19, 2020

@jorisvandenbossche this is already supported in C++, but currently the schemas must be identical rather than merely compatible. Creating a view of a dataset with differing schema would probably be straightforward, created https://issues.apache.org/jira/browse/ARROW-8164 to track this feature

@jorisvandenbossche
Copy link
Copy Markdown
Member

Thanks for opening the issue!

@wesm
Copy link
Copy Markdown
Member

wesm commented Apr 3, 2020

What's the status of this patch? It's still in the 0.17.0 backlog

@jorisvandenbossche
Copy link
Copy Markdown
Member

This is blocked by #6721 (which is blocked by a strange R failure, I think).

I added it to the 0.17 milestone, as it it would be nice to get this in, as I noted in the JIRA:

This depends on ARROW-8164, but if that gets merged quickly, it would be nice tackle this issue for 0.17 since that would enable us to remove factory() from the high-level user API (which wasn't there yet in 0.16, so this would avoid it ever being in a released version)

But it is also certainly not a blocker, since the datasets API is still experimental anyway.

@nealrichardson
Copy link
Copy Markdown
Member

#6721 has been merged, so this is unblocked

@kszucs kszucs force-pushed the dataset-factory-reference branch from 95a2508 to f761dbd Compare April 9, 2020 18:32
@kszucs kszucs changed the title ARROW-7965: [Python] Hold a reference to the dataset factory for later reuse ARROW-7965: [Python] Refine higher level dataset API Apr 10, 2020
Comment thread python/pyarrow/dataset.py Outdated
Comment thread python/pyarrow/tests/test_dataset.py Outdated
Copy link
Copy Markdown
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this PR! The diff is a bit hard to interpret, but generally looks good to me.

One thing I am not really sure about is the string URI value to specify the filesystem. What's the use for that? If you have a URI, just pass it to the source? This seems a needless complication to me, and not something we support elsewhere? (or do we?)

Comment thread python/pyarrow/types.pxi Outdated
Comment thread python/pyarrow/types.pxi Outdated
Comment thread python/pyarrow/types.pxi Outdated
Comment thread python/pyarrow/tests/test_dataset.py Outdated
Comment thread python/pyarrow/tests/test_dataset.py Outdated
Comment thread python/pyarrow/dataset.py Outdated
Comment thread python/pyarrow/dataset.py Outdated
Comment thread python/pyarrow/dataset.py Outdated
Comment thread python/pyarrow/dataset.py
Comment thread python/pyarrow/dataset.py
Comment thread python/pyarrow/_dataset.pyx Outdated
Comment thread python/pyarrow/_dataset.pyx Outdated
Comment thread python/pyarrow/dataset.py Outdated
Comment thread python/pyarrow/dataset.py Outdated
Comment thread python/pyarrow/dataset.py Outdated
Comment thread python/pyarrow/dataset.py Outdated
@kszucs
Copy link
Copy Markdown
Member Author

kszucs commented Apr 10, 2020

@github-actions crossbow submit conda-win-vs2015-py36

@github-actions
Copy link
Copy Markdown

Revision: dacded6

Submitted crossbow builds: ursa-labs/crossbow @ actions-90

Task Status
conda-win-vs2015-py36 Azure

@kszucs kszucs force-pushed the dataset-factory-reference branch from 3f21c9a to 5692edc Compare April 14, 2020 02:16
@kszucs
Copy link
Copy Markdown
Member Author

kszucs commented Apr 14, 2020

@github-actions crossbow submit conda-win-vs2015-py36

@github-actions
Copy link
Copy Markdown

Revision: 5692edc

Submitted crossbow builds: ursa-labs/crossbow @ actions-111

Task Status
conda-win-vs2015-py36 Azure

@jorisvandenbossche
Copy link
Copy Markdown
Member

Failures are due to "ignore_prefix" -> "selector_ignore_prefix" rename on the C++ side.

Since all Ursabot python builds are passing, it also seems they are not running any of the parquet or dataset related tests ..

@kszucs
Copy link
Copy Markdown
Member Author

kszucs commented Apr 14, 2020

@github-actions crossbow submit conda-win-vs2015-py36

@github-actions
Copy link
Copy Markdown

Revision: acbe2cb

Submitted crossbow builds: ursa-labs/crossbow @ actions-113

Task Status
conda-win-vs2015-py36 Azure

@kszucs
Copy link
Copy Markdown
Member Author

kszucs commented Apr 14, 2020

Failures are due to "ignore_prefix" -> "selector_ignore_prefix" rename on the C++ side.

Updated.

Since all Ursabot python builds are passing, it also seems they are not running any of the parquet or dataset related tests ..

Yes, but we're not actively maintaining the ursabot builds now.

Comment thread python/pyarrow/dataset.py Outdated
Comment thread python/pyarrow/tests/test_dataset.py Outdated
Copy link
Copy Markdown
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only skimmed through this, a few comments.

Comment thread python/pyarrow/dataset.py Outdated
Comment thread python/pyarrow/dataset.py Outdated
Comment thread python/pyarrow/dataset.py Outdated
Comment thread python/pyarrow/dataset.py Outdated
Comment thread python/pyarrow/dataset.py Outdated
Comment thread python/pyarrow/dataset.py Outdated
Comment thread python/pyarrow/dataset.py
Comment thread python/pyarrow/dataset.py

def factory(path_or_paths, filesystem=None, partitioning=None,
format=None):
def _ensure_filesystem(fs_or_uri):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be for a follow-up JIRA, but we might want to move this helper to pyarrow.fs, and use some this also in other places where we accept filesystems? (although I actually don't know if there are already many such places ..)

Comment thread python/pyarrow/dataset.py Outdated
Comment thread python/pyarrow/tests/test_dataset.py Outdated
@jorisvandenbossche
Copy link
Copy Markdown
Member

@kszucs can you also add this patch to this branch:

--- a/python/pyarrow/tests/test_parquet.py
+++ b/python/pyarrow/tests/test_parquet.py
@@ -2490,13 +2490,7 @@ def _assert_dataset_paths(dataset, paths, use_legacy_dataset):
         assert set(map(str, paths)) == {x.path for x in dataset.pieces}
     else:
         paths = [str(path.as_posix()) for path in paths]
-        if hasattr(dataset._dataset, 'files'):
-            assert set(paths) == set(dataset._dataset.files)
-        else:
-            # UnionDataset
-            # TODO(temp hack) remove this branch once ARROW-7965 is in (which
-            # will change this to a FileSystemDataset)
-            assert dataset.read().num_rows == 50
+        assert set(paths) == set(dataset._dataset.files)

(that's a clean-up for a hack I added yesterday becuase of a list of paths giving a UnionDataset, which this PR is fixing)

Copy link
Copy Markdown
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. A few more comments.

Comment thread python/pyarrow/dataset.py Outdated
Comment thread python/pyarrow/tests/test_dataset.py Outdated
Comment thread python/pyarrow/tests/test_dataset.py
Comment thread python/pyarrow/tests/test_schema.py
Comment thread python/pyarrow/types.pxi Outdated
Comment thread python/pyarrow/types.pxi
Comment thread python/pyarrow/types.pxi Outdated
@kszucs
Copy link
Copy Markdown
Member Author

kszucs commented Apr 14, 2020

Thanks Joris, Ben, Antoine! Merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants