Skip to content

ARROW-8644: [Python] Restore ParquetDataset behaviour to always include partition column for dask compatibility#7096

Closed
jorisvandenbossche wants to merge 2 commits into
apache:masterfrom
jorisvandenbossche:ARROW-8644-dask-partitioned
Closed

ARROW-8644: [Python] Restore ParquetDataset behaviour to always include partition column for dask compatibility#7096
jorisvandenbossche wants to merge 2 commits into
apache:masterfrom
jorisvandenbossche:ARROW-8644-dask-partitioned

Conversation

@jorisvandenbossche

@jorisvandenbossche jorisvandenbossche commented May 4, 2020

Copy link
Copy Markdown
Member

Given that the original change (https://issues.apache.org/jira/browse/ARROW-3861 / #7050) breaks dask's reading of partitioned datasets (it doesn't add the partition column to the list of columns to read, but expects it will still be read automatically), it doesn't seem worth it to me to fix this in the "old" ParquetDataset implementation.

But we can keep the "correct" behaviour in the Datasets API - based implementation going forward.

@github-actions

github-actions Bot commented May 4, 2020

Copy link
Copy Markdown

@jorisvandenbossche

Copy link
Copy Markdown
Member Author

@github-actions crossbow submit test-conda-python-3.7-dask-latest test-conda-python-3.8-dask-master

@github-actions

github-actions Bot commented May 4, 2020

Copy link
Copy Markdown

Revision: 3e480a9

Submitted crossbow builds: ursa-labs/crossbow @ actions-200

Task Status
test-conda-python-3.7-dask-latest CircleCI
test-conda-python-3.8-dask-master CircleCI

@jorisvandenbossche

jorisvandenbossche commented May 4, 2020

Copy link
Copy Markdown
Member Author

So the question comes up if we actually should have the same behaviour in case of use_legacy_dataset=False (the _ParquetDatasetV2 shim).

For me, that depends a bit on what we want to do long term with ParquetDataset. If we want to keep it as "the" ParquetDataset (maybe becoming a subclass of the actual Dataset class then), then I think it should have the "correct" behaviour. If we only see it as a temporary vehicle to get people try it out / have poeple eventually use the pyarrow.dataset API, then it is less important to fix it.

@wesm

wesm commented May 4, 2020

Copy link
Copy Markdown
Member

This is a regression? If so can you mark it with 0.17.1 (and 1.0.0)

@jorisvandenbossche

Copy link
Copy Markdown
Member Author

No, the PR that caused the regression was only merged after 0.17

@wesm

wesm commented May 5, 2020

Copy link
Copy Markdown
Member

For me, that depends a bit on what we want to do long term with ParquetDataset.

IMHO we should be trying to transition people off of this, so that Parquet format isn't necessarily treated separately from general datasets (which may contain different format files). We can keep discussing this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants