Skip to content

[Python] Long-term fate of pyarrow.parquet.ParquetDataset #25775

@asfimport

Description

@asfimport

The business logic of the python implementation of reading partitioned parquet datasets in pyarrow.parquet.ParquetDataset has been ported to C++ (ARROW-3764), and has also been optionally enabled in ParquetDataset(..) by using use_legacy_dataset=False (ARROW-8039).

But the question still is: what do we do with this class long term?

So for users who now do:

dataset = pq.ParquetDataset(...)
dataset.metadata
table = dataset.read()

what should they do in the future?
Do we keep a class like this (but backed by the pyarrow.dataset implementation), or do we deprecate the class entirely, pointing users to dataset = ds.dataset(..., format="parquet") ?

In any case, we should strive to entirely delete the current custom python implementation, but we could keep a ParquetDataset class that wraps or inherits pyarrow.dataset.FileSystemDataset and adds some parquet specifics to it (eg access to the parquet schema, the common metadata, exposing the parquet-specific constructor keywords more easily, ..).

Features the ParquetDataset currently has that are not exactly covered by pyarrow.dataset:

  • Partitioning information (the .partitions attribute
  • Access to common metadata (.metadata_path, .common_metadata_path and .metadata attributes)
  • ParquetSchema of the dataset

Reporter: Joris Van den Bossche / @jorisvandenbossche

Related issues:

Note: This issue was originally created as ARROW-9720. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions