When currently saving a ParquetDataset from Pandas, we don't get consistent schemas, even if the source was a single DataFrame. This is due to the fact that in some partitions object columns like string can become empty. Then the resulting Arrow schema will differ. In the central metadata, we will store this column as pa.string whereas in the partition file with the empty columns, this columns will be stored as pa.null.
The two schemas are still a valid match in terms of schema evolution and we should respect that in
|
def validate_schemas(self): |
Instead of doing a
pa.Schema.equals in
|
if not dataset_schema.equals(file_schema, check_metadata=False): |
we should introduce a new method
pa.Schema.can_evolve_to that is more graceful and returns
True if a dataset piece has a null column where the main metadata states a nullable column of any type.
Reporter: Uwe Korn / @xhochy
Related issues:
Original Issue Attachments:
Note: This issue was originally created as ARROW-2659. Please see the migration documentation for further details.
When currently saving a
ParquetDatasetfrom Pandas, we don't get consistent schemas, even if the source was a single DataFrame. This is due to the fact that in some partitions object columns like string can become empty. Then the resulting Arrow schema will differ. In the central metadata, we will store this column aspa.stringwhereas in the partition file with the empty columns, this columns will be stored aspa.null.The two schemas are still a valid match in terms of schema evolution and we should respect that in
arrow/python/pyarrow/parquet.py
Line 754 in 79a2207
pa.Schema.equalsinarrow/python/pyarrow/parquet.py
Line 778 in 79a2207
pa.Schema.can_evolve_tothat is more graceful and returnsTrueif a dataset piece has a null column where the main metadata states a nullable column of any type.Reporter: Uwe Korn / @xhochy
Related issues:
Original Issue Attachments:
Note: This issue was originally created as ARROW-2659. Please see the migration documentation for further details.