[Python] Datatypes not preserved for partition fields in roundtrip to partitioned parquet dataset

### Datatypes are not preserved when a pandas data frame is **partitioned** and saved as parquet file using pyarrow but that's not the case when the data frame is not partitioned.

**Case 1: Saving a partitioned dataset - Data Types are NOT preserved**
```java

# Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
import pandas as pd
df = pd.DataFrame( {'age': [77,32,234],'name':['agan','bbobby','test'] }
)
path = 'test'
partition_cols=['age']
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, partition_cols=partition_cols, preserve_index=False)

# Loading a dataset partioned parquet dataset from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)
```
**Output:**
```java

Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
name object
age category
dtype: object
```
##### <font color="#d04437">From the above output, we could see that the data type for age is int64 in the original pandas data frame but it got changed to category when we saved to local and loaded back.</font>

**Case 2: Non-partitioned dataset - Data types are preserved**
```java

import pandas as pd
print('Saving a Pandas Dataframe to Local as a parquet file without partitioning using pyarrow')
df = pd.DataFrame(

{'age': [77,32,234],'name':['agan','bbobby','test'] }

)
path = 'test_without_partition'
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, preserve_index=False)
# Loading a non-partioned parquet file from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)

```
**Output:**
```java

Saving a Pandas Dataframe to Local as a parquet file without partitioning using pyarrow
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
age int64
name object
dtype: object
```
**Versions**
- Python 3.7.3
- pyarrow 0.14.1

**Environment**: Python 3.7.3
pyarrow 0.14.1
**Reporter**: [Naga](https://issues.apache.org/jira/browse/ARROW-6114)
#### Related issues:
- [[Python] Underscores in partition (string) values are dropped when reading dataset](https://github.com/apache/arrow/issues/22099) (is related to)
- [[C++][Dataset] Automatically detect boolean partition columns](https://github.com/apache/arrow/issues/19718) (is related to)
- [[Python][Dataset] Support using dataset API in pyarrow.parquet with a minimal ParquetDataset shim](https://github.com/apache/arrow/issues/17077) (depends upon)

<sub>**Note**: *This issue was originally created as [ARROW-6114](https://issues.apache.org/jira/browse/ARROW-6114). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Python] Datatypes not preserved for partition fields in roundtrip to partitioned parquet dataset #22510

Datatypes are not preserved when a pandas data frame is partitioned and saved as parquet file using pyarrow but that's not the case when the data frame is not partitioned.

From the above output, we could see that the data type for age is int64 in the original pandas data frame but it got changed to category when we saved to local and loaded back.

Related issues:

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Python] Datatypes not preserved for partition fields in roundtrip to partitioned parquet dataset #22510

Description

Datatypes are not preserved when a pandas data frame is partitioned and saved as parquet file using pyarrow but that's not the case when the data frame is not partitioned.

From the above output, we could see that the data type for age is int64 in the original pandas data frame but it got changed to category when we saved to local and loaded back.

Related issues:

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions