When reading a partitioned dataset, in which the partition column contains string values with underscores, pyarrow seems to be ignoring the underscores in the resulting values.
For example if I write and then read a dataset as follows:
import pyarrow as pa
import pandas as pd
df = pd.DataFrame({
"year_week": ["2019_2", "2019_3"],
"value": [1, 2]
})
table = pa.Table.from_pandas(df.head())
pq.write_to_dataset(table, 'test', partition_cols=["year_week"])
table2 = pq.ParquetDataset('test').read()
The resulting 'year_week' column in table 2 has lost the underscores:
table2[1] # Gives:
<Column name='year_week' type=DictionaryType(dictionary<values=int64, indices=int32, ordered=0>)>
[
-- dictionary:
[
20192,
20193
]
-- indices:
[
0
],
-- dictionary:
[
20192,
20193
]
-- indices:
[
1
]
]
Is this intentional behaviour or is this a bug in arrow?
Reporter: Julian de Ruiter
Assignee: Joris Van den Bossche / @jorisvandenbossche
Related issues:
Note: This issue was originally created as ARROW-5666. Please see the migration documentation for further details.
When reading a partitioned dataset, in which the partition column contains string values with underscores, pyarrow seems to be ignoring the underscores in the resulting values.
For example if I write and then read a dataset as follows:
The resulting 'year_week' column in table 2 has lost the underscores:
Is this intentional behaviour or is this a bug in arrow?
Reporter: Julian de Ruiter
Assignee: Joris Van den Bossche / @jorisvandenbossche
Related issues:
Note: This issue was originally created as ARROW-5666. Please see the migration documentation for further details.