Skip to content

Possible to read categoricals back into Pandas from Parquet using Pyarrow? #1688

Description

@kylebarron

Apologies if this is solved somewhere in the documentation; I've tried to look through it and through the issues here and on JIRA. If so just close this issue.

Is it possible to have a Pandas dataset with categorical variables, save it to Parquet, and read in those variables as categorical again?

Setup:

In [1]: import fastparquet as fp
In [2]: import pyarrow as pa
In [3]: import pyarrow.parquet as pq
In [4]: import pandas as pd
In [5]: df = pd.DataFrame({'A':['a','b','c','a']})
In [6]: df['B'] = df['A'].astype('category')
In [7]: df.dtypes
Out[7]: 
A      object
B    category
dtype: object

Using Pandas to read/write only reads B as categorical using fastparquet, but interestingly, reads that even if it was written by pyarrow:

In [11]: df.to_parquet('pa.parquet', engine='pyarrow')
    ...: df.to_parquet('fp.parquet', engine='fastparquet')

In [12]: pd.read_parquet('pa.parquet', engine='pyarrow').dtypes
Out[12]: 
A    object
B    object
dtype: object

In [13]: pd.read_parquet('pa.parquet', engine='fastparquet').dtypes
Out[13]: 
A      object
B    category
dtype: object

In [14]: pd.read_parquet('fp.parquet', engine='pyarrow').dtypes
Out[14]: 
A    object
B    object
dtype: object

In [15]: pd.read_parquet('fp.parquet', engine='fastparquet').dtypes
Out[15]: 
A      object
B    category
dtype: object

I'm guessing that fastparquet reads the pandas_type option in the metadata and sees categorical.

In [16]: pq.ParquetFile('pa.parquet').read(use_pandas_metadata=True)
Out[16]: 
pyarrow.Table
A: string
B: string
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
            b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
            b' [{"name": "A", "field_name": "A", "pandas_type": "unicode", "nu'
            b'mpy_type": "object", "metadata": null}, {"name": "B", "field_nam'
            b'e": "B", "pandas_type": "categorical", "numpy_type": "int8", "me'
            b'tadata": {"num_categories": 3, "ordered": false}}, {"name": null'
            b', "field_name": "__index_level_0__", "pandas_type": "int64", "nu'
            b'mpy_type": "int64", "metadata": null}], "pandas_version": "0.22.'
            b'0"}'}

It seems that the strings_to_categorical option of pa.Table.to_pandas() doesn't work in this situation either (maybe I'm using it wrong; also I'd prefer to only read as categorical the columns that were originally categorical in Pandas):

In [17]: table = pq.ParquetFile('pa.parquet').read(use_pandas_metadata=True)
    ...: table.to_pandas(strings_to_categorical=True)
    ...: 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-17-42387e24e864> in <module>()
      1 table = pq.ParquetFile('pa.parquet').read(use_pandas_metadata=True)
----> 2 table.to_pandas(strings_to_categorical=True)

~/local/anaconda3/lib/python3.6/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.to_pandas (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:46331)()

~/local/anaconda3/lib/python3.6/site-packages/pyarrow/pandas_compat.py in table_to_blockmanager(options, table, memory_pool, nthreads, categoricals)
    526             )
    527 
--> 528     blocks = _table_to_blocks(options, block_table, nthreads, memory_pool)
    529 
    530     # Construct the row index

~/local/anaconda3/lib/python3.6/site-packages/pyarrow/pandas_compat.py in _table_to_blocks(options, block_table, nthreads, memory_pool)
    620 
    621     # Defined above
--> 622     return [_reconstruct_block(item) for item in result]
    623 
    624 

~/local/anaconda3/lib/python3.6/site-packages/pyarrow/pandas_compat.py in <listcomp>(.0)
    620 
    621     # Defined above
--> 622     return [_reconstruct_block(item) for item in result]
    623 
    624 

~/local/anaconda3/lib/python3.6/site-packages/pyarrow/pandas_compat.py in _reconstruct_block(item)
    434         cat = pd.Categorical.from_codes(block_arr,
    435                                         categories=item['dictionary'],
--> 436                                         ordered=item['ordered'])
    437         block = _int.make_block(cat, placement=placement,
    438                                 klass=_int.CategoricalBlock,

~/local/anaconda3/lib/python3.6/site-packages/pandas/core/categorical.py in from_codes(cls, codes, categories, ordered)
    619 
    620         if len(codes) and (codes.max() >= len(categories) or codes.min() < -1):
--> 621             raise ValueError("codes need to be between -1 and "
    622                              "len(categories)-1")
    623 

ValueError: codes need to be between -1 and len(categories)-1

It seems that the pd.DataFrame > pa.Table > pd.DataFrame conversion keeps categoricals as a dictionary Arrow type, but that the pa.Table > .parquet > pa.Table process loses the categoricals.

In [22]: table = pa.Table.from_pandas(df)

In [23]: table.schema
Out[23]: 
A: string
B: dictionary<values=string, indices=int8, ordered=0>
  dictionary: ["a", "b", "c"]
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
            b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
            b' [{"name": "A", "field_name": "A", "pandas_type": "unicode", "nu'
            b'mpy_type": "object", "metadata": null}, {"name": "B", "field_nam'
            b'e": "B", "pandas_type": "categorical", "numpy_type": "int8", "me'
            b'tadata": {"num_categories": 3, "ordered": false}}, {"name": null'
            b', "field_name": "__index_level_0__", "pandas_type": "int64", "nu'
            b'mpy_type": "int64", "metadata": null}], "pandas_version": "0.22.'
            b'0"}'}

In [24]: table.to_pandas().dtypes
Out[24]: 
A      object
B    category
dtype: object

In [25]: pq.write_table(table, 'table.parquet')

In [26]: pq.ParquetFile('table.parquet').read()
Out[26]: 
pyarrow.Table
A: string
B: string
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
            b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
            b' [{"name": "A", "field_name": "A", "pandas_type": "unicode", "nu'
            b'mpy_type": "object", "metadata": null}, {"name": "B", "field_nam'
            b'e": "B", "pandas_type": "categorical", "numpy_type": "int8", "me'
            b'tadata": {"num_categories": 3, "ordered": false}}, {"name": null'
            b', "field_name": "__index_level_0__", "pandas_type": "int64", "nu'
            b'mpy_type": "int64", "metadata": null}], "pandas_version": "0.22.'
            b'0"}'}

Using fastparquet works well at the moment, but

  1. Some files written by fastparquet are unable to be written by pyarrow. I get the trace
    In [3]: df = pf.read()
    ---------------------------------------------------------------------------
    ArrowIOError                              Traceback (most recent call last)
    <ipython-input-3-910db268c9c9> in <module>()
    ----> 1 df = pf.read()
    
    ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in read(self, columns, nthreads, use_pandas_metadata)
        140             columns, use_pandas_metadata=use_pandas_metadata)
        141         return self.reader.read_all(column_indices=column_indices,
    --> 142                                     nthreads=nthreads)
        143 
        144     def scan_contents(self, columns=None, batch_size=65536):
    
    ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetReader.read_all (/arrow/python/build/temp.linux-x86_64-3.6/_parquet.cxx:12865)()
    
    ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:8345)()
    
    ArrowIOError: Unexpected end of stream.
    I'm working with restricted data, but I could try to reproduce this error with sample data.
  2. I'm trying to write a wrapper for the Parquet C++ library that would read/write Parquet files to Stata, and so I'd need all my files to be readable by pyarrow.
  3. Fastparquet is single threaded so I suppose pyarrow could be faster when using multiple cores

This was done using versions:

In [18]: pa.__version__
Out[18]: '0.8.0'

In [20]: fp.__version__
Out[20]: '0.1.4'

In [21]: pd.__version__
Out[21]: '0.22.0'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions