Skip to content

[Python] Map data types doesn't work from Arrow to Parquet #25855

Description

@asfimport

Hi,

I'm having problems using 'map' data type in Arrow/parquet/pandas.

I'm able to convert a pandas data frame to Arrow with a map data type.

When I write Arrow to Parquet, it seems to work, but I'm not sure if the data type is written correctly.

When I read back Parquet to Arrow, it fails saying "reading list of structs" is not supported. It seems that map is stored as list of structs.

There are two problems here:

  1. Map data type doesn't work from Arrow -> Pandas. Fixed in ARROW-10151

  2. Map data type doesn't get written to or read from Arrow -> Parquet.

    Questions:

    1. Am I doing something wrong? Is there a way to get these to work? 

    2. If these are unsupported features, will this be fixed in a future version? Do you plans or ETA?

    The following code example (followed by output) should demonstrate the issues:

    I'm using Arrow 1.0.0 and Pandas 1.0.5.

    Thanks!

    Mayur

    $ cat arrowtest.py
    
    import pyarrow as pa
    import pandas as pd
    import pyarrow.parquet as pq
    import traceback as tb
    import io
    
    print(f'PyArrow Version = {pa.__version__}')
    print(f'Pandas Version = {pd.__version__}')
    
    df1 = pd.DataFrame({'a': [[('b', '2')]]})
    print(f'df1')
    print(f'{df1}')
    
    print(f'Pandas -> Arrow')
    try:
        t1 = pa.Table.from_pandas(df1, schema=pa.schema([pa.field('a', pa.map_(pa.string(), pa.string()))]))
        print('PASSED')
        print(t1)
    except:
        print(f'FAILED')
        tb.print_exc()
    
    print(f'Arrow -> Pandas')
    try:
        t1.to_pandas()
        print('PASSED')
    except:
        print(f'FAILED')
        tb.print_exc()print(f'Arrow -> Parquet')
    
    fh = io.BytesIO()
    try:
        pq.write_table(t1, fh)
        print('PASSED')
    except:
        print('FAILED')
        tb.print_exc()
        
    print(f'Parquet -> Arrow')
    try:
        t2 = pq.read_table(source=fh)
        print('PASSED')
        print(t2)
    except:
        print('FAILED')
        tb.print_exc()
    $ python3.6 arrowtest.py
    PyArrow Version = 1.0.0 
    Pandas Version = 1.0.5 
    df1 
    a 0 [(b, 2)] 
     
    Pandas -> Arrow 
    PASSED 
    pyarrow.Table 
    a: map<string, string>
     child 0, entries: struct<key: string not null, value: string> not null
     child 0, key: string not null
     child 1, value: string 
     
    Arrow -> Pandas 
    FAILED 
    Traceback (most recent call last):
    File "arrowtest.py", line 26, in <module> t1.to_pandas() 
    File "pyarrow/array.pxi", line 715, in pyarrow.lib._PandasConvertible.to_pandas 
    File "pyarrow/table.pxi", line 1565, in pyarrow.lib.Table._to_pandas File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 779, in table_to_blockmanager blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes) 
    File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 1115, in _table_to_blocks list(extension_columns.keys())) 
    File "pyarrow/table.pxi", line 1028, in pyarrow.lib.table_to_blocks File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status pyarrow.lib.ArrowNotImplementedError: No known equivalent Pandas block for Arrow data of type map<string, string> is known. 
     
    Arrow -> Parquet 
    PASSED 
     
    Parquet -> Arrow 
    FAILED 
    Traceback (most recent call last): File "arrowtest.py", line 43, in <module> t2 = pq.read_table(source=fh) 
    File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/parquet.py", line 1586, in read_table use_pandas_metadata=use_pandas_metadata) 
    File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/parquet.py", line 1474, in read use_threads=use_threads 
    File "pyarrow/_dataset.pyx", line 399, in pyarrow._dataset.Dataset.to_table 
    File "pyarrow/_dataset.pyx", line 1994, in pyarrow._dataset.Scanner.to_table 
    File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status 
    File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status pyarrow.lib.ArrowNotImplementedError: Reading lists of structs from Parquet files not yet supported: key_value: list<key_value: struct<key: string not null, value: string> not null> not null

    Updated to indicate to Pandas conversion done, but not yet for Parquet.

Reporter: Mayur Srivastava / @mayuropensource

Related issues:

Note: This issue was originally created as ARROW-9812. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions