Skip to content

pyarrow / pandas support for tensors (multi-dimensional arrays) #4802

Description

@pwais

pyarrow appears to have a tensor type: https://arrow.apache.org/docs/python/generated/pyarrow.Tensor.html#pyarrow.Tensor

But numpy arrays do not get translated to Tensors when used in pandas dataframes:

if __name__ == '__main__':
  import numpy as np
  import pandas as pd
  a = np.array([
                [[1, 2],[1, 2],[1, 2]],
                [[1, 2],[1, 2],[1, 2]],
                [[1, 2],[1, 2],[1, 2]]])
  df = pd.DataFrame({'x':range(3), 'y':['a','b','b'], 'z':[a, a, a]})
  print(df)

  import pyarrow as pa
  import pyarrow.parquet as pq
  print(pa.__version__)

  table = pa.Table.from_pandas(df)
  pq.write_to_dataset(
        table,
        '/tmp/rows')

results in:

# python3 yay.py 
   x  y                                                  z
0  0  a  [[[1, 2], [1, 2], [1, 2]], [[1, 2], [1, 2], [1...
1  1  b  [[[1, 2], [1, 2], [1, 2]], [[1, 2], [1, 2], [1...
2  2  b  [[[1, 2], [1, 2], [1, 2]], [[1, 2], [1, 2], [1...
0.14.0
Traceback (most recent call last):
  File "yay.py", line 15, in <module>
    table = pa.Table.from_pandas(df)
  File "pyarrow/table.pxi", line 1174, in pyarrow.lib.Table.from_pandas
  File "/usr/local/lib/python3.6/dist-packages/pyarrow/pandas_compat.py", line 496, in dataframe_to_arrays
    for c, f in zip(columns_to_convert, convert_fields)]
  File "/usr/local/lib/python3.6/dist-packages/pyarrow/pandas_compat.py", line 496, in <listcomp>
    for c, f in zip(columns_to_convert, convert_fields)]
  File "/usr/local/lib/python3.6/dist-packages/pyarrow/pandas_compat.py", line 487, in convert_column
    raise e
  File "/usr/local/lib/python3.6/dist-packages/pyarrow/pandas_compat.py", line 481, in convert_column
    result = pa.array(col, type=type_, from_pandas=True, safe=safe)
  File "pyarrow/array.pxi", line 191, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 78, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: ('Can only convert 1-dimensional array values', 'Conversion failed for column z with type object')

Might we be able to simply patch pandas_compat.py to do Tensor conversions?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions