Skip to content

ARROW-6084: [Python] Support LargeList#4979

Closed
pitrou wants to merge 1 commit into
apache:masterfrom
pitrou:ARROW-6084-py-large-list
Closed

ARROW-6084: [Python] Support LargeList#4979
pitrou wants to merge 1 commit into
apache:masterfrom
pitrou:ARROW-6084-py-large-list

Conversation

@pitrou

@pitrou pitrou commented Aug 1, 2019

Copy link
Copy Markdown
Member

No description provided.

Comment thread python/pyarrow/tests/strategies.py Outdated

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kszucs Does this look right?

@pitrou pitrou force-pushed the ARROW-6084-py-large-list branch from 8ccd61f to b5ae30c Compare August 1, 2019 12:31

@jorisvandenbossche jorisvandenbossche left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

Comment thread python/pyarrow/scalar.pxi Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not possible to have this subclass ListValue to reduce code duplication? (getitem, iter, as_py look the same)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the C++ types are different (e.g. ListArray vs. LargeListArray), and those need to be compile-time constants for Cython.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code like this makes me wish for some kind of macro system in Cython.

@jorisvandenbossche

Copy link
Copy Markdown
Member

Again, while trying it out this branch, I noticed some oddities. But now they are certainly not related to the code in this PR, as they also hold for non-large ListType.

When creating an array from offsets and values in python, there is no validation of the offsets that it starts with 0 and ends with the length of the array (but is that required? the docs seem to indicate that: https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type ("The first value in the offsets array is 0, and the last element is the length of the values array.").
The array you get "seems" ok (the repr), but on conversion to python or flattened arrays, things go wrong:

In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) 

In [62]: a
Out[62]: 
<pyarrow.lib.ListArray object at 0x7fdd9c468678>
[
  [
    1,
    2
  ],
  [
    3,
    4
  ]
]

In [63]: a.flatten()
Out[63]: 
<pyarrow.lib.Int64Array object at 0x7fdd9cbfe9e8>
[
  0,   # <--- includes the 0
  1,
  2,
  3,
  4
]

In [64]: a.to_pylist()
Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]]  # <--includes more elements as garbage

I understand that in C++ the main constructors are not safe, and as the caller you need to ensure that the data is correct or call a safe (slower) constructor. But do we want to use the unsafe / fast constructors without validation in Python as default as well?

@jorisvandenbossche

Copy link
Copy Markdown
Member

Ah, I see that in Python a validation function is exposed, and this actually raises:

In [65]: a.validate()                                                                                                                                                                                              
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
<ipython-input-65-b377de7e0b22> in <module>
----> 1 a.validate()

~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib.Array.validate()

~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Final offset invariant not equal to values length: 10!=5

Do we want to call that by default in the creation method?
A quick search seems to indicate that pa.Array.from_buffers does this, but other from_arrays method don't seem to explicitly do this. But eg in DictionaryArray.from_arrays, if you pass values that is too short for the max of the indices, you do get an error.

@pitrou

pitrou commented Aug 5, 2019

Copy link
Copy Markdown
Member Author

On the C++ side indeed, calling validation for each constructor would be expensive. We can adopt a different strategy for Python. Can you open a JIRA about that?

@jorisvandenbossche

Copy link
Copy Markdown
Member

@pitrou pitrou force-pushed the ARROW-6084-py-large-list branch 2 times, most recently from e87eef5 to 3c38574 Compare August 5, 2019 17:34
@pitrou

pitrou commented Aug 6, 2019

Copy link
Copy Markdown
Member Author

@wesm Do you want to take a look here or can I merge as-is? (assuming the R AppVeyor failure isn't related)

@pitrou pitrou force-pushed the ARROW-6084-py-large-list branch from 3c38574 to 4266ea2 Compare August 6, 2019 12:20
@pitrou

pitrou commented Aug 6, 2019

Copy link
Copy Markdown
Member Author

Rebased.

@codecov-io

Copy link
Copy Markdown

Codecov Report

Merging #4979 into master will increase coverage by 1.6%.
The diff coverage is 87.4%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4979      +/-   ##
==========================================
+ Coverage   87.57%   89.17%    +1.6%     
==========================================
  Files        1005      727     -278     
  Lines      143560   103008   -40552     
  Branches     1418        0    -1418     
==========================================
- Hits       125720    91862   -33858     
+ Misses      17478    11146    -6332     
+ Partials      362        0     -362
Impacted Files Coverage Δ
python/pyarrow/__init__.py 67.04% <ø> (ø) ⬆️
python/pyarrow/tests/test_compute.py 100% <ø> (ø) ⬆️
python/pyarrow/lib.pyx 100% <100%> (ø) ⬆️
cpp/src/arrow/python/python_to_arrow.cc 91.29% <100%> (+0.07%) ⬆️
python/pyarrow/tests/test_convert_builtin.py 97.24% <100%> (+0.03%) ⬆️
python/pyarrow/public-api.pxi 59.67% <100%> (+0.43%) ⬆️
python/pyarrow/types.py 97.84% <100%> (+0.04%) ⬆️
python/pyarrow/tests/test_scalars.py 100% <100%> (ø) ⬆️
python/pyarrow/tests/test_types.py 97.11% <100%> (+0.11%) ⬆️
python/pyarrow/tests/test_array.py 96.33% <100%> (+0.03%) ⬆️
... and 285 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 42f4f34...4266ea2. Read the comment docs.

@wesm

wesm commented Aug 6, 2019

Copy link
Copy Markdown
Member

Taking a look

@wesm wesm left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Comment thread python/pyarrow/scalar.pxi Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code like this makes me wish for some kind of macro system in Cython.

@wesm wesm closed this in 2774cfb Aug 6, 2019
@pitrou pitrou deleted the ARROW-6084-py-large-list branch August 6, 2019 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants