ARROW-17964: [C++] Range data comparison for struct type may go out of bounds by rtpsw · Pull Request #14347 · apache/arrow

rtpsw · 2022-10-07T19:13:10Z

See https://issues.apache.org/jira/browse/ARROW-17964

…f bounds

github-actions · 2022-10-07T19:13:27Z

https://issues.apache.org/jira/browse/ARROW-17964

lidavidm · 2022-10-07T19:24:22Z

We should probably add a unit test for this

rtpsw · 2022-10-07T21:00:10Z

We should probably add a unit test for this

Agreed. I'm not sure how I'd like to test this just yet, so ideas are welcomed. At first, I thought I could somehow use AssertDatumsEqual, but it cannot be directly negated.

rtpsw · 2022-10-07T21:28:00Z

@lidavidm, are you aware of any example test case that checks for accessing an index out of bounds?

lidavidm · 2022-10-10T12:06:09Z

No, although looking at it, the fix is a little weird - it means we're comparing data of two different types, or the data doesn't match its type?

Regardless, it seems something like this would suffice?

auto lhs = ArrayFromJSON(..., "[{"a": 2, "b": 3}]");
auto rhs = ArrayFromJSON(..., "[{"a": 2}]");
ASSERT_FALSE(lhs->Equals(*rhs));  // assuming this segfaults without this PR

pitrou · 2022-10-10T13:37:46Z

Right, since the data types are supposed to match, this PR is only guarding against invalid data.
It you want to make sure data is valid, you should call Validate or ValidateFull before doing anything else.

rtpsw · 2022-10-10T13:53:11Z

No, although looking at it, the fix is a little weird - it means we're comparing data of two different types, or the data doesn't match its type?

I run into the issue that led to this PR when a unit test compared data of two different types, one is the expected and another is the unexpected. My quick impression was that there seemed to be sufficiently many test cases that compare this way, without trying to compare types or to validate first. I figured it would be easier to fix the comparison in one place than to fix many test cases.

Right, since the data types are supposed to match, this PR is only guarding against invalid data.
It you want to make sure data is valid, you should call Validate or ValidateFull before doing anything else.

I may be missing something, but I don't think validating would have fixed the issue, because I believe in the above case both expected and actual data were valid, since they both were obtained via JSON parsing; they were just of different types.

Regardless, it seems something like this would suffice? ... // assuming this segfaults without this PR

I didn't check your particular case yet, but I did run into a segfault in the case I described above. My vote would be to minimize segfaults as an indication of test failure, if only because such a failure would be less convenient to work with.

lidavidm · 2022-10-10T14:12:24Z

I think it's OK to have this, I wonder why the comparison doesn't start with a type comparison though (which should avoid this class of issues)

pitrou · 2022-10-10T14:36:06Z

It does start with a type comparison, it's also mentioned above:

arrow/cpp/src/arrow/compare.cc

Lines 164 to 169 in d4190cc

    
           class RangeDataEqualsImpl { 
        
            public: 
        
             // PRE-CONDITIONS: 
        
             // - the types are equal 
        
             // - the ranges are in bounds 
        
             RangeDataEqualsImpl(const EqualOptions& options, bool floating_approximate,

and you can see an example of type checking here:

arrow/cpp/src/arrow/compare.cc

Lines 547 to 554 in d4190cc

    
           bool CompareArrayRanges(const ArrayData& left, const ArrayData& right, 
        
                                   int64_t left_start_idx, int64_t left_end_idx, 
        
                                   int64_t right_start_idx, const EqualOptions& options, 
        
                                   bool floating_approximate) { 
        
             if (left.type->id() != right.type->id() || 
        
                 !TypeEquals(*left.type, *right.type, false /* check_metadata */)) { 
        
               return false; 
        
             }

rtpsw · 2022-10-10T14:37:12Z

I think it's OK to have this, I wonder why the comparison doesn't start with a type comparison though (which should avoid this class of issues)

AFAICS, the path of invocations seems to be Array::Equals -> ArrayEquals -> ArrayRangeEquals -> CompareArrayRanges -> TypeEqualsVisitor::Visit(const StructType &). This path does not go thorough TypeEqualsVisitor::VisitChildren, for example, that would have checked for equal types. Perhaps all visitors methods in TypeEqualsVisitor should start with a type comparison.

rtpsw · 2022-10-10T14:41:00Z

It does start with a type comparison

Right. As I just noted in crossing, I suspect the issue is with TypeEqualsVisitor which it uses.

pitrou · 2022-10-10T14:43:07Z

Can you show a snippet that would show the issue?

rtpsw · 2022-10-10T14:44:20Z

Can you show a snippet that would show the issue?

Good chances I could. I'll need a bit of time to get to this, though.

rtpsw · 2022-10-13T13:20:20Z

Can you show a snippet that would show the issue?

Good chances I could. I'll need a bit of time to get to this, though.

I added a test that checks for this by comparing a badly structures array with a correctly structured one:

The final check in the test covers what you asked for. The other checks appearing before it are for normal conditions.
Without the change in RangeDataEqualsImpl, the final check fails due to not observing a failure.
Without the change in PrintDiff (but with the change in RangeDataEqualsImpl), the final check leads to a segmentation fault.

pitrou · 2022-10-13T13:41:23Z

Okay, so here is the problem: users shouldn't pass invalid data to Arrow APIs (except to Validate and ValidateFull, which are explicitly designed to handle such data). So it doesn't make sense to check for invalid data at the beginning of other functions; also, it can be quite costly (ValidateFull can typically be O(nrows * columns)).

(note: "invalid data" here is a badly structured array)

rtpsw · 2022-10-13T13:58:14Z

Okay, so here is the problem: users shouldn't pass invalid data to Arrow APIs (except to Validate and ValidateFull, which are explicitly designed to handle such data). So it doesn't make sense to check for invalid data at the beginning of other functions; also, it can be quite costly (ValidateFull can typically be O(nrows * columns)).

(note: "invalid data" here is a badly structured array)

This circles back to points we discussed. I can understand the requirement of passing valid data in a correct Arrow app, as well as in correct Arrow code, but less so during its development, where incorrect code frequently occurs. This PR aims to make (failure analysis during) development easier, given that its runtime cost is small. For the purpose of cost, I think the calls to ValidateFull in PrintDiff shouldn't count because they can be removed - I only give them for reproducibility without a segmentation fault.

pitrou · 2022-10-13T14:44:02Z

But again, during development you can call Validate[Full] in your own code. It doesn't really make sense to randomly add checks in Arrow functions, IMHO.

rtpsw · 2022-10-13T15:56:40Z

But again, during development you can call Validate[Full] in your own code. It doesn't really make sense to randomly add checks in Arrow functions, IMHO.

Well, not really randomly, but I understand what you're saying. While I agree one can easily add validation calls to the code while developing, I still think it's not convenient because in many cases the segfault is not making it easy to determine which structure is invalid. Moreover, a segfault is a relatively good result; when luck betrays, the result might be a memory leak or a buffer overrun that would make analysis of the root cause much harder.

Having said that, since you seem to be firm in your opinion, I'll stop pushing for this PR. It's not high priority.

ARROW-17964: [C++] Range data comparison for struct type may go out o…

d4190cc

…f bounds

github-actions Bot added the Component: C++ label Oct 7, 2022

rtpsw mentioned this pull request Oct 8, 2022

ARROW-17642: [C++] Add ordered aggregation #14352

Closed

pitrou mentioned this pull request Oct 12, 2022

ARROW-18004: [C++] ExecBatch conversion to RecordBatch may go out of bounds #14386

Merged

add tests

62fdbe4

rtpsw closed this Oct 13, 2022

rtpsw deleted the ARROW-17964 branch October 13, 2022 18:31

Uh oh!

Conversation

rtpsw commented Oct 7, 2022

Uh oh!

github-actions Bot commented Oct 7, 2022

Uh oh!

lidavidm commented Oct 7, 2022

Uh oh!

rtpsw commented Oct 7, 2022

Uh oh!

rtpsw commented Oct 7, 2022

Uh oh!

lidavidm commented Oct 10, 2022

Uh oh!

pitrou commented Oct 10, 2022

Uh oh!

rtpsw commented Oct 10, 2022

Uh oh!

lidavidm commented Oct 10, 2022

Uh oh!

pitrou commented Oct 10, 2022

Uh oh!

rtpsw commented Oct 10, 2022

Uh oh!

rtpsw commented Oct 10, 2022

Uh oh!

pitrou commented Oct 10, 2022

Uh oh!

rtpsw commented Oct 10, 2022

Uh oh!

rtpsw commented Oct 13, 2022

Uh oh!

pitrou commented Oct 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rtpsw commented Oct 13, 2022

Uh oh!

pitrou commented Oct 13, 2022

Uh oh!

rtpsw commented Oct 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pitrou commented Oct 13, 2022 •

edited

Loading