Skip to content

ARROW-17964: [C++] Range data comparison for struct type may go out of bounds#14347

Closed
rtpsw wants to merge 2 commits into
apache:masterfrom
rtpsw:ARROW-17964
Closed

ARROW-17964: [C++] Range data comparison for struct type may go out of bounds#14347
rtpsw wants to merge 2 commits into
apache:masterfrom
rtpsw:ARROW-17964

Conversation

@rtpsw

@rtpsw rtpsw commented Oct 7, 2022

Copy link
Copy Markdown
Contributor

@github-actions

github-actions Bot commented Oct 7, 2022

Copy link
Copy Markdown

@lidavidm

lidavidm commented Oct 7, 2022

Copy link
Copy Markdown
Member

We should probably add a unit test for this

@rtpsw

rtpsw commented Oct 7, 2022

Copy link
Copy Markdown
Contributor Author

We should probably add a unit test for this

Agreed. I'm not sure how I'd like to test this just yet, so ideas are welcomed. At first, I thought I could somehow use AssertDatumsEqual, but it cannot be directly negated.

@rtpsw

rtpsw commented Oct 7, 2022

Copy link
Copy Markdown
Contributor Author

@lidavidm, are you aware of any example test case that checks for accessing an index out of bounds?

@lidavidm

Copy link
Copy Markdown
Member

No, although looking at it, the fix is a little weird - it means we're comparing data of two different types, or the data doesn't match its type?

Regardless, it seems something like this would suffice?

auto lhs = ArrayFromJSON(..., "[{"a": 2, "b": 3}]");
auto rhs = ArrayFromJSON(..., "[{"a": 2}]");
ASSERT_FALSE(lhs->Equals(*rhs));  // assuming this segfaults without this PR

@pitrou

pitrou commented Oct 10, 2022

Copy link
Copy Markdown
Member

Right, since the data types are supposed to match, this PR is only guarding against invalid data.
It you want to make sure data is valid, you should call Validate or ValidateFull before doing anything else.

@rtpsw

rtpsw commented Oct 10, 2022

Copy link
Copy Markdown
Contributor Author

No, although looking at it, the fix is a little weird - it means we're comparing data of two different types, or the data doesn't match its type?

I run into the issue that led to this PR when a unit test compared data of two different types, one is the expected and another is the unexpected. My quick impression was that there seemed to be sufficiently many test cases that compare this way, without trying to compare types or to validate first. I figured it would be easier to fix the comparison in one place than to fix many test cases.

Right, since the data types are supposed to match, this PR is only guarding against invalid data.
It you want to make sure data is valid, you should call Validate or ValidateFull before doing anything else.

I may be missing something, but I don't think validating would have fixed the issue, because I believe in the above case both expected and actual data were valid, since they both were obtained via JSON parsing; they were just of different types.

Regardless, it seems something like this would suffice? ... // assuming this segfaults without this PR

I didn't check your particular case yet, but I did run into a segfault in the case I described above. My vote would be to minimize segfaults as an indication of test failure, if only because such a failure would be less convenient to work with.

@lidavidm

Copy link
Copy Markdown
Member

I think it's OK to have this, I wonder why the comparison doesn't start with a type comparison though (which should avoid this class of issues)

@pitrou

pitrou commented Oct 10, 2022

Copy link
Copy Markdown
Member

It does start with a type comparison, it's also mentioned above:

class RangeDataEqualsImpl {
public:
// PRE-CONDITIONS:
// - the types are equal
// - the ranges are in bounds
RangeDataEqualsImpl(const EqualOptions& options, bool floating_approximate,

and you can see an example of type checking here:
bool CompareArrayRanges(const ArrayData& left, const ArrayData& right,
int64_t left_start_idx, int64_t left_end_idx,
int64_t right_start_idx, const EqualOptions& options,
bool floating_approximate) {
if (left.type->id() != right.type->id() ||
!TypeEquals(*left.type, *right.type, false /* check_metadata */)) {
return false;
}

@rtpsw

rtpsw commented Oct 10, 2022

Copy link
Copy Markdown
Contributor Author

I think it's OK to have this, I wonder why the comparison doesn't start with a type comparison though (which should avoid this class of issues)

AFAICS, the path of invocations seems to be Array::Equals -> ArrayEquals -> ArrayRangeEquals -> CompareArrayRanges -> TypeEqualsVisitor::Visit(const StructType &). This path does not go thorough TypeEqualsVisitor::VisitChildren, for example, that would have checked for equal types. Perhaps all visitors methods in TypeEqualsVisitor should start with a type comparison.

@rtpsw

rtpsw commented Oct 10, 2022

Copy link
Copy Markdown
Contributor Author

It does start with a type comparison

Right. As I just noted in crossing, I suspect the issue is with TypeEqualsVisitor which it uses.

@pitrou

pitrou commented Oct 10, 2022

Copy link
Copy Markdown
Member

Can you show a snippet that would show the issue?

@rtpsw

rtpsw commented Oct 10, 2022

Copy link
Copy Markdown
Contributor Author

Can you show a snippet that would show the issue?

Good chances I could. I'll need a bit of time to get to this, though.

@rtpsw

rtpsw commented Oct 13, 2022

Copy link
Copy Markdown
Contributor Author

Can you show a snippet that would show the issue?

Good chances I could. I'll need a bit of time to get to this, though.

I added a test that checks for this by comparing a badly structures array with a correctly structured one:

  1. The final check in the test covers what you asked for. The other checks appearing before it are for normal conditions.
  2. Without the change in RangeDataEqualsImpl, the final check fails due to not observing a failure.
  3. Without the change in PrintDiff (but with the change in RangeDataEqualsImpl), the final check leads to a segmentation fault.

@pitrou

pitrou commented Oct 13, 2022

Copy link
Copy Markdown
Member

Okay, so here is the problem: users shouldn't pass invalid data to Arrow APIs (except to Validate and ValidateFull, which are explicitly designed to handle such data). So it doesn't make sense to check for invalid data at the beginning of other functions; also, it can be quite costly (ValidateFull can typically be O(nrows * columns)).

(note: "invalid data" here is a badly structured array)

@rtpsw

rtpsw commented Oct 13, 2022

Copy link
Copy Markdown
Contributor Author

Okay, so here is the problem: users shouldn't pass invalid data to Arrow APIs (except to Validate and ValidateFull, which are explicitly designed to handle such data). So it doesn't make sense to check for invalid data at the beginning of other functions; also, it can be quite costly (ValidateFull can typically be O(nrows * columns)).

(note: "invalid data" here is a badly structured array)

This circles back to points we discussed. I can understand the requirement of passing valid data in a correct Arrow app, as well as in correct Arrow code, but less so during its development, where incorrect code frequently occurs. This PR aims to make (failure analysis during) development easier, given that its runtime cost is small. For the purpose of cost, I think the calls to ValidateFull in PrintDiff shouldn't count because they can be removed - I only give them for reproducibility without a segmentation fault.

@pitrou

pitrou commented Oct 13, 2022

Copy link
Copy Markdown
Member

But again, during development you can call Validate[Full] in your own code. It doesn't really make sense to randomly add checks in Arrow functions, IMHO.

@rtpsw

rtpsw commented Oct 13, 2022

Copy link
Copy Markdown
Contributor Author

But again, during development you can call Validate[Full] in your own code. It doesn't really make sense to randomly add checks in Arrow functions, IMHO.

Well, not really randomly, but I understand what you're saying. While I agree one can easily add validation calls to the code while developing, I still think it's not convenient because in many cases the segfault is not making it easy to determine which structure is invalid. Moreover, a segfault is a relatively good result; when luck betrays, the result might be a memory leak or a buffer overrun that would make analysis of the root cause much harder.

Having said that, since you seem to be firm in your opinion, I'll stop pushing for this PR. It's not high priority.

@rtpsw rtpsw closed this Oct 13, 2022
@rtpsw rtpsw deleted the ARROW-17964 branch October 13, 2022 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants