Skip to content

ARROW-10924: [C++] Validate temporal data in ValidateArrayFull#12014

Closed
JabariBooker wants to merge 21 commits into
apache:masterfrom
JabariBooker:ARROW-10924
Closed

ARROW-10924: [C++] Validate temporal data in ValidateArrayFull#12014
JabariBooker wants to merge 21 commits into
apache:masterfrom
JabariBooker:ARROW-10924

Conversation

@JabariBooker

Copy link
Copy Markdown
Contributor

Validate that temporal data in arrays meet the restrictions of their data type specifications.

@github-actions

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@lidavidm lidavidm left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

There are quite a few failing tests. We'll have to go through each of them individually to decide what to do about them.

You will want to lint and format the code.

Comment thread cpp/src/arrow/array/validate.cc Outdated
Comment thread cpp/src/arrow/array/validate.cc Outdated
Comment thread cpp/src/arrow/array/validate.cc Outdated

@lidavidm lidavidm left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates.

You'll want to lint and format again (see the CI failure: https://github.com/apache/arrow/runs/4612596806?check_suite_focus=true)

Looking at test failures:

Comment thread cpp/src/arrow/array/validate.cc Outdated
Comment thread cpp/src/arrow/array/validate.cc Outdated
Comment thread cpp/src/arrow/array/validate.cc
Comment thread cpp/src/arrow/array/validate.cc Outdated
Comment thread cpp/src/arrow/array/validate.cc Outdated
@edponce

edponce commented Dec 29, 2021

Copy link
Copy Markdown
Contributor

@JabariBooker Thanks for working on this! I left some minor comments.
Also, we are missing validation for the following temporal types: Date32Type and TimestampType.

@pitrou

pitrou commented Dec 30, 2021

Copy link
Copy Markdown
Member

@edponce There are no limits on Date32Type and TimestampType AFAICT.

@edponce

edponce commented Jan 6, 2022

Copy link
Copy Markdown
Contributor

@pitrou So there is a limit for Date64Type but not for Date32Type? I now understand that TimestampType does not have limits.

@lidavidm

lidavidm commented Jan 6, 2022

Copy link
Copy Markdown
Member

Date32Type's unit is days and so there's nothing to check, but Date64Type's unit is milliseconds, hence it doesn't have a limit per se, but it must be a multiple of 86400000 (=1 day in milliseconds).

@lidavidm

Copy link
Copy Markdown
Member

FYI, that test failure in GroupBy.MinMaxTypes is here:

TEST(GroupBy, MinMaxTypes) {

The test uses one set of data to go through several data types, so you could just update the hardcoded data there too.

@lidavidm lidavidm left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM. I left some nits (we can check DataType::id() instead of the string name). And thanks for going through all the function tests; we should perhaps consider removing Date64 from the list of types to check in order to just split off those tests separately given the restrictions on value. (EDIT: not in this PR of course.)

SCOPED_TRACE(ty->ToString());
auto in_schema = schema({field("argument0", ty), field("key", int64())});
auto table =
TableFromJSON(in_schema, (ty->name() == "date64") ? date64_table : default_table);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can check ty->id() == Type::DATE64

[{"min": 3, "max": 5}, 4],
[{"min": 1, "max": 4}, null]
])"),
(ty->name() == "date64") ? date64_expected : default_expected),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ditto here

// Sliced
CheckUnique(ArrayFromJSON(type, "[1, 2, null, 3, 2, null]")->Slice(1, 4),
ArrayFromJSON(type, "[2, null, 3]"));
if (type->name() == "date64") {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here and so on (sorry)

Comment thread cpp/src/arrow/testing/random.cc Outdated
}

case Type::type::TIME32: {
TimeUnit::type unit = std::dynamic_pointer_cast<Time32Type>(field.type())->unit();

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we have internal::checked_[pointer_]cast which will use dynamic or static cast depending on debug/release build

@JabariBooker

Copy link
Copy Markdown
Contributor Author

@lidavidm @edponce @pitrou There are still failing integration tests (all are also failing due to invalid values for date64). The data involved in these tests seem to be generated by some mechanism I'm unaware of.

Comment thread cpp/src/arrow/array/array_test.cc
options.GenerateData(buffers[1]->mutable_data(), size);
if (std::is_same<ArrowType, Date64Type>::value) {
GenerateFullDayMillisNoNan(buffers[1]->mutable_data(), size);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an if-else. buffers[1]->mutable_data() is modified twice if Date64Type.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a more suitable place for this type-specific generation of numeric arrays, is in GenerateTypeDataNoNan() from GenerateOptions. Make GenerateFullDayMillisNoNan as a private member to GenerateOptions and perform the if-else check in GenerateTypedDataNoNan().

@JabariBooker JabariBooker Jan 28, 2022

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional. Since we need to generate multiples of 86400000 for Date64Type, random numbers are generated in GenerateData and ultimately multiplied by 86400000 in GenerateFullDayMillisNoNan. The behavior of GenerateFullDayMillisNoNan is separated from GenerateOptions since this class doesn't know the arrow type, only the underlying C type. However, behavior of GenerateFullDayMillisNoNan could be moved into the above if statement if needed.

Comment thread cpp/src/arrow/testing/random.cc
@pitrou

pitrou commented Jan 31, 2022

Copy link
Copy Markdown
Member

@JabariBooker Integration data is generated in Python by Archery here: https://github.com/apache/arrow/blob/master/dev/archery/archery/integration/datagen.py#L196

You can find more info about the JSON integration format here, and how to run the integration tests locally, here: https://arrow.apache.org/docs/format/Integration.html

@lidavidm lidavidm left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, just a minor nit on the datagen (which we probably won't run into, but…)

return super().generate_range(size, lower, upper, name)

full_day_millis = 1000 * 60 * 60 * 24
lower //= full_day_millis

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python appears to round towards -Inf so a lower negative bound that is not exactly even will get rounded to an out-of-bound value

@lidavidm lidavidm left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think we should be good here. This turned out to be quite a bit more involved than it looked at first.

@pitrou @edponce any other comments?

Comment thread dev/archery/archery/integration/tester_cpp.py
@lidavidm lidavidm closed this in c715beb Feb 2, 2022
@ursabot

ursabot commented Feb 2, 2022

Copy link
Copy Markdown

Benchmark runs are scheduled for baseline = 9b53235 and contender = c715beb. c715beb is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.13% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants