ARROW-15251: [C++] Temporal floor/ceil/round handle ambiguous/nonexistent local time#12528
Conversation
|
|
354bef6 to
2e018c5
Compare
|
@rok could you check this?, I tested different |
|
A somewhat contrived example that currently gives a nonexistent error with rounding (it requires an atypical |
rok
left a comment
There was a problem hiding this comment.
Thanks for doing this @AlvinJ15! This looks pretty complete already. You can find an example test for nonexistent and ambiguous here: https://howardhinnant.github.io/date/tz.html#nonexistent_local_time
I'll do another pass tonight or tomorrow.
There was a problem hiding this comment.
I'm not sure if AssumeTimezoneOptions::Ambiguous and RoundTemporalOptions::Ambiguous would have the same options long-term (same for nonexistent). For now this change seems like the way to go, I'm just wondering if the name compute::Ambiguous should maybe be compute::AmbiguousTime (and compute::NonexistentTime?
There was a problem hiding this comment.
I take the suggestion and changed compute::Ambiguous to compute::AmbiguousTime and compute::Nonexistent to compute::NonexistentTime
Are you saying that |
2e018c5 to
11c1c52
Compare
|
@rok comments solved, the |
c6cb7e6 to
acdf872
Compare
0c31bd1 to
b16eeed
Compare
rok
left a comment
There was a problem hiding this comment.
Some comments. Looks good overall!
We need a review from a commiter as well @jorisvandenbossche @pitrou
There was a problem hiding this comment.
Perhaps AssumeTimezone kernel could reuse this now that you have it nicely factored out?
There was a problem hiding this comment.
Would it make sense to have arrow_vendored::date::choose as a template parameter rather than passing options every time the kernel is called? (I'm not certain just asking)
Also note that we could probably do more templating for the ceil/floor/round kernels, but that's out of scope here.
|
Can you explain the motivation for this functionality? |
The issue is that rounding is done on local time not UTC. So if the rounded-to moment does not exist in local time rounding will fail and we need to handle it at that point. |
|
Right... but as I said this usually means the original timestamp was already invalid, no? So instead of catching errors, would it be more/less useful to have a function purely to fix invalid timestamps? |
|
Original timestamp can be valid and rounded-to not. Let's take an example from date.h docs: If we start with a valid |
|
Hmm, I see. While ceiling to 2h30min sounds exotic, this is a valid use case. |
|
Still, this example doesn't make sense to me: const char* times = R"(["2018-10-28 01:20:00"])";
const char* times_earliest = R"(["2018-10-28 00:30:00"])";
const char* times_latest = R"(["2018-10-28 01:30:00"])";The only correct answer here is |
I think you can achieve this with less exotic intervals too.
Indeed. So choice should only be raise or latest? |
|
Taken more abstractly, the contract is the following:
So it should be possible to implement the expected semantics without exposing any additional options to the user, possibly by examining the two |
Currently we only raise and we want to control raise vs earliest/latest. That would mean exposing additional option I think? Or are you proposing to not raise? I think that's a valid for nonexistent, but I'm not sure about ambiguous. E.g. ceil(t) falls to exact moment of DST switch and could return t or t + dst_offset. We can take the same approach here and see if users eventually complain :). Here's an ambiguous example from date.h. |
Rok already mentioned it, but while it's true that non-existent times from rounding are a bit exotic, the ambiguous is certainly not. To give a concrete example, assume the local time "2021-10-31 02:25:00" in Europe (during a DST switch) and rounding that to the hour: But indeed, also in this case we can know that "00:00::00 UTC" is closer to the original timestamp than "01:00:00 UTC" (since the original timestamp in UTC was "00:25:00 UTC"). That adds some more logic to this kernel, but this would actually make those round kernels more useful!
If there is no ambiguity left (eg as in the example above), I think we should not raise by default. But it might be that for some cases it's still better to raise by default. For example in the case of "non-existent" times, we are actually changing the resulting timestamp, and thus that also means it will not necessarily "follow" the rounding |
By definition of ceil(), it should return the smallest applicable value, so there is no ambiguity.
That's true, though it only seems to trigger for "unusual" roundings. So we may want to add an option for non-existent timestamps, but it sounds less important than getting ambiguous timestamps right (which shouldn't require an option). |
We could catch these cases and implement logic to return correct multiple rounding. Then we wouldn't need any options and would never have to raise. I agree with the other conclusions. |
|
@AlvinJ15 any progress on this? Can I help somehow? |
|
Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍 |
fff7bd4 to
a48f69d
Compare
a48f69d to
c027b76
Compare
f77fd93 to
107c413
Compare
Tweaking nonexistent/ambiguous rounding Moving nonexistent/ambiguous logic to AssumeTimezone Revert AssumeTimezoneOptions::Nonexistent changes Fixing compiler warnings Fixing ceil/round issues Apply suggestions from code review Review feedback Review feedback Changes to ceil/floor, more tests Refactoring refactoring Review feedback review feedback Review feedback adding python tests adding ambiguous round test python Update cpp/src/arrow/compute/kernels/scalar_temporal_test.cc change nonexistent/ambiguous behaviour Add preserve_wall_time_order flag
107c413 to
35cab06
Compare
|
Thank you for your contribution. Unfortunately, this pull request has been marked as stale because it has had no activity in the past 365 days. Please remove the stale label or comment below, or this PR will be closed in 14 days. Feel free to re-open this if it has been closed in error. If you do not have repository permissions to reopen the PR, please tag a maintainer. |
| } | ||
|
|
||
| template <typename Duration> | ||
| Duration ConvertLocalToSys(Duration t, Status* st) const { |
There was a problem hiding this comment.
Perhaps consider bringing back the try-catch here (or review parameter Status* st and callers in scalar_temporal_unary.cc between 753 - 1037)?
| @@ -887,10 +800,8 @@ Duration CeilWeekTimePoint(const int64_t arg, const RoundTemporalOptions& option | |||
| template <typename Duration, typename Unit, typename Localizer> | |||
| // to hours since the beginning of the day. | ||
| const Duration origin = | ||
| OriginHelper(d, ConvertTimePoint<Duration>(t), options.unit); | ||
| return duration_cast<Duration>(CeilHelper<Duration, Unit>((d - origin), options) + |
There was a problem hiding this comment.
(to self: verify CeilHelper under FloorTimePoint here)
Temporal floor/ceil/round handle ambiguous/nonexistent local time