Skip to content

ARROW-15251: [C++] Temporal floor/ceil/round handle ambiguous/nonexistent local time#12528

Open
AlvinJ15 wants to merge 5 commits into
apache:mainfrom
AlvinJ15:ARROW-15251-Temporal_floor/ceil/round_handle_ambiguous/nonexistent_local_time
Open

ARROW-15251: [C++] Temporal floor/ceil/round handle ambiguous/nonexistent local time#12528
AlvinJ15 wants to merge 5 commits into
apache:mainfrom
AlvinJ15:ARROW-15251-Temporal_floor/ceil/round_handle_ambiguous/nonexistent_local_time

Conversation

@AlvinJ15

@AlvinJ15 AlvinJ15 commented Mar 1, 2022

Copy link
Copy Markdown
Contributor

Temporal floor/ceil/round handle ambiguous/nonexistent local time

@github-actions

github-actions Bot commented Mar 1, 2022

Copy link
Copy Markdown

@github-actions

github-actions Bot commented Mar 1, 2022

Copy link
Copy Markdown

⚠️ Ticket has no components in JIRA, make sure you assign one.

@AlvinJ15 AlvinJ15 force-pushed the ARROW-15251-Temporal_floor/ceil/round_handle_ambiguous/nonexistent_local_time branch from 354bef6 to 2e018c5 Compare March 1, 2022 08:34
@AlvinJ15

AlvinJ15 commented Mar 1, 2022

Copy link
Copy Markdown
Contributor Author

@rok could you check this?, I tested different NonExistentTimeError but the fllor/ceil/random didn't raise the exception, it seems like the FloorTimePoint handle this

@jorisvandenbossche

Copy link
Copy Markdown
Member

A somewhat contrived example that currently gives a nonexistent error with rounding (it requires an atypical multiple to end up in a gap):

>>> arr = pc.assume_timezone(pa.array([pd.Timestamp("2015-03-29 02:30:00")]), "Europe/Brussels", nonexistent="latest")
>>> pc.round_temporal(arr, 16, "minute")
...
ArrowInvalid: Local time does not exist: 2015-03-29 02:56:00.000000 is in a gap between
2015-03-29 02:00:00 CET and
2015-03-29 03:00:00 CEST which are both equivalent to
2015-03-29 01:00:00 UTC

@rok rok left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this @AlvinJ15! This looks pretty complete already. You can find an example test for nonexistent and ambiguous here: https://howardhinnant.github.io/date/tz.html#nonexistent_local_time
I'll do another pass tonight or tomorrow.

Comment thread cpp/src/arrow/compute/api_scalar.cc Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if AssumeTimezoneOptions::Ambiguous and RoundTemporalOptions::Ambiguous would have the same options long-term (same for nonexistent). For now this change seems like the way to go, I'm just wondering if the name compute::Ambiguous should maybe be compute::AmbiguousTime (and compute::NonexistentTime?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take the suggestion and changed compute::Ambiguous to compute::AmbiguousTime and compute::Nonexistent to compute::NonexistentTime

@rok

rok commented Mar 1, 2022

Copy link
Copy Markdown
Member

it seems like the FloorTimePoint handle this

Are you saying that CeilTimePoint and RoundTimePoint don't raise at all? Or that FloorTimePoint raises for them?

@AlvinJ15 AlvinJ15 force-pushed the ARROW-15251-Temporal_floor/ceil/round_handle_ambiguous/nonexistent_local_time branch from 2e018c5 to 11c1c52 Compare March 2, 2022 05:54
@AlvinJ15

AlvinJ15 commented Mar 2, 2022

Copy link
Copy Markdown
Contributor Author

@rok comments solved, the re-request review button doesn't work for me

@AlvinJ15 AlvinJ15 force-pushed the ARROW-15251-Temporal_floor/ceil/round_handle_ambiguous/nonexistent_local_time branch 2 times, most recently from c6cb7e6 to acdf872 Compare March 2, 2022 06:22
@AlvinJ15 AlvinJ15 force-pushed the ARROW-15251-Temporal_floor/ceil/round_handle_ambiguous/nonexistent_local_time branch 2 times, most recently from 0c31bd1 to b16eeed Compare March 2, 2022 07:04

@rok rok left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments. Looks good overall!
We need a review from a commiter as well @jorisvandenbossche @pitrou

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps AssumeTimezone kernel could reuse this now that you have it nicely factored out?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have arrow_vendored::date::choose as a template parameter rather than passing options every time the kernel is called? (I'm not certain just asking)

Also note that we could probably do more templating for the ceil/floor/round kernels, but that's out of scope here.

@pitrou

pitrou commented Mar 7, 2022

Copy link
Copy Markdown
Member

Can you explain the motivation for this functionality?
Usually, if the rounded time is ambiguous/non-existent, the input time was already ambiguous/non-existent, no?

@rok

rok commented Mar 7, 2022

Copy link
Copy Markdown
Member

Can you explain the motivation for this functionality? Usually, if the rounded time is ambiguous/non-existent, the input time was already ambiguous/non-existent, no?

The issue is that rounding is done on local time not UTC. So if the rounded-to moment does not exist in local time rounding will fail and we need to handle it at that point.

@pitrou

pitrou commented Mar 7, 2022

Copy link
Copy Markdown
Member

Right... but as I said this usually means the original timestamp was already invalid, no? So instead of catching errors, would it be more/less useful to have a function purely to fix invalid timestamps?

@rok

rok commented Mar 7, 2022

Copy link
Copy Markdown
Member

Original timestamp can be valid and rounded-to not. Let's take an example from date.h docs:

2016-03-13 02:30:00 is in a gap between
2016-03-13 02:00:00 EST and
2016-03-13 03:00:00 EDT which are both equivalent to
2016-03-13 07:00:00 UTC

If we start with a valid 2016-03-13 00:01:00 EST and ceil it to 2h30min, then the local time will fall into nonexistent gap.

@pitrou

pitrou commented Mar 7, 2022

Copy link
Copy Markdown
Member

Hmm, I see. While ceiling to 2h30min sounds exotic, this is a valid use case.

@pitrou

pitrou commented Mar 7, 2022

Copy link
Copy Markdown
Member

Still, this example doesn't make sense to me:

  const char* times = R"(["2018-10-28 01:20:00"])";
  const char* times_earliest = R"(["2018-10-28 00:30:00"])";
  const char* times_latest = R"(["2018-10-28 01:30:00"])";

The only correct answer here is 2018-10-28 01:30:00 (because ceil should produce a timestamp that is not before the input timestamp). And so there is no ambiguity.

@rok

rok commented Mar 7, 2022

Copy link
Copy Markdown
Member

Hmm, I see. While ceiling to 2h30min sounds exotic, this is a valid use case.

I think you can achieve this with less exotic intervals too.

The only correct answer here is 2018-10-28 01:30:00 (because ceil should produce a timestamp that is not before the input timestamp). And so there is no ambiguity.

Indeed. So choice should only be raise or latest?

@pitrou

pitrou commented Mar 7, 2022

Copy link
Copy Markdown
Member

Taken more abstractly, the contract is the following:

  • round returns the possible output that is closest to the input timestamp
  • floor returns the possible output that is closest to but not after the input timestamp
  • ceil returns the possible output that is closest to but not before the input timestamp

So it should be possible to implement the expected semantics without exposing any additional options to the user, possibly by examining the two earliest and latest values and choosing the best one.

@rok

rok commented Mar 8, 2022

Copy link
Copy Markdown
Member

So it should be possible to implement the expected semantics without exposing any additional options to the user, possibly by examining the two earliest and latest values and choosing the best one.

Currently we only raise and we want to control raise vs earliest/latest. That would mean exposing additional option I think?

Or are you proposing to not raise? I think that's a valid for nonexistent, but I'm not sure about ambiguous. E.g. ceil(t) falls to exact moment of DST switch and could return t or t + dst_offset. We can take the same approach here and see if users eventually complain :). Here's an ambiguous example from date.h.

@jorisvandenbossche

Copy link
Copy Markdown
Member

While ceiling to 2h30min sounds exotic, this is a valid use case.

Rok already mentioned it, but while it's true that non-existent times from rounding are a bit exotic, the ambiguous is certainly not.

To give a concrete example, assume the local time "2021-10-31 02:25:00" in Europe (during a DST switch) and rounding that to the hour:

>>> arr = pa.array([pd.Timestamp("2021-10-31 02:25:00")])
>>> arr = pc.assume_timezone(arr, "Europe/Brussels", ambiguous="earliest")
>>> arr
<pyarrow.lib.TimestampArray object at 0x7f00c1e04760>
[
  2021-10-31 00:25:00.000000
]

>>> pc.round_temporal(arr, 1, "hour")
...
ArrowInvalid: Local time is ambiguous: 2021-10-31 02:00:00.000000 is ambiguous.  It could be
2021-10-31 02:00:00.000000 CEST == 2021-10-31 00:00:00.000000 UTC or
2021-10-31 02:00:00.000000 CET == 2021-10-31 01:00:00.000000 UTC

But indeed, also in this case we can know that "00:00::00 UTC" is closer to the original timestamp than "01:00:00 UTC" (since the original timestamp in UTC was "00:25:00 UTC").

That adds some more logic to this kernel, but this would actually make those round kernels more useful!
(for example, if you have a regular timeseries (say of minute interval) and you round it to the hour, you could never pick an ambiguous="latest"/"earliest" option that is correct for all values in your timeseries)

Or are you proposing to not raise?

If there is no ambiguity left (eg as in the example above), I think we should not raise by default.

But it might be that for some cases it's still better to raise by default. For example in the case of "non-existent" times, we are actually changing the resulting timestamp, and thus that also means it will not necessarily "follow" the rounding multiple and unit. I think in such cases, it might still be better to raise by default?

@pitrou

pitrou commented Mar 9, 2022

Copy link
Copy Markdown
Member

@rok

Or are you proposing to not raise? I think that's a valid for nonexistent, but I'm not sure about ambiguous. E.g. ceil(t) falls to exact moment of DST switch and could return t or t + dst_offset.

By definition of ceil(), it should return the smallest applicable value, so there is no ambiguity.

@jorisvandenbossche

For example in the case of "non-existent" times, we are actually changing the resulting timestamp, and thus that also means it will not necessarily "follow" the rounding multiple and unit. I think in such cases, it might still be better to raise by default?

That's true, though it only seems to trigger for "unusual" roundings. So we may want to add an option for non-existent timestamps, but it sounds less important than getting ambiguous timestamps right (which shouldn't require an option).

@rok

rok commented Mar 9, 2022

Copy link
Copy Markdown
Member

For example in the case of "non-existent" times, we are actually changing the resulting timestamp, and thus that also means it will not necessarily "follow" the rounding multiple and unit. I think in such cases, it might still be better to raise by default?

We could catch these cases and implement logic to return correct multiple rounding. Then we wouldn't need any options and would never have to raise.

I agree with the other conclusions.

@rok

rok commented Mar 30, 2022

Copy link
Copy Markdown
Member

@AlvinJ15 any progress on this? Can I help somehow?
I have another rounding PR and I'd like to use your changes there.

@rok

rok commented Jun 16, 2022

Copy link
Copy Markdown
Member

@raulcd

@rok rok self-assigned this Jan 14, 2023
@amol-

amol- commented Mar 30, 2023

Copy link
Copy Markdown
Member

Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍

@amol- amol- closed this Mar 30, 2023
@rok rok reopened this Mar 30, 2023
@rok rok requested review from AlenkaF and westonpace as code owners March 30, 2023 17:51
@westonpace westonpace removed their request for review July 6, 2023 14:09
@rok rok force-pushed the ARROW-15251-Temporal_floor/ceil/round_handle_ambiguous/nonexistent_local_time branch from fff7bd4 to a48f69d Compare December 23, 2023 22:31
@github-actions github-actions Bot added the awaiting review Awaiting review label Dec 23, 2023
@rok rok force-pushed the ARROW-15251-Temporal_floor/ceil/round_handle_ambiguous/nonexistent_local_time branch from a48f69d to c027b76 Compare April 8, 2024 00:22
@rok rok force-pushed the ARROW-15251-Temporal_floor/ceil/round_handle_ambiguous/nonexistent_local_time branch 2 times, most recently from f77fd93 to 107c413 Compare April 19, 2024 17:12
AlvinJ15 and others added 5 commits April 20, 2024 23:57
Tweaking nonexistent/ambiguous rounding
Moving nonexistent/ambiguous logic to AssumeTimezone
Revert AssumeTimezoneOptions::Nonexistent changes
Fixing compiler warnings
Fixing ceil/round issues
Apply suggestions from code review
Review feedback
Review feedback
Changes to ceil/floor, more tests
Refactoring
refactoring
Review feedback
review feedback
Review feedback
adding python tests
adding ambiguous round test python
Update cpp/src/arrow/compute/kernels/scalar_temporal_test.cc
change nonexistent/ambiguous behaviour
Add preserve_wall_time_order flag
@rok rok force-pushed the ARROW-15251-Temporal_floor/ceil/round_handle_ambiguous/nonexistent_local_time branch from 107c413 to 35cab06 Compare April 28, 2024 19:56
@github-actions

Copy link
Copy Markdown

Thank you for your contribution. Unfortunately, this pull request has been marked as stale because it has had no activity in the past 365 days. Please remove the stale label or comment below, or this PR will be closed in 14 days. Feel free to re-open this if it has been closed in error. If you do not have repository permissions to reopen the PR, please tag a maintainer.

@github-actions github-actions Bot added the Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise label Nov 18, 2025
@github-actions github-actions Bot closed this Dec 5, 2025
@rok rok reopened this Mar 3, 2026
@rok rok requested a review from raulcd as a code owner March 3, 2026 11:50
@rok rok removed the Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise label Mar 3, 2026
}

template <typename Duration>
Duration ConvertLocalToSys(Duration t, Status* st) const {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps consider bringing back the try-catch here (or review parameter Status* st and callers in scalar_temporal_unary.cc between 753 - 1037)?

@@ -887,10 +800,8 @@ Duration CeilWeekTimePoint(const int64_t arg, const RoundTemporalOptions& option
template <typename Duration, typename Unit, typename Localizer>

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No caller, to be removed?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be removed?

// to hours since the beginning of the day.
const Duration origin =
OriginHelper(d, ConvertTimePoint<Duration>(t), options.unit);
return duration_cast<Duration>(CeilHelper<Duration, Unit>((d - origin), options) +

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(to self: verify CeilHelper under FloorTimePoint here)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants