fix: handle date_bin negative subsecond and overflow cases#22610
Conversation
|
@kumarUjjawal |
|
Thank you @kosiew I will resolve the issues now. |
77ca1d3 to
a20b061
Compare
There was a problem hiding this comment.
@kumarUjjawal
Thanks for the update. I did not find any blocking issues. I left a couple of small suggestions that could help maintainability and future-proofing.
| } | ||
|
|
||
| Ok(match array { | ||
| ColumnarValue::Scalar(ScalarValue::TimestampNanosecond(v, tz_opt)) => { |
There was a problem hiding this comment.
Nice cleanup overall. One thought: the four scalar timestamp arms now have the same control flow and only differ by the ScalarValue variant and Arrow timestamp type. It might be worth introducing a small typed helper or macro here so the scalar path mirrors transform_array_with_stride. That would make it easier to keep the scalar and array paths in sync and reduce the chance of them drifting apart in a future change.
| TimestampSecondArray, | ||
| }; | ||
|
|
||
| let return_field = &Arc::new(Field::new( |
There was a problem hiding this comment.
Small test suggestion: test_date_bin_scale_overflow_returns_null currently creates a single return_field with Timestamp(Second) while also exercising millisecond and microsecond source cases. The test passes today because the implementation does not use return_field, but it might be a little more robust to build the field from each case's expected_type. That way the test stays accurate if invoke_with_args ever starts validating or using the return field.
|
Thanks @kosiew I have addressed the comments |
Jefffrey
left a comment
There was a problem hiding this comment.
so do we only explicitly error on the origin check now? all other errors are converted to nulls?
| let nsec = (nanos % NANOS_PER_SEC) as u32; | ||
| // DateTime::from_timestamp requires a non-negative nanosecond part. | ||
| let secs = nanos.div_euclid(NANOS_PER_SEC); | ||
| let nsec = nanos.rem_euclid(NANOS_PER_SEC) as u32; |
There was a problem hiding this comment.
whats the significance of changing to div_euclid/rem_euclid here?
There was a problem hiding this comment.
This fixes negative sub second timestamps. Normal / and % can produce a negative nanosecond remainder, but DateTime::from_timestamp requires nanos to be non-negative.div_euclid / rem_euclid normalize the timestamp into valid seconds + nanos.
There was a problem hiding this comment.
i think only rem_euclid does this? div_euclid just seems to decide which way to round up/down
row-level data problems -> NULL while query-level invalid inputs -> error. |
Jefffrey
left a comment
There was a problem hiding this comment.
row-level data problems -> NULL while query-level invalid inputs -> error.
this is a succinct way of putting it, thanks 👍
| // Scale to nanoseconds and report overflow as a normal error. | ||
| fn checked_scale_to_nanos(x: i64, scale: i64) -> Result<i64> { | ||
| x.checked_mul(scale).ok_or_else(|| { | ||
| ArrowError::InvalidArgumentError(format!( |
There was a problem hiding this comment.
doesnt need to be an arrowerror anymore, can just be datafusion error
date_bin negative subsecond and overflow cases
|
run benchmark date_bin |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing fix/date-bin-error-change (38c4c6a) to c83a981 (merge-base) diff using: date_bin File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagedate_bin — base (merge-base)
date_bin — branch
File an issue against this benchmark runner |
|
thanks all |
Which issue does this PR close?
date_binscalar & array paths have different error behaviours #22528Rationale for this change
date_binhad a few edge cases that could return the wrong result, return an error only on array inputs, or panic/wrap when scaling timestamp and time values to nanoseconds.What changes are included in this PR?
NULLconsistently for per-row binning errors.Are these changes tested?
Yes
Are there any user-facing changes?
No public API changes.