[BLOCKED] Integer arithmetic with overflow checking#3755
[BLOCKED] Integer arithmetic with overflow checking#3755fbusato wants to merge 19 commits intoNVIDIA:mainfrom
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
🟨 CI finished in 2h 27m: Pass: 92%/151 | Total: 3d 03h | Avg: 29m 59s | Max: 1h 19m | Hits: 62%/209394
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 151)
| # | Runner |
|---|---|
| 108 | linux-amd64-cpu16 |
| 15 | windows-amd64-cpu16 |
| 10 | linux-arm64-cpu16 |
| 8 | linux-amd64-gpu-rtx2080-latest-1 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 1 | linux-amd64-gpu-h100-latest-1 |
|
I've already implemented the saturation arithmetics in #3449, there are just some compiler issues I haven't resolved yet. However the behaviour is not equivalent, the saturation arithmetics just clamps the result in If you need the overflow flag as a result, you may checkout the implementation, there are some clever ways to optimize the behaviour on device using |
|
thanks, @davebayer. Indeed, I was going to ask you to take a look at this PR. I check if I can drop the current one if it is redundant with saturation arithmetic. |
In my opinion having |
|
let me summarize the differences:
The main open question that I have is if we want the same semantics of intrinsic. This would make the implementation more complex without a clear benefits IMO (but I could be wrong) |
I am against this. I think the user should be consistent with the types passed to int16_t fn(int16 x)
{
auto [result, overflow] = cuda::add_overflow(x, 10);
if (overflow)
{
throw std::runtime_error("Error");
}
return result;
}The user clearly wants to check against
I would follow the namespace cuda
{
template <class _Tp>
struct op_overflow_result
{
_Tp value;
bool overflow;
};
template <class _Tp>
op_overflow_result<_Tp> op_overflow(_Tp __lhs, _Tp __rhs)
{
op_overflow_result<_Tp> __ret;
__ret.overflow = __builtin_op_overflow(__lhs, __rhs, &__ret.value);
return __ret;
}
} // namespace cuda |
|
based on internal discussion and current CUB use cases: https://github.com/NVIDIA/cccl/blob/main/cub/cub/agent/agent_reduce.cuh#L424 and https://github.com/NVIDIA/cccl/blob/main/cub/cub/device/dispatch/dispatch_histogram.cuh#L801. The functions will only check if an operation is valid or not, without providing the result. This is not redundant with the actual computation |
|
/ok to test |
🟨 CI finished in 2h 48m: Pass: 93%/151 | Total: 3d 00h | Avg: 28m 47s | Max: 1h 19m | Hits: 63%/213614
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 151)
| # | Runner |
|---|---|
| 108 | linux-amd64-cpu16 |
| 15 | windows-amd64-cpu16 |
| 10 | linux-arm64-cpu16 |
| 8 | linux-amd64-gpu-rtx2080-latest-1 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 1 | linux-amd64-gpu-h100-latest-1 |
Yes, I am refering to the solution I proposed. Actually the fastest way to check if an operation overflows is to compute the result and check the overflow flags and the result. I've checked the assembly generated by the compilers and it does exactly that.
I've implemented a version fully functional in both host and device code prefering builtins and falling back a generic implementation. |
miscco
left a comment
There was a problem hiding this comment.
As touched on monday I prefer to not waste already available information.
That is why I would prefer the approach with computing the result and also passing a flag around that signifies whether overflow occurred.
I believe that there is effectively never a situation where we are completely uninterested in the result of an operation and just want to throw in that hypothetical case.
So throwing away the result in all common cases seems wastefull
|
Maybe I should have introduced better the solution. All of the functions have 2 overloads: template <class T>
constexpr bool op_overflow(T x, T y, T& result) noexcept;
template <class T>
constexpr overflow_arithmetic_result_t<T> op_overflow(T x, T y) noexcept;They can be used as: // ...
int val;
if (cuda::add_overflow(x, y, result))
{
// handle overflow
}
// use `val`
// ...and // ...
if (auto res = add_overflow(x, y))
{
// handle overflow saved in `res.overflow`
// use result saved in `res.value`
}
// ...The I've already discussed the design with @miscco and he seems to be happy with it. However, the implementation currently all of the inputs must be of the same type. If you insist on type mixing and returning common type, I can change the implementation. What are your thoughts on this, @fbusato? :) |
This is not a waste of available information. Checking the overflow could involve different operations compared to the actual computation. @davebayer I like the idea of the overloads but I would prefer to keep |
I only optimized the multiplication for device, because I did not come up with anything better than what the generic C++ implementation does. I'd like to demonstrate that there is no performance benefit from having My implementation generates the same PTX as the clang-cuda's There are the extended precision integer arithmetic instructions, but we have no way getting the The only improvements I see is that NVCC seems to have trouble using predicates, so I could use inline PTX to fix that, but it would bring more complexity to the whole thing. |
|
Add/Subtraction
Your idea is very nice, but I would argue the opposite. Even in the worst case for the comparison ( Multiplication:
Thoughts: I'm still convinced that checking for overflow and computing the operations are two different things:
Personally, I would like to have both versions, boolean value and with the result. Final note about the parameter types. Using different types + |
|
tl;dr: In libcudf, we'd love to able to use saturating addition/subtraction that also returns whether overflow occurred. In libcudf we have need of saturating integer arithmetic that also returns whether overflow occurred. The context is searching for an insertion point in an array ( For an open window, if no overflow occurs, I can find the correct insertion point for a row i with However, if overflow does occur, then I need a way of distinguishing a legitimately obtained saturated value, from one that occurred due to overflow. Particularly, for open windows, if saturation occurred then I need to change the comparator to |
|
Spark-Rapids repo requires the following APIs: template <typename T where T = int8_t, int16_t, int32_t or int64_t>
__device__ void add(T x, T y, bool check_overflow, bool* valid, T* result);
template <typename T where T = int8_t, int16_t, int32_t or int64_t>
__device__ void subtract(T x, T y, bool check_overflow, bool* valid, T* result);
template <typename T where T = int8_t, int16_t, int32_t or int64_t>
__device__ void multiply(T x, T y, bool check_overflow, bool* valid, T* result);
template <typename T where T = int8_t, int16_t, int32_t or int64_t>
__device__ void divide(T x, T y, bool check_overflow, bool* valid, T* result);Spark-Rapids will use the above APIs via cuDF repo. |
|
@fbusato could you please rebase the PR? |
|
As I said earlier I have strong reservations about the API here. personally, I would strongly prefer if we would align this closer with the C++ saturation arithmetics that were implemented by @davebayer in #3449 How about the following API: namespace cuda {
template<class T>
struct __staturation_result_t {
T result;
bool has_overflown;
};
_CCCL_TEMPLATE(class _Tp)
_CCCL_REQUIRES(__cccl_is_integer_v<_Tp>)
[[nodiscard]] _LIBCUDACXX_HIDE_FROM_ABI constexpr __staturation_result_t <_Tp> div_sat(_Tp __x, _Tp __y) noexcept {}
} // namespace cuda
namespace cuda::std {
_CCCL_TEMPLATE(class _Tp)
_CCCL_REQUIRES(__cccl_is_integer_v<_Tp>)
[[nodiscard]] _LIBCUDACXX_HIDE_FROM_ABI constexpr _Tp div_sat(_Tp __x, _Tp __y) noexcept {
return ::cuda::div_sat(__x, __y).result;
}
} // namespace cuda::stdThat would have the clear benefit of having a common implementation that greatly reduces code duplication and also provides all the information we need. Should a user really want to discard the return value and only check for overflow, they can just add their own wrapper or directly access |
I don't like the naming. To me, saturating and overflow checking are two separate things. Plus saturating the overflow everytime adds some overhead. I would really like to keep it simple, we needn't to reinvent a wheel here. Rust also implements them separately with different names. I've already implemented There is also an issue #4419 tracking those changes. I will try to finish the |
|
@davebayer @miscco We already discussed this point. I'm in favor of #4415 design, and I probably work on that soon. I will keep this PR open only for reference and to compare the implementations. |
|
closing. We already provided the functionalities in different PRs |
|
I found the following overflow checks in https://nvidia.github.io/cccl/libcudacxx/extended_api/numeric.html: |
Not yet, coming soon! |
|
@res-life would you be interested in this set of APIs for overflow arithmetic with overflow checking? template <class T>
cuda::overflow_result<T> add_sat_overflow(T lhs, T rhs) noexcept;
template <class T>
bool add_sat_overflow(T& result, T lhs, T rhs) noexcept;or would you prefer what you suggested? template <class T>
void add(T x, T y, bool check_overflow, bool* valid, T* result); |
|
I prefer the first. Please ignore what I suggested. |
Description
Provide the following functions to check if addition, subtraction, multiplication, or division of two integrals (including 128-bit integers) overflows the maximum value or underflow the minimum value of the common type (
cuda::std::common_type_t<T, U>).Inspired by https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html and https://clang.llvm.org/docs/LanguageExtensions.html#checked-arithmetic-builtins
Useful when/where undefined behavior sanitizer is not available (e.g. device code) and for assertions