Skip to content

[BLOCKED] Integer arithmetic with overflow checking#3755

Closed
fbusato wants to merge 19 commits intoNVIDIA:mainfrom
fbusato:arithmetic-overflow-checking
Closed

[BLOCKED] Integer arithmetic with overflow checking#3755
fbusato wants to merge 19 commits intoNVIDIA:mainfrom
fbusato:arithmetic-overflow-checking

Conversation

@fbusato
Copy link
Contributor

@fbusato fbusato commented Feb 8, 2025

Description

Provide the following functions to check if addition, subtraction, multiplication, or division of two integrals (including 128-bit integers) overflows the maximum value or underflow the minimum value of the common type (cuda::std::common_type_t<T, U>).

template <typename T, typename U>
[[nodiscard]] __host__ __device__ inline
constexpr bool is_add_overflow(T a, U b) noexcept;

template <typename T, typename U>
[[nodiscard]] __host__ __device__ inline
constexpr bool is_sub_overflow(T a, U b) noexcept;

template <typename T, typename U>
[[nodiscard]] __host__ __device__ inline
constexpr bool is_mul_overflow(T a, U b) noexcept;

template <typename T, typename U>
[[nodiscard]] __host__ __device__ inline
constexpr bool is_div_overflow(T a, U b) noexcept;

Inspired by https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html and https://clang.llvm.org/docs/LanguageExtensions.html#checked-arithmetic-builtins

Useful when/where undefined behavior sanitizer is not available (e.g. device code) and for assertions

@fbusato fbusato added the 3.0 label Feb 8, 2025
@fbusato fbusato self-assigned this Feb 8, 2025
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Feb 8, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@fbusato
Copy link
Contributor Author

fbusato commented Feb 8, 2025

/ok to test

@github-actions
Copy link
Contributor

github-actions bot commented Feb 8, 2025

🟨 CI finished in 2h 27m: Pass: 92%/151 | Total: 3d 03h | Avg: 29m 59s | Max: 1h 19m | Hits: 62%/209394
  • 🟨 libcudacxx: Pass: 73%/41 | Total: 5h 47m | Avg: 8m 28s | Max: 25m 32s | Hits: 92%/69958

    🔍 cpu: amd64 🔍
      🔍 amd64              Pass:  71%/39  | Total:  5h 39m | Avg:  8m 42s | Max: 25m 32s | Hits:  92%/64333 
      🟩 arm64              Pass: 100%/2   | Total:  7m 28s | Avg:  3m 44s | Max:  3m 52s | Hits:  98%/5625  
    🔍 cudacxx_family: nvcc 🔍
      🟩 ClangCUDA          Pass: 100%/2   | Total: 41m 11s | Avg: 20m 35s | Max: 22m 01s | Hits:  26%/5589  
      🔍 nvcc               Pass:  71%/39  | Total:  5h 06m | Avg:  7m 50s | Max: 25m 32s | Hits:  98%/64369 
    🔍 jobs: Build 🔍
      🔍 Build              Pass:  69%/36  | Total:  4h 50m | Avg:  8m 04s | Max: 25m 32s | Hits:  92%/69918 
      🟩 NVRTC              Pass: 100%/2   | Total: 35m 53s | Avg: 17m 56s | Max: 20m 26s | Hits:  90%/40    
      🟩 Test               Pass: 100%/2   | Total: 18m 18s | Avg:  9m 09s | Max:  9m 21s
      🟩 VerifyCodegen      Pass: 100%/1   | Total:  2m 13s | Avg:  2m 13s | Max:  2m 13s
    🟨 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 41m 11s | Avg: 20m 35s | Max: 22m 01s | Hits:  26%/5589  
      🟨 nvcc12.0           Pass:  40%/5   | Total: 36m 43s | Avg:  7m 20s | Max: 21m 43s | Hits:  98%/5561  
      🟥 nvcc12.5           Pass:   0%/2   | Total: 21m 23s | Avg: 10m 41s | Max: 12m 21s
      🟨 nvcc12.8           Pass:  81%/32  | Total:  4h 08m | Avg:  7m 45s | Max: 25m 32s | Hits:  98%/58808 
    🟨 cxx
      🟩 Clang14            Pass: 100%/4   | Total: 16m 46s | Avg:  4m 11s | Max:  4m 33s | Hits:  99%/11142 
      🟩 Clang15            Pass: 100%/2   | Total: 11m 33s | Avg:  5m 46s | Max:  7m 26s | Hits:  94%/5581  
      🟩 Clang16            Pass: 100%/2   | Total:  8m 56s | Avg:  4m 28s | Max:  4m 37s | Hits:  99%/5581  
      🟩 Clang17            Pass: 100%/2   | Total:  8m 56s | Avg:  4m 28s | Max:  4m 36s | Hits:  99%/5581  
      🟩 Clang18            Pass: 100%/6   | Total:  1h 03m | Avg: 10m 30s | Max: 22m 01s | Hits:  70%/13982 
      🟥 GCC7               Pass:   0%/2   | Total:  7m 16s | Avg:  3m 38s | Max:  3m 41s
      🟥 GCC8               Pass:   0%/1   | Total:  3m 35s | Avg:  3m 35s | Max:  3m 35s
      🟥 GCC9               Pass:   0%/2   | Total:  7m 29s | Avg:  3m 44s | Max:  3m 59s
      🟩 GCC10              Pass: 100%/2   | Total:  8m 05s | Avg:  4m 02s | Max:  4m 09s | Hits:  98%/5587  
      🟩 GCC11              Pass: 100%/2   | Total:  8m 05s | Avg:  4m 02s | Max:  4m 17s | Hits:  98%/5583  
      🟩 GCC12              Pass: 100%/2   | Total:  8m 10s | Avg:  4m 05s | Max:  4m 14s | Hits:  99%/5583  
      🟩 GCC13              Pass: 100%/8   | Total:  1h 15m | Avg:  9m 29s | Max: 20m 26s | Hits:  98%/11338 
      🟥 MSVC14.29          Pass:   0%/2   | Total: 47m 14s | Avg: 23m 37s | Max: 25m 31s
      🟥 MSVC14.42          Pass:   0%/2   | Total: 50m 49s | Avg: 25m 24s | Max: 25m 32s
      🟥 NVHPC24.7          Pass:   0%/2   | Total: 21m 23s | Avg: 10m 41s | Max: 12m 21s
    🟨 cxx_family
      🟩 Clang              Pass: 100%/16  | Total:  1h 49m | Avg:  6m 49s | Max: 22m 01s | Hits:  88%/41867 
      🟨 GCC                Pass:  73%/19  | Total:  1h 58m | Avg:  6m 14s | Max: 20m 26s | Hits:  98%/28091 
      🟥 MSVC               Pass:   0%/4   | Total:  1h 38m | Avg: 24m 30s | Max: 25m 32s
      🟥 NVHPC              Pass:   0%/2   | Total: 21m 23s | Avg: 10m 41s | Max: 12m 21s
    🟨 gpu
      🟨 rtx2080            Pass:  73%/41  | Total:  5h 47m | Avg:  8m 28s | Max: 25m 32s | Hits:  92%/69958 
    🟨 ctk
      🟨 12.0               Pass:  40%/5   | Total: 36m 43s | Avg:  7m 20s | Max: 21m 43s | Hits:  98%/5561  
      🟥 12.5               Pass:   0%/2   | Total: 21m 23s | Avg: 10m 41s | Max: 12m 21s
      🟨 12.8               Pass:  82%/34  | Total:  4h 49m | Avg:  8m 30s | Max: 25m 32s | Hits:  92%/64397 
    🟩 sm
      🟩 75                 Pass: 100%/2   | Total: 35m 53s | Avg: 17m 56s | Max: 20m 26s | Hits:  90%/40    
      🟩 90;90a;100         Pass: 100%/1   | Total: 16m 10s | Avg: 16m 10s | Max: 16m 10s | Hits:  96%/2902  
    🟨 std
      🟨 17                 Pass:  57%/21  | Total:  3h 01m | Avg:  8m 38s | Max: 25m 31s | Hits:  92%/30479 
      🟨 20                 Pass:  89%/19  | Total:  2h 43m | Avg:  8m 37s | Max: 25m 32s | Hits:  92%/39479 
    
  • 🟩 cub: Pass: 100%/44 | Total: 1d 18h | Avg: 58m 14s | Max: 1h 19m | Hits: 29%/52496

    🟩 cpu
      🟩 amd64              Pass: 100%/42  | Total:  1d 16h | Avg: 58m 00s | Max:  1h 19m | Hits:  30%/50056 
      🟩 arm64              Pass: 100%/2   | Total:  2h 05m | Avg:  1h 02m | Max:  1h 03m | Hits:  16%/2440  
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  5h 26m | Avg:  1h 05m | Max:  1h 07m | Hits:  15%/5934  
      🟩 12.5               Pass: 100%/2   | Total:  2h 24m | Avg:  1h 12m | Max:  1h 14m | Hits:  12%/2258  
      🟩 12.8               Pass: 100%/37  | Total:  1d 10h | Avg: 56m 32s | Max:  1h 19m | Hits:  32%/44304 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  2h 13m | Avg:  1h 06m | Max:  1h 10m | Hits:  15%/2112  
      🟩 nvcc12.0           Pass: 100%/5   | Total:  5h 26m | Avg:  1h 05m | Max:  1h 07m | Hits:  15%/5934  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 24m | Avg:  1h 12m | Max:  1h 14m | Hits:  12%/2258  
      🟩 nvcc12.8           Pass: 100%/35  | Total:  1d 08h | Avg: 55m 57s | Max:  1h 19m | Hits:  33%/42192 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  2h 13m | Avg:  1h 06m | Max:  1h 10m | Hits:  15%/2112  
      🟩 nvcc               Pass: 100%/42  | Total:  1d 16h | Avg: 57m 50s | Max:  1h 19m | Hits:  30%/50384 
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  4h 22m | Avg:  1h 05m | Max:  1h 07m | Hits:  16%/4888  
      🟩 Clang15            Pass: 100%/2   | Total:  2h 09m | Avg:  1h 04m | Max:  1h 05m | Hits:  16%/2440  
      🟩 Clang16            Pass: 100%/2   | Total:  2h 04m | Avg:  1h 02m | Max:  1h 03m | Hits:  16%/2440  
      🟩 Clang17            Pass: 100%/2   | Total:  2h 02m | Avg:  1h 01m | Max:  1h 02m | Hits:  16%/2440  
      🟩 Clang18            Pass: 100%/7   | Total:  6h 06m | Avg: 52m 20s | Max:  1h 10m | Hits:  41%/8212  
      🟩 GCC7               Pass: 100%/2   | Total:  2h 15m | Avg:  1h 07m | Max:  1h 08m | Hits:  16%/2444  
      🟩 GCC8               Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m | Hits:  16%/1222  
      🟩 GCC9               Pass: 100%/2   | Total:  2h 04m | Avg:  1h 02m | Max:  1h 02m | Hits:  16%/2444  
      🟩 GCC10              Pass: 100%/2   | Total:  2h 13m | Avg:  1h 06m | Max:  1h 07m | Hits:  16%/2444  
      🟩 GCC11              Pass: 100%/2   | Total:  2h 10m | Avg:  1h 05m | Max:  1h 05m | Hits:  16%/2440  
      🟩 GCC12              Pass: 100%/2   | Total:  2h 13m | Avg:  1h 06m | Max:  1h 07m | Hits:  16%/2440  
      🟩 GCC13              Pass: 100%/10  | Total:  6h 46m | Avg: 40m 37s | Max:  1h 10m | Hits:  57%/12200 
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 15m | Avg:  1h 07m | Max:  1h 12m | Hits:  12%/2092  
      🟩 MSVC14.42          Pass: 100%/2   | Total:  2h 31m | Avg:  1h 15m | Max:  1h 19m | Hits:  12%/2092  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 24m | Avg:  1h 12m | Max:  1h 14m | Hits:  12%/2258  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/17  | Total: 16h 46m | Avg: 59m 10s | Max:  1h 10m | Hits:  26%/20420 
      🟩 GCC                Pass: 100%/21  | Total: 18h 44m | Avg: 53m 32s | Max:  1h 10m | Hits:  36%/25634 
      🟩 MSVC               Pass: 100%/4   | Total:  4h 47m | Avg:  1h 11m | Max:  1h 19m | Hits:  12%/4184  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 24m | Avg:  1h 12m | Max:  1h 14m | Hits:  12%/2258  
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 52m 29s | Avg: 26m 14s | Max: 28m 18s | Hits:  57%/2440  
      🟩 rtx2080            Pass: 100%/34  | Total:  1d 13h | Avg:  1h 06m | Max:  1h 19m | Hits:  15%/40296 
      🟩 rtxa6000           Pass: 100%/8   | Total:  4h 24m | Avg: 33m 04s | Max:  1h 05m | Hits:  78%/9760  
    🟩 jobs
      🟩 Build              Pass: 100%/37  | Total:  1d 16h | Avg:  1h 04m | Max:  1h 19m | Hits:  15%/43956 
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 22m 57s | Avg: 22m 57s | Max: 22m 57s | Hits:  99%/1220  
      🟩 GraphCapture       Pass: 100%/1   | Total: 16m 13s | Avg: 16m 13s | Max: 16m 13s | Hits:  99%/1220  
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 15m | Avg: 25m 10s | Max: 25m 56s | Hits:  99%/3660  
      🟩 TestGPU            Pass: 100%/2   | Total: 44m 29s | Avg: 22m 14s | Max: 23m 56s | Hits:  99%/2440  
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 52m 29s | Avg: 26m 14s | Max: 28m 18s | Hits:  57%/2440  
      🟩 90;90a;100         Pass: 100%/1   | Total:  1h 10m | Avg:  1h 10m | Max:  1h 10m | Hits:  16%/1220  
    🟩 std
      🟩 17                 Pass: 100%/20  | Total: 21h 46m | Avg:  1h 05m | Max:  1h 14m | Hits:  15%/23639 
      🟩 20                 Pass: 100%/24  | Total: 20h 56m | Avg: 52m 20s | Max:  1h 19m | Hits:  40%/28857 
    
  • 🟩 thrust: Pass: 100%/43 | Total: 1d 00h | Avg: 34m 09s | Max: 1h 10m | Hits: 53%/76572

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 37m 35s | Avg: 18m 47s | Max: 26m 28s | Hits:  73%/3564  
    🟩 cpu
      🟩 amd64              Pass: 100%/41  | Total: 23h 28m | Avg: 34m 20s | Max:  1h 10m | Hits:  53%/73009 
      🟩 arm64              Pass: 100%/2   | Total:  1h 00m | Avg: 30m 11s | Max: 32m 13s | Hits:  47%/3563  
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  3h 10m | Avg: 38m 01s | Max:  1h 01m | Hits:  46%/8901  
      🟩 12.5               Pass: 100%/2   | Total:  2h 03m | Avg:  1h 01m | Max:  1h 02m | Hits:  25%/3562  
      🟩 12.8               Pass: 100%/36  | Total: 19h 14m | Avg: 32m 04s | Max:  1h 10m | Hits:  56%/64109 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 54m 47s | Avg: 27m 23s | Max: 28m 15s | Hits:  47%/3562  
      🟩 nvcc12.0           Pass: 100%/5   | Total:  3h 10m | Avg: 38m 01s | Max:  1h 01m | Hits:  46%/8901  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 03m | Avg:  1h 01m | Max:  1h 02m | Hits:  25%/3562  
      🟩 nvcc12.8           Pass: 100%/34  | Total: 18h 19m | Avg: 32m 20s | Max:  1h 10m | Hits:  56%/60547 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 54m 47s | Avg: 27m 23s | Max: 28m 15s | Hits:  47%/3562  
      🟩 nvcc               Pass: 100%/41  | Total: 23h 33m | Avg: 34m 28s | Max:  1h 10m | Hits:  53%/73010 
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  2h 06m | Avg: 31m 34s | Max: 32m 30s | Hits:  56%/7124  
      🟩 Clang15            Pass: 100%/2   | Total:  1h 09m | Avg: 34m 34s | Max: 35m 20s | Hits:  47%/3562  
      🟩 Clang16            Pass: 100%/2   | Total:  1h 03m | Avg: 31m 50s | Max: 32m 04s | Hits:  47%/3562  
      🟩 Clang17            Pass: 100%/2   | Total:  1h 01m | Avg: 30m 51s | Max: 31m 23s | Hits:  47%/3562  
      🟩 Clang18            Pass: 100%/7   | Total:  2h 47m | Avg: 23m 54s | Max: 34m 35s | Hits:  64%/12467 
      🟩 GCC7               Pass: 100%/2   | Total:  1h 01m | Avg: 30m 47s | Max: 31m 25s | Hits:  60%/3564  
      🟩 GCC8               Pass: 100%/1   | Total: 33m 12s | Avg: 33m 12s | Max: 33m 12s | Hits:  47%/1782  
      🟩 GCC9               Pass: 100%/2   | Total:  1h 09m | Avg: 34m 56s | Max: 34m 59s | Hits:  57%/3564  
      🟩 GCC10              Pass: 100%/2   | Total:  1h 02m | Avg: 31m 27s | Max: 32m 52s | Hits:  47%/3564  
      🟩 GCC11              Pass: 100%/2   | Total:  1h 05m | Avg: 32m 46s | Max: 33m 03s | Hits:  47%/3564  
      🟩 GCC12              Pass: 100%/2   | Total:  1h 11m | Avg: 35m 40s | Max: 37m 25s | Hits:  47%/3564  
      🟩 GCC13              Pass: 100%/8   | Total:  3h 18m | Avg: 24m 50s | Max: 37m 51s | Hits:  68%/14256 
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 03m | Avg:  1h 01m | Max:  1h 01m | Hits:  30%/3550  
      🟩 MSVC14.42          Pass: 100%/3   | Total:  2h 49m | Avg: 56m 29s | Max:  1h 10m | Hits:  38%/5325  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 03m | Avg:  1h 01m | Max:  1h 02m | Hits:  25%/3562  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/17  | Total:  8h 08m | Avg: 28m 43s | Max: 35m 20s | Hits:  56%/30277 
      🟩 GCC                Pass: 100%/19  | Total:  9h 23m | Avg: 29m 38s | Max: 37m 51s | Hits:  58%/33858 
      🟩 MSVC               Pass: 100%/5   | Total:  4h 53m | Avg: 58m 39s | Max:  1h 10m | Hits:  35%/8875  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 03m | Avg:  1h 01m | Max:  1h 02m | Hits:  25%/3562  
    🟩 gpu
      🟩 rtx2080            Pass: 100%/33  | Total: 20h 18m | Avg: 36m 54s | Max:  1h 04m | Hits:  47%/58769 
      🟩 rtx4090            Pass: 100%/10  | Total:  4h 10m | Avg: 25m 02s | Max:  1h 10m | Hits:  74%/17803 
    🟩 jobs
      🟩 Build              Pass: 100%/37  | Total: 23h 07m | Avg: 37m 29s | Max:  1h 10m | Hits:  46%/65889 
      🟩 TestCPU            Pass: 100%/3   | Total: 49m 23s | Avg: 16m 27s | Max: 34m 39s | Hits:  90%/5338  
      🟩 TestGPU            Pass: 100%/3   | Total: 31m 58s | Avg: 10m 39s | Max: 11m 19s | Hits:  99%/5345  
    🟩 sm
      🟩 90;90a;100         Pass: 100%/1   | Total: 37m 51s | Avg: 37m 51s | Max: 37m 51s | Hits:  50%/1782  
    🟩 std
      🟩 17                 Pass: 100%/20  | Total: 12h 44m | Avg: 38m 12s | Max:  1h 04m | Hits:  46%/35611 
      🟩 20                 Pass: 100%/21  | Total: 11h 06m | Avg: 31m 45s | Max:  1h 10m | Hits:  57%/37397 
    
  • 🟩 cudax: Pass: 100%/20 | Total: 1h 50m | Avg: 5m 32s | Max: 12m 19s | Hits: 96%/10080

    🟩 cpu
      🟩 amd64              Pass: 100%/16  | Total:  1h 36m | Avg:  6m 00s | Max: 12m 19s | Hits:  95%/7868  
      🟩 arm64              Pass: 100%/4   | Total: 14m 46s | Avg:  3m 41s | Max:  3m 49s | Hits:  98%/2212  
    🟩 ctk
      🟩 12.0               Pass: 100%/1   | Total:  9m 57s | Avg:  9m 57s | Max:  9m 57s | Hits:  60%/261   
      🟩 12.5               Pass: 100%/2   | Total: 12m 00s | Avg:  6m 00s | Max:  6m 12s | Hits:  95%/706   
      🟩 12.8               Pass: 100%/17  | Total:  1h 28m | Avg:  5m 13s | Max: 12m 19s | Hits:  97%/9113  
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/1   | Total:  9m 57s | Avg:  9m 57s | Max:  9m 57s | Hits:  60%/261   
      🟩 nvcc12.5           Pass: 100%/2   | Total: 12m 00s | Avg:  6m 00s | Max:  6m 12s | Hits:  95%/706   
      🟩 nvcc12.8           Pass: 100%/17  | Total:  1h 28m | Avg:  5m 13s | Max: 12m 19s | Hits:  97%/9113  
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/20  | Total:  1h 50m | Avg:  5m 32s | Max: 12m 19s | Hits:  96%/10080 
    🟩 cxx
      🟩 Clang14            Pass: 100%/1   | Total:  4m 01s | Avg:  4m 01s | Max:  4m 01s | Hits:  98%/555   
      🟩 Clang15            Pass: 100%/1   | Total:  4m 00s | Avg:  4m 00s | Max:  4m 00s | Hits:  98%/553   
      🟩 Clang16            Pass: 100%/1   | Total:  4m 26s | Avg:  4m 26s | Max:  4m 26s | Hits:  98%/553   
      🟩 Clang17            Pass: 100%/1   | Total:  4m 15s | Avg:  4m 15s | Max:  4m 15s | Hits:  98%/553   
      🟩 Clang18            Pass: 100%/4   | Total: 23m 29s | Avg:  5m 52s | Max: 12m 19s | Hits:  98%/2212  
      🟩 GCC10              Pass: 100%/1   | Total:  3m 42s | Avg:  3m 42s | Max:  3m 42s | Hits:  98%/555   
      🟩 GCC11              Pass: 100%/1   | Total:  4m 07s | Avg:  4m 07s | Max:  4m 07s | Hits:  98%/553   
      🟩 GCC12              Pass: 100%/2   | Total: 16m 34s | Avg:  8m 17s | Max: 12m 03s | Hits:  98%/1106  
      🟩 GCC13              Pass: 100%/4   | Total: 14m 38s | Avg:  3m 39s | Max:  3m 49s | Hits:  98%/2212  
      🟩 MSVC14.39          Pass: 100%/1   | Total:  9m 57s | Avg:  9m 57s | Max:  9m 57s | Hits:  60%/261   
      🟩 MSVC14.42          Pass: 100%/1   | Total:  9m 41s | Avg:  9m 41s | Max:  9m 41s | Hits:  60%/261   
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 12m 00s | Avg:  6m 00s | Max:  6m 12s | Hits:  95%/706   
    🟩 cxx_family
      🟩 Clang              Pass: 100%/8   | Total: 40m 11s | Avg:  5m 01s | Max: 12m 19s | Hits:  98%/4426  
      🟩 GCC                Pass: 100%/8   | Total: 39m 01s | Avg:  4m 52s | Max: 12m 03s | Hits:  98%/4426  
      🟩 MSVC               Pass: 100%/2   | Total: 19m 38s | Avg:  9m 49s | Max:  9m 57s | Hits:  60%/522   
      🟩 NVHPC              Pass: 100%/2   | Total: 12m 00s | Avg:  6m 00s | Max:  6m 12s | Hits:  95%/706   
    🟩 gpu
      🟩 rtx2080            Pass: 100%/20  | Total:  1h 50m | Avg:  5m 32s | Max: 12m 19s | Hits:  96%/10080 
    🟩 jobs
      🟩 Build              Pass: 100%/18  | Total:  1h 26m | Avg:  4m 48s | Max:  9m 57s | Hits:  95%/8974  
      🟩 Test               Pass: 100%/2   | Total: 24m 22s | Avg: 12m 11s | Max: 12m 19s | Hits:  99%/1106  
    🟩 sm
      🟩 90                 Pass: 100%/1   | Total:  3m 30s | Avg:  3m 30s | Max:  3m 30s | Hits:  98%/553   
      🟩 90a                Pass: 100%/1   | Total:  3m 34s | Avg:  3m 34s | Max:  3m 34s | Hits:  98%/553   
    🟩 std
      🟩 17                 Pass: 100%/4   | Total: 16m 44s | Avg:  4m 11s | Max:  5m 48s | Hits:  97%/2012  
      🟩 20                 Pass: 100%/16  | Total:  1h 34m | Avg:  5m 52s | Max: 12m 19s | Hits:  96%/8068  
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 11m 17s | Avg: 5m 38s | Max: 8m 46s | Hits: 97%/288

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total: 11m 17s | Avg:  5m 38s | Max:  8m 46s | Hits:  97%/288   
    🟩 ctk
      🟩 12.8               Pass: 100%/2   | Total: 11m 17s | Avg:  5m 38s | Max:  8m 46s | Hits:  97%/288   
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/2   | Total: 11m 17s | Avg:  5m 38s | Max:  8m 46s | Hits:  97%/288   
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total: 11m 17s | Avg:  5m 38s | Max:  8m 46s | Hits:  97%/288   
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total: 11m 17s | Avg:  5m 38s | Max:  8m 46s | Hits:  97%/288   
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total: 11m 17s | Avg:  5m 38s | Max:  8m 46s | Hits:  97%/288   
    🟩 gpu
      🟩 rtx2080            Pass: 100%/2   | Total: 11m 17s | Avg:  5m 38s | Max:  8m 46s | Hits:  97%/288   
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 31s | Avg:  2m 31s | Max:  2m 31s | Hits:  96%/144   
      🟩 Test               Pass: 100%/1   | Total:  8m 46s | Avg:  8m 46s | Max:  8m 46s | Hits:  98%/144   
    
  • 🟩 python: Pass: 100%/1 | Total: 28m 20s | Avg: 28m 20s | Max: 28m 20s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 28m 20s | Avg: 28m 20s | Max: 28m 20s
    🟩 ctk
      🟩 12.8               Pass: 100%/1   | Total: 28m 20s | Avg: 28m 20s | Max: 28m 20s
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/1   | Total: 28m 20s | Avg: 28m 20s | Max: 28m 20s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 28m 20s | Avg: 28m 20s | Max: 28m 20s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 28m 20s | Avg: 28m 20s | Max: 28m 20s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 28m 20s | Avg: 28m 20s | Max: 28m 20s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/1   | Total: 28m 20s | Avg: 28m 20s | Max: 28m 20s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 28m 20s | Avg: 28m 20s | Max: 28m 20s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
+/- libcu++
CUB
Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
+/- libcu++
+/- CUB
+/- Thrust
+/- CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 151)

# Runner
108 linux-amd64-cpu16
15 windows-amd64-cpu16
10 linux-arm64-cpu16
8 linux-amd64-gpu-rtx2080-latest-1
6 linux-amd64-gpu-rtxa6000-latest-1
3 linux-amd64-gpu-rtx4090-latest-1
1 linux-amd64-gpu-h100-latest-1

@davebayer
Copy link
Contributor

I've already implemented the saturation arithmetics in #3449, there are just some compiler issues I haven't resolved yet.

However the behaviour is not equivalent, the saturation arithmetics just clamps the result in TYPE_MIN and TYPE_MAX range.

If you need the overflow flag as a result, you may checkout the implementation, there are some clever ways to optimize the behaviour on device using min or max instructions.

@fbusato
Copy link
Contributor Author

fbusato commented Feb 8, 2025

thanks, @davebayer. Indeed, I was going to ask you to take a look at this PR. I check if I can drop the current one if it is redundant with saturation arithmetic.

@davebayer
Copy link
Contributor

thanks, @davebayer. Indeed, I was going to ask you to take a look at this PR. I check if I can drop the current one if it is redundant with saturation arithmetic.

In my opinion having op_overflow functions can be useful in many cases when saturating the result is not exactly what we want. I am just unsure whether it is something we want to expose or just keep it for internal use.

@fbusato
Copy link
Contributor Author

fbusato commented Feb 10, 2025

let me summarize the differences:

  • Saturation arithmetic doesn't have the concept of overflow which is very useful in practice, especially for debugging
  • Secondly, the new functions accept any combination of types

The main open question that I have is if we want the same semantics of intrinsic. This would make the implementation more complex without a clear benefits IMO (but I could be wrong)

@davebayer
Copy link
Contributor

davebayer commented Feb 10, 2025

  • Secondly, the new functions accept any combination of types

I am against this. I think the user should be consistent with the types passed to op_overflow function, so he is sure about the type the overflow is checked for. Consider this example:

int16_t fn(int16 x)
{
  auto [result, overflow] = cuda::add_overflow(x, 10);
  if (overflow)
  {
    throw std::runtime_error("Error");
  }
  return result;
}

The user clearly wants to check against int16_t overflow, however the common type of the two inputs is int. So, the overflow flag will be set only if the result exceeds the int range. The result will be then silently converted to the returned int16_t and a bug is introduced.

The main open question that I have is if we want the same semantics of intrinsic. This would make the implementation more complex without a clear benefits IMO (but I could be wrong)

I would follow the __builtin_op_overflow definition.

namespace cuda
{

template <class _Tp>
struct op_overflow_result
{
  _Tp  value;
  bool overflow;
};

template <class _Tp>
op_overflow_result<_Tp> op_overflow(_Tp __lhs, _Tp __rhs)
{
  op_overflow_result<_Tp> __ret;
  __ret.overflow = __builtin_op_overflow(__lhs, __rhs, &__ret.value);
  return __ret;
}

} // namespace cuda

@fbusato
Copy link
Contributor Author

fbusato commented Feb 11, 2025

based on internal discussion and current CUB use cases: https://github.com/NVIDIA/cccl/blob/main/cub/cub/agent/agent_reduce.cuh#L424 and https://github.com/NVIDIA/cccl/blob/main/cub/cub/device/dispatch/dispatch_histogram.cuh#L801. The functions will only check if an operation is valid or not, without providing the result. This is not redundant with the actual computation

@fbusato
Copy link
Contributor Author

fbusato commented Feb 12, 2025

/ok to test

@github-actions
Copy link
Contributor

🟨 CI finished in 2h 48m: Pass: 93%/151 | Total: 3d 00h | Avg: 28m 47s | Max: 1h 19m | Hits: 63%/213614
  • 🟨 libcudacxx: Pass: 78%/41 | Total: 5h 20m | Avg: 7m 49s | Max: 26m 23s | Hits: 93%/75398

    🔍 cudacxx_family: nvcc 🔍
      🟩 ClangCUDA          Pass: 100%/2   | Total: 39m 44s | Avg: 19m 52s | Max: 20m 47s | Hits:  26%/5589  
      🔍 nvcc               Pass:  76%/39  | Total:  4h 41m | Avg:  7m 12s | Max: 26m 23s | Hits:  98%/69809 
    🔍 jobs: Build 🔍
      🔍 Build              Pass:  75%/36  | Total:  4h 31m | Avg:  7m 31s | Max: 26m 23s | Hits:  93%/75358 
      🟩 NVRTC              Pass: 100%/2   | Total: 30m 04s | Avg: 15m 02s | Max: 15m 12s | Hits:  90%/40    
      🟩 Test               Pass: 100%/2   | Total: 17m 25s | Avg:  8m 42s | Max:  8m 46s
      🟩 VerifyCodegen      Pass: 100%/1   | Total:  2m 11s | Avg:  2m 11s | Max:  2m 11s
    🟨 ctk
      🟨 12.0               Pass:  40%/5   | Total: 41m 33s | Avg:  8m 18s | Max: 26m 23s | Hits:  98%/5561  
      🟩 12.5               Pass: 100%/2   | Total: 17m 43s | Avg:  8m 51s | Max:  9m 07s | Hits:  98%/5569  
      🟨 12.8               Pass:  82%/34  | Total:  4h 21m | Avg:  7m 41s | Max: 25m 03s | Hits:  92%/64268 
    🟨 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 39m 44s | Avg: 19m 52s | Max: 20m 47s | Hits:  26%/5589  
      🟨 nvcc12.0           Pass:  40%/5   | Total: 41m 33s | Avg:  8m 18s | Max: 26m 23s | Hits:  98%/5561  
      🟩 nvcc12.5           Pass: 100%/2   | Total: 17m 43s | Avg:  8m 51s | Max:  9m 07s | Hits:  98%/5569  
      🟨 nvcc12.8           Pass:  81%/32  | Total:  3h 41m | Avg:  6m 55s | Max: 25m 03s | Hits:  98%/58679 
    🟨 cxx
      🟩 Clang14            Pass: 100%/4   | Total: 16m 40s | Avg:  4m 10s | Max:  4m 15s | Hits:  99%/11142 
      🟩 Clang15            Pass: 100%/2   | Total:  8m 25s | Avg:  4m 12s | Max:  4m 19s | Hits:  99%/5581  
      🟩 Clang16            Pass: 100%/2   | Total:  9m 03s | Avg:  4m 31s | Max:  4m 45s | Hits:  99%/5581  
      🟩 Clang17            Pass: 100%/2   | Total:  9m 25s | Avg:  4m 42s | Max:  4m 43s | Hits:  99%/5581  
      🟩 Clang18            Pass: 100%/6   | Total:  1h 00m | Avg: 10m 05s | Max: 20m 47s | Hits:  70%/13982 
      🟥 GCC7               Pass:   0%/2   | Total:  7m 14s | Avg:  3m 37s | Max:  3m 48s
      🟥 GCC8               Pass:   0%/1   | Total:  3m 38s | Avg:  3m 38s | Max:  3m 38s
      🟥 GCC9               Pass:   0%/2   | Total:  7m 36s | Avg:  3m 48s | Max:  4m 04s
      🟩 GCC10              Pass: 100%/2   | Total:  8m 05s | Avg:  4m 02s | Max:  4m 04s | Hits:  98%/5587  
      🟩 GCC11              Pass: 100%/2   | Total:  8m 06s | Avg:  4m 03s | Max:  4m 15s | Hits:  98%/5583  
      🟩 GCC12              Pass: 100%/2   | Total:  8m 07s | Avg:  4m 03s | Max:  4m 12s | Hits:  99%/5583  
      🟨 GCC13              Pass:  87%/8   | Total: 59m 40s | Avg:  7m 27s | Max: 15m 12s | Hits:  95%/8525  
      🟥 MSVC14.29          Pass:   0%/2   | Total: 48m 31s | Avg: 24m 15s | Max: 26m 23s
      🟨 MSVC14.42          Pass:  50%/2   | Total: 48m 03s | Avg: 24m 01s | Max: 25m 03s | Hits:  98%/2684  
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 17m 43s | Avg:  8m 51s | Max:  9m 07s | Hits:  98%/5569  
    🟨 cxx_family
      🟩 Clang              Pass: 100%/16  | Total:  1h 44m | Avg:  6m 30s | Max: 20m 47s | Hits:  89%/41867 
      🟨 GCC                Pass:  68%/19  | Total:  1h 42m | Avg:  5m 23s | Max: 15m 12s | Hits:  97%/25278 
      🟨 MSVC               Pass:  25%/4   | Total:  1h 36m | Avg: 24m 08s | Max: 26m 23s | Hits:  98%/2684  
      🟩 NVHPC              Pass: 100%/2   | Total: 17m 43s | Avg:  8m 51s | Max:  9m 07s | Hits:  98%/5569  
    🟨 gpu
      🟨 rtx2080            Pass:  78%/41  | Total:  5h 20m | Avg:  7m 49s | Max: 26m 23s | Hits:  93%/75398 
    🟨 cpu
      🟨 amd64              Pass:  79%/39  | Total:  5h 16m | Avg:  8m 06s | Max: 26m 23s | Hits:  92%/72586 
      🟨 arm64              Pass:  50%/2   | Total:  4m 44s | Avg:  2m 22s | Max:  3m 47s | Hits:  98%/2812  
    🟩 sm
      🟩 75                 Pass: 100%/2   | Total: 30m 04s | Avg: 15m 02s | Max: 15m 12s | Hits:  90%/40    
      🟩 90;90a;100         Pass: 100%/1   | Total:  9m 28s | Avg:  9m 28s | Max:  9m 28s | Hits:  88%/2902  
    🟨 std
      🟨 17                 Pass:  61%/21  | Total:  2h 54m | Avg:  8m 18s | Max: 26m 23s | Hits:  93%/33242 
      🟨 20                 Pass:  94%/19  | Total:  2h 24m | Avg:  7m 35s | Max: 25m 03s | Hits:  93%/42156 
    
  • 🟨 cub: Pass: 97%/44 | Total: 1d 16h | Avg: 55m 06s | Max: 1h 19m | Hits: 29%/51276

    🔍 cpu: arm64 🔍
      🟩 amd64              Pass: 100%/42  | Total:  1d 15h | Avg: 56m 11s | Max:  1h 19m | Hits:  30%/50056 
      🔍 arm64              Pass:  50%/2   | Total:  1h 05m | Avg: 32m 36s | Max:  1h 03m | Hits:  16%/1220  
    🔍 ctk: 12.8 🔍
      🟩 12.0               Pass: 100%/5   | Total:  5h 11m | Avg:  1h 02m | Max:  1h 06m | Hits:  15%/5934  
      🟩 12.5               Pass: 100%/2   | Total:  2h 22m | Avg:  1h 11m | Max:  1h 11m | Hits:  12%/2258  
      🔍 12.8               Pass:  97%/37  | Total:  1d 08h | Avg: 53m 16s | Max:  1h 19m | Hits:  32%/43084 
    🔍 cudacxx: nvcc12.8 🔍
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  2h 10m | Avg:  1h 05m | Max:  1h 09m | Hits:  15%/2112  
      🟩 nvcc12.0           Pass: 100%/5   | Total:  5h 11m | Avg:  1h 02m | Max:  1h 06m | Hits:  15%/5934  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 22m | Avg:  1h 11m | Max:  1h 11m | Hits:  12%/2258  
      🔍 nvcc12.8           Pass:  97%/35  | Total:  1d 06h | Avg: 52m 34s | Max:  1h 19m | Hits:  33%/40972 
    🔍 cudacxx_family: nvcc 🔍
      🟩 ClangCUDA          Pass: 100%/2   | Total:  2h 10m | Avg:  1h 05m | Max:  1h 09m | Hits:  15%/2112  
      🔍 nvcc               Pass:  97%/42  | Total:  1d 14h | Avg: 54m 37s | Max:  1h 19m | Hits:  30%/49164 
    🔍 cxx: Clang18 🔍
      🟩 Clang14            Pass: 100%/4   | Total:  4h 09m | Avg:  1h 02m | Max:  1h 05m | Hits:  16%/4888  
      🟩 Clang15            Pass: 100%/2   | Total:  2h 06m | Avg:  1h 03m | Max:  1h 04m | Hits:  16%/2440  
      🟩 Clang16            Pass: 100%/2   | Total:  1h 58m | Avg: 59m 19s | Max:  1h 01m | Hits:  16%/2440  
      🟩 Clang17            Pass: 100%/2   | Total:  1h 59m | Avg: 59m 57s | Max:  1h 01m | Hits:  16%/2440  
      🔍 Clang18            Pass:  85%/7   | Total:  4h 54m | Avg: 42m 06s | Max:  1h 09m | Hits:  45%/6992  
      🟩 GCC7               Pass: 100%/2   | Total:  2h 00m | Avg:  1h 00m | Max:  1h 03m | Hits:  16%/2444  
      🟩 GCC8               Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m | Hits:  16%/1222  
      🟩 GCC9               Pass: 100%/2   | Total:  2h 04m | Avg:  1h 02m | Max:  1h 06m | Hits:  16%/2444  
      🟩 GCC10              Pass: 100%/2   | Total:  2h 09m | Avg:  1h 04m | Max:  1h 07m | Hits:  16%/2444  
      🟩 GCC11              Pass: 100%/2   | Total:  1h 59m | Avg: 59m 51s | Max:  1h 02m | Hits:  16%/2440  
      🟩 GCC12              Pass: 100%/2   | Total:  2h 01m | Avg:  1h 00m | Max:  1h 02m | Hits:  16%/2440  
      🟩 GCC13              Pass: 100%/10  | Total:  6h 43m | Avg: 40m 22s | Max:  1h 17m | Hits:  57%/12200 
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 22m | Avg:  1h 11m | Max:  1h 19m | Hits:  12%/2092  
      🟩 MSVC14.42          Pass: 100%/2   | Total:  2h 29m | Avg:  1h 14m | Max:  1h 17m | Hits:  12%/2092  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 22m | Avg:  1h 11m | Max:  1h 11m | Hits:  12%/2258  
    🔍 cxx_family: Clang 🔍
      🔍 Clang              Pass:  94%/17  | Total: 15h 09m | Avg: 53m 30s | Max:  1h 09m | Hits:  27%/19200 
      🟩 GCC                Pass: 100%/21  | Total: 18h 00m | Avg: 51m 28s | Max:  1h 17m | Hits:  36%/25634 
      🟩 MSVC               Pass: 100%/4   | Total:  4h 51m | Avg:  1h 12m | Max:  1h 19m | Hits:  12%/4184  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 22m | Avg:  1h 11m | Max:  1h 11m | Hits:  12%/2258  
    🔍 gpu: rtx2080 🔍
      🟩 h100               Pass: 100%/2   | Total: 52m 49s | Avg: 26m 24s | Max: 28m 39s | Hits:  57%/2440  
      🔍 rtx2080            Pass:  97%/34  | Total:  1d 11h | Avg:  1h 02m | Max:  1h 19m | Hits:  15%/39076 
      🟩 rtxa6000           Pass: 100%/8   | Total:  4h 11m | Avg: 31m 26s | Max:  1h 04m | Hits:  78%/9760  
    🔍 jobs: Build 🔍
      🔍 Build              Pass:  97%/37  | Total:  1d 13h | Avg:  1h 01m | Max:  1h 19m | Hits:  15%/42736 
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 21m 10s | Avg: 21m 10s | Max: 21m 10s | Hits:  99%/1220  
      🟩 GraphCapture       Pass: 100%/1   | Total: 17m 23s | Avg: 17m 23s | Max: 17m 23s | Hits:  99%/1220  
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 11m | Avg: 23m 40s | Max: 24m 10s | Hits:  99%/3660  
      🟩 TestGPU            Pass: 100%/2   | Total: 43m 24s | Avg: 21m 42s | Max: 21m 54s | Hits:  99%/2440  
    🔍 std: 20 🔍
      🟩 17                 Pass: 100%/20  | Total: 20h 59m | Avg:  1h 02m | Max:  1h 19m | Hits:  15%/23639 
      🔍 20                 Pass:  95%/24  | Total: 19h 25m | Avg: 48m 34s | Max:  1h 17m | Hits:  41%/27637 
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 52m 49s | Avg: 26m 24s | Max: 28m 39s | Hits:  57%/2440  
      🟩 90;90a;100         Pass: 100%/1   | Total:  1h 17m | Avg:  1h 17m | Max:  1h 17m | Hits:  16%/1220  
    
  • 🟩 thrust: Pass: 100%/43 | Total: 1d 00h | Avg: 33m 46s | Max: 1h 04m | Hits: 52%/76572

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total:  1h 04m | Avg: 32m 28s | Max: 35m 39s | Hits:  48%/3564  
    🟩 cpu
      🟩 amd64              Pass: 100%/41  | Total: 23h 11m | Avg: 33m 56s | Max:  1h 04m | Hits:  53%/73009 
      🟩 arm64              Pass: 100%/2   | Total:  1h 00m | Avg: 30m 12s | Max: 32m 01s | Hits:  47%/3563  
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  3h 01m | Avg: 36m 13s | Max: 57m 22s | Hits:  46%/8901  
      🟩 12.5               Pass: 100%/2   | Total:  2h 01m | Avg:  1h 00m | Max:  1h 00m | Hits:  25%/3562  
      🟩 12.8               Pass: 100%/36  | Total: 19h 09m | Avg: 31m 56s | Max:  1h 04m | Hits:  55%/64109 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 53m 39s | Avg: 26m 49s | Max: 27m 27s | Hits:  47%/3562  
      🟩 nvcc12.0           Pass: 100%/5   | Total:  3h 01m | Avg: 36m 13s | Max: 57m 22s | Hits:  46%/8901  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 01m | Avg:  1h 00m | Max:  1h 00m | Hits:  25%/3562  
      🟩 nvcc12.8           Pass: 100%/34  | Total: 18h 16m | Avg: 32m 14s | Max:  1h 04m | Hits:  55%/60547 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 53m 39s | Avg: 26m 49s | Max: 27m 27s | Hits:  47%/3562  
      🟩 nvcc               Pass: 100%/41  | Total: 23h 18m | Avg: 34m 06s | Max:  1h 04m | Hits:  53%/73010 
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  2h 06m | Avg: 31m 38s | Max: 31m 52s | Hits:  58%/7124  
      🟩 Clang15            Pass: 100%/2   | Total:  1h 03m | Avg: 31m 35s | Max: 32m 58s | Hits:  47%/3562  
      🟩 Clang16            Pass: 100%/2   | Total:  1h 03m | Avg: 31m 57s | Max: 32m 25s | Hits:  47%/3562  
      🟩 Clang17            Pass: 100%/2   | Total:  1h 07m | Avg: 33m 39s | Max: 34m 38s | Hits:  47%/3562  
      🟩 Clang18            Pass: 100%/7   | Total:  2h 45m | Avg: 23m 38s | Max: 35m 05s | Hits:  64%/12467 
      🟩 GCC7               Pass: 100%/2   | Total: 59m 06s | Avg: 29m 33s | Max: 29m 38s | Hits:  57%/3564  
      🟩 GCC8               Pass: 100%/1   | Total: 32m 24s | Avg: 32m 24s | Max: 32m 24s | Hits:  47%/1782  
      🟩 GCC9               Pass: 100%/2   | Total:  1h 02m | Avg: 31m 06s | Max: 31m 24s | Hits:  57%/3564  
      🟩 GCC10              Pass: 100%/2   | Total:  1h 05m | Avg: 32m 40s | Max: 32m 52s | Hits:  47%/3564  
      🟩 GCC11              Pass: 100%/2   | Total:  1h 11m | Avg: 35m 30s | Max: 36m 00s | Hits:  47%/3564  
      🟩 GCC12              Pass: 100%/2   | Total:  1h 10m | Avg: 35m 18s | Max: 35m 40s | Hits:  47%/3564  
      🟩 GCC13              Pass: 100%/8   | Total:  3h 32m | Avg: 26m 37s | Max: 35m 39s | Hits:  63%/14256 
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 54m | Avg: 57m 17s | Max: 57m 22s | Hits:  31%/3550  
      🟩 MSVC14.42          Pass: 100%/3   | Total:  2h 36m | Avg: 52m 05s | Max:  1h 04m | Hits:  38%/5325  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 01m | Avg:  1h 00m | Max:  1h 00m | Hits:  25%/3562  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/17  | Total:  8h 06m | Avg: 28m 36s | Max: 35m 05s | Hits:  56%/30277 
      🟩 GCC                Pass: 100%/19  | Total:  9h 33m | Avg: 30m 11s | Max: 36m 00s | Hits:  56%/33858 
      🟩 MSVC               Pass: 100%/5   | Total:  4h 30m | Avg: 54m 10s | Max:  1h 04m | Hits:  35%/8875  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 01m | Avg:  1h 00m | Max:  1h 00m | Hits:  25%/3562  
    🟩 gpu
      🟩 rtx2080            Pass: 100%/33  | Total: 19h 54m | Avg: 36m 11s | Max:  1h 02m | Hits:  47%/58769 
      🟩 rtx4090            Pass: 100%/10  | Total:  4h 17m | Avg: 25m 45s | Max:  1h 04m | Hits:  71%/17803 
    🟩 jobs
      🟩 Build              Pass: 100%/37  | Total: 22h 30m | Avg: 36m 30s | Max:  1h 04m | Hits:  47%/65889 
      🟩 TestCPU            Pass: 100%/3   | Total: 44m 19s | Avg: 14m 46s | Max: 29m 39s | Hits:  90%/5338  
      🟩 TestGPU            Pass: 100%/3   | Total: 57m 12s | Avg: 19m 04s | Max: 35m 39s | Hits:  83%/5345  
    🟩 sm
      🟩 90;90a;100         Pass: 100%/1   | Total: 32m 52s | Avg: 32m 52s | Max: 32m 52s | Hits:  47%/1782  
    🟩 std
      🟩 17                 Pass: 100%/20  | Total: 12h 29m | Avg: 37m 29s | Max:  1h 02m | Hits:  46%/35611 
      🟩 20                 Pass: 100%/21  | Total: 10h 37m | Avg: 30m 20s | Max:  1h 04m | Hits:  58%/37397 
    
  • 🟩 cudax: Pass: 100%/20 | Total: 1h 49m | Avg: 5m 27s | Max: 11m 54s | Hits: 96%/10080

    🟩 cpu
      🟩 amd64              Pass: 100%/16  | Total:  1h 33m | Avg:  5m 50s | Max: 11m 54s | Hits:  95%/7868  
      🟩 arm64              Pass: 100%/4   | Total: 15m 35s | Avg:  3m 53s | Max:  4m 14s | Hits:  98%/2212  
    🟩 ctk
      🟩 12.0               Pass: 100%/1   | Total:  9m 14s | Avg:  9m 14s | Max:  9m 14s | Hits:  60%/261   
      🟩 12.5               Pass: 100%/2   | Total: 12m 02s | Avg:  6m 01s | Max:  6m 03s | Hits:  95%/706   
      🟩 12.8               Pass: 100%/17  | Total:  1h 27m | Avg:  5m 09s | Max: 11m 54s | Hits:  97%/9113  
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/1   | Total:  9m 14s | Avg:  9m 14s | Max:  9m 14s | Hits:  60%/261   
      🟩 nvcc12.5           Pass: 100%/2   | Total: 12m 02s | Avg:  6m 01s | Max:  6m 03s | Hits:  95%/706   
      🟩 nvcc12.8           Pass: 100%/17  | Total:  1h 27m | Avg:  5m 09s | Max: 11m 54s | Hits:  97%/9113  
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/20  | Total:  1h 49m | Avg:  5m 27s | Max: 11m 54s | Hits:  96%/10080 
    🟩 cxx
      🟩 Clang14            Pass: 100%/1   | Total:  3m 53s | Avg:  3m 53s | Max:  3m 53s | Hits:  98%/555   
      🟩 Clang15            Pass: 100%/1   | Total:  4m 01s | Avg:  4m 01s | Max:  4m 01s | Hits:  98%/553   
      🟩 Clang16            Pass: 100%/1   | Total:  4m 09s | Avg:  4m 09s | Max:  4m 09s | Hits:  98%/553   
      🟩 Clang17            Pass: 100%/1   | Total:  4m 16s | Avg:  4m 16s | Max:  4m 16s | Hits:  98%/553   
      🟩 Clang18            Pass: 100%/4   | Total: 23m 54s | Avg:  5m 58s | Max: 11m 54s | Hits:  98%/2212  
      🟩 GCC10              Pass: 100%/1   | Total:  3m 43s | Avg:  3m 43s | Max:  3m 43s | Hits:  98%/555   
      🟩 GCC11              Pass: 100%/1   | Total:  4m 04s | Avg:  4m 04s | Max:  4m 04s | Hits:  98%/553   
      🟩 GCC12              Pass: 100%/2   | Total: 15m 45s | Avg:  7m 52s | Max: 11m 38s | Hits:  98%/1106  
      🟩 GCC13              Pass: 100%/4   | Total: 14m 42s | Avg:  3m 40s | Max:  3m 52s | Hits:  98%/2212  
      🟩 MSVC14.39          Pass: 100%/1   | Total:  9m 14s | Avg:  9m 14s | Max:  9m 14s | Hits:  60%/261   
      🟩 MSVC14.42          Pass: 100%/1   | Total:  9m 20s | Avg:  9m 20s | Max:  9m 20s | Hits:  60%/261   
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 12m 02s | Avg:  6m 01s | Max:  6m 03s | Hits:  95%/706   
    🟩 cxx_family
      🟩 Clang              Pass: 100%/8   | Total: 40m 13s | Avg:  5m 01s | Max: 11m 54s | Hits:  98%/4426  
      🟩 GCC                Pass: 100%/8   | Total: 38m 14s | Avg:  4m 46s | Max: 11m 38s | Hits:  98%/4426  
      🟩 MSVC               Pass: 100%/2   | Total: 18m 34s | Avg:  9m 17s | Max:  9m 20s | Hits:  60%/522   
      🟩 NVHPC              Pass: 100%/2   | Total: 12m 02s | Avg:  6m 01s | Max:  6m 03s | Hits:  95%/706   
    🟩 gpu
      🟩 rtx2080            Pass: 100%/20  | Total:  1h 49m | Avg:  5m 27s | Max: 11m 54s | Hits:  96%/10080 
    🟩 jobs
      🟩 Build              Pass: 100%/18  | Total:  1h 25m | Avg:  4m 45s | Max:  9m 20s | Hits:  95%/8974  
      🟩 Test               Pass: 100%/2   | Total: 23m 32s | Avg: 11m 46s | Max: 11m 54s | Hits:  99%/1106  
    🟩 sm
      🟩 90                 Pass: 100%/1   | Total:  3m 39s | Avg:  3m 39s | Max:  3m 39s | Hits:  98%/553   
      🟩 90a                Pass: 100%/1   | Total:  3m 23s | Avg:  3m 23s | Max:  3m 23s | Hits:  98%/553   
    🟩 std
      🟩 17                 Pass: 100%/4   | Total: 17m 48s | Avg:  4m 27s | Max:  6m 03s | Hits:  97%/2012  
      🟩 20                 Pass: 100%/16  | Total:  1h 31m | Avg:  5m 42s | Max: 11m 54s | Hits:  96%/8068  
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 10m 49s | Avg: 5m 24s | Max: 8m 14s | Hits: 97%/288

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total: 10m 49s | Avg:  5m 24s | Max:  8m 14s | Hits:  97%/288   
    🟩 ctk
      🟩 12.8               Pass: 100%/2   | Total: 10m 49s | Avg:  5m 24s | Max:  8m 14s | Hits:  97%/288   
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/2   | Total: 10m 49s | Avg:  5m 24s | Max:  8m 14s | Hits:  97%/288   
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total: 10m 49s | Avg:  5m 24s | Max:  8m 14s | Hits:  97%/288   
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total: 10m 49s | Avg:  5m 24s | Max:  8m 14s | Hits:  97%/288   
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total: 10m 49s | Avg:  5m 24s | Max:  8m 14s | Hits:  97%/288   
    🟩 gpu
      🟩 rtx2080            Pass: 100%/2   | Total: 10m 49s | Avg:  5m 24s | Max:  8m 14s | Hits:  97%/288   
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 35s | Avg:  2m 35s | Max:  2m 35s | Hits:  96%/144   
      🟩 Test               Pass: 100%/1   | Total:  8m 14s | Avg:  8m 14s | Max:  8m 14s | Hits:  98%/144   
    
  • 🟩 python: Pass: 100%/1 | Total: 30m 27s | Avg: 30m 27s | Max: 30m 27s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 30m 27s | Avg: 30m 27s | Max: 30m 27s
    🟩 ctk
      🟩 12.8               Pass: 100%/1   | Total: 30m 27s | Avg: 30m 27s | Max: 30m 27s
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/1   | Total: 30m 27s | Avg: 30m 27s | Max: 30m 27s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 30m 27s | Avg: 30m 27s | Max: 30m 27s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 30m 27s | Avg: 30m 27s | Max: 30m 27s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 30m 27s | Avg: 30m 27s | Max: 30m 27s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/1   | Total: 30m 27s | Avg: 30m 27s | Max: 30m 27s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 30m 27s | Avg: 30m 27s | Max: 30m 27s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
+/- libcu++
CUB
Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
+/- libcu++
+/- CUB
+/- Thrust
+/- CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 151)

# Runner
108 linux-amd64-cpu16
15 windows-amd64-cpu16
10 linux-arm64-cpu16
8 linux-amd64-gpu-rtx2080-latest-1
6 linux-amd64-gpu-rtxa6000-latest-1
3 linux-amd64-gpu-rtx4090-latest-1
1 linux-amd64-gpu-h100-latest-1

@fbusato fbusato marked this pull request as ready for review February 12, 2025 23:51
@fbusato fbusato requested a review from a team as a code owner February 12, 2025 23:51
@fbusato fbusato requested a review from ericniebler February 12, 2025 23:51
@fbusato fbusato requested a review from a team as a code owner February 13, 2025 01:14
@fbusato fbusato requested a review from miscco February 14, 2025 17:34
@davebayer
Copy link
Contributor

davebayer commented Feb 14, 2025

not sure if I'm understanding it correctly. Based on comment #3755 (comment), the idea is to only verify the overflow of add, sub, mul,
div. We don't see much value in computing the result of the operation.

Yes, I am refering to the solution I proposed. Actually the fastest way to check if an operation overflows is to compute the result and check the overflow flags and the result. I've checked the assembly generated by the compilers and it does exactly that.

Other problems related to builtins:

  • Not available on device (which is the main target)
  • Not all compilers support them
  • Don't work on constexpr functions (we need a dispatch)

I've implemented a version fully functional in both host and device code prefering builtins and falling back a generic implementation.

Copy link
Contributor

@miscco miscco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As touched on monday I prefer to not waste already available information.

That is why I would prefer the approach with computing the result and also passing a flag around that signifies whether overflow occurred.

I believe that there is effectively never a situation where we are completely uninterested in the result of an operation and just want to throw in that hypothetical case.

So throwing away the result in all common cases seems wastefull

@davebayer
Copy link
Contributor

davebayer commented Feb 14, 2025

Maybe I should have introduced better the solution. All of the functions have 2 overloads:

template <class T>
constexpr bool op_overflow(T x, T y, T& result) noexcept;

template <class T>
constexpr overflow_arithmetic_result_t<T> op_overflow(T x, T y) noexcept;

They can be used as:

// ...
int val;
if (cuda::add_overflow(x, y, result))
{
  // handle overflow
}
// use `val`
// ...

and

// ...
if (auto res = add_overflow(x, y))
{
  // handle overflow saved in `res.overflow`
  // use result saved in `res.value`
}
// ...

The overflow_arithmetic_result_t type implements explicit operator bool(), so it can be used directly in if statements and in static_assert.

I've already discussed the design with @miscco and he seems to be happy with it.

However, the implementation currently all of the inputs must be of the same type. If you insist on type mixing and returning common type, I can change the implementation.

What are your thoughts on this, @fbusato? :)

@fbusato
Copy link
Contributor Author

fbusato commented Feb 14, 2025

As touched on monday I prefer to not waste already available information.

This is not a waste of available information. Checking the overflow could involve different operations compared to the actual computation.

@davebayer I like the idea of the overloads but I would prefer to keep bool is_op_overflow(T,U) + the version with both results.
Then we need to decide how to proceed.
Additional note: trying to optimize these functions only for X64-86 (Host) seems a bit out of scope.

@davebayer
Copy link
Contributor

davebayer commented Feb 14, 2025

@davebayer I like the idea of the overloads but I would prefer to keep bool is_op_overflow(T,U) + the version with both results. Then we need to decide how to proceed. Additional note: trying to optimize these functions only for X64-86 (Host) seems a bit out of scope.

I only optimized the multiplication for device, because I did not come up with anything better than what the generic C++ implementation does.

I'd like to demonstrate that there is no performance benefit from having is_op_overflow implemented differently than op_overflow. See the generated PTX code for add on godbolt.

My implementation generates the same PTX as the clang-cuda's __builtin_add_overflow. Your solution is more complicated, generates more comparisons and introduces branching.

There are the extended precision integer arithmetic instructions, but we have no way getting the CC.CF flag other than using addc for a second time.

The only improvements I see is that NVCC seems to have trouble using predicates, so I could use inline PTX to fix that, but it would bring more complexity to the whole thing.

@fbusato
Copy link
Contributor Author

fbusato commented Feb 14, 2025

Add/Subtraction

I'd like to demonstrate that there is no performance benefit from having is_op_overflow implemented differently than op_overflow. See the generated PTX code for add on godbolt.

My implementation generates the same PTX as the clang-cuda's __builtin_add_overflow. Your solution is more complicated, generates more comparisons and introduces branching.

Your idea is very nice, but I would argue the opposite. Even in the worst case for the comparison (int) there is just one instruction difference at SASS level + in my version, only half instructions are actually executed.

federico_is_add_overflow(int, int):
 ISETP.GT.AND P1, PT, R5, -0x1, PT 
 @!P1 IADD3 R3, -R5, -0x80000000, RZ 
 @!P1 ISETP.GT.AND P0, PT, R3, R4, PT 
 @!P1 ISETP.EQ.OR P0, PT, R5.reuse, -0x80000000, P0 
 @P1 IADD3 R5, -R5, 0x7fffffff, RZ 
 @!P1 ISETP.LT.AND P0, PT, R4, RZ, P0 
 @P1 ISETP.LT.AND P0, PT, R5, R4, PT 
 SEL R4, RZ, 0x1, !P0 
 RET.ABS.NODEC R20 0x0 
david_is_add_overflow(int, int):
 IMAD.IADD R3, R4, 0x1, R5.reuse 
 SHF.R.U32.HI R5, RZ, 0x1f, R5 
 ISETP.GE.AND P0, PT, R3, R4, PT 
 LOP3.LUT R5, R5, 0x1, RZ, 0x3c, !PT 
 SEL R0, RZ, 0x1, P0 
 ISETP.NE.AND P0, PT, R5, R0, PT 
 SEL R4, RZ, 0x1, P0 
 RET.ABS.NODEC R20 0x0 

Multiplication:

  • The idea of using ptx for 32-bit/64-bit is excellent
  • For T/U < 4B we can skip most cases. Also, I don't think 8-bit/16-bit variants in PTX are very efficient
  • For 128-bit, our solutions look pretty similar
  • Technically, we can also optimize the multiplication check by looking at the number of bits of a and b or checking only the upper-part of the multiplication.

Thoughts:

I'm still convinced that checking for overflow and computing the operations are two different things:

  • Add/sub generate different code
  • 128-bit mul doesn't need to compute the multiplication
  • same for division
  • same for small integer types

Personally, I would like to have both versions, boolean value and with the result.

Final note about the parameter types. Using different types + common_type_t internally give users more flexibility and it is aligned with the other cuda/cmath functions.

@wence-
Copy link
Contributor

wence- commented Feb 28, 2025

tl;dr: In libcudf, we'd love to able to use saturating addition/subtraction that also returns whether overflow occurred.

In libcudf we have need of saturating integer arithmetic that also returns whether overflow occurred. The context is searching for an insertion point in an array (thrust::lower_bound, for example). Particularly, the use case is, I have some sorted array x = [a, b, c, ...] and some delta. For each row i, I want to find the insertion point in x of x[i] - delta, defining rows of the array that are contained in a "window" around the current row. There are two treatments of the endpoints: open (endpoint not included) and closed (endpoint included).

For an open window, if no overflow occurs, I can find the correct insertion point for a row i with thrust::lower_bound(x.begin(), x.end(), x[i] - delta, thrust::less_equal<>{}) (for a closed window, I need to use thrust::less<>{} instead.

However, if overflow does occur, then I need a way of distinguishing a legitimately obtained saturated value, from one that occurred due to overflow. Particularly, for open windows, if saturation occurred then I need to change the comparator to thrust::less<>. If all I know is that after calling x[i].sub_sat(delta) that the result is equal to a saturated value, I don't have enough information to know if I should make this modification.

@fbusato
Copy link
Contributor Author

fbusato commented Feb 28, 2025

thanks @wence- for describing your use case. Indeed, we also have a PR for saturation arithmetic #3449

@fbusato fbusato changed the title Integer arithmetic with overflow checking [BLOCKED] Integer arithmetic with overflow checking Mar 8, 2025
@res-life
Copy link

Spark-Rapids repo requires the following APIs:

template <typename T where T = int8_t, int16_t, int32_t or int64_t>
__device__ void add(T x, T y, bool check_overflow, bool* valid, T* result);

template <typename T where T = int8_t, int16_t, int32_t or int64_t>
__device__ void subtract(T x, T y, bool check_overflow, bool* valid, T* result);

template <typename T where T = int8_t, int16_t, int32_t or int64_t>
__device__ void multiply(T x, T y, bool check_overflow, bool* valid, T* result);

template <typename T where T = int8_t, int16_t, int32_t or int64_t>
__device__ void divide(T x, T y, bool check_overflow, bool* valid, T* result);

Spark-Rapids will use the above APIs via cuDF repo.
Please help for the above APIs.

@miscco
Copy link
Contributor

miscco commented Jul 15, 2025

@fbusato could you please rebase the PR?

@miscco
Copy link
Contributor

miscco commented Jul 15, 2025

As I said earlier I have strong reservations about the API here.

personally, I would strongly prefer if we would align this closer with the C++ saturation arithmetics that were implemented by @davebayer in #3449

How about the following API:

namespace cuda {

template<class T>
struct __staturation_result_t {
  T result;
   bool has_overflown;
};

_CCCL_TEMPLATE(class _Tp)
_CCCL_REQUIRES(__cccl_is_integer_v<_Tp>)
[[nodiscard]] _LIBCUDACXX_HIDE_FROM_ABI constexpr __staturation_result_t <_Tp> div_sat(_Tp __x, _Tp __y) noexcept {}

} // namespace cuda

namespace cuda::std {

_CCCL_TEMPLATE(class _Tp)
_CCCL_REQUIRES(__cccl_is_integer_v<_Tp>)
[[nodiscard]] _LIBCUDACXX_HIDE_FROM_ABI constexpr _Tp div_sat(_Tp __x, _Tp __y) noexcept {
  return ::cuda::div_sat(__x, __y).result;
}

} // namespace cuda::std

That would have the clear benefit of having a common implementation that greatly reduces code duplication and also provides all the information we need.

Should a user really want to discard the return value and only check for overflow, they can just add their own wrapper or directly access has_overflown.

@davebayer
Copy link
Contributor

How about the following API:

namespace cuda {

template<class T>
struct __staturation_result_t {
  T result;
   bool has_overflown;
};

_CCCL_TEMPLATE(class _Tp)
_CCCL_REQUIRES(__cccl_is_integer_v<_Tp>)
[[nodiscard]] _LIBCUDACXX_HIDE_FROM_ABI constexpr __staturation_result_t <_Tp> div_sat(_Tp __x, _Tp __y) noexcept {}

} // namespace cuda

namespace cuda::std {

_CCCL_TEMPLATE(class _Tp)
_CCCL_REQUIRES(__cccl_is_integer_v<_Tp>)
[[nodiscard]] _LIBCUDACXX_HIDE_FROM_ABI constexpr _Tp div_sat(_Tp __x, _Tp __y) noexcept {
  return ::cuda::div_sat(__x, __y).result;
}

} // namespace cuda::std

That would have the clear benefit of having a common implementation that greatly reduces code duplication and also provides all the information we need.

Should a user really want to discard the return value and only check for overflow, they can just add their own wrapper or directly access has_overflown.

I don't like the naming. To me, saturating and overflow checking are two separate things. Plus saturating the overflow everytime adds some overhead. I would really like to keep it simple, we needn't to reinvent a wheel here.

Rust also implements them separately with different names.

I've already implemented overflow_cast in #4151 and I have an unfinished PR on mul_overflow #4415. Implementations of additions, subtractions and divisions are trivial. Multiplication was the hard one. All of the functions should return the overflow_result<T> which is already implemented, too.

There is also an issue #4419 tracking those changes. I will try to finish the mul_overflow branch, @fbusato if you'd like to continue with the others I would be grateful :)

@fbusato
Copy link
Contributor Author

fbusato commented Jul 15, 2025

@davebayer @miscco We already discussed this point. I'm in favor of #4415 design, and I probably work on that soon. I will keep this PR open only for reference and to compare the implementations.

@fbusato
Copy link
Contributor Author

fbusato commented Jan 22, 2026

closing. We already provided the functionalities in different PRs

@fbusato fbusato closed this Jan 22, 2026
@github-project-automation github-project-automation bot moved this from Blocked to Done in CCCL Jan 22, 2026
@res-life
Copy link

I found the following overflow checks in https://nvidia.github.io/cccl/libcudacxx/extended_api/numeric.html:
add, sub, div.
Is there a multiply overflow check?

@davebayer
Copy link
Contributor

davebayer commented Jan 23, 2026

Is there a multiply overflow check?

Not yet, coming soon!

@davebayer
Copy link
Contributor

@res-life would you be interested in this set of APIs for overflow arithmetic with overflow checking?

template <class T>
cuda::overflow_result<T> add_sat_overflow(T lhs, T rhs) noexcept;

template <class T>
bool add_sat_overflow(T& result, T lhs, T rhs) noexcept;

or would you prefer what you suggested?

template <class T>
void add(T x, T y, bool check_overflow, bool* valid, T* result);

@res-life
Copy link

res-life commented Feb 2, 2026

I prefer the first. Please ignore what I suggested.

@davebayer
Copy link
Contributor

@res-life I've implemented those functions in #7808. Hopefully, they will be part of the 3.4 release (CUDA 13.4) together with the cuda::std::op_sat variants (#3449).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

6 participants