Conversation
🟨 CI finished in 1h 41m: Pass: 84%/93 | Total: 2d 21h | Avg: 44m 42s | Max: 1h 25m | Hits: 60%/115695
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 93)
| # | Runner |
|---|---|
| 66 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
|
Using |
|
Thanks! I'm out until Monday. Will review then. Could you, in the meantime, please run the benchmarks for radix sort wnd share the results here? |
|
I can take a look |
🟨 CI finished in 1h 15m: Pass: 90%/93 | Total: 18h 36m | Avg: 12m 00s | Max: 1h 00m | Hits: 97%/121785
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 93)
| # | Runner |
|---|---|
| 66 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
🟨 CI finished in 1h 33m: Pass: 98%/158 | Total: 2d 13h | Avg: 23m 16s | Max: 1h 16m | Hits: 84%/246942
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 158)
| # | Runner |
|---|---|
| 111 | linux-amd64-cpu16 |
| 15 | windows-amd64-cpu16 |
| 10 | linux-arm64-cpu16 |
| 8 | linux-amd64-gpu-rtx2080-latest-1 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 5 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
🟩 CI finished in 1h 18m: Pass: 100%/158 | Total: 1d 05h | Avg: 11m 10s | Max: 1h 01m | Hits: 91%/250596
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 158)
| # | Runner |
|---|---|
| 111 | linux-amd64-cpu16 |
| 15 | windows-amd64-cpu16 |
| 10 | linux-arm64-cpu16 |
| 8 | linux-amd64-gpu-rtx2080-latest-1 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 5 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| const int begin_bit = GENERATE_COPY(take(2, random(0, key_size - 1))); | ||
| const int end_bit = GENERATE_COPY(take(2, random(begin_bit + 1, key_size))); |
There was a problem hiding this comment.
@elstehle could you please have a quick look whether this fix is correct? We previously tested begin_bit == end_bit sometimes. Was this an invalid scenario?
There was a problem hiding this comment.
if begin_bit == end_bit is valid then the kernel should not be called (I guess)
There was a problem hiding this comment.
I think begin_bit == end_bit is generally valid, it's similar to num_items==0.
[...] then the kernel should not be called (I guess)
We can skip any kernel invocation only if the user invoked DeviceRardixSort via the DoubleBuffer interface. Otherwise the user will expect the output to end up in d_{keys,values}_out, in which case we need to copy the "sorted" output there.
|
added performance comparison in the description |
elstehle
left a comment
There was a problem hiding this comment.
Apart from the begin_bit==end_bit constraint, this looks good. Can we lift that new constraint? Otherwise, this will be a breaking change as it narrows the usage scenarios.
|
Btw, I already included this change in the incoming CCCL 3.0 migration guide: #4069 |
|
@elstehle sorry, what do you mean for lifting the new constraint? what do you think if we return from the host call when |
I assumed that the new
Is your question: "What should we do if the user passes |
|
|
After reconsidering, I would actually prefer failing explicitly when Broadly, we have two types of users:
The priority should be to avoid regressing user group (1), whether that means preventing performance degradation or avoiding the compilation of an extra copy kernel. If we can achieve that while still accommodating user group (2), I’m in favor. Otherwise, I’d prefer to fail in cases where |
🟨 CI finished in 1h 37m: Pass: 93%/93 | Total: 2d 21h | Avg: 44m 56s | Max: 1h 24m | Hits: 63%/125451
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 93)
| # | Runner |
|---|---|
| 66 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
🟨 CI finished in 1h 21m: Pass: 94%/158 | Total: 2d 10h | Avg: 22m 23s | Max: 1h 18m | Hits: 88%/240766
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 158)
| # | Runner |
|---|---|
| 111 | linux-amd64-cpu16 |
| 15 | windows-amd64-cpu16 |
| 10 | linux-arm64-cpu16 |
| 8 | linux-amd64-gpu-rtx2080-latest-1 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 5 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
🟨 CI finished in 1h 17m: Pass: 96%/158 | Total: 2d 10h | Avg: 22m 18s | Max: 1h 15m | Hits: 92%/242650
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 158)
| # | Runner |
|---|---|
| 111 | linux-amd64-cpu16 |
| 15 | windows-amd64-cpu16 |
| 10 | linux-arm64-cpu16 |
| 8 | linux-amd64-gpu-rtx2080-latest-1 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 5 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
🟩 CI finished in 1h 24m: Pass: 100%/158 | Total: 2d 10h | Avg: 22m 21s | Max: 1h 22m | Hits: 91%/251089
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 158)
| # | Runner |
|---|---|
| 111 | linux-amd64-cpu16 |
| 15 | windows-amd64-cpu16 |
| 10 | linux-arm64-cpu16 |
| 8 | linux-amd64-gpu-rtx2080-latest-1 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 5 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
Fixes #4025
Description
Replace/deprecate
cub::BFEwith new<cuda/bit>functionalities nvidia.github.io/cccl/libcudacxx/extended_api/bit.htmlThe initial PR found the following problems:
bitfield.hcatch2_test_device_segmented_radix_sort_keys.cuincludes tests with 0 bit widthcatch2_test_block_radix_sort.cuincludes tests with 0 bit widthPerformance comparison PTX
BFEvs.cuda::bitfield_extractwith SM80. TLDR: slightly faster[0] NVIDIA RTX A6000
Summary