Fix SM100 histogram tunings#3691
Conversation
🟩 CI finished in 1h 42m: Pass: 100%/90 | Total: 2d 16h | Avg: 43m 05s | Max: 1h 16m | Hits: 214%/13398
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 90)
| # | Runner |
|---|---|
| 65 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
| 1 | linux-amd64-gpu-h100-latest-1 |
|
Please don't merge until we have a perf diff from @gonidelis |
|
|
|
preliminary |
From looking at the code, we only provide a tuning for |
From the benchmark results it seems that also only the |
7623d12 to
2699952
Compare
🟩 CI finished in 1h 38m: Pass: 100%/90 | Total: 2d 19h | Avg: 44m 40s | Max: 1h 19m | Hits: 56%/132225
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 90)
| # | Runner |
|---|---|
| 65 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
| 1 | linux-amd64-gpu-h100-latest-1 |
|
@gonidelis please provide multi histogram benchmarks as well. Thx! |
98eae95 to
e28fed2
Compare
The tuning data member names did not match the one used when selecting tunings, so all SM100 tunings were SFINAE-ed out.
e28fed2 to
30f5b21
Compare
🟩 CI finished in 1h 10m: Pass: 100%/90 | Total: 1d 16h | Avg: 27m 18s | Max: 1h 04m | Hits: 90%/132225
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 90)
| # | Runner |
|---|---|
| 65 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
| 1 | linux-amd64-gpu-h100-latest-1 |
|
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin branch/2.8.x
git worktree add -d .worktree/backport-3691-to-branch/2.8.x origin/branch/2.8.x
cd .worktree/backport-3691-to-branch/2.8.x
git switch --create backport-3691-to-branch/2.8.x
git cherry-pick -x e7aae03124f20d8a4783d3e1668307d4a9e3bb8b |
The tuning data member names did not match the one used when selecting tunings, so all SM100 tunings were SFINAE-ed out. Also drop tunings with no benefit.
The tuning data member names did not match the one used when selecting tunings, so all SM100 tunings were SFINAE-ed out. Also drop tunings with no benefit.
* Add b200 tunings for histogram (#3616) Co-authored-by: Giannis Gonidelis <ggonidelis@nvidia.com> * Fix SM100 histogram tunings (#3691) The tuning data member names did not match the one used when selecting tunings, so all SM100 tunings were SFINAE-ed out. Also drop tunings with no benefit. --------- Co-authored-by: Giannis Gonidelis <ggonidelis@nvidia.com>
|
|
The tuning data member names did not match the one used when selecting tunings, so all SM100 tunings were SFINAE-ed out.
cub.bench.radix_sort.pairs.baseis empty (no side effect)cub.bench.histogram.even.basecontains instruction changes (tunings have effect)