Fix the vectorized loading of BlockLoad#3517
Conversation
|
Thank you @ChristinaZ for looking into this. It seems that the root cause is that we do have a superfluous template parameter that prevents the compiler to choose the overload implementing the vectorized load: #431 describes this limitation. However, as stated by Georgii in a comment on the issue, the actual shortcoming is that we need to safeguard against un-aligned pointers: I think the right path forward is, as Georgii suggested in that comment:
Is this something you could take on? |
Yes, I think so. We can use a similar check within function |
|
I can reproduce the issue on godbolt: https://godbolt.org/z/dfTa9oe1v |
|
Just to summarize the findings from the offline discussion with @miscco and @ChristinaZ: We have this overload that is supposed to be chosen when the iterator type qualifies for vectorization: cccl/cub/cub/block/block_load.cuh Lines 882 to 886 in 7b19a43 There seems to be two root causes why that overload isn't chosen:
Since the implementation of the overload is a one liner, my suggestion was to add another overload taking a pointer-to-non-const. As mentioned before, we should address the scenario for pointer not aligned to the vectorized type (see #431 (comment)) |
There was a problem hiding this comment.
Could you please also add a small test to test/catch2_test_block_load.cu that loads from an {aligned,unaligned} x {ptr-to-const,ptr-to-non-const}?
Regarding performance concerns: I also checked our tuning policies for the use of BLOCK_LOAD_VECTORIZE and it seems it is only considered for DeviceSpmv (which is about to be dropped) and in DeviceRadixSort for SM 35 (which is also about to be dropped).
Given that it's not used in any Device* algorithm today, I'd be happy to have it merged. It finally makes the BLOCK_LOAD_VECTORIZE algorithm do what it promises to; it provides upside for the vectorized case (i.e., we are seeing considerable improvements in our top-k algorithm) and only minor downside for the case where we now have the extra alignment checks but need to default to non-vectorized loading.
No problem. Let me add this test.
Yes, I just rerun the benchmark with the latest updates in BlockLoad(). I find that adding address alignment checking doesn't hurt the performance of our topK. And the vectorized data loading truly helps improve the performance compared with the default direct loading. |
|
/ok to test |
|
Note, we have the following line in the block load test: Which means, the source file will be compiled twice: (1) once with |
🟨 CI finished in 1h 47m: Pass: 95%/90 | Total: 2d 14h | Avg: 41m 20s | Max: 1h 08m | Hits: 216%/12772
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 90)
| # | Runner |
|---|---|
| 65 | linux-amd64-cpu16 |
| 11 | linux-amd64-gpu-v100-latest-1 |
| 9 | windows-amd64-cpu16 |
| 4 | linux-arm64-cpu16 |
| 1 | linux-amd64-gpu-h100-latest-1-testing |
0f36910 to
18d3da9
Compare
|
/ok to test |
1 similar comment
|
/ok to test |
elstehle
left a comment
There was a problem hiding this comment.
That's great. Thanks a lot for your contribution!
There was a problem hiding this comment.
@elstehle Do we need a before/after benchmark? AFAIK, we don't have one for block load. I expect the SASS to change (for good!).
My take on the performance would be this:
Christina has some very positive performance data in her top-k work that reassures the performance upside we get from her fix. |
I should learn how to read. |
🟩 CI finished in 1h 45m: Pass: 100%/90 | Total: 2d 15h | Avg: 42m 31s | Max: 1h 14m | Hits: 216%/12730
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 90)
| # | Runner |
|---|---|
| 65 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
| 1 | linux-amd64-gpu-h100-latest-1 |
5cab148 to
4ae725b
Compare
|
/ok to test |
🟩 CI finished in 2h 49m: Pass: 100%/90 | Total: 2d 17h | Avg: 43m 35s | Max: 1h 45m | Hits: 216%/12730
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 90)
| # | Runner |
|---|---|
| 65 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
| 1 | linux-amd64-gpu-h100-latest-1 |
|
Thanks a lot for your contribution, @ChristinaZ |
Thanks Elias! It's my honor to contribute to CUB! |
Description
Partially addresses #431
What is remaining from #431 after this PR is applying the same logic to
WARP_LOAD_VECTORIZE.Hi @elstehle elias and @bernhardmgruber,
Background:
We (Elias and I) are working on adding a parallel top-k algorithm into CUB. In this work, we are trying to load data using
Blockloading (BLOCK_LOAD_VECTORIZE). However, we found that the data loading used the instruction LDG instead of LDG128. This PR is intended to fix this issue.Description
To show this bug, I wrote a standalone unit test code in this repo.
To run this unit test, you first need to modify the cccl directory
CCCL_DIRin Makefile. Then runmaketo compile.Then we can get related PTX instructions with
cuobjdump --dump-ptx ./benchmark | grep ldWe can find that it failed to perform vectorized loading. In this repo, we use the template parameter
InputIteratorfor the input. Its value is actuallyfloat*. So it should be able to perform vectorized loading.cuobjdump --dump-ptx ./benchmark | grep ld, we can get the following result:We can see that after the modification, we can use the vectorized loading successfully.
Checklist
Related issue
#431