[libcudacxx] Stable abstraction for Blackwell work-stealing (PTX try_cancel)#3671
Merged
miscco merged 58 commits intoNVIDIA:mainfrom Feb 20, 2025
Merged
[libcudacxx] Stable abstraction for Blackwell work-stealing (PTX try_cancel)#3671miscco merged 58 commits intoNVIDIA:mainfrom
miscco merged 58 commits intoNVIDIA:mainfrom
Conversation
Contributor
miscco
requested changes
Feb 4, 2025
Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
Contributor
Author
|
pre-commit.ci autofix |
Contributor
|
We discussed the PR on VC today. @gonzalobg will push a few fixes, @miscco will polish, and I shall review. |
miscco
reviewed
Feb 19, 2025
Comment on lines
+113
to
+117
| if (auto e = cudaDeviceSynchronize(); e != cudaSuccess) | ||
| { | ||
| std::cerr << "ERROR: synchronize failed" << std::endl; | ||
| return false; | ||
| } |
Contributor
|
/ok to test |
Contributor
|
/ok to test |
Contributor
|
/ok to test |
gonzalobg
commented
Feb 19, 2025
libcudacxx/test/libcudacxx/cuda/for_each_canceled/for_each_canceled.pass.cpp
Show resolved
Hide resolved
gonzalobg
commented
Feb 19, 2025
Contributor
|
/ok to test |
Contributor
|
/ok to test |
Contributor
|
/ok to test |
1 similar comment
Contributor
|
/ok to test |
Contributor
🟩 CI finished in 1h 36m: Pass: 100%/158 | Total: 3d 01h | Avg: 27m 50s | Max: 1h 24m | Hits: 76%/247898
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 158)
| # | Runner |
|---|---|
| 111 | linux-amd64-cpu16 |
| 15 | windows-amd64-cpu16 |
| 10 | linux-arm64-cpu16 |
| 8 | linux-amd64-gpu-rtx2080-latest-1 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 5 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
Contributor
|
/ok to test |
miscco
approved these changes
Feb 20, 2025
12 tasks
Contributor
|
/ok to test |
Contributor
🟩 CI finished in 1h 04m: Pass: 100%/158 | Total: 23h 42m | Avg: 9m 00s | Max: 31m 09s | Hits: 94%/247972
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 158)
| # | Runner |
|---|---|
| 111 | linux-amd64-cpu16 |
| 15 | windows-amd64-cpu16 |
| 10 | linux-arm64-cpu16 |
| 8 | linux-amd64-gpu-rtx2080-latest-1 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 5 | linux-amd64-gpu-h100-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
18 tasks
davebayer
pushed a commit
to davebayer/cccl
that referenced
this pull request
Feb 20, 2025
…cancel) (NVIDIA#3671) * [libcudacxx] Experimental try_cancel exposure * Update documentation * Update try_cancel_blocks ABI Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> * Update ABI of __cluster_get_dim Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> * Use if target and provide SW fallback * Use simple license * Fix if guard * Guard for C++20 or newer * Simplify API * Add tests * Clarify C++20 support in docs * Test main function should only run in host * Rename to for_each_cancelled_block and extend docs * Fix typo in docs * [pre-commit.ci] auto code formatting * Support C++17, move to different file, improve docs * Fix two typos * Free memory in doc example * Fix typos and add suggestions * [pre-commit.ci] auto code formatting * remove dangling requires clauses * More comments; initial arrive can be relaxed * cancelled 2 cancelled for consistency with PTX * Add missing invocable include * Add missing __detail namespace closing brace * [pre-commit.ci] auto code formatting * Stabilize API for CTK * [pre-commit.ci] auto code formatting * Update docs * Update docs Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> * Enable tests in C++17 Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> * Update test Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> * Update test Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> * Run test on all silicon * Fix Bernhard suggestions * Improve docs clarity * This needs a cuda compiler * [pre-commit.ci] auto code formatting * Use int as the size type * Use `assert` in tests * Use functions from cuda::std * Reduce includes to necessary ones * Use proper license * Drop unnecessary `__detail` namespace Everything that is `__ugly` is already internal * Cleanup the test a bit more * Drop unsupported dialects * Move to `<cuda/functional>` * clusterlaunchcontrol.try_cancel requires PTX 8.7 * Add missing include * Drop superfluous header guard * Use `NV_DISPATCH_TARGET` because that is more future proof * document requirement on PTX ISA 8.7 * Add check back ^^ * Fix namespace * Add pre PTX ISA 8.7 fallback and use invoke to support function pointers * Move to `<cuda/work_stealing>` * move test file --------- Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
davebayer
pushed a commit
to davebayer/cccl
that referenced
this pull request
Apr 7, 2025
…cancel) (NVIDIA#3671) * [libcudacxx] Experimental try_cancel exposure * Update documentation * Update try_cancel_blocks ABI Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> * Update ABI of __cluster_get_dim Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> * Use if target and provide SW fallback * Use simple license * Fix if guard * Guard for C++20 or newer * Simplify API * Add tests * Clarify C++20 support in docs * Test main function should only run in host * Rename to for_each_cancelled_block and extend docs * Fix typo in docs * [pre-commit.ci] auto code formatting * Support C++17, move to different file, improve docs * Fix two typos * Free memory in doc example * Fix typos and add suggestions * [pre-commit.ci] auto code formatting * remove dangling requires clauses * More comments; initial arrive can be relaxed * cancelled 2 cancelled for consistency with PTX * Add missing invocable include * Add missing __detail namespace closing brace * [pre-commit.ci] auto code formatting * Stabilize API for CTK * [pre-commit.ci] auto code formatting * Update docs * Update docs Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> * Enable tests in C++17 Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> * Update test Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> * Update test Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> * Run test on all silicon * Fix Bernhard suggestions * Improve docs clarity * This needs a cuda compiler * [pre-commit.ci] auto code formatting * Use int as the size type * Use `assert` in tests * Use functions from cuda::std * Reduce includes to necessary ones * Use proper license * Drop unnecessary `__detail` namespace Everything that is `__ugly` is already internal * Cleanup the test a bit more * Drop unsupported dialects * Move to `<cuda/functional>` * clusterlaunchcontrol.try_cancel requires PTX 8.7 * Add missing include * Drop superfluous header guard * Use `NV_DISPATCH_TARGET` because that is more future proof * document requirement on PTX ISA 8.7 * Add check back ^^ * Fix namespace * Add pre PTX ISA 8.7 fallback and use invoke to support function pointers * Move to `<cuda/work_stealing>` * move test file --------- Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR provides a stable low-level abstraction for Blackwell work-stealing features (PTX
try_cancel) in libcu++. It's targeting CUDA 13.0.The main goals of this low-level abstraction are to:
This PR only covers the thread-block level work-stealing. Once it lands, a follow-up PR will add cluster level work-stealing in a 1:1 fashion.
The following features have made MVP:
There are many extensions to this feature that are worth evaluating but not made it into the MVP. These are tracked here: #3870 .
Checklist