-
Notifications
You must be signed in to change notification settings - Fork 358
[libcudacxx] Stable abstraction for Blackwell work-stealing (PTX try_cancel) #3671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 29 commits
Commits
Show all changes
58 commits
Select commit
Hold shift + click to select a range
803e781
[libcudacxx] Experimental try_cancel exposure
gonzalobg 3740f0f
Update documentation
gonzalobg 09c48f8
Update try_cancel_blocks ABI
gonzalobg 6d15626
Update ABI of __cluster_get_dim
gonzalobg 4b9f6a3
Use if target and provide SW fallback
gonzalobg 57dcd42
Use simple license
gonzalobg aba388b
Fix if guard
gonzalobg 220cf29
Guard for C++20 or newer
gonzalobg d2baa62
Simplify API
gonzalobg 27a6a52
Add tests
gonzalobg e1f090b
Clarify C++20 support in docs
gonzalobg 7e99c76
Test main function should only run in host
gonzalobg 0b62b46
Rename to for_each_cancelled_block and extend docs
gonzalobg 0202b52
Fix typo in docs
gonzalobg a9b46bc
[pre-commit.ci] auto code formatting
pre-commit-ci[bot] dd6c971
Support C++17, move to different file, improve docs
gonzalobg cbd4885
Fix two typos
gonzalobg bdc1011
Free memory in doc example
gonzalobg ac02abc
Fix typos and add suggestions
gonzalobg ad807c1
[pre-commit.ci] auto code formatting
pre-commit-ci[bot] 9e75ce2
remove dangling requires clauses
gonzalobg ab5f28b
More comments; initial arrive can be relaxed
gonzalobg b3a9933
cancelled 2 cancelled for consistency with PTX
gonzalobg 299e4b3
Add missing invocable include
gonzalobg e20a87a
Add missing __detail namespace closing brace
gonzalobg 1ddcc83
[pre-commit.ci] auto code formatting
pre-commit-ci[bot] d11cb32
Merge branch 'main' into try_cancel_api
gonzalobg c09a381
Stabilize API for CTK
gonzalobg aba1484
[pre-commit.ci] auto code formatting
pre-commit-ci[bot] 69c5313
Update docs
gonzalobg 59cf5e1
Update docs
gonzalobg 9cb8360
Enable tests in C++17
gonzalobg c3da9b3
Update test
gonzalobg cb16ef3
Update test
gonzalobg 898ae57
Run test on all silicon
gonzalobg ea9f3bb
Fix Bernhard suggestions
gonzalobg 709671c
Improve docs clarity
gonzalobg 1e5ed5d
This needs a cuda compiler
gonzalobg 18d2cec
[pre-commit.ci] auto code formatting
pre-commit-ci[bot] 57a6918
Use int as the size type
miscco 2654baa
Use `assert` in tests
miscco 828fbd1
Use functions from cuda::std
miscco 01bb9ac
Reduce includes to necessary ones
miscco e788fb0
Use proper license
miscco f7fc53c
Drop unnecessary `__detail` namespace
miscco ca0e92a
Cleanup the test a bit more
miscco e03de4e
Drop unsupported dialects
miscco 4104076
Move to `<cuda/functional>`
miscco a4b9783
clusterlaunchcontrol.try_cancel requires PTX 8.7
miscco 0272ffe
Add missing include
miscco 7ea70fb
Drop superfluous header guard
miscco 284d84b
Use `NV_DISPATCH_TARGET` because that is more future proof
miscco 8829e90
document requirement on PTX ISA 8.7
miscco 0d1991e
Add check back ^^
miscco da83903
Fix namespace
miscco 7068da5
Add pre PTX ISA 8.7 fallback and use invoke to support function pointers
miscco 91f3e4d
Move to `<cuda/work_stealing>`
miscco 990bb85
move test file
miscco File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -16,3 +16,4 @@ Extended API | |
| extended_api/streams | ||
| extended_api/memory_resource | ||
| extended_api/math | ||
| extended_api/work_stealing | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,111 @@ | ||
| .. _libcudacxx-extended-api-work-stealing: | ||
|
|
||
| Work stealing | ||
| ============= | ||
|
|
||
| In header file ``<cuda/for_each_canceled>``: | ||
|
|
||
| .. code:: cuda | ||
|
|
||
| namespace cuda { | ||
|
|
||
| template <int ThreadBlockRank = 3, typename UnaryFunction = ..unspecified..> | ||
| __device__ void for_each_canceled_block(UnaryFunction uf); | ||
|
|
||
| } // namespace cuda | ||
|
|
||
| On devices with compute capability 10.0 or higher, it may leverage hardware acceleration. | ||
|
|
||
| This API is mainly intended to implement work-stealing at thread-block level granularity. | ||
| When compared against alternative work distribution techniques like `grid-stride loops <https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/>`__, which distribute load statically, or against other dynamic work distribution techniques using global memory concurrency, the main advantages of this API over these alternatives are: | ||
|
|
||
| - It performs work-stealing dynamically: thread blocks that finish work sooner may do more work than thread blocks whose work takes longer. | ||
|
||
| - It may cooperate with the GPU work-scheduler to respect work priorities and perform load-balancing. | ||
| - It may have lower work-stealing latency than global memory atomics. | ||
|
|
||
| For better performance, extract the shared thread block prologue and epilog outside the lambda, and re-use it across thread-block iterations: | ||
|
|
||
| - Prologue: thread-block initialization code and data that is common to all thread blocks, e.g., ``__shared__`` memory allocation, their initialization, etc. | ||
| - Epilogue: thread-block finalization code that is common to all thread blocks, e.g., writing back shared memory to global memory, etc. | ||
|
|
||
| **Mandates**: | ||
|
|
||
| - ``ThreadBlockRank`` equals the rank of the thread block: ``1``, ``2``, or ``3`` for one-dimensional, two-dimensional, and three-dimensional thread blocks, respectively. | ||
| - ``is_invokable_r_v<UnaryFunction, void, dim3>`` is true. | ||
miscco marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| **Preconditions**: | ||
|
|
||
| - All threads of the current thread-block shall call ``for_each_canceled_block`` **exactly once**. | ||
|
|
||
| **Effects**: | ||
|
|
||
| - Invokes ``uf`` with ``blockIdx``, then repeatedly attempts to cancel the launch of another thread block in the current grid, and: | ||
|
|
||
| - on success, calls ``uf`` with that thread block's ``blockIdx`` and repeats, | ||
| - otherwise, it failed to cancel the launch of a thread block and it returns. | ||
|
|
||
| Example | ||
| ------- | ||
|
|
||
| This example shows how to perform work-stealing at thread-block granularity using this API. | ||
|
|
||
| .. code:: cuda | ||
|
|
||
| // Before: | ||
|
|
||
| #include <cuda/math> | ||
| #include <cuda/for_each_canceled> | ||
| __global__ void vec_add(int* a, int* b, int* c, int n) { | ||
| // Extract common prologue outside the lambda, e.g., | ||
| // - __shared__ or global (malloc) memory allocation | ||
| // - common initialization code | ||
| // - etc. | ||
|
|
||
| cuda::for_each_canceled_block<1>([=](dim3 block_idx) { | ||
| int idx = threadIdx.x + block_idx.x * blockDim.x; | ||
| // assert(block_idx == blockIdx); // May fail! | ||
gonzalobg marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| if (idx < n) { | ||
| c[idx] += a[idx] + b[idx]; | ||
| } | ||
| }); | ||
| // Note: Calling for_each_canceled_block<1> again from this | ||
| // thread block exhibits undefined behavior. | ||
|
|
||
| // Extract common epilogue outside the lambda, e.g., | ||
| // - write back shared memory to global memory | ||
| // - external synchronization | ||
| // - global memory deallocation (free) | ||
| // - etc. | ||
| } | ||
|
|
||
| int main() { | ||
| int N = 10000; | ||
| int *a, *b, *c; | ||
| cudaMallocManaged(&a, N * sizeof(int)); | ||
| cudaMallocManaged(&b, N * sizeof(int)); | ||
| cudaMallocManaged(&c, N * sizeof(int)); | ||
gonzalobg marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| for (int i = 0; i < N; ++i) { | ||
| a[i] = i; | ||
| b[i] = 1; | ||
| c[i] = 0; | ||
| } | ||
|
|
||
| int tpb = 256; | ||
| int bpg = cuda::ceil_div(N, tpb); | ||
gonzalobg marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| vec_add<<<bpg, tpb>>>(a, b, c, N); | ||
| cudaDeviceSynchronize(); | ||
|
|
||
| bool success = true; | ||
| for (int i = 0; i < N; ++i) { | ||
| if (c[i] != (1 + i)) { | ||
| std::cerr << "ERROR " << i << ", " << c[i] << std::endl; | ||
| success = false; | ||
| } | ||
| } | ||
| cudaFree(a); | ||
| cudaFree(b); | ||
| cudaFree(c); | ||
|
|
||
| return success? 0 : 1; | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,240 @@ | ||
| // SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | ||
| // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception | ||
|
|
||
| #ifndef _CUDA_FOR_EACH_CANCELED | ||
| #define _CUDA_FOR_EACH_CANCELED | ||
|
|
||
| #include <cuda/std/detail/__config> | ||
|
|
||
| #include <cuda/std/type_traits> // For cuda::std::is_invokable_r_v | ||
| #include <cuda/std/utility> // For cuda::std::move, unreachable | ||
|
|
||
| #include <nv/target> | ||
|
|
||
| #if defined(_CCCL_IMPLICIT_SYSTEM_HEADER_GCC) | ||
| # pragma GCC system_header | ||
| #elif defined(_CCCL_IMPLICIT_SYSTEM_HEADER_CLANG) | ||
| # pragma clang system_header | ||
| #elif defined(_CCCL_IMPLICIT_SYSTEM_HEADER_MSVC) | ||
| # pragma system_header | ||
| #endif // no system header | ||
|
|
||
| _LIBCUDACXX_BEGIN_NAMESPACE_CUDA | ||
|
|
||
| namespace __detail | ||
| { | ||
|
|
||
| template <int __I> | ||
| _CCCL_NODISCARD _CCCL_DEVICE _CCCL_HIDE_FROM_ABI int __cluster_get_dim(__int128 __result) noexcept | ||
| { | ||
| int __r; | ||
| if constexpr (__I == 0) | ||
| { | ||
| asm volatile("clusterlaunchcontrol.query_cancel.get_first_ctaid::x.b32.b128 %0, %1;" | ||
| : "=r"(__r) | ||
| : "q"(__result) | ||
| : "memory"); | ||
| } | ||
| else if constexpr (__I == 1) | ||
| { | ||
| asm volatile("clusterlaunchcontrol.query_cancel.get_first_ctaid::y.b32.b128 %0, %1;" | ||
| : "=r"(__r) | ||
| : "q"(__result) | ||
| : "memory"); | ||
| } | ||
| else if constexpr (__I == 2) | ||
| { | ||
| asm volatile("clusterlaunchcontrol.query_cancel.get_first_ctaid::z.b32.b128 %0, %1;" | ||
| : "=r"(__r) | ||
| : "q"(__result) | ||
| : "memory"); | ||
| } | ||
| else | ||
| { | ||
| _CCCL_UNREACHABLE(); | ||
| } | ||
| return __r; | ||
| } | ||
|
|
||
| /// This API for implementing work-stealing, repeatedly attempts to cancel the launch of a thread block | ||
| /// from the current grid. On success, it invokes the unary function `__uf` before trying again. | ||
| /// On failure, it returns. | ||
| /// | ||
| /// This API does not provide any memory synchronization. | ||
| /// This API does not guarantee that any thread will invoke `__uf` with the next block index until all | ||
| /// invocatons of `__uf` for the prior block index have returned. | ||
| /// | ||
| /// Preconditions: | ||
| /// - All thread block threads shall call this API exactly once. | ||
| /// - Exactly one thread block thread shall call this API with `__is_leader` equals `true`. | ||
| template <int __ThreadBlockRank = 3, typename __UnaryFunction = void> | ||
| _CCCL_DEVICE _CCCL_HIDE_FROM_ABI void __for_each_canceled_block(bool __is_leader, __UnaryFunction __uf) | ||
| { | ||
| static_assert(__ThreadBlockRank >= 1 && __ThreadBlockRank <= 3, "ThreadBlockRank out-of-range [1, 3]."); | ||
| static_assert(::cuda::std::is_invocable_r_v<void, __UnaryFunction, dim3>, | ||
| "__for_each_canceled_block first argument requires an UnaryFunction with signature: void(dim3).\n" | ||
| "For example, call with lambda: __for_each_canceled_block([](dim3 block_idx) { ... });"); | ||
| dim3 __block_idx = dim3(blockIdx.x, 1, 1); | ||
| if constexpr (__ThreadBlockRank >= 2) | ||
| { | ||
| __block_idx = dim3(blockIdx.x, blockIdx.y, 1); | ||
| } | ||
| if constexpr (__ThreadBlockRank >= 3) | ||
| { | ||
| __block_idx = dim3(blockIdx.x, blockIdx.y, blockIdx.z); | ||
| } | ||
|
|
||
| NV_IF_ELSE_TARGET( | ||
| NV_PROVIDES_SM_100, | ||
| ( | ||
| __shared__ uint64_t __barrier; // TODO: use 2 barriers and 2 results to avoid last sync threads | ||
| __shared__ __int128 __result; | ||
| bool __phase = false; | ||
|
|
||
| // Initialize barrier and kick-start try_cancel pipeline: | ||
| if (__is_leader) { | ||
| auto __leader_mask = __activemask(); | ||
| asm volatile( | ||
| "{\n\t" | ||
| ".reg .pred p;\n\t" | ||
| // elect.sync is a workaround for peeling loop (#nvbug-id) | ||
| "elect.sync _|p, %2;\n\t" | ||
| "@p mbarrier.init.shared::cta.b64 [%1], 1;\n\t" | ||
| // `try_cancel` access the mbarrier using generic-proxy, so no cross-proxy fence required here | ||
| "@p clusterlaunchcontrol.try_cancel.async.shared::cta.mbarrier::complete_tx::bytes.b128 [%0], [%1];\n\t" | ||
| // This arrive does not order prior memory operations and can be relaxed. | ||
| "@p mbarrier.arrive.expect_tx.relaxed.cta.shared::cta.b64 _, [%1], 16;\n\t" | ||
| "}" | ||
| : | ||
| : "r"((int) __cvta_generic_to_shared(&__result)), | ||
| "r"((int) __cvta_generic_to_shared(&__barrier)), | ||
| "r"(__leader_mask) | ||
| : "memory"); | ||
| } | ||
|
|
||
| do { | ||
| __uf(__block_idx); // Invoke unary function. | ||
|
|
||
| if (__is_leader) | ||
| { | ||
| asm volatile( | ||
| "{\n\t" | ||
| ".reg .pred p;\n\t" | ||
| "waitLoop:\n\t\t" | ||
| "mbarrier.try_wait.parity.relaxed.cta.shared.b64 p, [%0], %1;\n\t\t" | ||
| "@!p bra waitLoop;\n\t" | ||
| "}" | ||
| : | ||
| : "r"((int) __cvta_generic_to_shared(&__barrier)), "r"((unsigned) __phase) | ||
| : "memory"); | ||
| __phase = !__phase; | ||
| } | ||
| __syncthreads(); // All threads of prior thread block have "exited". | ||
| // Note: this syncthreads provides the .acquire.cta fence preventing | ||
| // the next query operations from being re-ordered above the poll loop. | ||
| { | ||
| int __success = 0; | ||
| asm volatile( | ||
| "{\n\t" | ||
| ".reg .pred p;\n\t" | ||
| "clusterlaunchcontrol.query_cancel.is_canceled.pred.b128 p, %1;\n\t" | ||
| "selp.b32 %0, 1, 0, p;\n\t" | ||
| "}\n\t" | ||
| : "=r"(__success) | ||
| : "q"(__result)); | ||
| if (__success != 1) | ||
| { | ||
| // Invalidating mbarrier and synchronizing before exiting not | ||
| // required since each thread block calls this API at most once. | ||
| break; | ||
| } | ||
| } | ||
|
|
||
| // Read new thread block dimensions | ||
| dim3 __b(__detail::__cluster_get_dim<0>(__result), 1, 1); | ||
| if constexpr (__ThreadBlockRank >= 2) | ||
| { | ||
| __b.y = __detail::__cluster_get_dim<1>(__result); | ||
| } | ||
| if constexpr (__ThreadBlockRank == 3) | ||
| { | ||
| __b.z = __detail::__cluster_get_dim<2>(__result); | ||
| } | ||
| __block_idx = __b; | ||
|
|
||
| // Wait for all threads to read __result before issuing next async op. | ||
| // generic->generic synchronization | ||
| __syncthreads(); | ||
| // TODO: only control-warp requires sync, other warps can arrive | ||
| // TODO: double-buffering results+barrier pairs using phase to avoids this sync | ||
|
|
||
| if (__is_leader) | ||
| { | ||
| auto __leader_mask = __activemask(); | ||
| asm volatile( | ||
| "{\n\t" | ||
| ".reg .pred p;\n\t" | ||
| // elect.sync is a workaround for peeling loop (#nvbug-id) | ||
| "elect.sync _|p, %2;\n\t" | ||
| // generic->async release + acquire synchronization of prior reads: | ||
| // use bi-directional cross-proxy acq_rel fence instead of uni-dir rel; acq; fences. | ||
| "@p fence.proxy.async.shared::cta;\n\t" | ||
| // try to cancel another thread block | ||
| "@p clusterlaunchcontrol.try_cancel.async.shared::cta.mbarrier::complete_tx::bytes.b128 [%0], [%1];\n\t" | ||
| "@p mbarrier.arrive.expect_tx.relaxed.cta.shared::cta.b64 _, [%1], 16;\n\t" | ||
| "}" | ||
| : | ||
| : "r"((int) __cvta_generic_to_shared(&__result)), | ||
| "r"((int) __cvta_generic_to_shared(&__barrier)), | ||
| "r"(__leader_mask) | ||
| : "memory"); | ||
| } | ||
| } while (true);), | ||
| ( // NV_IF_ELSE_TARGET(NV_PROVIDES_SM_100, | ||
| // SW fall-back for lower compute capabilities. | ||
| // TODO: it may make sense to __trap here instead since lower compute capabilities may want | ||
| // to do something else (grid-stride, atomics, etc.). | ||
| // A higher-level abstraction like for_each should handle that. | ||
|
|
||
| __uf(__block_idx);)) // NV_IF_ELSE_TARGET(NV_PROVIDES_SM_100, | ||
| } | ||
|
|
||
| } // namespace __detail | ||
|
|
||
| /// This API used to implement work-stealing, repeatedly attempts to cancel the launch of a thread block | ||
| /// from the current grid. On success, it invokes the unary function `__uf` before trying again. | ||
| /// On failure, it returns. | ||
| /// | ||
| /// This API does not provide any memory synchronization. | ||
| /// This API does not guarantee that any thread will invoke `__uf` with the next block index until all | ||
| /// invocatons of `__uf` for the prior block index have returned. | ||
| /// | ||
| /// Preconditions: | ||
| /// - All thread block threads shall call this API exactly once. | ||
| /// - Exactly one thread block thread shall call this API with `__is_leader` equals `true`. | ||
| template <int __ThreadBlockRank = 3, typename __UnaryFunction = void> | ||
| _CCCL_DEVICE _CCCL_HIDE_FROM_ABI void for_each_canceled_block(__UnaryFunction __uf) | ||
| { | ||
| static_assert(__ThreadBlockRank >= 1 && __ThreadBlockRank <= 3, | ||
| "for_each_canceled_block<ThreadBlockRank>: ThreadBlockRank out-of-range [1, 3]."); | ||
| static_assert(::cuda::std::is_invocable_r_v<void, __UnaryFunction, dim3>, | ||
| "for_each_canceled_block first argument requires an UnaryFunction with signature: void(dim3).\n" | ||
| "For example, call with lambda: for_each_canceled_block([](dim3 block_idx) { ... });"); | ||
| if constexpr (__ThreadBlockRank == 1) | ||
| { | ||
| __detail::__for_each_canceled_block<1>(threadIdx.x == 0, ::cuda::std::move(__uf)); | ||
| } | ||
| else if constexpr (__ThreadBlockRank == 2) | ||
| { | ||
| __detail::__for_each_canceled_block<2>(threadIdx.x == 0 && threadIdx.y == 0, ::cuda::std::move(__uf)); | ||
| } | ||
| else if constexpr (__ThreadBlockRank == 3) | ||
| { | ||
| __detail::__for_each_canceled_block<3>( | ||
| threadIdx.x == 0 && threadIdx.y == 0 && threadIdx.z == 0, ::cuda::std::move(__uf)); | ||
| } | ||
| } | ||
|
|
||
| _LIBCUDACXX_END_NAMESPACE_CUDA | ||
|
|
||
| #endif // _CUDA_FOR_EACH_CANCELED |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: I would put this note after the API explanation.