Add cuda::device::memcpy_async_tx#405
Conversation
f400754 to
3398386
Compare
gonzalobg
left a comment
There was a problem hiding this comment.
Initial review, this is probably blocked on the other PRs about arrive_tx and the TMA intrinsics.
libcudacxx/docs/extended_api/synchronization_primitives/barrier/memcpy_async_tx.md
Show resolved
Hide resolved
libcudacxx/docs/extended_api/synchronization_primitives/barrier/memcpy_async_tx.md
Show resolved
Hide resolved
| Copies `size` bytes from global memory `src` to shared memory `dest` and arrives | ||
| on a shared memory barrier `bar`, updating its transaction count by `size` | ||
| bytes. | ||
|
|
There was a problem hiding this comment.
Docs should mention preconditions and effects.
There was a problem hiding this comment.
I did the preconditions. I really don't grok what should be documented under effects, so I have left that as is. The current memcpy_async docs didn't have anything I could use as inspiration.
libcudacxx/docs/extended_api/synchronization_primitives/barrier/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
libcudacxx/docs/extended_api/synchronization_primitives/barrier/memcpy_async_tx.md
Show resolved
Hide resolved
libcudacxx/docs/extended_api/synchronization_primitives/barrier/memcpy_async_tx.md
Show resolved
Hide resolved
libcudacxx/docs/extended_api/synchronization_primitives/barrier/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/barrier.h
Outdated
Show resolved
Hide resolved
griwes
left a comment
There was a problem hiding this comment.
This seems fine. If we ever come up with the formulation for pipeline_tx we'll want to move the function into its own header and reuse much of the code between pipeline and barrier versions, but since that's very much a future thing, this is sufficient for now.
libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/barrier.h
Outdated
Show resolved
Hide resolved
libcudacxx/docs/extended_api/synchronization_primitives/barrier/memcpy_async_tx.md
Show resolved
Hide resolved
| // unconditionally. This permits checking for the feature in host code. When | ||
| // __CUDA_MINIMUM_ARCH__ is available, we only enable the feature when the | ||
| // hardware supports it. | ||
| #if (!defined(__CUDA_MINIMUM_ARCH__)) || (defined(__CUDA_MINIMUM_ARCH__) && 900 <= __CUDA_MINIMUM_ARCH__) |
libcudacxx/docs/extended_api/synchronization_primitives/barrier/memcpy_async_tx.md
Show resolved
Hide resolved
3398386 to
9aab0cf
Compare
ahendriksen
left a comment
There was a problem hiding this comment.
Thanks for the review. I have moved the docs, added tests, and improved the docs.
The only thing that I am still unsure of is the what the effects section should list. @gonzalobg
libcudacxx/docs/extended_api/synchronization_primitives/barrier/memcpy_async_tx.md
Show resolved
Hide resolved
libcudacxx/docs/extended_api/synchronization_primitives/barrier/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/barrier.h
Outdated
Show resolved
Hide resolved
libcudacxx/docs/extended_api/synchronization_primitives/barrier/memcpy_async_tx.md
Show resolved
Hide resolved
libcudacxx/docs/extended_api/synchronization_primitives/barrier/memcpy_async_tx.md
Show resolved
Hide resolved
libcudacxx/docs/extended_api/synchronization_primitives/barrier/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
| Copies `size` bytes from global memory `src` to shared memory `dest` and arrives | ||
| on a shared memory barrier `bar`, updating its transaction count by `size` | ||
| bytes. | ||
|
|
There was a problem hiding this comment.
I did the preconditions. I really don't grok what should be documented under effects, so I have left that as is. The current memcpy_async docs didn't have anything I could use as inspiration.
libcudacxx/docs/extended_api/synchronization_primitives/barrier/memcpy_async_tx.md
Show resolved
Hide resolved
5ea0488 to
7ac7ce7
Compare
libcudacxx/docs/extended_api/asynchronous_operations/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
libcudacxx/docs/extended_api/asynchronous_operations/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
libcudacxx/docs/extended_api/asynchronous_operations/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
libcudacxx/docs/extended_api/asynchronous_operations/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
libcudacxx/docs/extended_api/synchronization_primitives/barrier.md
Outdated
Show resolved
Hide resolved
libcudacxx/docs/extended_api/asynchronous_operations/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
libcudacxx/.upstream-tests/test/cuda/memcpy_async/memcpy_async_tx.pass.cpp
Show resolved
Hide resolved
libcudacxx/.upstream-tests/test/cuda/memcpy_async/memcpy_async_tx.pass.cpp
Show resolved
Hide resolved
libcudacxx/docs/extended_api/asynchronous_operations/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
libcudacxx/docs/extended_api/asynchronous_operations/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
libcudacxx/docs/extended_api/asynchronous_operations/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
9f85f3c to
abe3457
Compare
libcudacxx/.upstream-tests/test/cuda/memcpy_async/memcpy_async_tx.pass.cpp
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/barrier.h
Outdated
Show resolved
Hide resolved
Co-authored-by: gonzalobg <65027571+gonzalobg@users.noreply.github.com>
abe3457 to
b886685
Compare
libcudacxx/docs/extended_api/asynchronous_operations/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
libcudacxx/docs/extended_api/asynchronous_operations/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
…ync_tx.md Co-authored-by: gonzalobg <65027571+gonzalobg@users.noreply.github.com>
|
Thanks a lot for adding this new feature 🎉 |
Description
Closes #404
Add
cuda::device::memcpy_async_txfor global to shared memory loads.Checklist