@@ -18,16 +18,14 @@ cuda::device::memcpy_async_tx(
1818 cuda::barrier<cuda::thread_scope_block>& bar);
1919```
2020
21- Copies ` size ` bytes from global memory ` src ` to shared memory ` dest ` and arrives
22- on a shared memory barrier ` bar ` , updating its transaction count by ` size `
23- bytes.
21+ Copies ` size ` bytes from global memory ` src ` to shared memory ` dest ` and decrements the transaction count of ` bar ` by ` size ` bytes.
2422
2523## Preconditions
2624
2725* ` src ` , ` dest ` are 16-byte aligned and ` size ` is a multiple of 16, i.e.,
2826 ` Alignment >= 16 ` .
29- * ` dest ` points to shared memory
30- * ` src ` points to global memory
27+ * ` dest ` points to a shared memory allocation that is at least ` size ` bytes wide.
28+ * ` src ` points to a global memory allocation that is at least ` size ` bytes wide.
3129* ` bar ` is located in shared memory
3230* If either ` destination ` or ` source ` is an invalid or null pointer, the
3331 behavior is undefined (even if ` count ` is zero).
@@ -46,9 +44,8 @@ available.
4644
4745** Comparison to ` cuda::memcpy_async ` ** : ` memcpy_async_tx ` supports a subset of
4846the operations of ` memcpy_async ` . It gives more control over the synchronization
49- with a barrier than ` memcpy_async ` . ` memcpy_async_tx ` has no synchronous
50- fallback mechanism, so it can be used to ensure that the newest hardware
51- features are used. The drawback is that it does not work on older hardware
47+ with a barrier than ` memcpy_async ` . Currently, ` memcpy_async_tx ` has no synchronous
48+ fallback mechanism., i.e., it currently does not work on older hardware
5249(pre-CUDA Compute Capability 9.0, i.e., Hopper).
5350
5451## Return Value
0 commit comments