You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[cuda.cooperative] Support multidimensional thread blocks in block load/store and improve load/store docs (NVIDIA#3161)
* [cuda.cooperative] Support multidimensional thread blocks in block load/store
* [cuda.cooperative] Add tests for multidimensional block loads and stores and add
documentation for block loads and stores.
* [cuda.cooperative] Remove an unnecessary synchronization from the block
load/store example and fix the return types of block load/store in the docs.
"""Creates an operation that performs a block-wide load.
43
+
44
+
Returns a callable object that can be linked to and invoked from device code. It can be
45
+
invoked with the following signatures:
46
+
47
+
- `(src: numba.types.Array, dest: numba.types.Array) -> None`: Each thread loads
48
+
`items_per_thread` items from `src` into `dest`. `dest` must contain at least
49
+
`items_per_thread` items.
50
+
51
+
Different data movement strategies can be selected via the `algorithm` parameter:
52
+
53
+
- `algorithm="direct"` (default): A blocked arrangement of data is read directly from memory.
54
+
- `algorithm="striped"`: A striped arrangement of data is read directly from memory.
55
+
- `algorithm="vectorize"`: A blocked arrangement of data is read directly from memory using CUDA's built-in vectorized loads as a coalescing optimization.
56
+
- `algorithm="transpose"`: A striped arrangement of data is read directly from memory and is then locally transposed into a blocked arrangement.
57
+
- `algorithm="warp_transpose"`: A warp-striped arrangement of data is read directly from memory and is then locally transposed into a blocked arrangement.
58
+
- `algorithm="warp_transpose_timesliced"`: A warp-striped arrangement of data is read directly from memory and is then locally transposed into a blocked arrangement one warp at a time.
59
+
60
+
For more details, [read the corresponding CUB C++ documentation](https://nvidia.github.io/cccl/cub/api/classcub_1_1BlockLoad.html).
61
+
62
+
Args:
63
+
dtype: Data type being loaded
64
+
threads_per_block: The number of threads in a block, either an integer or a tuple of 2 or 3 integers
65
+
items_per_thread: The number of items each thread loads
66
+
algorithm: The data movement algorithm to use
67
+
68
+
Example:
69
+
The code snippet below illustrates a striped load and store of 128 integer items by 32 threads, with
"""Creates an operation that performs a block-wide store.
128
+
129
+
Returns a callable object that can be linked to and invoked from device code. It can be
130
+
invoked with the following signatures:
131
+
132
+
- `(dest: numba.types.Array, src: numba.types.Array) -> None`: Each thread stores
133
+
`items_per_thread` items from `src` into `dest`. `src` must contain at least
134
+
`items_per_thread` items.
135
+
136
+
Different data movement strategies can be selected via the `algorithm` parameter:
137
+
138
+
- `algorithm="direct"` (default): A blocked arrangement of data is written directly to memory.
139
+
- `algorithm="striped"`: A striped arrangement of data is written directly to memory.
140
+
- `algorithm="vectorize"`: A blocked arrangement of data is written directly to memory using CUDA's built-in vectorized stores as a coalescing optimization.
141
+
- `algorithm="transpose"`: A blocked arrangement is locally transposed into a striped arrangement which is then written to memory.
142
+
- `algorithm="warp_transpose"`: A blocked arrangement is locally transposed into a warp-striped arrangement which is then written to memory.
143
+
- `algorithm="warp_transpose_timesliced"`: A blocked arrangement is locally transposed into a warp-striped arrangement which is then written to memory. To reduce the shared memory requireent, only one warp’s worth of shared memory is provisioned and is subsequently time-sliced among warps.
144
+
145
+
For more details, [read the corresponding CUB C++ documentation](https://nvidia.github.io/cccl/cub/api/classcub_1_1BlockStore.html).
146
+
147
+
Args:
148
+
dtype: Data type being stored
149
+
threads_per_block: The number of threads in a block, either an integer or a tuple of 2 or 3 integers
150
+
items_per_thread: The number of items each thread loads
151
+
algorithm: The data movement algorithm to use
152
+
153
+
Example:
154
+
The code snippet below illustrates a striped load and store of 128 integer items by 32 threads, with
0 commit comments