Skip to content

Commit 7d7d65a

Browse files
authored
Update multi-gpu discussion for device_buffer and device_vector dtors (rapidsai#1524)
Since rapidsai#1370, the dtor for device_buffer ensures that the correct device is active when the deallocation occurs. We therefore update the example to discuss this. Since device_vector still requires the user to manage the active device correctly by hand, call this out explicitly in the documentation. - Closes rapidsai#1523 Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - Mark Harris (https://github.com/harrism) URL: rapidsai#1524
1 parent af756c6 commit 7d7d65a

File tree

1 file changed

+51
-8
lines changed

1 file changed

+51
-8
lines changed

README.md

Lines changed: 51 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -336,25 +336,68 @@ for(int i = 0; i < N; ++i) {
336336

337337
Note that the CUDA device that is current when creating a `device_memory_resource` must also be
338338
current any time that `device_memory_resource` is used to deallocate memory, including in a
339-
destructor. This affects RAII classes like `rmm::device_buffer` and `rmm::device_uvector`. Here's an
340-
(incorrect) example that assumes the above example loop has been run to create a
341-
`pool_memory_resource` for each device. A correct example adds a call to `cudaSetDevice(0)` on the
342-
line of the error comment.
339+
destructor. The RAII class `rmm::device_buffer` and classes that use it as a backing store
340+
(`rmm::device_scalar` and `rmm::device_uvector`) handle this by storing the active device when the
341+
constructor is called, and then ensuring that the stored device is active whenever an allocation or
342+
deallocation is performed (including in the destructor). The user must therefore only ensure that
343+
the device active during _creation_ of an `rmm::device_buffer` matches the active device of the
344+
memory resource being used.
345+
346+
Here is an incorrect example that creates a memory resource on device zero and then uses it to
347+
allocate a `device_buffer` on device one:
343348

344349
```c++
345350
{
346351
RMM_CUDA_TRY(cudaSetDevice(0));
347-
rmm::device_buffer buf_a(16);
348-
352+
auto mr = rmm::mr::cuda_memory_resource{};
349353
{
350354
RMM_CUDA_TRY(cudaSetDevice(1));
351-
rmm::device_buffer buf_b(16);
355+
// Invalid, current device is 1, but MR is only valid for device 0
356+
rmm::device_buffer buf(16, rmm::cuda_stream_default, &mr);
352357
}
358+
}
359+
```
360+
361+
A correct example creates the device buffer with device zero active. After that it is safe to switch
362+
devices and let the buffer go out of scope and destruct with a different device active. For example,
363+
this code is correct:
364+
365+
```c++
366+
{
367+
RMM_CUDA_TRY(cudaSetDevice(0));
368+
auto mr = rmm::mr::cuda_memory_resource{};
369+
rmm::device_buffer buf(16, rmm::cuda_stream_default, &mr);
370+
RMM_CUDA_TRY(cudaSetDevice(1));
371+
...
372+
// No need to switch back to device 0 before ~buf runs
373+
}
374+
```
375+
376+
#### Use of `rmm::device_vector` with multiple devices
377+
378+
> [!CAUTION] In contrast to the uninitialized `rmm:device_uvector`, `rmm::device_vector` **DOES
379+
> NOT** store the active device during construction, and therefore cannot arrange for it to be
380+
> active when the destructor runs. It is therefore the responsibility of the user to ensure the
381+
> currently active device is correct.
382+
383+
`rmm::device_vector` is therefore slightly less ergonomic to use in a multiple device setting since
384+
the caller must arrange that active devices on allocation and deallocation match. Recapitulating the
385+
previous example using `rmm::device_vector`:
353386

354-
// Error: when buf_a is destroyed, the current device must be 0, but it is 1
387+
```c++
388+
{
389+
RMM_CUDA_TRY(cudaSetDevice(0));
390+
auto mr = rmm::mr::cuda_memory_resource{};
391+
rmm::device_vector<int> vec(16, rmm::mr::thrust_allocator<int>(rmm::cuda_stream_default, &mr));
392+
RMM_CUDA_TRY(cudaSetDevice(1));
393+
...
394+
// ERROR: ~vec runs with device 1 active, but needs device 0 to be active
355395
}
356396
```
357397

398+
A correct example adds a call to `cudaSetDevice(0)` on the line of the error comment before the dtor
399+
for `~vec` runs.
400+
358401
## `cuda_stream_view` and `cuda_stream`
359402

360403
`rmm::cuda_stream_view` is a simple non-owning wrapper around a CUDA `cudaStream_t`. This wrapper's

0 commit comments

Comments
 (0)