-
Notifications
You must be signed in to change notification settings - Fork 17
Closed
Labels
Milestone
Description
While attempting to benchmark NVIDIA-Merlin/NVTabular#1687, I discovered that the dask-criteo benchmark does not work with the latest version of NVTabular/Merlin-core.
As far as I can tell, the problem is that #98 added the following logic to detect GPU availability: HAS_GPU = len(cuda.gpus.lst) > 0. This logic works just fine within a local process, but breaks Dask-CUDA device pinning when it is included in a top-level import (or is performed in the global context of the program). In other words, code like this shouldn't be executed by an import statement, like from merlin.core.compat import HAS_GPU.
The problem becomes apparent in a simple (Merlin-free) reproducer:
# reproducer.py
from dask_cuda import LocalCUDACluster
from numba import cuda # This is fine
HAS_GPU = len(cuda.gpus.lst) > 0 # This is not fine
if __name__ == "__main__":
cluster = LocalCUDACluster()If you execute python ./reproducer.py, you sill see warnings like:
/.../distributed/distributed/comm/ucx.py:67: UserWarning: Worker with process ID 49507 should have a CUDA context assigned to device 1, but instead the CUDA context is on device 0. This is often the result of a CUDA-enabled library calling a CUDA runtime function before Dask-CUDA can spawn worker processes. Please make sure any such function calls don't happen at import time or in the global scope of a program.
Reactions are currently unavailable