Skip to content
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 14 additions & 5 deletions merlin/core/compat.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,22 @@
# See the License for the specific language governing permissions and
# limitations under the License.
#
HAS_GPU = False
try:
from numba import cuda

try:
HAS_GPU = len(cuda.gpus.lst) > 0
except cuda.cudadrv.error.CudaSupportError:
pass
except ImportError:
cuda = None

HAS_GPU = False
try:
import dask_cuda

HAS_GPU = dask_cuda.utils.get_gpu_count()
except ImportError:
# Don't let numba.cuda set the context
# unless dask_cuda is not installed
if cuda is not None:
try:
HAS_GPU = len(cuda.gpus.lst) > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the change that added this check, and wasn't very familiar with the side-effects of numba.gpus.lst and it's relationship with dask_cuda.

The motivation for the change was making the automatic cluster type selection in the Distributed class in merlin.core.utils work more reliably in contexts with and without GPU. Which was previously using a successful import of numba.cuda to determine whether or not to use the dask_cuda.LocalCUDACluster.

Given that use-case, maybe we don't need anything in this except block? Since if we get to this point (without dask_cuda) and have visible cuda devices according to numba.cuda that doesn't help us anyway since we're using dask_cuda for the distributed functionality?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the change that added this check, and wasn't very familiar with the side-effects of numba.gpus.lst and it's relationship with dask_cuda.

No worries - I wouldn't have had any idea that this was a problem if I hadn't been running the benchmark.

Given that use-case, maybe we don't need anything in this except block?

Since dask_cuda depends on pynvml anyway, I suppose we can just simplify the HAS_GPU logic to be:

try:
    import pynvml

    try:
        pynvml.nvmlInit()
        HAS_GPU = pynvml.nvmlDeviceGetCount() > 0
    except pynvml.nvml.NVMLError_LibraryNotFound:
        HAS_GPU = False
except ImportError:
    HAS_GPU = False

cc @pentschev (In case there is another recommended way to check for a cuda gpu on the system) :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this comment before seeing the suggestion about pynvml above. Doing that would keep the intent of this variable to indicate whether or not at least one GPU is visible.

Assuming there's no side-effects of pynvml.nvmlInit() that would cause issues.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you always have Distributed installed? If so, it may be simpler to just use https://github.com/dask/distributed/blob/e05c7a44b18e8f617681f5906054189bceb20f34/distributed/diagnostics/nvml.py#L113-L118 . The reason I suggest that is that for more complex cases (such as having PyNVML installed but no libnvidia-ml.so available, or no GPUs) there are other cases that may need handling, see https://github.com/dask/distributed/blob/e05c7a44b18e8f617681f5906054189bceb20f34/distributed/diagnostics/nvml.py#L59-L110 .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you always have Distributed installed? If so, it may be simpler to just use https://github.com/dask/distributed/blob/e05c7a44b18e8f617681f5906054189bceb20f34/distributed/diagnostics/nvml.py#L113-L118 .

Thanks for advising @pentschev !
Hmmm. My understanding is that distributed may not be istalled, but I suppose we can do:

try:
    from numba import cuda

except ImportError:
    cuda = None

HAS_GPU = False
try:
    from dask.distributed.diagnostics import nvml

    HAS_GPU = nvml.device_get_count() > 0
except ImportError:
    HAS_GPU = cuda is not None

If I understand Oliver's comments correctly, we are okay using the success of import numba.cuda to set HAS_GPU if we are not going to be using dask_cuda anyway. Is that correct @oliverholworthy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

distributed is a dependency of this project so, we could use the device_get_count function from distributed if that simplifies handling the edge cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, sounds good! We can let distributed worry about the status of nvml.

except cuda.cudadrv.error.CudaSupportError:
pass