Update GPU detection in merlin.core.utils for Distributed class#98
Conversation
Documentation preview |
Click to view CI ResultsGitHub pull request #98 of commit 5386077a55d023523a735ba2d21a9a3be18685ed, no merge conflicts.
Running as SYSTEM
Setting status of 5386077a55d023523a735ba2d21a9a3be18685ed to PENDING with url https://10.20.13.93:8080/job/merlin_core/65/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_core
using credential ce87ff3c-94f0-400a-8303-cb4acb4918b5
> git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
> git config remote.origin.url https://github.com/NVIDIA-Merlin/core # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/core
> git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems username and pass
> git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/core +refs/pull/98/*:refs/remotes/origin/pr/98/* # timeout=10
> git rev-parse 5386077a55d023523a735ba2d21a9a3be18685ed^{commit} # timeout=10
Checking out Revision 5386077a55d023523a735ba2d21a9a3be18685ed (detached)
> git config core.sparsecheckout # timeout=10
> git checkout -f 5386077a55d023523a735ba2d21a9a3be18685ed # timeout=10
Commit message: "Check cuda.gpus.lst for available GPUs"
> git rev-list --no-walk f336ca3ff96810efbded64e0559ebb880ee06364 # timeout=10
[merlin_core] $ /bin/bash /tmp/jenkins14859643852832795028.sh
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already up-to-date: setuptools in /var/jenkins_home/.local/lib/python3.8/site-packages (62.3.2)
/usr/share/python-wheels/urllib3-1.25.8-py2.py3-none-any.whl/urllib3/connectionpool.py:999: InsecureRequestWarning: Unverified HTTPS request is being made to host 'pypi.ngc.nvidia.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_core/core, configfile: pyproject.toml
plugins: anyio-3.5.0, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 342 items / 1 skipped
|
5386077 to
21fbb71
Compare
Click to view CI ResultsGitHub pull request #98 of commit 21fbb71aa1808f19a1357651992b0c2f9bb60239, no merge conflicts.
Running as SYSTEM
Setting status of 21fbb71aa1808f19a1357651992b0c2f9bb60239 to PENDING with url https://10.20.13.93:8080/job/merlin_core/66/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_core
using credential ce87ff3c-94f0-400a-8303-cb4acb4918b5
> git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
> git config remote.origin.url https://github.com/NVIDIA-Merlin/core # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/core
> git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems username and pass
> git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/core +refs/pull/98/*:refs/remotes/origin/pr/98/* # timeout=10
> git rev-parse 21fbb71aa1808f19a1357651992b0c2f9bb60239^{commit} # timeout=10
Checking out Revision 21fbb71aa1808f19a1357651992b0c2f9bb60239 (detached)
> git config core.sparsecheckout # timeout=10
> git checkout -f 21fbb71aa1808f19a1357651992b0c2f9bb60239 # timeout=10
Commit message: "Check cuda.gpus.lst for available GPUs"
> git rev-list --no-walk 5386077a55d023523a735ba2d21a9a3be18685ed # timeout=10
[merlin_core] $ /bin/bash /tmp/jenkins15247958366670095269.sh
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already up-to-date: setuptools in /var/jenkins_home/.local/lib/python3.8/site-packages (62.3.2)
/usr/share/python-wheels/urllib3-1.25.8-py2.py3-none-any.whl/urllib3/connectionpool.py:999: InsecureRequestWarning: Unverified HTTPS request is being made to host 'pypi.ngc.nvidia.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_core/core, configfile: pyproject.toml
plugins: anyio-3.5.0, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 342 items / 1 skipped
|
Click to view CI ResultsGitHub pull request #98 of commit 0e6f300caea67715418756e1f77b3990d8010caf, no merge conflicts.
Running as SYSTEM
Setting status of 0e6f300caea67715418756e1f77b3990d8010caf to PENDING with url https://10.20.13.93:8080/job/merlin_core/67/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_core
using credential ce87ff3c-94f0-400a-8303-cb4acb4918b5
> git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
> git config remote.origin.url https://github.com/NVIDIA-Merlin/core # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/core
> git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems username and pass
> git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/core +refs/pull/98/*:refs/remotes/origin/pr/98/* # timeout=10
> git rev-parse 0e6f300caea67715418756e1f77b3990d8010caf^{commit} # timeout=10
Checking out Revision 0e6f300caea67715418756e1f77b3990d8010caf (detached)
> git config core.sparsecheckout # timeout=10
> git checkout -f 0e6f300caea67715418756e1f77b3990d8010caf # timeout=10
Commit message: "Move gpu check to compat module"
> git rev-list --no-walk 21fbb71aa1808f19a1357651992b0c2f9bb60239 # timeout=10
[merlin_core] $ /bin/bash /tmp/jenkins13254012742949887097.sh
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already up-to-date: setuptools in /var/jenkins_home/.local/lib/python3.8/site-packages (62.3.2)
/usr/share/python-wheels/urllib3-1.25.8-py2.py3-none-any.whl/urllib3/connectionpool.py:999: InsecureRequestWarning: Unverified HTTPS request is being made to host 'pypi.ngc.nvidia.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_core/core, configfile: pyproject.toml
plugins: anyio-3.5.0, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 342 items / 1 skipped
|
Updating the automatic GPU detection in
merlin.core.utilsso that theDistributedclass works on both GPU/CPU automatically depending on availability.Motivation. When adding the XGBoost + Dask integration in merlin models. Added an example to use the
merlin.core.utils.Distributedhelper and encountered this issue with the default behaviour. NVIDIA-Merlin/models#466Implementation Details 🚧
numba.cudadoesn't necessarily indicate that we have GPUs available.numba.cuda.gpus.lst, handling a potential CudaSupportError exception.merlin.core.dispatchhas aHAS_GPUvariable, however this raises aRuntimeErrorwhen the the GPU is unavailable in some configurations. (and a lazy runtime error even if you try to catch aRuntimeErroron importingdask_cudf)Testing Details
Unsure how best to automate tests for this in CI.
Manual tests conducted:
CUDA_VISIBLE_DEVICESunset ->HAS_GPU = TrueCUDA_VISIBLE_DEVICES=""->HAS_GPU = False--gpusetting ->HAS_GPU = False