Skip to content

Run with import without gpu#261

Merged
karlhigley merged 4 commits intoNVIDIA-Merlin:mainfrom
jperez999:run-with-import-without-gpu
Mar 29, 2023
Merged

Run with import without gpu#261
karlhigley merged 4 commits intoNVIDIA-Merlin:mainfrom
jperez999:run-with-import-without-gpu

Conversation

@jperez999
Copy link
Collaborator

This PR enables a user to run our GPU docker container without GPUs and successfully be able to run all tests. Previously we had try catches in a few places in the code for importing GPU specific packages. This PR remediates those try catches because in an environment where you have the package but no GPU you end up getting ccuda.pyx init failure. This is because, the package (i.e. cudf, dask_cudf, rmm) do exist but when they try to access information about GPUs it fails and throws an error something like:

collecting ... Traceback (most recent call last):
  File "cuda/_cuda/ccuda.pyx", line 3671, in cuda._cuda.ccuda._cuInit
  File "cuda/_cuda/ccuda.pyx", line 435, in cuda._cuda.ccuda.cuPythonInit
RuntimeError: Failed to dlopen libcuda.so
Exception ignored in: 'cuda._lib.ccudart.utils.cudaPythonGlobal.lazyInitGlobal'
Traceback (most recent call last):
  File "cuda/_cuda/ccuda.pyx", line 3671, in cuda._cuda.ccuda._cuInit
  File "cuda/_cuda/ccuda.pyx", line 435, in cuda._cuda.ccuda.cuPythonInit
RuntimeError: Failed to dlopen libcuda.so
Fatal Python error: Segmentation fault

This PR leverages the compat file, and makes it the single point of import for the main packages (cudf, cupy) and it adds a security around it that ensures you can only import those packages if GPUs are detected. So if you find yourself in a scenario where the package is installed but no GPUs are detected you can, now, still safely use the core package. Therefore, we can now run our containers with and without the --gpus=all flag in docker. This was a customer ask and it helps developers when trying to test cpu only environment on a resource that has GPUs.

def _register_tensor_table_from_cudf_df():
import cudf

from merlin.core.compat import cudf
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import forced me, via interrogate, to add in comments for all public methods in the file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could changing this cause unexpected runtime errors later in the case of cudf being installed but no visible GPUs available?

@jperez999 jperez999 self-assigned this Mar 29, 2023
@jperez999 jperez999 added enhancement New feature or request chore Maintenance for the repository labels Mar 29, 2023
@jperez999 jperez999 modified the milestones: Merlin 23.04, Merlin 23.03 Mar 29, 2023
Copy link
Contributor

@karlhigley karlhigley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good direction to go in, and I wonder if we can even take it a little farther

@oliverholworthy
Copy link
Contributor

oliverholworthy commented Mar 29, 2023

Fatal Python error: Segmentation fault

The segmentation fault was fixed in cuda-python 11.7.1 , however that requires use of cudf 22.12 and above.

we had a discussion about this before here #244 (comment)

@oliverholworthy
Copy link
Contributor

Having the cudf imports wrapped with HAS_GPU means that we are encapsulating in this the idea of having at least one visible device with the ability to import cudf for example

@oliverholworthy
Copy link
Contributor

I've been recently working toward a similar goal in #244 , #243 and #236

@oliverholworthy
Copy link
Contributor

Part of those changes involved adidng checks for HAS_GPU where relevant alongside checks of cudf. At one point HAS_GPU was the variable we were using to included the information of both whether we have a GPU, and cudf importable. We split that out in #211 . And I suppose this is moving what was once HAS_GPU into compat.cudf. Which is clearer than what we had before in the HAS_GPU variable.

One reason we might consider keeping the compat.cudf separate from compat.HAS_GPU is mostly about edge cases and reporting of errors. here for example in the Dataset . And I've seen cases where pynvml fails to initialize for some reason, but cudf is still available. Which may result in some confusion if cudf includes this notion. On the other hand, being able to refer to compat.cudf without checking compat.HAS_GPU does simplify our checks in tests quite a bit so on balance could be reasonable, if less explicit. And any code that does need to check for cudf availability (not including of visible CUDA devices), we can achieve that with a separate import elsewhere I suppose.

@jperez999
Copy link
Collaborator Author

Having the cudf imports wrapped with HAS_GPU means that we are encapsulating in this the idea of having at least one visible device with the ability to import cudf for example

I mean, we cant ever use cudf without having a GPU, so that seems like the correct usage. Do you see a time when we can use cudf without a GPU?

@jperez999
Copy link
Collaborator Author

jperez999 commented Mar 29, 2023

Part of those changes involved adidng checks for HAS_GPU where relevant alongside checks of cudf. At one point HAS_GPU was the variable we were using to included the information of both whether we have a GPU, and cudf importable. We split that out in #211 . And I suppose this is moving what was once HAS_GPU into compat.cudf. Which is clearer than what we had before in the HAS_GPU variable.

One reason we might consider keeping the compat.cudf separate from compat.HAS_GPU is mostly about edge cases and reporting of errors. here for example in the Dataset . And I've seen cases where pynvml fails to initialize for some reason, but cudf is still available. Which may result in some confusion if cudf includes this notion. On the other hand, being able to refer to compat.cudf without checking compat.HAS_GPU does simplify our checks in tests quite a bit so on balance could be reasonable, if less explicit. And any code that does need to check for cudf availability (not including of visible CUDA devices), we can achieve that with a separate import elsewhere I suppose.

So if there is ever a time we cannot detect a GPU, we should not import cudf. Chances are it will also fail inside on internal imports in cudf. We should work to make sure that we detect GPUs and dask nvml should not have problems detecting GPUs. I have not seen that. We are no longer using pynvml to detect GPU. Since we are now using nvml from dask we are essentially offloading that device detection to dask. I think that is ok. I would like to get a consensus. Thoughts?

@jperez999
Copy link
Collaborator Author

@oliverholworthy in terms of the edge case you outlined

core/merlin/io/dataset.py

Lines 249 to 261 in ec9a360

if self.cpu is False:
if not HAS_GPU:
raise RuntimeError(
"Cannot initialize Dataset on GPU. "
"No devices detected (with pynvml). "
"Check that pynvml can be initialized. "
)
if cudf is None:
raise RuntimeError(
"Cannot initialize Dataset on GPU. "
"cudf package not found. "
"Check that cudf is installed in this environment and can be imported. "
)
We would first check to see if HAS GPU was available and if it was, then we would check on cudf. This is the exact same as what happens on compat. We first decide if you have GPUs then we try to import the package. I dont see how this is an edge case. The only thing we could change is the error message since we now rely on nvml in dask instead of pynvml.

@oliverholworthy
Copy link
Contributor

we cant ever use cudf without having a GPU, so that seems like the correct usage. Do you see a time when we can use cudf without a GPU?

Yeah that's true, aside from the segfault in earlier versions of cudf. you still can't use cudf without a gpu even it it's importable.

@oliverholworthy
Copy link
Contributor

I'm onboard with this change to wrap the imports in compat with HAS_GPU.

@karlhigley karlhigley force-pushed the run-with-import-without-gpu branch from b80e9e2 to b83b159 Compare March 29, 2023 18:11
translator=NumpyPreprocessor("cudf", cudf_translator, attrs=["_categories"]),
)
_dtype_registry.register("cudf", cudf_dtypes)
except ImportError as exc:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need this try/except here? what kind of ImportError can we get now that we're checking cudf?

if cudf:
dask_cudf = pytest.importorskip("dask_cudf")
else:
pytest.mark.skip(reason="cudf did not import successfully")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one intepretation of this is could suggest to someone reading this that it's only the cudf import that matters here. could be worth mentioning that we also need at lesast one visible CUDA device too?

@karlhigley karlhigley merged commit 1011afb into NVIDIA-Merlin:main Mar 29, 2023
karlhigley pushed a commit to karlhigley/core that referenced this pull request Mar 29, 2023
karlhigley pushed a commit to karlhigley/core that referenced this pull request Mar 29, 2023
karlhigley added a commit that referenced this pull request Mar 29, 2023
* Revert "Run with import without gpu (#261)"

This reverts commit 1011afb.

* Revert "Update `merlin.core.compat` to use `HAS_GPU` and add add'l libraries (#262)"

This reverts commit e365820.

---------

Co-authored-by: Julio Perez <37191411+jperez999@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

chore Maintenance for the repository enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants