Run with import without gpu by jperez999 · Pull Request #261 · NVIDIA-Merlin/core

jperez999 · 2023-03-29T14:48:06Z

This PR enables a user to run our GPU docker container without GPUs and successfully be able to run all tests. Previously we had try catches in a few places in the code for importing GPU specific packages. This PR remediates those try catches because in an environment where you have the package but no GPU you end up getting ccuda.pyx init failure. This is because, the package (i.e. cudf, dask_cudf, rmm) do exist but when they try to access information about GPUs it fails and throws an error something like:

collecting ... Traceback (most recent call last):
  File "cuda/_cuda/ccuda.pyx", line 3671, in cuda._cuda.ccuda._cuInit
  File "cuda/_cuda/ccuda.pyx", line 435, in cuda._cuda.ccuda.cuPythonInit
RuntimeError: Failed to dlopen libcuda.so
Exception ignored in: 'cuda._lib.ccudart.utils.cudaPythonGlobal.lazyInitGlobal'
Traceback (most recent call last):
  File "cuda/_cuda/ccuda.pyx", line 3671, in cuda._cuda.ccuda._cuInit
  File "cuda/_cuda/ccuda.pyx", line 435, in cuda._cuda.ccuda.cuPythonInit
RuntimeError: Failed to dlopen libcuda.so
Fatal Python error: Segmentation fault

This PR leverages the compat file, and makes it the single point of import for the main packages (cudf, cupy) and it adds a security around it that ensures you can only import those packages if GPUs are detected. So if you find yourself in a scenario where the package is installed but no GPUs are detected you can, now, still safely use the core package. Therefore, we can now run our containers with and without the --gpus=all flag in docker. This was a customer ask and it helps developers when trying to test cpu only environment on a resource that has GPUs.

jperez999 · 2023-03-29T14:50:07Z

merlin/table/conversions.py

 def _register_tensor_table_from_cudf_df():
-    import cudf
-
+    from merlin.core.compat import cudf


This import forced me, via interrogate, to add in comments for all public methods in the file.

could changing this cause unexpected runtime errors later in the case of cudf being installed but no visible GPUs available?

karlhigley

This seems like a good direction to go in, and I wonder if we can even take it a little farther

merlin/io/csv.py

merlin/io/dataframe_engine.py

oliverholworthy · 2023-03-29T15:28:14Z

Fatal Python error: Segmentation fault

The segmentation fault was fixed in cuda-python 11.7.1 , however that requires use of cudf 22.12 and above.

we had a discussion about this before here #244 (comment)

oliverholworthy · 2023-03-29T15:30:53Z

Having the cudf imports wrapped with HAS_GPU means that we are encapsulating in this the idea of having at least one visible device with the ability to import cudf for example

oliverholworthy · 2023-03-29T15:37:28Z

I've been recently working toward a similar goal in #244 , #243 and #236

oliverholworthy · 2023-03-29T15:54:54Z

Part of those changes involved adidng checks for HAS_GPU where relevant alongside checks of cudf. At one point HAS_GPU was the variable we were using to included the information of both whether we have a GPU, and cudf importable. We split that out in #211 . And I suppose this is moving what was once HAS_GPU into compat.cudf. Which is clearer than what we had before in the HAS_GPU variable.

One reason we might consider keeping the compat.cudf separate from compat.HAS_GPU is mostly about edge cases and reporting of errors. here for example in the Dataset . And I've seen cases where pynvml fails to initialize for some reason, but cudf is still available. Which may result in some confusion if cudf includes this notion. On the other hand, being able to refer to compat.cudf without checking compat.HAS_GPU does simplify our checks in tests quite a bit so on balance could be reasonable, if less explicit. And any code that does need to check for cudf availability (not including of visible CUDA devices), we can achieve that with a separate import elsewhere I suppose.

jperez999 · 2023-03-29T15:55:36Z

Having the cudf imports wrapped with HAS_GPU means that we are encapsulating in this the idea of having at least one visible device with the ability to import cudf for example

I mean, we cant ever use cudf without having a GPU, so that seems like the correct usage. Do you see a time when we can use cudf without a GPU?

jperez999 · 2023-03-29T15:59:48Z

Part of those changes involved adidng checks for HAS_GPU where relevant alongside checks of cudf. At one point HAS_GPU was the variable we were using to included the information of both whether we have a GPU, and cudf importable. We split that out in #211 . And I suppose this is moving what was once HAS_GPU into compat.cudf. Which is clearer than what we had before in the HAS_GPU variable.

One reason we might consider keeping the compat.cudf separate from compat.HAS_GPU is mostly about edge cases and reporting of errors. here for example in the Dataset . And I've seen cases where pynvml fails to initialize for some reason, but cudf is still available. Which may result in some confusion if cudf includes this notion. On the other hand, being able to refer to compat.cudf without checking compat.HAS_GPU does simplify our checks in tests quite a bit so on balance could be reasonable, if less explicit. And any code that does need to check for cudf availability (not including of visible CUDA devices), we can achieve that with a separate import elsewhere I suppose.

So if there is ever a time we cannot detect a GPU, we should not import cudf. Chances are it will also fail inside on internal imports in cudf. We should work to make sure that we detect GPUs and dask nvml should not have problems detecting GPUs. I have not seen that. We are no longer using pynvml to detect GPU. Since we are now using nvml from dask we are essentially offloading that device detection to dask. I think that is ok. I would like to get a consensus. Thoughts?

jperez999 · 2023-03-29T16:13:05Z

@oliverholworthy in terms of the edge case you outlined

core/merlin/io/dataset.py

Lines 249 to 261 in ec9a360

    
           if self.cpu is False: 
        
               if not HAS_GPU: 
        
                   raise RuntimeError( 
        
                       "Cannot initialize Dataset on GPU. " 
        
                       "No devices detected (with pynvml). " 
        
                       "Check that pynvml can be initialized. " 
        
                   ) 
        
               if cudf is None: 
        
                   raise RuntimeError( 
        
                       "Cannot initialize Dataset on GPU. " 
        
                       "cudf package not found. " 
        
                       "Check that cudf is installed in this environment and can be imported.  " 
        
                   )

We would first check to see if HAS GPU was available and if it was, then we would check on cudf. This is the exact same as what happens on compat. We first decide if you have GPUs then we try to import the package. I dont see how this is an edge case. The only thing we could change is the error message since we now rely on nvml in dask instead of pynvml.

oliverholworthy · 2023-03-29T16:31:28Z

we cant ever use cudf without having a GPU, so that seems like the correct usage. Do you see a time when we can use cudf without a GPU?

Yeah that's true, aside from the segfault in earlier versions of cudf. you still can't use cudf without a gpu even it it's importable.

oliverholworthy · 2023-03-29T16:48:22Z

I'm onboard with this change to wrap the imports in compat with HAS_GPU.

oliverholworthy · 2023-03-29T18:27:22Z

merlin/dtypes/mappings/cudf.py

+            translator=NumpyPreprocessor("cudf", cudf_translator, attrs=["_categories"]),
+        )
+        _dtype_registry.register("cudf", cudf_dtypes)
+    except ImportError as exc:


do we still need this try/except here? what kind of ImportError can we get now that we're checking cudf?

oliverholworthy · 2023-03-29T18:32:59Z

tests/unit/io/test_avro.py

+if cudf:
+    dask_cudf = pytest.importorskip("dask_cudf")
+else:
+    pytest.mark.skip(reason="cudf did not import successfully")


one intepretation of this is could suggest to someone reading this that it's only the cudf import that matters here. could be worth mentioning that we also need at lesast one visible CUDA device too?

This reverts commit 1011afb.

* Revert "Run with import without gpu (#261)" This reverts commit 1011afb. * Revert "Update `merlin.core.compat` to use `HAS_GPU` and add add'l libraries (#262)" This reverts commit e365820. --------- Co-authored-by: Julio Perez <37191411+jperez999@users.noreply.github.com>

jperez999 commented Mar 29, 2023

View reviewed changes

jperez999 self-assigned this Mar 29, 2023

jperez999 added enhancement New feature or request chore Maintenance for the repository labels Mar 29, 2023

jperez999 modified the milestones: Merlin 23.04, Merlin 23.03 Mar 29, 2023

jperez999 requested review from karlhigley and oliverholworthy March 29, 2023 15:01

karlhigley reviewed Mar 29, 2023

View reviewed changes

merlin/io/csv.py Outdated Show resolved Hide resolved

merlin/io/dataframe_engine.py Outdated Show resolved Hide resolved

jperez999 added 2 commits March 29, 2023 14:11

avoid hitting failures on ccuda.pyx when no gpu is available

a59d519

use dask_cudf from compat

b83b159

karlhigley force-pushed the run-with-import-without-gpu branch from b80e9e2 to b83b159 Compare March 29, 2023 18:11

add fix for compat of numba.cuda

fbe1369

oliverholworthy reviewed Mar 29, 2023

View reviewed changes

This was referenced Mar 29, 2023

use merlin compat for imports of gpu specific packages NVIDIA-Merlin/NVTabular#1791

Merged

imports for GPU packages from compat NVIDIA-Merlin/dataloader#119

Merged

Merge branch 'main' into run-with-import-without-gpu

d420567

karlhigley approved these changes Mar 29, 2023

View reviewed changes

karlhigley merged commit 1011afb into NVIDIA-Merlin:main Mar 29, 2023

karlhigley pushed a commit to karlhigley/core that referenced this pull request Mar 29, 2023

Revert "Run with import without gpu (NVIDIA-Merlin#261)"

4a70531

This reverts commit 1011afb.

karlhigley pushed a commit to karlhigley/core that referenced this pull request Mar 29, 2023

Revert "Run with import without gpu (NVIDIA-Merlin#261)"

099f8fd

This reverts commit 1011afb.

Conversation

jperez999 commented Mar 29, 2023

Uh oh!

jperez999 Mar 29, 2023

Choose a reason for hiding this comment

Uh oh!

oliverholworthy Mar 29, 2023

Choose a reason for hiding this comment

Uh oh!

karlhigley left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

oliverholworthy commented Mar 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oliverholworthy commented Mar 29, 2023

Uh oh!

oliverholworthy commented Mar 29, 2023

Uh oh!

oliverholworthy commented Mar 29, 2023

Uh oh!

jperez999 commented Mar 29, 2023

Uh oh!

jperez999 commented Mar 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jperez999 commented Mar 29, 2023

Uh oh!

oliverholworthy commented Mar 29, 2023

Uh oh!

oliverholworthy commented Mar 29, 2023

Uh oh!

oliverholworthy Mar 29, 2023

Choose a reason for hiding this comment

Uh oh!

oliverholworthy Mar 29, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

oliverholworthy commented Mar 29, 2023 •

edited

Loading

jperez999 commented Mar 29, 2023 •

edited

Loading