Skip to content

Remove use of HAS_GPU from dispatch functions#244

Merged
karlhigley merged 9 commits intoNVIDIA-Merlin:mainfrom
oliverholworthy:dispatch-remove-has-gpu
Mar 17, 2023
Merged

Remove use of HAS_GPU from dispatch functions#244
karlhigley merged 9 commits intoNVIDIA-Merlin:mainfrom
oliverholworthy:dispatch-remove-has-gpu

Conversation

@oliverholworthy
Copy link
Contributor

Remove use of HAS_GPU from dispatch functions.

Following #211 the variable HAS_GPU no longer includes the information about whether or not cudf is installed. This can result in confusing errors when using the dispatch functions in a GPU enabled environment without cudf available. ('NoneType' object has no attribute 'DataFrame')

@oliverholworthy oliverholworthy added the chore Maintenance for the repository label Mar 13, 2023
@oliverholworthy oliverholworthy self-assigned this Mar 13, 2023
oliverholworthy added a commit to oliverholworthy/core that referenced this pull request Mar 13, 2023
rmm = None

if HAS_GPU:
if cudf:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be missing something, but seems like cudf would need to imported from merlin.core.compat for this to work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this commit from moving from the other PR. Updated

cupy = None

try:
import cudf
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on your comment here ("may blow up in CPU-only environments") @karlhigley

This is currently the case if you try to import cudf in our containers without GPUs available, you'll get a Segmentation fault, which the except ImportError can't catch

Exception ignored in: 'cuda._lib.ccudart.utils.cudaPythonGlobal.lazyInitGlobal'
Traceback (most recent call last):
  File "cuda/_cuda/ccuda.pyx", line 3671, in cuda._cuda.ccuda._cuInit
  File "cuda/_cuda/ccuda.pyx", line 435, in cuda._cuda.ccuda.cuPythonInit
RuntimeError: Failed to dlopen libcuda.so
Segmentation fault (core dumped)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was / still am considering wrapping this with a HAS_GPU check, but have been going back and forth about that because it is slightly confusing for this variable to represent both the successful import of cudf and what HAS_GPU represents (the successful import of pynvml and it's initialization).

It's possible to be in a situation where cudf can be imported, but HAS_GPU=False (pynvml not installed or fails to initialize for some reason)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The segmentation fault was fixed in cuda-python 11.7.1 , however that requires use of cudf 22.12 and above. With that in mind I think it might be ok to keep the import as it is and say that support for CPU-only environments that have cudf installed requries cudf >=22.12 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that sounds good. You're way ahead of me (as usual.)

@karlhigley
Copy link
Contributor

It looks like some of the NVT test failures might actually be related to these changes

@oliverholworthy
Copy link
Contributor Author

It looks like some of the NVT test failures might actually be related to these changes

I've updated and resolved these issues. The typehint for read_dispatch suggested that it might only accept a dataframe type, but is also called with strings. Updated to handle that case.

And the functions {random_state, arange, array, zeros} all had the same pattern. Updated to handle RangeIndex type passed as like_df and the base case where like_df is None and we're simply picking between cupy and numpy. The code is more verbose, but is clearer about what it's doing now. Hopefully a further refactor later will remove the need for these altogether

@karlhigley karlhigley merged commit 20bf15f into NVIDIA-Merlin:main Mar 17, 2023
karlhigley added a commit that referenced this pull request Mar 22, 2023
…n detected (#236)

* Run tests in GPU environment with no GPUs visible

* Update TensorTable tests with checks for HAS_GPU

* Remove unused `_HAS_GPU` variable from `test_utils`

* Wrap cupy/cudf imports in HAS_GPU check in `compat`

* Update tests to use HAS_GPU from compat module

* Reformat test_tensor_table.py

* Move HAS_GPU import to compat module

* Add pynvml dependency

* Update functions in `dispatch` to not use HAS_GPU

* Raise RuntimeError in Dataset if we can't run on GPU when cpu=False

* Update `convert_data` to handle unavailable cudf and dask_cudf

* Remove use of `HAS_GPU` from dispatch

* Keep cudf and cupy values representing presence of package

* Revert changes to `dataset.py`. Now part of #243

* Revert changes to `dispatch.py`. Now part of #244

* Use branch-name action for branch selection

* Remove unused ref_type variable

* Extend reason in `test_tensor_column.py`

Co-authored-by: Karl Higley <kmhigley@gmail.com>

* Extend reason in `tests/unit/table/test_tensor_column.py`

Co-authored-by: Karl Higley <kmhigley@gmail.com>

* Remove cudf import from compat. Now unrelated to this PR

* Remove use of branch-name action. `docker` not available in runner

* Add HAS_GPU checks with cupy to support env without visible devices

* Correct value of empty visible devices

* Update deps for GPU envs to match others

* Update get_lib to account for missing visible GPU

* Check HAS_GPU in `make_df` to handle visible GPU devices

* Update Dataset to handle default case when no visible GPUs are found

* Update fixtures to handle cudf with no visible devices

* Update tests to handle case of no visible GPUs

---------

Co-authored-by: Karl Higley <kmhigley@gmail.com>
Co-authored-by: Karl Higley <karlb@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

chore Maintenance for the repository

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants