[FEA]: Introduce Python module with CCCL headers#3201
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Q: In what way is it not working? |
It is getting a non-existing path here: At HEAD, cuda_paralleld/cuda/_include exists in the source directory (it is |
|
On August 30, 2014 @leofang wrote: Leo: Do you still recommend that we replace I'm asking because that'll take this PR in a very different direction (I think). |
|
Logging an observation (JIC it's useful to reference this later): With CCCL HEAD (I have @ d6253b5) TL;DR: @gevtushenko could this explain your "only works 50% of the time" experience? Current working directory is The output is: Similarly for cuda_parallel: Same output as above. |
|
Now with this PR (@ daab580) TL;DR: Same problem (this had me really confused TBH). Output: |
|
Small summary:
|
|
Commit ef9d5f4 makes the I wouldn't be surprised if this isn't the right way of doing it, but it does work in one pass. |
… cuda._include to find the include path.
|
Commit bc116dc fixes the |
… (they are equivalent to the default functions)
|
It turns out what I discovered the hard way was actually a known issue: Lines 23 to 27 in d6253b5 |
|
/ok to test |
🟩 CI finished in 58m 34s: Pass: 100%/176 | Total: 1d 00h | Avg: 8m 22s | Max: 44m 12s | Hits: 99%/22510
|
| Project | |
|---|---|
| +/- | CCCL Infrastructure |
| libcu++ | |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| +/- | python |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| +/- | CCCL Infrastructure |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 176)
| # | Runner |
|---|---|
| 125 | linux-amd64-cpu16 |
| 25 | linux-amd64-gpu-v100-latest-1 |
| 15 | windows-amd64-cpu16 |
| 10 | linux-arm64-cpu16 |
| 1 | linux-amd64-gpu-h100-latest-1-testing |
…skip-docs][skip-vdc]
…kip-vdc][skip pre-commit.ci]" This reverts commit ec206fd.
leofang
left a comment
There was a problem hiding this comment.
We're in a good shape now! A few minor comments
| from cuda.cccl import get_include_paths | ||
|
|
||
| for path in get_include_paths().as_tuple(): | ||
| if path: |
There was a problem hiding this comment.
I think this check should be moved to get_include_paths() so that we only pay this cost once per process? (and I think you've done that check via assert!)
There was a problem hiding this comment.
I think it's better to keep the if here, because with the current setup ...
@dataclass
class IncludePaths:
cuda: Optional[Path]
libcudacxx: Optional[Path]
cub: Optional[Path]
thrust: Optional[Path]
... it's safer. The Optional here are for flexibility/reusability/future-proofing.
Possibly, in the future some of the paths will be None.
I expect the runtime overhead (the price we pay for the flexibility) to be unmeasurable, especially because this function is cached, but even without caching.
However, I changed it to if path is None: (commit 12dbf29), for consistency, after I just realized that that's what we have in python/cuda_parallel/cuda/parallel/experimental/_bindings.py.
| cuda_include_path, | ||
| ) | ||
| for path in get_include_paths().as_tuple() | ||
| if path is not None |
There was a problem hiding this comment.
ditto, path could be checked only once in get_include_paths()
| ] | ||
| requires-python = ">=3.9" | ||
| dependencies = [ | ||
| "cuda-cccl", |
There was a problem hiding this comment.
I don't have a good way to declare version constraint for cuda-cccl statically, I suspect we will need to move dependencies to setup.py's install_requires, let us do this in another PR
🟩 CI finished in 2h 16m: Pass: 100%/148 | Total: 1d 14h | Avg: 15m 26s | Max: 1h 32m | Hits: 455%/25823
|
| Project | |
|---|---|
| +/- | CCCL Infrastructure |
| libcu++ | |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| +/- | python |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| +/- | CCCL Infrastructure |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 148)
| # | Runner |
|---|---|
| 98 | linux-amd64-cpu16 |
| 23 | linux-amd64-gpu-v100-latest-1 |
| 16 | windows-amd64-cpu16 |
| 10 | linux-arm64-cpu16 |
| 1 | linux-amd64-gpu-h100-latest-1-testing |
* Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative
* Run `copy_cccl_headers_to_aude_include()` before `setup()`
* Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path.
* Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel
* Bug fix: cuda/_include only exists after shutil.copytree() ran.
* Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py
* Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions)
* Replace := operator (needs Python 3.8+)
* Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md
* Restore original README.md: `pip3 install -e` now works on first pass.
* cuda_cccl/README.md: FOR INTERNAL USE ONLY
* Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under NVIDIA#3201 (comment))
Command used: ci/update_version.sh 2 8 0
* Modernize pyproject.toml, setup.py
Trigger for this change:
* NVIDIA#3201 (comment)
* NVIDIA#3201 (comment)
* Install CCCL headers under cuda.cccl.include
Trigger for this change:
* NVIDIA#3201 (comment)
Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely.
* Factor out cuda_cccl/cuda/cccl/include_paths.py
* Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative
* Add missing Copyright notice.
* Add missing __init__.py (cuda.cccl)
* Add `"cuda.cccl"` to `autodoc.mock_imports`
* Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.)
* Add # TODO: move this to a module-level import
* Modernize cuda_cooperative/pyproject.toml, setup.py
* Convert cuda_cooperative to use hatchling as build backend.
* Revert "Convert cuda_cooperative to use hatchling as build backend."
This reverts commit 61637d6.
* Move numpy from [build-system] requires -> [project] dependencies
* Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH
* Remove copy_license() and use license_files=["../../LICENSE"] instead.
* Further modernize cuda_cccl/setup.py to use pathlib
* Trivial simplifications in cuda_cccl/pyproject.toml
* Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code
* Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml
* Add taplo-pre-commit to .pre-commit-config.yaml
* taplo-pre-commit auto-fixes
* Use pathlib in cuda_cooperative/setup.py
* CCCL_PYTHON_PATH in cuda_cooperative/setup.py
* Modernize cuda_parallel/pyproject.toml, setup.py
* Use pathlib in cuda_parallel/setup.py
* Add `# TOML lint & format` comment.
* Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml
* Use pathlib in cuda/cccl/include_paths.py
* pre-commit autoupdate (EXCEPT clang-format, which was manually restored)
* Fixes after git merge main
* Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result'
```
=========================================================================== warnings summary ===========================================================================
tests/test_reduce.py::test_reduce_non_contiguous
/home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080>
Traceback (most recent call last):
File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__
bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result))
^^^^^^^^^^^^^^^^^
AttributeError: '_Reduce' object has no attribute 'build_result'
warnings.warn(pytest.PytestUnraisableExceptionWarning(msg))
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ==============================================================
```
* Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy`
* Introduce cuda_cooperative/constraints.txt
* Also add cuda_parallel/constraints.txt
* Add `--constraint constraints.txt` in ci/test_python.sh
* Update Copyright dates
* Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024)
For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI.
* Remove unused cuda_parallel jinja2 dependency (noticed by chance).
* Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead.
* Make cuda_cooperative, cuda_parallel testing completely independent.
* Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Fix sign-compare warning (NVIDIA#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]"
This reverts commit ea33a21.
Error message: NVIDIA#3201 (comment)
* Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Restore original ci/matrix.yaml [skip-rapids]
* Use for loop in test_python.sh to avoid code duplication.
* Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]
* Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]"
This reverts commit ec206fd.
* Implement suggestion by @shwina (NVIDIA#3201 (review))
* Address feedback by @leofang
---------
Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
* Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative
* Run `copy_cccl_headers_to_aude_include()` before `setup()`
* Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path.
* Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel
* Bug fix: cuda/_include only exists after shutil.copytree() ran.
* Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py
* Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions)
* Replace := operator (needs Python 3.8+)
* Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md
* Restore original README.md: `pip3 install -e` now works on first pass.
* cuda_cccl/README.md: FOR INTERNAL USE ONLY
* Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under NVIDIA#3201 (comment))
Command used: ci/update_version.sh 2 8 0
* Modernize pyproject.toml, setup.py
Trigger for this change:
* NVIDIA#3201 (comment)
* NVIDIA#3201 (comment)
* Install CCCL headers under cuda.cccl.include
Trigger for this change:
* NVIDIA#3201 (comment)
Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely.
* Factor out cuda_cccl/cuda/cccl/include_paths.py
* Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative
* Add missing Copyright notice.
* Add missing __init__.py (cuda.cccl)
* Add `"cuda.cccl"` to `autodoc.mock_imports`
* Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.)
* Add # TODO: move this to a module-level import
* Modernize cuda_cooperative/pyproject.toml, setup.py
* Convert cuda_cooperative to use hatchling as build backend.
* Revert "Convert cuda_cooperative to use hatchling as build backend."
This reverts commit 61637d6.
* Move numpy from [build-system] requires -> [project] dependencies
* Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH
* Remove copy_license() and use license_files=["../../LICENSE"] instead.
* Further modernize cuda_cccl/setup.py to use pathlib
* Trivial simplifications in cuda_cccl/pyproject.toml
* Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code
* Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml
* Add taplo-pre-commit to .pre-commit-config.yaml
* taplo-pre-commit auto-fixes
* Use pathlib in cuda_cooperative/setup.py
* CCCL_PYTHON_PATH in cuda_cooperative/setup.py
* Modernize cuda_parallel/pyproject.toml, setup.py
* Use pathlib in cuda_parallel/setup.py
* Add `# TOML lint & format` comment.
* Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml
* Use pathlib in cuda/cccl/include_paths.py
* pre-commit autoupdate (EXCEPT clang-format, which was manually restored)
* Fixes after git merge main
* Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result'
```
=========================================================================== warnings summary ===========================================================================
tests/test_reduce.py::test_reduce_non_contiguous
/home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080>
Traceback (most recent call last):
File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__
bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result))
^^^^^^^^^^^^^^^^^
AttributeError: '_Reduce' object has no attribute 'build_result'
warnings.warn(pytest.PytestUnraisableExceptionWarning(msg))
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ==============================================================
```
* Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy`
* Introduce cuda_cooperative/constraints.txt
* Also add cuda_parallel/constraints.txt
* Add `--constraint constraints.txt` in ci/test_python.sh
* Update Copyright dates
* Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024)
For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI.
* Remove unused cuda_parallel jinja2 dependency (noticed by chance).
* Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead.
* Make cuda_cooperative, cuda_parallel testing completely independent.
* Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Fix sign-compare warning (NVIDIA#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]"
This reverts commit ea33a21.
Error message: NVIDIA#3201 (comment)
* Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Restore original ci/matrix.yaml [skip-rapids]
* Use for loop in test_python.sh to avoid code duplication.
* Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]
* Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]"
This reverts commit ec206fd.
* Implement suggestion by @shwina (NVIDIA#3201 (review))
* Address feedback by @leofang
---------
Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
implement `add_sat`
split `signed`/`unsigned` implementation, improve implementation for MSVC
improve device `add_sat` implementation
add `add_sat` test
improve generic `add_sat` implementation for signed types
implement `sub_sat`
allow more msvc intrinsics on x86
add op tests
partially implement `mul_sat`
implement `div_sat` and `saturate_cast`
add `saturate_cast` test
simplify `div_sat` test
Deprectate C++11 and C++14 for libcu++ (#3173)
* Deprectate C++11 and C++14 for libcu++
Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
Implement `abs` and `div` from `cstdlib` (#3153)
* implement integer abs functions
* improve tests, fix constexpr support
* just use the our implementation
* implement `cuda::std::div`
* prefer host's `div_t` like types
* provide `cuda::std::abs` overloads for floats
* allow fp abs for NVRTC
* silence msvc's warning about conversion from floating point to integral
Fix missing radix sort policies (#3174)
Fixes NVBug 5009941
Introduces new `DeviceReduce::Arg{Min,Max}` interface with two output iterators (#3148)
* introduces new arg{min,max} interface with two output iterators
* adds fp inf tests
* fixes docs
* improves code example
* fixes exec space specifier
* trying to fix deprecation warning for more compilers
* inlines unzip operator
* trying to fix deprecation warning for nvhpc
* integrates supression fixes in diagnostics
* pre-ctk 11.5 deprecation suppression
* fixes icc
* fix for pre-ctk11.5
* cleans up deprecation suppression
* cleanup
Extend tuning documentation (#3179)
Add codespell pre-commit hook, fix typos in CCCL (#3168)
* Add codespell pre-commit hook
* Automatic changes from codespell.
* Manual changes.
Fix parameter space for TUNE_LOAD in scan benchmark (#3176)
fix various old compiler checks (#3178)
implement C++26 `std::projected` (#3175)
Fix pre-commit config for codespell and remaining typos (#3182)
Massive cleanup of our config (#3155)
Fix UB in atomics with automatic storage (#2586)
* Adds specialized local cuda atomics and injects them into most atomics paths.
Co-authored-by: Georgy Evtushenko <evtushenko.georgy@gmail.com>
Co-authored-by: gonzalobg <65027571+gonzalobg@users.noreply.github.com>
* Allow CUDA 12.2 to keep perf, this addresses earlier comments in #478
* Remove extraneous double brackets in unformatted code.
* Merge unsafe atomic logic into `__cuda_is_local`.
* Use `const_cast` for type conversions in cuda_local.h
* Fix build issues from interface changes
* Fix missing __nanosleep on sm70-
* Guard __isLocal from NVHPC
* Use PTX instead of running nothing from NVHPC
* fixup /s/nvrtc/nvhpc
* Fixup missing CUDA ifdef surrounding device code
* Fix codegen
* Bypass some sort of compiler bug on GCC7
* Apply suggestions from code review
* Use unsafe automatic storage atomics in codegen tests
---------
Co-authored-by: Georgy Evtushenko <evtushenko.georgy@gmail.com>
Co-authored-by: gonzalobg <65027571+gonzalobg@users.noreply.github.com>
Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
Refactor the source code layout for `cuda.parallel` (#3177)
* Refactor the source layout for cuda.parallel
* Add copyright
* Address review feedback
* Don't import anything into `experimental` namespace
* fix import
---------
Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
new type-erased memory resources (#2824)
s/_LIBCUDACXX_DECLSPEC_EMPTY_BASES/_CCCL_DECLSPEC_EMPTY_BASES/g (#3186)
Document address stability of `thrust::transform` (#3181)
* Do not document _LIBCUDACXX_MARK_CAN_COPY_ARGUMENTS
* Reformat and fix UnaryFunction/BinaryFunction in transform docs
* Mention transform can use proclaim_copyable_arguments
* Document cuda::proclaims_copyable_arguments better
* Deprecate depending on transform functor argument addresses
Fixes: #3053
turn off cuda version check for clangd (#3194)
[STF] jacobi example based on parallel_for (#3187)
* Simple jacobi example with parallel for and reductions
* clang-format
* remove useless capture list
fixes pre-nv_diag suppression issues (#3189)
Prefer c2h::type_name over c2h::demangle (#3195)
Fix memcpy_async* tests (#3197)
* memcpy_async_tx: Fix bug in test
Two bugs, one of which occurs in practice:
1. There is a missing fence.proxy.space::global between the writes to
global memory and the memcpy_async_tx. (Occurs in practice)
2. The end of the kernel should be fenced with `__syncthreads()`,
because the barrier is invalidated in the destructor. If other
threads are still waiting on it, there will be UB. (Has not yet
manifested itself)
* cp_async_bulk_tensor: Pre-emptively fence more in test
Add type annotations and mypy checks for `cuda.parallel` (#3180)
* Refactor the source layout for cuda.parallel
* Add initial type annotations
* Update pre-commit config
* More typing
* Fix bad merge
* Fix TYPE_CHECKING and numpy annotations
* typing bindings.py correctly
* Address review feedback
---------
Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
Fix rendering of cuda.parallel docs (#3192)
* Fix pre-commit config for codespell and remaining typos
* Fix rendering of docs for cuda.parallel
---------
Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
Enable PDL for DeviceMergeSortBlockSortKernel (#3199)
The kernel already contains a call to _CCCL_PDL_GRID_DEPENDENCY_SYNC.
This commit enables PDL when launching the kernel.
Adds support for large `num_items` to `DeviceReduce::{ArgMin,ArgMax}` (#2647)
* adds benchmarks for reduce::arg{min,max}
* preliminary streaming arg-extremum reduction
* fixes implicit conversion
* uses streaming dispatch class
* changes arg benches to use new streaming reduce
* streaming arg-extrema reduction
* fixes style
* fixes compilation failures
* cleanups
* adds rst style comments
* declare vars const and use clamp
* consolidates argmin argmax benchmarks
* fixes thrust usage
* drops offset type in arg-extrema benchmarks
* fixes clang cuda
* exec space macros
* switch to signed global offset type for slightly better perf
* clarifies documentation
* applies minor benchmark style changes from review comments
* fixes interface documentation and comments
* list-init accumulating output op
* improves style, comments, and tests
* cleans up aggregate init
* renames dispatch class usage in benchmarks
* fixes merge conflicts
* addresses review comments
* addresses review comments
* fixes assertion
* removes superseded implementation
* changes large problem tests to use new interface
* removes obsolete tests for deprecated interface
Fixes for Python 3.7 docs environment (#3206)
Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
Adds support for large number of items to `DeviceTransform` (#3172)
* moves large problem test helper to common file
* adds support for large num items to device transform
* adds tests for large number of items to device interface
* fixes format
* addresses review comments
cp_async_bulk: Fix test (#3198)
* memcpy_async_tx: Fix bug in test
Two bugs, one of which occurs in practice:
1. There is a missing fence.proxy.space::global between the writes to
global memory and the memcpy_async_tx. (Occurs in practice)
2. The end of the kernel should be fenced with `__syncthreads()`,
because the barrier is invalidated in the destructor. If other
threads are still waiting on it, there will be UB. (Has not yet
manifested itself)
* cp_async_bulk_tensor: Pre-emptively fence more in test
* cp_async_bulk: Fix test
The global memory pointer could be misaligned.
cudax fixes for msvc 14.41 (#3200)
avoid instantiating class templates in `is_same` implementation when possible (#3203)
Fix: make launchers a CUB detail; make kernel source functions hidden. (#3209)
* Fix: make launchers a CUB detail; make kernel source functions hidden.
* [pre-commit.ci] auto code formatting
* Address review comments, fix which macro gets fixed.
help the ranges concepts recognize standard contiguous iterators in c++14/17 (#3202)
unify macros and cmake options that control the suppression of deprecation warnings (#3220)
* unify macros and cmake options that control the suppression of deprecation warnings
* suppress nvcc warning #186 in thrust header tests
* suppress c++ dialect deprecation warnings in libcudacxx header tests
Fx thread-reduce performance regression (#3225)
cuda.parallel: In-memory caching of build objects (#3216)
* Define __eq__ and __hash__ for Iterators
* Define cache_with_key utility and use it to cache Reduce objects
* Add tests for caching Reduce objects
* Tighten up types
* Updates to support 3.7
* Address review feedback
* Introduce IteratorKind to hold iterator type information
* Use the .kind to generate an abi_name
* Remove __eq__ and __hash__ methods from IteratorBase
* Move helper function
* Formatting
* Don't unpack tuple in cache key
---------
Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
Just enough ranges for c++14 `span` (#3211)
use generalized concepts portability macros to simplify the `range` concept (#3217)
fixes some issues in the concepts portability macros and then re-implements the `range` concept with `_CCCL_REQUIRES_EXPR`
Use Ruff to sort imports (#3230)
* Update pyproject.tomls for import sorting
* Update files after running pre-commit
* Move ruff config to pyproject.toml
---------
Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
fix tuning_scan sm90 config issue (#3236)
Co-authored-by: Shijie Chen <shijiec@nvidia.com>
[STF] Logical token (#3196)
* Split the implementation of the void interface into the definition of the interface, and its implementations on streams and graphs.
* Add missing files
* Check if a task implementation can match a prototype where the void_interface arguments are ignored
* Implement ctx.abstract_logical_data() which relies on a void data interface
* Illustrate how to use abstract handles in local contexts
* Introduce an is_void_interface() virtual method in the data interface to potentially optimize some stages
* Small improvements in the examples
* Do not try to allocate or move void data
* Do not use I as a variable
* fix linkage error
* rename abtract_logical_data into logical_token
* Document logical token
* fix spelling error
* fix sphinx error
* reflect name changes
* use meaningful variable names
* simplify logical_token implementation because writeback is already disabled
* add a unit test for token elision
* implement token elision in host_launch
* Remove unused type
* Implement helpers to check if a function can be invoked from a tuple, or from a tuple where we removed tokens
* Much simpler is_tuple_invocable_with_filtered implementation
* Fix buggy test
* Factorize code
* Document that we can ignore tokens for task and host_launch
* Documentation for logical data freeze
Fix ReduceByKey tuning (#3240)
Fix RLE tuning (#3239)
cuda.parallel: Forbid non-contiguous arrays as inputs (or outputs) (#3233)
* Forbid non-contiguous arrays as inputs (or outputs)
* Implement a more robust way to check for contiguity
* Don't bother if cublas unavailable
* Fix how we check for zero-element arrays
* sort imports
---------
Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
expands support for more offset types in segmented benchmark (#3231)
Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects (#3253)
* Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects
* Do not add option twice
ptx: Add add_instruction.py (#3190)
This file helps create the necessary structure for new PTX instructions.
Co-authored-by: Allard Hendriksen <ahendriksen@nvidia.com>
Bump main to 2.9.0. (#3247)
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Drop cub::Mutex (#3251)
Fixes: #3250
Remove legacy macros from CUB util_arch.cuh (#3257)
Fixes: #3256
Remove thrust::[unary|binary]_traits (#3260)
Fixes: #3259
Architecture and OS identification macros (#3237)
Bump main to 3.0.0. (#3265)
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Drop thrust not1 and not2 (#3264)
Fixes: #3263
CCCL Internal macro documentation (#3238)
Deprecate GridBarrier and GridBarrierLifetime (#3258)
Fixes: #1389
Require at least gcc7 (#3268)
Fixes: #3267
Drop thrust::[unary|binary]_function (#3274)
Fixes: #3273
Drop ICC from CI (#3277)
[STF] Corruption of the capture list of an extended lambda with a parallel_for construct on a host execution place (#3270)
* Add a test to reproduce a bug observed with parallel_for on a host place
* clang-format
* use _CCCL_ASSERT
* Attempt to debug
* do not create a tuple with a universal reference that is out of scope when we use it, use an lvalue instead
* fix lambda expression
* clang-format
Enable thrust::identity test for non-MSVC (#3281)
This seems to be an oversight when the test was added
Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
Enable PDL in triple chevron launch (#3282)
It seems PDL was disabled by accident when _THRUST_HAS_PDL was renamed
to _CCCL_HAS_PDL during the review introducing the feature.
Disambiguate line continuations and macro continuations in <nv/target> (#3244)
Drop VS 2017 from CI (#3287)
Fixes: #3286
Drop ICC support in code (#3279)
* Drop ICC from code
Fixes: #3278
Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
Make CUB NVRTC commandline arguments come from a cmake template (#3292)
Propose the same components (thrust, cub, libc++, cudax, cuda.parallel,...) in the bug report template than in the feature request template (#3295)
Use process isolation instead of default hyper-v for Windows. (#3294)
Try improving build times by using process isolation instead of hyper-v
Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
[pre-commit.ci] pre-commit autoupdate (#3248)
* [pre-commit.ci] pre-commit autoupdate
updates:
- [github.com/pre-commit/mirrors-clang-format: v18.1.8 → v19.1.6](https://github.com/pre-commit/mirrors-clang-format/compare/v18.1.8...v19.1.6)
- [github.com/astral-sh/ruff-pre-commit: v0.8.3 → v0.8.6](https://github.com/astral-sh/ruff-pre-commit/compare/v0.8.3...v0.8.6)
- [github.com/pre-commit/mirrors-mypy: v1.13.0 → v1.14.1](https://github.com/pre-commit/mirrors-mypy/compare/v1.13.0...v1.14.1)
Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
Drop Thrust legacy arch macros (#3298)
Which were disabled and could be re-enabled using THRUST_PROVIDE_LEGACY_ARCH_MACROS
Drop Thrust's compiler_fence.h (#3300)
Drop CTK 11.x from CI (#3275)
* Add cuda12.0-gcc7 devcontainer
* Move MSVC2017 jobs to CTK 12.6
Those is the only combination where rapidsai has devcontainers
* Add /Zc:__cplusplus for the libcudacxx tests
* Only add excape hatch for affected CTKs
* Workaround missing cudaLaunchKernelEx on MSVC
cudaLaunchKernelEx requires C++11, but unfortunately <cuda_runtime.h> checks this using the __cplusplus macro, which is reported wrongly for MSVC. CTK 12.3 fixed this by additionally detecting _MSV_VER. As a workaround, we provide our own copy of cudaLaunchKernelEx when it is not available from the CTK.
* Workaround nvcc+MSVC issue
* Regenerate devcontainers
Fixes: #3249
Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
Drop CUB's util_compiler.cuh (#3302)
All contained macros were deprecated
Update packman and repo_docs versions (#3293)
Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
Drop Thrust's deprecated compiler macros (#3301)
Drop CUB_RUNTIME_ENABLED and __THRUST_HAS_CUDART__ (#3305)
Adds support for large number of items to `DevicePartition::If` with the `ThreeWayPartition` overload (#2506)
* adds support for large number of items to three-way partition
* adapts interface to use choose_signed_offset_t
* integrates applicable feedback from device-select pr
* changes behavior for empty problems
* unifies grid constant macro
* fixes kernel template specialization mismatch
* integrates _CCCL_GRID_CONSTANT changes
* resolve merge conflicts
* fixes checks in test
* fixes test verification
* improves tests
* makes few improvements to streaming dispatch
* improves code comment on test
* fixes unrelated compiler error
* minor style improvements
Refactor scan tunings (#3262)
Require C++17 for compiling Thrust and CUB (#3255)
* Issue an unsuppressable warning when compiling with < C++17
* Remove C++11/14 presets
* Remove CCCL_IGNORE_DEPRECATED_CPP_DIALECT from headers
* Remove [CUB|THRUST|TCT]_IGNORE_DEPRECATED_CPP_[11|14]
* Remove CUB_ENABLE_DIALECT_CPP[11|14]
* Update CI runs
* Remove C++11/14 CI runs for CUB and Thrust
* Raise compiler minimum versions for C++17
* Update ReadMe
* Drop Thrust's cpp14_required.h
* Add escape hatch for C++17 removal
Fixes: #3252
Implement `views::empty` (#3254)
* Disable pair conversion of subrange with clang in C++17
* Fix namespace views
* Implement `views::empty`
This implements `std::ranges::views::empty`, see https://en.cppreference.com/w/cpp/ranges/empty_view
Refactor `limits` and `climits` (#3221)
* implement builtins for huge val, nan and nans
* change `INFINITY` and `NAN` implementation for NVRTC
cuda.parallel: Add documentation for the current iterators along with examples and tests (#3311)
* Add tests demonstrating usage of different iterators
* Update documentation of reduce_into by merging import code snippet with the rest of the example
* Add documentation for current iterators
* Run pre-commit checks and update accordingly
* Fix comments to refer to the proper lines in the code snippets in the docs
Drop clang<14 from CI, update devcontainers. (#3309)
Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
[STF] Cleanup task dependencies object constructors (#3291)
* Define tag types for access modes
* - Rework how we build task_dep objects based on access mode tags
- pack_state is now responsible for using a const_cast for read only data
* Greatly simplify the previous attempt : do not define new types, but use integral constants based on the enums
* It seems the const_cast was not necessarily so we can simplify it and not even do some dispatch based on access modes
Disable test with a gcc-14 regression (#3297)
Deprecate Thrust's cpp_compatibility.h macros (#3299)
Remove dropped function objects from docs (#3319)
Document `NV_TARGET` macros (#3313)
[STF] Define ctx.pick_stream() which was missing for the unified context (#3326)
* Define ctx.pick_stream() which was missing for the unified context
* clang-format
Deprecate cub::IterateThreadStore (#3337)
Drop CUB's BinaryFlip operator (#3332)
Deprecate cub::Swap (#3333)
Clarify transform output can overlap input (#3323)
Drop CUB APIs with a debug_synchronous parameter (#3330)
Fixes: #3329
Drop CUB's util_compiler.cuh for real (#3340)
PR #3302 planned to drop the file, but only dropped its content. This
was an oversight. So let's drop the entire file.
Drop cub::ValueCache (#3346)
limits offset types for merge sort (#3328)
Drop CDPv1 (#3344)
Fixes: #3341
Drop thrust::void_t (#3362)
Use cuda::std::addressof in Thrust (#3363)
Fix all_of documentation for empty ranges (#3358)
all_of always returns true on an empty range.
[STF] Do not keep track of dangling events in a CUDA graph backend (#3327)
* Unlike the CUDA stream backend, nodes in a CUDA graph are necessarily done when
the CUDA graph completes. Therefore keeping track of "dangling events" is a
waste of time and resources.
* replace can_ignore_dangling_events by track_dangling_events which leads to more readable code
* When not storing the dangling events, we must still perform the deinit operations that were producing these events !
Extract scan kernels into NVRTC-compilable header (#3334)
* Extract scan kernels into NVRTC-compilable header
* Update cub/cub/device/dispatch/dispatch_scan.cuh
Co-authored-by: Georgii Evtushenko <evtushenko.georgy@gmail.com>
---------
Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
Co-authored-by: Georgii Evtushenko <evtushenko.georgy@gmail.com>
Drop deprecated aliases in Thrust functional (#3272)
Fixes: #3271
Drop cub::DivideAndRoundUp (#3347)
Use cuda::std::min/max in Thrust (#3364)
Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (#3361)
* implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16`
Cleanup util_arch (#2773)
Deprecate thrust::null_type (#3367)
Deprecate cub::DeviceSpmv (#3320)
Fixes: #896
Improves `DeviceSegmentedSort` test run time for large number of items and segments (#3246)
* fixes segment offset generation
* switches to analytical verification
* switches to analytical verification for pairs
* fixes spelling
* adds tests for large number of segments
* fixes narrowing conversion in tests
* addresses review comments
* fixes includes
Compile basic infra test with C++17 (#3377)
Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (#3308)
* fixes segment offset generation
* switches to analytical verification
* switches to analytical verification for pairs
* addresses review comments
* introduces segment offset type
* adds tests for large number of segments
* adds support for large number of segments
* drops segment offset type
* fixes thrust namespace
* removes about-to-be-deprecated cub iterators
* no exec specifier on defaulted ctor
* fixes gcc7 linker error
* uses local_segment_index_t throughout
* determine offset type based on type returned by segment iterator begin/end iterators
* minor style improvements
Exit with error when RAPIDS CI fails. (#3385)
cuda.parallel: Support structured types as algorithm inputs (#3218)
* Introduce gpu_struct decorator and typing
* Enable `reduce` to accept arrays of structs as inputs
* Add test for reducing arrays-of-struct
* Update documentation
* Use a numpy array rather than ctypes object
* Change zeros -> empty for output array and temp storage
* Add a TODO for typing GpuStruct
* Documentation udpates
* Remove test_reduce_struct_type from test_reduce.py
* Revert to `to_cccl_value()` accepting ndarray + GpuStruct
* Bump copyrights
---------
Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
Deprecate thrust::async (#3324)
Fixes: #100
Review/Deprecate CUB `util.ptx` for CCCL 2.x (#3342)
Fix broken `_CCCL_BUILTIN_ASSUME` macro (#3314)
* add compiler-specific path
* fix device code path
* add _CCC_ASSUME
Deprecate thrust::numeric_limits (#3366)
Replace `typedef` with `using` in libcu++ (#3368)
Deprecate thrust::optional (#3307)
Fixes: #3306
Upgrade to Catch2 3.8 (#3310)
Fixes: #1724
refactor `<cuda/std/cstdint>` (#3325)
Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
Update CODEOWNERS (#3331)
* Update CODEOWNERS
* Update CODEOWNERS
* Update CODEOWNERS
* [pre-commit.ci] auto code formatting
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Fix sign-compare warning (#3408)
Implement more cmath functions to be usable on host and device (#3382)
* Implement more cmath functions to be usable on host and device
* Implement math roots functions
* Implement exponential functions
Redefine and deprecate thrust::remove_cvref (#3394)
* Redefine and deprecate thrust::remove_cvref
Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
Fix assert definition for NVHPC due to constexpr issues (#3418)
NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it.
Fix this by always using the host definition which should also work on device.
Fixes #3411
Extend CUB reduce benchmarks (#3401)
* Rename max.cu to custom.cu, since it uses a custom operator
* Extend types covered my min.cu to all fundamental types
* Add some notes on how to collect tuning parameters
Fixes: #3283
Update upload-pages-artifact to v3 (#3423)
* Update upload-pages-artifact to v3
* Empty commit
---------
Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
Replace and deprecate thrust::cuda_cub::terminate (#3421)
`std::linalg` accessors and `transposed_layout` (#2962)
Add round up/down to multiple (#3234)
[FEA]: Introduce Python module with CCCL headers (#3201)
* Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative
* Run `copy_cccl_headers_to_aude_include()` before `setup()`
* Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path.
* Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel
* Bug fix: cuda/_include only exists after shutil.copytree() ran.
* Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py
* Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions)
* Replace := operator (needs Python 3.8+)
* Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md
* Restore original README.md: `pip3 install -e` now works on first pass.
* cuda_cccl/README.md: FOR INTERNAL USE ONLY
* Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894035917)
Command used: ci/update_version.sh 2 8 0
* Modernize pyproject.toml, setup.py
Trigger for this change:
* https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894043178
* https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894044996
* Install CCCL headers under cuda.cccl.include
Trigger for this change:
* https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894048562
Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely.
* Factor out cuda_cccl/cuda/cccl/include_paths.py
* Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative
* Add missing Copyright notice.
* Add missing __init__.py (cuda.cccl)
* Add `"cuda.cccl"` to `autodoc.mock_imports`
* Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.)
* Add # TODO: move this to a module-level import
* Modernize cuda_cooperative/pyproject.toml, setup.py
* Convert cuda_cooperative to use hatchling as build backend.
* Revert "Convert cuda_cooperative to use hatchling as build backend."
This reverts commit 61637d608da06fcf6851ef6197f88b5e7dbc3bbe.
* Move numpy from [build-system] requires -> [project] dependencies
* Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH
* Remove copy_license() and use license_files=["../../LICENSE"] instead.
* Further modernize cuda_cccl/setup.py to use pathlib
* Trivial simplifications in cuda_cccl/pyproject.toml
* Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code
* Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml
* Add taplo-pre-commit to .pre-commit-config.yaml
* taplo-pre-commit auto-fixes
* Use pathlib in cuda_cooperative/setup.py
* CCCL_PYTHON_PATH in cuda_cooperative/setup.py
* Modernize cuda_parallel/pyproject.toml, setup.py
* Use pathlib in cuda_parallel/setup.py
* Add `# TOML lint & format` comment.
* Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml
* Use pathlib in cuda/cccl/include_paths.py
* pre-commit autoupdate (EXCEPT clang-format, which was manually restored)
* Fixes after git merge main
* Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result'
```
=========================================================================== warnings summary ===========================================================================
tests/test_reduce.py::test_reduce_non_contiguous
/home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080>
Traceback (most recent call last):
File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__
bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result))
^^^^^^^^^^^^^^^^^
AttributeError: '_Reduce' object has no attribute 'build_result'
warnings.warn(pytest.PytestUnraisableExceptionWarning(msg))
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ==============================================================
```
* Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy`
* Introduce cuda_cooperative/constraints.txt
* Also add cuda_parallel/constraints.txt
* Add `--constraint constraints.txt` in ci/test_python.sh
* Update Copyright dates
* Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024)
For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI.
* Remove unused cuda_parallel jinja2 dependency (noticed by chance).
* Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead.
* Make cuda_cooperative, cuda_parallel testing completely independent.
* Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Fix sign-compare warning (#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]"
This reverts commit ea33a218ed77a075156cd1b332047202adb25aa2.
Error message: https://github.com/NVIDIA/cccl/pull/3201#issuecomment-2594012971
* Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Restore original ci/matrix.yaml [skip-rapids]
* Use for loop in test_python.sh to avoid code duplication.
* Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]
* Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]"
This reverts commit ec206fd8b50a6a293e00a5825b579e125010b13d.
* Implement suggestion by @shwina (https://github.com/NVIDIA/cccl/pull/3201#pullrequestreview-2556918460)
* Address feedback by @leofang
---------
Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
cuda.parallel: Add optional stream argument to reduce_into() (#3348)
* Add optional stream argument to reduce_into()
* Add tests to check for reduce_into() stream behavior
* Move protocol related utils to separate file and rework __cuda_stream__ error messages
* Fix synchronization issue in stream test and add one more invalid stream test case
* Rename cuda stream validation function after removing leading underscore
* Unpack values from __cuda_stream__ instead of indexing
* Fix linting errors
* Handle TypeError when unpacking invalid __cuda_stream__ return
* Use stream to allocate cupy memory in new stream test
Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (#3434)
Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (#3419)
* Deprecate `cub::{min, max}` and replace internal uses with those from libcu++
Fixes #3404
move to c++17, finalize device optimization
fix msvc compilation, update tests
Deprectate C++11 and C++14 for libcu++ (#3173)
* Deprectate C++11 and C++14 for libcu++
Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
Implement `abs` and `div` from `cstdlib` (#3153)
* implement integer abs functions
* improve tests, fix constexpr support
* just use the our implementation
* implement `cuda::std::div`
* prefer host's `div_t` like types
* provide `cuda::std::abs` overloads for floats
* allow fp abs for NVRTC
* silence msvc's warning about conversion from floating point to integral
Fix missing radix sort policies (#3174)
Fixes NVBug 5009941
Introduces new `DeviceReduce::Arg{Min,Max}` interface with two output iterators (#3148)
* introduces new arg{min,max} interface with two output iterators
* adds fp inf tests
* fixes docs
* improves code example
* fixes exec space specifier
* trying to fix deprecation warning for more compilers
* inlines unzip operator
* trying to fix deprecation warning for nvhpc
* integrates supression fixes in diagnostics
* pre-ctk 11.5 deprecation suppression
* fixes icc
* fix for pre-ctk11.5
* cleans up deprecation suppression
* cleanup
Extend tuning documentation (#3179)
Add codespell pre-commit hook, fix typos in CCCL (#3168)
* Add codespell pre-commit hook
* Automatic changes from codespell.
* Manual changes.
Fix parameter space for TUNE_LOAD in scan benchmark (#3176)
fix various old compiler checks (#3178)
implement C++26 `std::projected` (#3175)
Fix pre-commit config for codespell and remaining typos (#3182)
Massive cleanup of our config (#3155)
Fix UB in atomics with automatic storage (#2586)
* Adds specialized local cuda atomics and injects them into most atomics paths.
Co-authored-by: Georgy Evtushenko <evtushenko.georgy@gmail.com>
Co-authored-by: gonzalobg <65027571+gonzalobg@users.noreply.github.com>
* Allow CUDA 12.2 to keep perf, this addresses earlier comments in #478
* Remove extraneous double brackets in unformatted code.
* Merge unsafe atomic logic into `__cuda_is_local`.
* Use `const_cast` for type conversions in cuda_local.h
* Fix build issues from interface changes
* Fix missing __nanosleep on sm70-
* Guard __isLocal from NVHPC
* Use PTX instead of running nothing from NVHPC
* fixup /s/nvrtc/nvhpc
* Fixup missing CUDA ifdef surrounding device code
* Fix codegen
* Bypass some sort of compiler bug on GCC7
* Apply suggestions from code review
* Use unsafe automatic storage atomics in codegen tests
---------
Co-authored-by: Georgy Evtushenko <evtushenko.georgy@gmail.com>
Co-authored-by: gonzalobg <65027571+gonzalobg@users.noreply.github.com>
Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
Refactor the source code layout for `cuda.parallel` (#3177)
* Refactor the source layout for cuda.parallel
* Add copyright
* Address review feedback
* Don't import anything into `experimental` namespace
* fix import
---------
Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
new type-erased memory resources (#2824)
s/_LIBCUDACXX_DECLSPEC_EMPTY_BASES/_CCCL_DECLSPEC_EMPTY_BASES/g (#3186)
Document address stability of `thrust::transform` (#3181)
* Do not document _LIBCUDACXX_MARK_CAN_COPY_ARGUMENTS
* Reformat and fix UnaryFunction/BinaryFunction in transform docs
* Mention transform can use proclaim_copyable_arguments
* Document cuda::proclaims_copyable_arguments better
* Deprecate depending on transform functor argument addresses
Fixes: #3053
turn off cuda version check for clangd (#3194)
[STF] jacobi example based on parallel_for (#3187)
* Simple jacobi example with parallel for and reductions
* clang-format
* remove useless capture list
fixes pre-nv_diag suppression issues (#3189)
Prefer c2h::type_name over c2h::demangle (#3195)
Fix memcpy_async* tests (#3197)
* memcpy_async_tx: Fix bug in test
Two bugs, one of which occurs in practice:
1. There is a missing fence.proxy.space::global between the writes to
global memory and the memcpy_async_tx. (Occurs in practice)
2. The end of the kernel should be fenced with `__syncthreads()`,
because the barrier is invalidated in the destructor. If other
threads are still waiting on it, there will be UB. (Has not yet
manifested itself)
* cp_async_bulk_tensor: Pre-emptively fence more in test
Add type annotations and mypy checks for `cuda.parallel` (#3180)
* Refactor the source layout for cuda.parallel
* Add initial type annotations
* Update pre-commit config
* More typing
* Fix bad merge
* Fix TYPE_CHECKING and numpy annotations
* typing bindings.py correctly
* Address review feedback
---------
Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
Fix rendering of cuda.parallel docs (#3192)
* Fix pre-commit config for codespell and remaining typos
* Fix rendering of docs for cuda.parallel
---------
Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
Enable PDL for DeviceMergeSortBlockSortKernel (#3199)
The kernel already contains a call to _CCCL_PDL_GRID_DEPENDENCY_SYNC.
This commit enables PDL when launching the kernel.
Adds support for large `num_items` to `DeviceReduce::{ArgMin,ArgMax}` (#2647)
* adds benchmarks for reduce::arg{min,max}
* preliminary streaming arg-extremum reduction
* fixes implicit conversion
* uses streaming dispatch class
* changes arg benches to use new streaming reduce
* streaming arg-extrema reduction
* fixes style
* fixes compilation failures
* cleanups
* adds rst style comments
* declare vars const and use clamp
* consolidates argmin argmax benchmarks
* fixes thrust usage
* drops offset type in arg-extrema benchmarks
* fixes clang cuda
* exec space macros
* switch to signed global offset type for slightly better perf
* clarifies documentation
* applies minor benchmark style changes from review comments
* fixes interface documentation and comments
* list-init accumulating output op
* improves style, comments, and tests
* cleans up aggregate init
* renames dispatch class usage in benchmarks
* fixes merge conflicts
* addresses review comments
* addresses review comments
* fixes assertion
* removes superseded implementation
* changes large problem tests to use new interface
* removes obsolete tests for deprecated interface
Fixes for Python 3.7 docs environment (#3206)
Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
Adds support for large number of items to `DeviceTransform` (#3172)
* moves large problem test helper to common file
* adds support for large num items to device transform
* adds tests for large number of items to device interface
* fixes format
* addresses review comments
cp_async_bulk: Fix test (#3198)
* memcpy_async_tx: Fix bug in test
Two bugs, one of which occurs in practice:
1. There is a missing fence.proxy.space::global between the writes to
global memory and the memcpy_async_tx. (Occurs in practice)
2. The end of the kernel should be fenced with `__syncthreads()`,
because the barrier is invalidated in the destructor. If other
threads are still waiting on it, there will be UB. (Has not yet
manifested itself)
* cp_async_bulk_tensor: Pre-emptively fence more in test
* cp_async_bulk: Fix test
The global memory pointer could be misaligned.
cudax fixes for msvc 14.41 (#3200)
avoid instantiating class templates in `is_same` implementation when possible (#3203)
Fix: make launchers a CUB detail; make kernel source functions hidden. (#3209)
* Fix: make launchers a CUB detail; make kernel source functions hidden.
* [pre-commit.ci] auto code formatting
* Address review comments, fix which macro gets fixed.
help the ranges concepts recognize standard contiguous iterators in c++14/17 (#3202)
unify macros and cmake options that control the suppression of deprecation warnings (#3220)
* unify macros and cmake options that control the suppression of deprecation warnings
* suppress nvcc warning #186 in thrust header tests
* suppress c++ dialect deprecation warnings in libcudacxx header tests
Fx thread-reduce performance regression (#3225)
cuda.parallel: In-memory caching of build objects (#3216)
* Define __eq__ and __hash__ for Iterators
* Define cache_with_key utility and use it to cache Reduce objects
* Add tests for caching Reduce objects
* Tighten up types
* Updates to support 3.7
* Address review feedback
* Introduce IteratorKind to hold iterator type information
* Use the .kind to generate an abi_name
* Remove __eq__ and __hash__ methods from IteratorBase
* Move helper function
* Formatting
* Don't unpack tuple in cache key
---------
Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
Just enough ranges for c++14 `span` (#3211)
use generalized concepts portability macros to simplify the `range` concept (#3217)
fixes some issues in the concepts portability macros and then re-implements the `range` concept with `_CCCL_REQUIRES_EXPR`
Use Ruff to sort imports (#3230)
* Update pyproject.tomls for import sorting
* Update files after running pre-commit
* Move ruff config to pyproject.toml
---------
Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
fix tuning_scan sm90 config issue (#3236)
Co-authored-by: Shijie Chen <shijiec@nvidia.com>
[STF] Logical token (#3196)
* Split the implementation of the void interface into the definition of the interface, and its implementations on streams and graphs.
* Add missing files
* Check if a task implementation can match a prototype where the void_interface arguments are ignored
* Implement ctx.abstract_logical_data() which relies on a void data interface
* Illustrate how to use abstract handles in local contexts
* Introduce an is_void_interface() virtual method in the data interface to potentially optimize some stages
* Small improvements in the examples
* Do not try to allocate or move void data
* Do not use I as a variable
* fix linkage error
* rename abtract_logical_data into logical_token
* Document logical token
* fix spelling error
* fix sphinx error
* reflect name changes
* use meaningful variable names
* simplify logical_token implementation because writeback is already disabled
* add a unit test for token elision
* implement token elision in host_launch
* Remove unused type
* Implement helpers to check if a function can be invoked from a tuple, or from a tuple where we removed tokens
* Much simpler is_tuple_invocable_with_filtered implementation
* Fix buggy test
* Factorize code
* Document that we can ignore tokens for task and host_launch
* Documentation for logical data freeze
Fix ReduceByKey tuning (#3240)
Fix RLE tuning (#3239)
cuda.parallel: Forbid non-contiguous arrays as inputs (or outputs) (#3233)
* Forbid non-contiguous arrays as inputs (or outputs)
* Implement a more robust way to check for contiguity
* Don't bother if cublas unavailable
* Fix how we check for zero-element arrays
* sort imports
---------
Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
expands support for more offset types in segmented benchmark (#3231)
Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects (#3253)
* Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects
* Do not add option twice
ptx: Add add_instruction.py (#3190)
This file helps create the necessary structure for new PTX instructions.
Co-authored-by: Allard Hendriksen <ahendriksen@nvidia.com>
Bump main to 2.9.0. (#3247)
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Drop cub::Mutex (#3251)
Fixes: #3250
Remove legacy macros from CUB util_arch.cuh (#3257)
Fixes: #3256
Remove thrust::[unary|binary]_traits (#3260)
Fixes: #3259
Architecture and OS identification macros (#3237)
Bump main to 3.0.0. (#3265)
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Drop thrust not1 and not2 (#3264)
Fixes: #3263
CCCL Internal macro documentation (#3238)
Deprecate GridBarrier and GridBarrierLifetime (#3258)
Fixes: #1389
Require at least gcc7 (#3268)
Fixes: #3267
Drop thrust::[unary|binary]_function (#3274)
Fixes: #3273
Drop ICC from CI (#3277)
[STF] Corruption of the capture list of an extended lambda with a parallel_for construct on a host execution place (#3270)
* Add a test to reproduce a bug observed with parallel_for on a host place
* clang-format
* use _CCCL_ASSERT
* Attempt to debug
* do not create a tuple with a universal reference that is out of scope when we use it, use an lvalue instead
* fix lambda expression
* clang-format
Enable thrust::identity test for non-MSVC (#3281)
This seems to be an oversight when the test was added
Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
Enable PDL in triple chevron launch (#3282)
It seems PDL was disabled by accident when _THRUST_HAS_PDL was renamed
to _CCCL_HAS_PDL during the review introducing the feature.
Disambiguate line continuations and macro continuations in <nv/target> (#3244)
Drop VS 2017 from CI (#3287)
Fixes: #3286
Drop ICC support in code (#3279)
* Drop ICC from code
Fixes: #3278
Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
Make CUB NVRTC commandline arguments come from a cmake template (#3292)
Propose the same components (thrust, cub, libc++, cudax, cuda.parallel,...) in the bug report template than in the feature request template (#3295)
Use process isolation instead of default hyper-v for Windows. (#3294)
Try improving build times by using process isolation instead of hyper-v
Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
[pre-commit.ci] pre-commit autoupdate (#3248)
* [pre-commit.ci] pre-commit autoupdate
updates:
- [github.com/pre-commit/mirrors-clang-format: v18.1.8 → v19.1.6](https://github.com/pre-commit/mirrors-clang-format/compare/v18.1.8...v19.1.6)
- [github.com/astral-sh/ruff-pre-commit: v0.8.3 → v0.8.6](https://github.com/astral-sh/ruff-pre-commit/compare/v0.8.3...v0.8.6)
- [github.com/pre-commit/mirrors-mypy: v1.13.0 → v1.14.1](https://github.com/pre-commit/mirrors-mypy/compare/v1.13.0...v1.14.1)
Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
Drop Thrust legacy arch macros (#3298)
Which were disabled and could be re-enabled using THRUST_PROVIDE_LEGACY_ARCH_MACROS
Drop Thrust's compiler_fence.h (#3300)
Drop CTK 11.x from CI (#3275)
* Add cuda12.0-gcc7 devcontainer
* Move MSVC2017 jobs to CTK 12.6
Those is the only combination where rapidsai has devcontainers
* Add /Zc:__cplusplus for the libcudacxx tests
* Only add excape hatch for affected CTKs
* Workaround missing cudaLaunchKernelEx on MSVC
cudaLaunchKernelEx requires C++11, but unfortunately <cuda_runtime.h> checks this using the __cplusplus macro, which is reported wrongly for MSVC. CTK 12.3 fixed this by additionally detecting _MSV_VER. As a workaround, we provide our own copy of cudaLaunchKernelEx when it is not available from the CTK.
* Workaround nvcc+MSVC issue
* Regenerate devcontainers
Fixes: #3249
Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
Update packman and repo_docs versions (#3293)
Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
Drop Thrust's deprecated compiler macros (#3301)
Drop CUB_RUNTIME_ENABLED and __THRUST_HAS_CUDART__ (#3305)
Adds support for large number of items to `DevicePartition::If` with the `ThreeWayPartition` overload (#2506)
* adds support for large number of items to three-way partition
* adapts interface to use choose_signed_offset_t
* integrates applicable feedback from device-select pr
* changes behavior for empty problems
* unifies grid constant macro
* fixes kernel template specialization mismatch
* integrates _CCCL_GRID_CONSTANT changes
* resolve merge conflicts
* fixes checks in test
* fixes test verification
* improves tests
* makes few improvements to streaming dispatch
* improves code comment on test
* fixes unrelated compiler error
* minor style improvements
Refactor scan tunings (#3262)
Require C++17 for compiling Thrust and CUB (#3255)
* Issue an unsuppressable warning when compiling with < C++17
* Remove C++11/14 presets
* Remove CCCL_IGNORE_DEPRECATED_CPP_DIALECT from headers
* Remove [CUB|THRUST|TCT]_IGNORE_DEPRECATED_CPP_[11|14]
* Remove CUB_ENABLE_DIALECT_CPP[11|14]
* Update CI runs
* Remove C++11/14 CI runs for CUB and Thrust
* Raise compiler minimum versions for C++17
* Update ReadMe
* Drop Thrust's cpp14_required.h
* Add escape hatch for C++17 removal
Fixes: #3252
Implement `views::empty` (#3254)
* Disable pair conversion of subrange with clang in C++17
* Fix namespace views
* Implement `views::empty`
This implements `std::ranges::views::empty`, see https://en.cppreference.com/w/cpp/ranges/empty_view
Refactor `limits` and `climits` (#3221)
* implement builtins for huge val, nan and nans
* change `INFINITY` and `NAN` implementation for NVRTC
cuda.parallel: Add documentation for the current iterators along with examples and tests (#3311)
* Add tests demonstrating usage of different iterators
* Update documentation of reduce_into by merging import code snippet with the rest of the example
* Add documentation for current iterators
* Run pre-commit checks and update accordingly
* Fix comments to refer to the proper lines in the code snippets in the docs
Drop clang<14 from CI, update devcontainers. (#3309)
Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
[STF] Cleanup task dependencies object constructors (#3291)
* Define tag types for access modes
* - Rework how we build task_dep objects based on access mode tags
- pack_state is now responsible for using a const_cast for read only data
* Greatly simplify the previous attempt : do not define new types, but use integral constants based on the enums
* It seems the const_cast was not necessarily so we can simplify it and not even do some dispatch based on access modes
Disable test with a gcc-14 regression (#3297)
Deprecate Thrust's cpp_compatibility.h macros (#3299)
Remove dropped function objects from docs (#3319)
Document `NV_TARGET` macros (#3313)
[STF] Define ctx.pick_stream() which was missing for the unified context (#3326)
* Define ctx.pick_stream() which was missing for the unified context
* clang-format
Deprecate cub::IterateThreadStore (#3337)
Drop CUB's BinaryFlip operator (#3332)
Deprecate cub::Swap (#3333)
Clarify transform output can overlap input (#3323)
Drop CUB APIs with a debug_synchronous parameter (#3330)
Fixes: #3329
Drop CUB's util_compiler.cuh for real (#3340)
PR #3302 planned to drop the file, but only dropped its content. This
was an oversight. So let's drop the entire file.
Drop cub::ValueCache (#3346)
limits offset types for merge sort (#3328)
Drop CDPv1 (#3344)
Fixes: #3341
Drop thrust::void_t (#3362)
Use cuda::std::addressof in Thrust (#3363)
Fix all_of documentation for empty ranges (#3358)
all_of always returns true on an empty range.
[STF] Do not keep track of dangling events in a CUDA graph backend (#3327)
* Unlike the CUDA stream backend, nodes in a CUDA graph are necessarily done when
the CUDA graph completes. Therefore keeping track of "dangling events" is a
waste of time and resources.
* replace can_ignore_dangling_events by track_dangling_events which leads to more readable code
* When not storing the dangling events, we must still perform the deinit operations that were producing these events !
Extract scan kernels into NVRTC-compilable header (#3334)
* Extract scan kernels into NVRTC-compilable header
* Update cub/cub/device/dispatch/dispatch_scan.cuh
Co-authored-by: Georgii Evtushenko <evtushenko.georgy@gmail.com>
---------
Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
Co-authored-by: Georgii Evtushenko <evtushenko.georgy@gmail.com>
Drop deprecated aliases in Thrust functional (#3272)
Fixes: #3271
Drop cub::DivideAndRoundUp (#3347)
Use cuda::std::min/max in Thrust (#3364)
Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (#3361)
* implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16`
Cleanup util_arch (#2773)
Deprecate thrust::null_type (#3367)
Deprecate cub::DeviceSpmv (#3320)
Fixes: #896
Improves `DeviceSegmentedSort` test run time for large number of items and segments (#3246)
* fixes segment offset generation
* switches to analytical verification
* switches to analytical verification for pairs
* fixes spelling
* adds tests for large number of segments
* fixes narrowing conversion in tests
* addresses review comments
* fixes includes
Compile basic infra test with C++17 (#3377)
Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (#3308)
* fixes segment offset generation
* switches to analytical verification
* switches to analytical verification for pairs
* addresses review comments
* introduces segment offset type
* adds tests for large number of segments
* adds support for large number of segments
* drops segment offset type
* fixes thrust namespace
* removes about-to-be-deprecated cub iterators
* no exec specifier on defaulted ctor
* fixes gcc7 linker error
* uses local_segment_index_t throughout
* determine offset type based on type returned by segment iterator begin/end iterators
* minor style improvements
Exit with error when RAPIDS CI fails. (#3385)
cuda.parallel: Support structured types as algorithm inputs (#3218)
* Introduce gpu_struct decorator and typing
* Enable `reduce` to accept arrays of structs as inputs
* Add test for reducing arrays-of-struct
* Update documentation
* Use a numpy array rather than ctypes object
* Change zeros -> empty for output array and temp storage
* Add a TODO for typing GpuStruct
* Documentation udpates
* Remove test_reduce_struct_type from test_reduce.py
* Revert to `to_cccl_value()` accepting ndarray + GpuStruct
* Bump copyrights
---------
Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
Deprecate thrust::async (#3324)
Fixes: #100
Review/Deprecate CUB `util.ptx` for CCCL 2.x (#3342)
Fix broken `_CCCL_BUILTIN_ASSUME` macro (#3314)
* add compiler-specific path
* fix device code path
* add _CCC_ASSUME
Deprecate thrust::numeric_limits (#3366)
Replace `typedef` with `using` in libcu++ (#3368)
Deprecate thrust::optional (#3307)
Fixes: #3306
Upgrade to Catch2 3.8 (#3310)
Fixes: #1724
refactor `<cuda/std/cstdint>` (#3325)
Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
Update CODEOWNERS (#3331)
* Update CODEOWNERS
* Update CODEOWNERS
* Update CODEOWNERS
* [pre-commit.ci] auto code formatting
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Fix sign-compare warning (#3408)
Implement more cmath functions to be usable on host and device (#3382)
* Implement more cmath functions to be usable on host and device
* Implement math roots functions
* Implement exponential functions
Redefine and deprecate thrust::remove_cvref (#3394)
* Redefine and deprecate thrust::remove_cvref
Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
Fix assert definition for NVHPC due to constexpr issues (#3418)
NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it.
Fix this by always using the host definition which should also work on device.
Fixes #3411
Extend CUB reduce benchmarks (#3401)
* Rename max.cu to custom.cu, since it uses a custom operator
* Extend types covered my min.cu to all fundamental types
* Add some notes on how to collect tuning parameters
Fixes: #3283
Update upload-pages-artifact to v3 (#3423)
* Update upload-pages-artifact to v3
* Empty commit
---------
Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
Replace and deprecate thrust::cuda_cub::terminate (#3421)
`std::linalg` accessors and `transposed_layout` (#2962)
Add round up/down to multiple (#3234)
[FEA]: Introduce Python module with CCCL headers (#3201)
* Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative
* Run `copy_cccl_headers_to_aude_include()` before `setup()`
* Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path.
* Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel
* Bug fix: cuda/_include only exists after shutil.copytree() ran.
* Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py
* Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions)
* Replace := operator (needs Python 3.8+)
* Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md
* Restore original README.md: `pip3 install -e` now works on first pass.
* cuda_cccl/README.md: FOR INTERNAL USE ONLY
* Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894035917)
Command used: ci/update_version.sh 2 8 0
* Modernize pyproject.toml, setup.py
Trigger for this change:
* https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894043178
* https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894044996
* Install CCCL headers under cuda.cccl.include
Trigger for this change:
* https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894048562
Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely.
* Factor out cuda_cccl/cuda/cccl/include_paths.py
* Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative
* Add missing Copyright notice.
* Add missing __init__.py (cuda.cccl)
* Add `"cuda.cccl"` to `autodoc.mock_imports`
* Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.)
* Add # TODO: move this to a module-level import
* Modernize cuda_cooperative/pyproject.toml, setup.py
* Convert cuda_cooperative to use hatchling as build backend.
* Revert "Convert cuda_cooperative to use hatchling as build backend."
This reverts commit 61637d608da06fcf6851ef6197f88b5e7dbc3bbe.
* Move numpy from [build-system] requires -> [project] dependencies
* Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH
* Remove copy_license() and use license_files=["../../LICENSE"] instead.
* Further modernize cuda_cccl/setup.py to use pathlib
* Trivial simplifications in cuda_cccl/pyproject.toml
* Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code
* Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml
* Add taplo-pre-commit to .pre-commit-config.yaml
* taplo-pre-commit auto-fixes
* Use pathlib in cuda_cooperative/setup.py
* CCCL_PYTHON_PATH in cuda_cooperative/setup.py
* Modernize cuda_parallel/pyproject.toml, setup.py
* Use pathlib in cuda_parallel/setup.py
* Add `# TOML lint & format` comment.
* Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml
* Use pathlib in cuda/cccl/include_paths.py
* pre-commit autoupdate (EXCEPT clang-format, which was manually restored)
* Fixes after git merge main
* Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result'
```
=========================================================================== warnings summary ===========================================================================
tests/test_reduce.py::test_reduce_non_contiguous
/home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080>
Traceback (most recent call last):
File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__
bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result))
^^^^^^^^^^^^^^^^^
AttributeError: '_Reduce' object has no attribute 'build_result'
warnings.warn(pytest.PytestUnraisableExceptionWarning(msg))
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ==============================================================
```
* Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy`
* Introduce cuda_cooperative/constraints.txt
* Also add cuda_parallel/constraints.txt
* Add `--constraint constraints.txt` in ci/test_python.sh
* Update Copyright dates
* Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024)
For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI.
* Remove unused cuda_parallel jinja2 dependency (noticed by chance).
* Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead.
* Make cuda_cooperative, cuda_parallel testing completely independent.
* Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Fix sign-compare warning (#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]"
This reverts commit ea33a218ed77a075156cd1b332047202adb25aa2.
Error message: https://github.com/NVIDIA/cccl/pull/3201#issuecomment-2594012971
* Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Restore original ci/matrix.yaml [skip-rapids]
* Use for loop in test_python.sh to avoid code duplication.
* Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]
* Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]"
This reverts commit ec206fd8b50a6a293e00a5825b579e125010b13d.
* Implement suggestion by @shwina (https://github.com/NVIDIA/cccl/pull/3201#pullrequestreview-2556918460)
* Address feedback by @leofang
---------
Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
cuda.parallel: Add optional stream argument to reduce_into() (#3348)
* Add optional stream argument to reduce_into()
* Add tests to check for reduce_into() stream behavior
* Move protocol related utils to separate file and rework __cuda_stream__ error messages
* Fix synchronization issue in stream test and add one more invalid stream test case
* Rename cuda stream validation function after removing leading underscore
* Unpack values from __cuda_stream__ instead of indexing
* Fix linting errors
* Handle TypeError when unpacking invalid __cuda_stream__ return
* Use stream to allocate cupy memory in new stream test
Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (#3434)
Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (#3419)
* Deprecate `cub::{min, max}` and replace internal uses with those from libcu++
Fixes #3404
Fix CI issues (#3443)
update docs
fix review
restrict allowed types
replace constexpr implementations with generic
optimize `__is_arithmetic_integral`
* Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative
* Run `copy_cccl_headers_to_aude_include()` before `setup()`
* Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path.
* Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel
* Bug fix: cuda/_include only exists after shutil.copytree() ran.
* Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py
* Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions)
* Replace := operator (needs Python 3.8+)
* Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md
* Restore original README.md: `pip3 install -e` now works on first pass.
* cuda_cccl/README.md: FOR INTERNAL USE ONLY
* Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under NVIDIA#3201 (comment))
Command used: ci/update_version.sh 2 8 0
* Modernize pyproject.toml, setup.py
Trigger for this change:
* NVIDIA#3201 (comment)
* NVIDIA#3201 (comment)
* Install CCCL headers under cuda.cccl.include
Trigger for this change:
* NVIDIA#3201 (comment)
Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely.
* Factor out cuda_cccl/cuda/cccl/include_paths.py
* Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative
* Add missing Copyright notice.
* Add missing __init__.py (cuda.cccl)
* Add `"cuda.cccl"` to `autodoc.mock_imports`
* Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.)
* Add # TODO: move this to a module-level import
* Modernize cuda_cooperative/pyproject.toml, setup.py
* Convert cuda_cooperative to use hatchling as build backend.
* Revert "Convert cuda_cooperative to use hatchling as build backend."
This reverts commit 61637d6.
* Move numpy from [build-system] requires -> [project] dependencies
* Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH
* Remove copy_license() and use license_files=["../../LICENSE"] instead.
* Further modernize cuda_cccl/setup.py to use pathlib
* Trivial simplifications in cuda_cccl/pyproject.toml
* Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code
* Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml
* Add taplo-pre-commit to .pre-commit-config.yaml
* taplo-pre-commit auto-fixes
* Use pathlib in cuda_cooperative/setup.py
* CCCL_PYTHON_PATH in cuda_cooperative/setup.py
* Modernize cuda_parallel/pyproject.toml, setup.py
* Use pathlib in cuda_parallel/setup.py
* Add `# TOML lint & format` comment.
* Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml
* Use pathlib in cuda/cccl/include_paths.py
* pre-commit autoupdate (EXCEPT clang-format, which was manually restored)
* Fixes after git merge main
* Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result'
```
=========================================================================== warnings summary ===========================================================================
tests/test_reduce.py::test_reduce_non_contiguous
/home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080>
Traceback (most recent call last):
File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__
bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result))
^^^^^^^^^^^^^^^^^
AttributeError: '_Reduce' object has no attribute 'build_result'
warnings.warn(pytest.PytestUnraisableExceptionWarning(msg))
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ==============================================================
```
* Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy`
* Introduce cuda_cooperative/constraints.txt
* Also add cuda_parallel/constraints.txt
* Add `--constraint constraints.txt` in ci/test_python.sh
* Update Copyright dates
* Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024)
For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI.
* Remove unused cuda_parallel jinja2 dependency (noticed by chance).
* Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead.
* Make cuda_cooperative, cuda_parallel testing completely independent.
* Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Fix sign-compare warning (NVIDIA#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]"
This reverts commit ea33a21.
Error message: NVIDIA#3201 (comment)
* Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Restore original ci/matrix.yaml [skip-rapids]
* Use for loop in test_python.sh to avoid code duplication.
* Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]
* Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]"
This reverts commit ec206fd.
* Implement suggestion by @shwina (NVIDIA#3201 (review))
* Address feedback by @leofang
---------
Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
update docs update docs add `memcmp`, `memmove` and `memchr` implementations implement tests Use cuda::std::min/max in Thrust (NVIDIA#3364) Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (NVIDIA#3361) * implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` Cleanup util_arch (NVIDIA#2773) Deprecate thrust::null_type (NVIDIA#3367) Deprecate cub::DeviceSpmv (NVIDIA#3320) Fixes: NVIDIA#896 Improves `DeviceSegmentedSort` test run time for large number of items and segments (NVIDIA#3246) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * fixes spelling * adds tests for large number of segments * fixes narrowing conversion in tests * addresses review comments * fixes includes Compile basic infra test with C++17 (NVIDIA#3377) Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (NVIDIA#3308) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * addresses review comments * introduces segment offset type * adds tests for large number of segments * adds support for large number of segments * drops segment offset type * fixes thrust namespace * removes about-to-be-deprecated cub iterators * no exec specifier on defaulted ctor * fixes gcc7 linker error * uses local_segment_index_t throughout * determine offset type based on type returned by segment iterator begin/end iterators * minor style improvements Exit with error when RAPIDS CI fails. (NVIDIA#3385) cuda.parallel: Support structured types as algorithm inputs (NVIDIA#3218) * Introduce gpu_struct decorator and typing * Enable `reduce` to accept arrays of structs as inputs * Add test for reducing arrays-of-struct * Update documentation * Use a numpy array rather than ctypes object * Change zeros -> empty for output array and temp storage * Add a TODO for typing GpuStruct * Documentation udpates * Remove test_reduce_struct_type from test_reduce.py * Revert to `to_cccl_value()` accepting ndarray + GpuStruct * Bump copyrights --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Deprecate thrust::async (NVIDIA#3324) Fixes: NVIDIA#100 Review/Deprecate CUB `util.ptx` for CCCL 2.x (NVIDIA#3342) Fix broken `_CCCL_BUILTIN_ASSUME` macro (NVIDIA#3314) * add compiler-specific path * fix device code path * add _CCC_ASSUME Deprecate thrust::numeric_limits (NVIDIA#3366) Replace `typedef` with `using` in libcu++ (NVIDIA#3368) Deprecate thrust::optional (NVIDIA#3307) Fixes: NVIDIA#3306 Upgrade to Catch2 3.8 (NVIDIA#3310) Fixes: NVIDIA#1724 refactor `<cuda/std/cstdint>` (NVIDIA#3325) Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> Update CODEOWNERS (NVIDIA#3331) * Update CODEOWNERS * Update CODEOWNERS * Update CODEOWNERS * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Fix sign-compare warning (NVIDIA#3408) Implement more cmath functions to be usable on host and device (NVIDIA#3382) * Implement more cmath functions to be usable on host and device * Implement math roots functions * Implement exponential functions Redefine and deprecate thrust::remove_cvref (NVIDIA#3394) * Redefine and deprecate thrust::remove_cvref Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Fix assert definition for NVHPC due to constexpr issues (NVIDIA#3418) NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it. Fix this by always using the host definition which should also work on device. Fixes NVIDIA#3411 Extend CUB reduce benchmarks (NVIDIA#3401) * Rename max.cu to custom.cu, since it uses a custom operator * Extend types covered my min.cu to all fundamental types * Add some notes on how to collect tuning parameters Fixes: NVIDIA#3283 Update upload-pages-artifact to v3 (NVIDIA#3423) * Update upload-pages-artifact to v3 * Empty commit --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Replace and deprecate thrust::cuda_cub::terminate (NVIDIA#3421) `std::linalg` accessors and `transposed_layout` (NVIDIA#2962) Add round up/down to multiple (NVIDIA#3234) [FEA]: Introduce Python module with CCCL headers (NVIDIA#3201) * Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under NVIDIA#3201 (comment)) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * NVIDIA#3201 (comment) * NVIDIA#3201 (comment) * Install CCCL headers under cuda.cccl.include Trigger for this change: * NVIDIA#3201 (comment) Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d6. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (NVIDIA#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a21. Error message: NVIDIA#3201 (comment) * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd. * Implement suggestion by @shwina (NVIDIA#3201 (review)) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> cuda.parallel: Add optional stream argument to reduce_into() (NVIDIA#3348) * Add optional stream argument to reduce_into() * Add tests to check for reduce_into() stream behavior * Move protocol related utils to separate file and rework __cuda_stream__ error messages * Fix synchronization issue in stream test and add one more invalid stream test case * Rename cuda stream validation function after removing leading underscore * Unpack values from __cuda_stream__ instead of indexing * Fix linting errors * Handle TypeError when unpacking invalid __cuda_stream__ return * Use stream to allocate cupy memory in new stream test Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (NVIDIA#3434) Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (NVIDIA#3419) * Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ Fixes NVIDIA#3404 Fix CI issues (NVIDIA#3443) Remove deprecated `cub::min` (NVIDIA#3450) * Remove deprecated `cuda::{min,max}` * Drop unused `thrust::remove_cvref` file Fix typo in builtin (NVIDIA#3451) Moves agents to `detail::<algorithm_name>` namespace (NVIDIA#3435) uses unsigned offset types in thrust's scan dispatch (NVIDIA#3436) Default transform_iterator's copy ctor (NVIDIA#3395) Fixes: NVIDIA#2393 Turn C++ dialect warning into error (NVIDIA#3453) Uses unsigned offset types in thrust's sort algorithm calling into `DispatchMergeSort` (NVIDIA#3437) * uses thrust's dynamic dispatch for merge_sort * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Refactor allocator handling of contiguous_storage (NVIDIA#3050) Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Drop thrust::detail::integer_traits (NVIDIA#3391) Add cuda::is_floating_point supporting half and bfloat (NVIDIA#3379) Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Improve docs of std headers (NVIDIA#3416) Drop C++11 and C++14 support for all of cccl (NVIDIA#3417) * Drop C++11 and C++14 support for all of cccl --------- Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> Deprecate a few CUB macros (NVIDIA#3456) Deprecate thrust universal iterator categories (NVIDIA#3461) Fix launch args order (NVIDIA#3465) Add `--extended-lambda` to the list of removed clangd flags (NVIDIA#3432) add `_CCCL_HAS_NVFP8` macro (NVIDIA#3429) Add `_CCCL_BUILTIN_PREFETCH` (NVIDIA#3433) Drop universal iterator categories (NVIDIA#3474) Ensure that headers in `<cuda/*>` can be build with a C++ only compiler (NVIDIA#3472) Specialize __is_extended_floating_point for FP8 types (NVIDIA#3470) Also ensure that we actually can enable FP8 due to FP16 and BF16 requirements Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Moves CUB kernel entry points to a detail namespace (NVIDIA#3468) * moves emptykernel to detail ns * second batch * third batch * fourth batch * fixes cuda parallel * concatenates nested namespaces Deprecate block/warp algo specializations (NVIDIA#3455) Fixes: NVIDIA#3409 Refactor CUB's util_debug (NVIDIA#3345)
* Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative
* Run `copy_cccl_headers_to_aude_include()` before `setup()`
* Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path.
* Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel
* Bug fix: cuda/_include only exists after shutil.copytree() ran.
* Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py
* Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions)
* Replace := operator (needs Python 3.8+)
* Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md
* Restore original README.md: `pip3 install -e` now works on first pass.
* cuda_cccl/README.md: FOR INTERNAL USE ONLY
* Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under NVIDIA#3201 (comment))
Command used: ci/update_version.sh 2 8 0
* Modernize pyproject.toml, setup.py
Trigger for this change:
* NVIDIA#3201 (comment)
* NVIDIA#3201 (comment)
* Install CCCL headers under cuda.cccl.include
Trigger for this change:
* NVIDIA#3201 (comment)
Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely.
* Factor out cuda_cccl/cuda/cccl/include_paths.py
* Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative
* Add missing Copyright notice.
* Add missing __init__.py (cuda.cccl)
* Add `"cuda.cccl"` to `autodoc.mock_imports`
* Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.)
* Add # TODO: move this to a module-level import
* Modernize cuda_cooperative/pyproject.toml, setup.py
* Convert cuda_cooperative to use hatchling as build backend.
* Revert "Convert cuda_cooperative to use hatchling as build backend."
This reverts commit 61637d6.
* Move numpy from [build-system] requires -> [project] dependencies
* Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH
* Remove copy_license() and use license_files=["../../LICENSE"] instead.
* Further modernize cuda_cccl/setup.py to use pathlib
* Trivial simplifications in cuda_cccl/pyproject.toml
* Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code
* Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml
* Add taplo-pre-commit to .pre-commit-config.yaml
* taplo-pre-commit auto-fixes
* Use pathlib in cuda_cooperative/setup.py
* CCCL_PYTHON_PATH in cuda_cooperative/setup.py
* Modernize cuda_parallel/pyproject.toml, setup.py
* Use pathlib in cuda_parallel/setup.py
* Add `# TOML lint & format` comment.
* Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml
* Use pathlib in cuda/cccl/include_paths.py
* pre-commit autoupdate (EXCEPT clang-format, which was manually restored)
* Fixes after git merge main
* Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result'
```
=========================================================================== warnings summary ===========================================================================
tests/test_reduce.py::test_reduce_non_contiguous
/home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080>
Traceback (most recent call last):
File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__
bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result))
^^^^^^^^^^^^^^^^^
AttributeError: '_Reduce' object has no attribute 'build_result'
warnings.warn(pytest.PytestUnraisableExceptionWarning(msg))
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ==============================================================
```
* Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy`
* Introduce cuda_cooperative/constraints.txt
* Also add cuda_parallel/constraints.txt
* Add `--constraint constraints.txt` in ci/test_python.sh
* Update Copyright dates
* Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024)
For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI.
* Remove unused cuda_parallel jinja2 dependency (noticed by chance).
* Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead.
* Make cuda_cooperative, cuda_parallel testing completely independent.
* Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Fix sign-compare warning (NVIDIA#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]"
This reverts commit ea33a21.
Error message: NVIDIA#3201 (comment)
* Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Restore original ci/matrix.yaml [skip-rapids]
* Use for loop in test_python.sh to avoid code duplication.
* Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]
* Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]"
This reverts commit ec206fd.
* Implement suggestion by @shwina (NVIDIA#3201 (review))
* Address feedback by @leofang
---------
Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
* [FEA]: Introduce Python module with CCCL headers (#3201) * Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under #3201 (comment)) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * #3201 (comment) * #3201 (comment) * Install CCCL headers under cuda.cccl.include Trigger for this change: * #3201 (comment) Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d6. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a21. Error message: #3201 (comment) * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd. * Implement suggestion by @shwina (#3201 (review)) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> * cuda.parallel: invoke pytest directly rather than via `python -m pytest` (#3523) Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> * Copy file from PR #3547 (bugfix/drop_pipe_in_lit by @wmaxey) * Revert "cuda.parallel: invoke pytest directly rather than via `python -m pytest` (#3523)" This reverts commit a2e21cb. * Replace pipes.quote with shlex.quote in lit config (#3547) * Replace pipes.quote with shlex.quote * Drop TBB run on windows to unblock CI * Update ci/matrix.yaml Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> * Remove nvks runners from testing pool. (#3580) --------- Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> Co-authored-by: Ashwin Srinath <3190405+shwina@users.noreply.github.com> Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Co-authored-by: Wesley Maxey <71408887+wmaxey@users.noreply.github.com> Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Co-authored-by: Allison Piper <alliepiper16@gmail.com>
…e 2.8.x branch. (NVIDIA#3536) * [FEA]: Introduce Python module with CCCL headers (NVIDIA#3201) * Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under NVIDIA#3201 (comment)) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * NVIDIA#3201 (comment) * NVIDIA#3201 (comment) * Install CCCL headers under cuda.cccl.include Trigger for this change: * NVIDIA#3201 (comment) Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d6. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (NVIDIA#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a21. Error message: NVIDIA#3201 (comment) * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd. * Implement suggestion by @shwina (NVIDIA#3201 (review)) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> * cuda.parallel: invoke pytest directly rather than via `python -m pytest` (NVIDIA#3523) Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> * Copy file from PR NVIDIA#3547 (bugfix/drop_pipe_in_lit by @wmaxey) * Revert "cuda.parallel: invoke pytest directly rather than via `python -m pytest` (NVIDIA#3523)" This reverts commit a2e21cb. * Replace pipes.quote with shlex.quote in lit config (NVIDIA#3547) * Replace pipes.quote with shlex.quote * Drop TBB run on windows to unblock CI * Update ci/matrix.yaml Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> * Remove nvks runners from testing pool. (NVIDIA#3580) --------- Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> Co-authored-by: Ashwin Srinath <3190405+shwina@users.noreply.github.com> Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Co-authored-by: Wesley Maxey <71408887+wmaxey@users.noreply.github.com> Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Co-authored-by: Allison Piper <alliepiper16@gmail.com>
* Backport PRs #3201, #3523, #3547, #3580 to the 2.8.x branch. (#3536) * [FEA]: Introduce Python module with CCCL headers (#3201) * Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under #3201 (comment)) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * #3201 (comment) * #3201 (comment) * Install CCCL headers under cuda.cccl.include Trigger for this change: * #3201 (comment) Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d6. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a21. Error message: #3201 (comment) * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd. * Implement suggestion by @shwina (#3201 (review)) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> * cuda.parallel: invoke pytest directly rather than via `python -m pytest` (#3523) Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> * Copy file from PR #3547 (bugfix/drop_pipe_in_lit by @wmaxey) * Revert "cuda.parallel: invoke pytest directly rather than via `python -m pytest` (#3523)" This reverts commit a2e21cb. * Replace pipes.quote with shlex.quote in lit config (#3547) * Replace pipes.quote with shlex.quote * Drop TBB run on windows to unblock CI * Update ci/matrix.yaml Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> * Remove nvks runners from testing pool. (#3580) --------- Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> Co-authored-by: Ashwin Srinath <3190405+shwina@users.noreply.github.com> Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Co-authored-by: Wesley Maxey <71408887+wmaxey@users.noreply.github.com> Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> Co-authored-by: Allison Piper <alliepiper16@gmail.com> * Suppress execution checks for vocabulary types (#3578) * Suppress execution checks for optional * Suppress execution checks for `expected` * Suppress execution checks for `pair` * Suppress execution checks for `variant` * Remove some jobs * Disable sampls in old CI * Fix compiler detection * Disable tests for unsupported standard modes * Fix compiler detection * Fix compiler detection more * Fix matrix * Also suppress for swap * Fix formatting * Use the internal function fopr MSVC * Try adding import? * Revert all changes to python module * Fix formatting * Update `upload-pages-artifact` * Update RAPIDS to 25.02. (#2967) * Update RAPIDS to 25.02. * Remove RAFT BUILD_ANN_BENCH option. * Rename KvikIO to kvikio. * Add back cugraph-ops until it's completely purged from RAPIDS upstream dependencies. * Update devcontainers. * Use the 24.10 image for cccl CI * Drop gugraph ops * Also drop cugraph-gnn for now --------- Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com> Co-authored-by: Ashwin Srinath <3190405+shwina@users.noreply.github.com> Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com> Co-authored-by: Wesley Maxey <71408887+wmaxey@users.noreply.github.com> Co-authored-by: Allison Piper <alliepiper16@gmail.com> Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Description
closes #2281
pip installworks as expected in one pass. Resolves this.cuda.cccl.include_pathsfromcuda.cooperative.experimental._nvrtcandcuda.parallel.experimental._bindings.os.path->pathlibmodernization in all .py files touched by this PR.Currently cuda-cccl is not published on PyPI. For interactive development, this is the recommended workflow:
I.e. by installing
cuda-ccclfirst, the dependency is satisfied when runningpip installin cuda_cooperative or cuda_parallel.However, CI testing (ci/test_python.sh) uses an alternative approach, to ensure that we're not accidentally removing
cuda-ccclfrom thedependenciesin{cuda_cooperative,cuda_parallel}/pyproject.toml. Concretely,pip installis only run once, using thepip install --constraintoption:This will fail if
cuda-ccclis missing in thedependencies.Note for completeness:
I spent a significant amount of time trying to use
hatchlingas the build backend (instead ofsetuptools):With that commit,
pip installworked, butpip install --editabledid not. The root cause is this file installed by cuda-python:This file interferes with Python's Implicit Namespace Packages mechanism, which is what
hatchlingrelies on in--editablemode.