update docs
update docs
add `memcmp`, `memmove` and `memchr` implementations
implement tests
Use cuda::std::min/max in Thrust (NVIDIA#3364)
Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (NVIDIA#3361)
* implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16`
Cleanup util_arch (NVIDIA#2773)
Deprecate thrust::null_type (NVIDIA#3367)
Deprecate cub::DeviceSpmv (NVIDIA#3320)
Fixes: NVIDIA#896
Improves `DeviceSegmentedSort` test run time for large number of items and segments (NVIDIA#3246)
* fixes segment offset generation
* switches to analytical verification
* switches to analytical verification for pairs
* fixes spelling
* adds tests for large number of segments
* fixes narrowing conversion in tests
* addresses review comments
* fixes includes
Compile basic infra test with C++17 (NVIDIA#3377)
Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (NVIDIA#3308)
* fixes segment offset generation
* switches to analytical verification
* switches to analytical verification for pairs
* addresses review comments
* introduces segment offset type
* adds tests for large number of segments
* adds support for large number of segments
* drops segment offset type
* fixes thrust namespace
* removes about-to-be-deprecated cub iterators
* no exec specifier on defaulted ctor
* fixes gcc7 linker error
* uses local_segment_index_t throughout
* determine offset type based on type returned by segment iterator begin/end iterators
* minor style improvements
Exit with error when RAPIDS CI fails. (NVIDIA#3385)
cuda.parallel: Support structured types as algorithm inputs (NVIDIA#3218)
* Introduce gpu_struct decorator and typing
* Enable `reduce` to accept arrays of structs as inputs
* Add test for reducing arrays-of-struct
* Update documentation
* Use a numpy array rather than ctypes object
* Change zeros -> empty for output array and temp storage
* Add a TODO for typing GpuStruct
* Documentation udpates
* Remove test_reduce_struct_type from test_reduce.py
* Revert to `to_cccl_value()` accepting ndarray + GpuStruct
* Bump copyrights
---------
Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
Deprecate thrust::async (NVIDIA#3324)
Fixes: NVIDIA#100
Review/Deprecate CUB `util.ptx` for CCCL 2.x (NVIDIA#3342)
Fix broken `_CCCL_BUILTIN_ASSUME` macro (NVIDIA#3314)
* add compiler-specific path
* fix device code path
* add _CCC_ASSUME
Deprecate thrust::numeric_limits (NVIDIA#3366)
Replace `typedef` with `using` in libcu++ (NVIDIA#3368)
Deprecate thrust::optional (NVIDIA#3307)
Fixes: NVIDIA#3306
Upgrade to Catch2 3.8 (NVIDIA#3310)
Fixes: NVIDIA#1724
refactor `<cuda/std/cstdint>` (NVIDIA#3325)
Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
Update CODEOWNERS (NVIDIA#3331)
* Update CODEOWNERS
* Update CODEOWNERS
* Update CODEOWNERS
* [pre-commit.ci] auto code formatting
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Fix sign-compare warning (NVIDIA#3408)
Implement more cmath functions to be usable on host and device (NVIDIA#3382)
* Implement more cmath functions to be usable on host and device
* Implement math roots functions
* Implement exponential functions
Redefine and deprecate thrust::remove_cvref (NVIDIA#3394)
* Redefine and deprecate thrust::remove_cvref
Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
Fix assert definition for NVHPC due to constexpr issues (NVIDIA#3418)
NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it.
Fix this by always using the host definition which should also work on device.
Fixes NVIDIA#3411
Extend CUB reduce benchmarks (NVIDIA#3401)
* Rename max.cu to custom.cu, since it uses a custom operator
* Extend types covered my min.cu to all fundamental types
* Add some notes on how to collect tuning parameters
Fixes: NVIDIA#3283
Update upload-pages-artifact to v3 (NVIDIA#3423)
* Update upload-pages-artifact to v3
* Empty commit
---------
Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
Replace and deprecate thrust::cuda_cub::terminate (NVIDIA#3421)
`std::linalg` accessors and `transposed_layout` (NVIDIA#2962)
Add round up/down to multiple (NVIDIA#3234)
[FEA]: Introduce Python module with CCCL headers (NVIDIA#3201)
* Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative
* Run `copy_cccl_headers_to_aude_include()` before `setup()`
* Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path.
* Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel
* Bug fix: cuda/_include only exists after shutil.copytree() ran.
* Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py
* Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions)
* Replace := operator (needs Python 3.8+)
* Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md
* Restore original README.md: `pip3 install -e` now works on first pass.
* cuda_cccl/README.md: FOR INTERNAL USE ONLY
* Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under NVIDIA#3201 (comment))
Command used: ci/update_version.sh 2 8 0
* Modernize pyproject.toml, setup.py
Trigger for this change:
* NVIDIA#3201 (comment)
* NVIDIA#3201 (comment)
* Install CCCL headers under cuda.cccl.include
Trigger for this change:
* NVIDIA#3201 (comment)
Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely.
* Factor out cuda_cccl/cuda/cccl/include_paths.py
* Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative
* Add missing Copyright notice.
* Add missing __init__.py (cuda.cccl)
* Add `"cuda.cccl"` to `autodoc.mock_imports`
* Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.)
* Add # TODO: move this to a module-level import
* Modernize cuda_cooperative/pyproject.toml, setup.py
* Convert cuda_cooperative to use hatchling as build backend.
* Revert "Convert cuda_cooperative to use hatchling as build backend."
This reverts commit 61637d6.
* Move numpy from [build-system] requires -> [project] dependencies
* Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH
* Remove copy_license() and use license_files=["../../LICENSE"] instead.
* Further modernize cuda_cccl/setup.py to use pathlib
* Trivial simplifications in cuda_cccl/pyproject.toml
* Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code
* Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml
* Add taplo-pre-commit to .pre-commit-config.yaml
* taplo-pre-commit auto-fixes
* Use pathlib in cuda_cooperative/setup.py
* CCCL_PYTHON_PATH in cuda_cooperative/setup.py
* Modernize cuda_parallel/pyproject.toml, setup.py
* Use pathlib in cuda_parallel/setup.py
* Add `# TOML lint & format` comment.
* Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml
* Use pathlib in cuda/cccl/include_paths.py
* pre-commit autoupdate (EXCEPT clang-format, which was manually restored)
* Fixes after git merge main
* Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result'
```
=========================================================================== warnings summary ===========================================================================
tests/test_reduce.py::test_reduce_non_contiguous
/home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080>
Traceback (most recent call last):
File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__
bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result))
^^^^^^^^^^^^^^^^^
AttributeError: '_Reduce' object has no attribute 'build_result'
warnings.warn(pytest.PytestUnraisableExceptionWarning(msg))
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ==============================================================
```
* Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy`
* Introduce cuda_cooperative/constraints.txt
* Also add cuda_parallel/constraints.txt
* Add `--constraint constraints.txt` in ci/test_python.sh
* Update Copyright dates
* Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024)
For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI.
* Remove unused cuda_parallel jinja2 dependency (noticed by chance).
* Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead.
* Make cuda_cooperative, cuda_parallel testing completely independent.
* Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Fix sign-compare warning (NVIDIA#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]"
This reverts commit ea33a21.
Error message: NVIDIA#3201 (comment)
* Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Restore original ci/matrix.yaml [skip-rapids]
* Use for loop in test_python.sh to avoid code duplication.
* Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]
* Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc]
* Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]"
This reverts commit ec206fd.
* Implement suggestion by @shwina (NVIDIA#3201 (review))
* Address feedback by @leofang
---------
Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
cuda.parallel: Add optional stream argument to reduce_into() (NVIDIA#3348)
* Add optional stream argument to reduce_into()
* Add tests to check for reduce_into() stream behavior
* Move protocol related utils to separate file and rework __cuda_stream__ error messages
* Fix synchronization issue in stream test and add one more invalid stream test case
* Rename cuda stream validation function after removing leading underscore
* Unpack values from __cuda_stream__ instead of indexing
* Fix linting errors
* Handle TypeError when unpacking invalid __cuda_stream__ return
* Use stream to allocate cupy memory in new stream test
Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (NVIDIA#3434)
Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (NVIDIA#3419)
* Deprecate `cub::{min, max}` and replace internal uses with those from libcu++
Fixes NVIDIA#3404
Fix CI issues (NVIDIA#3443)
Remove deprecated `cub::min` (NVIDIA#3450)
* Remove deprecated `cuda::{min,max}`
* Drop unused `thrust::remove_cvref` file
Fix typo in builtin (NVIDIA#3451)
Moves agents to `detail::<algorithm_name>` namespace (NVIDIA#3435)
uses unsigned offset types in thrust's scan dispatch (NVIDIA#3436)
Default transform_iterator's copy ctor (NVIDIA#3395)
Fixes: NVIDIA#2393
Turn C++ dialect warning into error (NVIDIA#3453)
Uses unsigned offset types in thrust's sort algorithm calling into `DispatchMergeSort` (NVIDIA#3437)
* uses thrust's dynamic dispatch for merge_sort
* [pre-commit.ci] auto code formatting
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Refactor allocator handling of contiguous_storage (NVIDIA#3050)
Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
Drop thrust::detail::integer_traits (NVIDIA#3391)
Add cuda::is_floating_point supporting half and bfloat (NVIDIA#3379)
Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
Improve docs of std headers (NVIDIA#3416)
Drop C++11 and C++14 support for all of cccl (NVIDIA#3417)
* Drop C++11 and C++14 support for all of cccl
---------
Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
Deprecate a few CUB macros (NVIDIA#3456)
Deprecate thrust universal iterator categories (NVIDIA#3461)
Fix launch args order (NVIDIA#3465)
Add `--extended-lambda` to the list of removed clangd flags (NVIDIA#3432)
add `_CCCL_HAS_NVFP8` macro (NVIDIA#3429)
Add `_CCCL_BUILTIN_PREFETCH` (NVIDIA#3433)
Drop universal iterator categories (NVIDIA#3474)
Ensure that headers in `<cuda/*>` can be build with a C++ only compiler (NVIDIA#3472)
Specialize __is_extended_floating_point for FP8 types (NVIDIA#3470)
Also ensure that we actually can enable FP8 due to FP16 and BF16 requirements
Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
Moves CUB kernel entry points to a detail namespace (NVIDIA#3468)
* moves emptykernel to detail ns
* second batch
* third batch
* fourth batch
* fixes cuda parallel
* concatenates nested namespaces
Deprecate block/warp algo specializations (NVIDIA#3455)
Fixes: NVIDIA#3409
Refactor CUB's util_debug (NVIDIA#3345)
Accidentially compiling with < C++17 issues a warning (That C++ < 17 is no longer supported), but also causes errors later during compilation and the dialect warning is lost in the compiler error novel. We should error out sooner so users have a clear error message.