CCCL 2.3.0
What’s New
In addition to various fixes and documentation improvements, the following notable improvements have been made to Thrust, CUB, and libcudacxx.
System Headers and Warnings
Users don't want to see warnings from CCCL headers. The typical way to accomplish this with header libraries is to use -isystem. However, this causes problems when using CCCL from GitHub, it will conflict with the CCCL headers in the CTK. Therefore, you should always include CCCL headers via -I.
To achieve the same effect as -isystem, CCCL headers will now use the system_header pragma. For more information, see #527.
TL;DR: You should never see warnings emitted from a CCCL header ever again!
Linkage Issues
Using CUB and Thrust in shared libraries is a known source of issues. Previously, the solution to these issues consisted of using the THRUST_CUB_WRAPPED_NAMESPACE macro so that different shared libraries have different symbol names. However, this solution has poor discoverability, since issues present themselves in forms of segmentation faults, hangs, wrong results, etc. As of the 2.3 release, linkage issues are addressed by default without the need for THRUST_CUB_WRAPPED_NAMESPACE. Although the fix is API compatible, it might cause ABI compatibility issues. For more details, see issue #443.
Thrust
thrust::tuple, thrust::pair, and thrust::complex have been replaced with cuda::std alternatives. This can be a breaking change, but should be source compatible.
CUB
Up to 60% performance improvements of cub::DeviceSelect::UniqueByKey, cub::DeviceScan::ExclusiveSumByKey, and cub::DeviceReduce::ReduceByKey on A100. cub::DeviceSegmentedReduce now supports 64-bit indexing.
libcudacxx
- The
cuda::ptxnamespace and<cuda/ptx>header is now available and provides access to various inline PTX functions that enumerate various async memcpy and barrier intrinsics. - #379 - Added experimental bulk TMA memcpy under
<cuda/barrier>
What's Changed
- Port cub::DeviceSegmentedReduce tests to catch2 by @elstehle in #303
- Branch/2.2.x by @gevtushenko in #305
- Tune unique by key on A100 by @gevtushenko in #306
- Merge branch/2.2.x to main by @jrhemstad in #308
- Add example cmake project by @jrhemstad in #177
- Adds catch2 tests for reduce-by-key by @elstehle in #311
- Tune scan by key on A100 by @gevtushenko in #325
- Replace diag_suppress by nv_diag_suppress in documentation by @ahendriksen in #281
- Fix MSVC / CUB tests build by @gevtushenko in #336
- gdb pretty printer: handle non-cuda device vectors by @siboehm in #264
- Add a nvrtc configuration for libcu++ by @miscco in #202
- GH Infra: project automation and issue template fixes by @jarmak-nv in #297
- Tune reduce by key on A100 by @gevtushenko in #346
- Merge commits from 2.2 branch by @miscco in #350
- Fix a shadow warning in thrust's execute_with_dependencies.h by @hageboeck in #334
- Assorted fixes for MSVC 2017 by @miscco in #341
- [skip-tests] Guard inline variables with
_LIBCUDACXX_INLINE_VARmacro by @miscco in #355 - Port cub::DeviceScan tests to catch2 by @elstehle in #347
- Remove _NOEXCEPT macro in favor of noexcept in libcu++ by @Blonck in #349
- Project Automation: add conditional steps due to context errors by @jarmak-nv in #353
- Work around strange gcc bug by @miscco in #363
- Implement
iter_swapCPO by @miscco in #332 - Replace default, constexpr, and delete macros by original keywords by @Blonck in #360
- Add clang16 devcontainer and CI job by @miscco in #362
- [skip-tests] Skip merge conflict from old iter_swap PR by @miscco in #369
- [skip-tests] Also skip all CI runs that require a GPU when [skip-tests] is set by @miscco in #370
- Remove _LIBCUDACXX_CXX03_LANG macro and all encapsulated code by @Blonck in #368
- Remove checks against _LIBCUDACXX_STD_VER < 11 by @Blonck in #375
- Use
copy-pr-botby @ajschmidt8 in #381 - Implement the
permutableconcept by @miscco in #367 - [NFC] We missed some
_NOEXCEPT_macro uses by @miscco in #371 - Implement
identitychanges for c++20 by @miscco in #383 - Hide third party cmake options in our cmake developer builds. by @allisonvacanti in #300
- Port cub::DeviceScanByKey tests to Catch2 by @elstehle in #380
- Fixes a race in DeviceRunLengthEncode::NonTrivialRuns by @elstehle in #399
- Add commit information to the test output by @miscco in #401
- Project Automation: Handle PRs opened as non-draft + multiple bug fixes by @jarmak-nv in #387
- Project Automation: set
Roadmapproject value on issue/pr close and Auto-type new issues by @jarmak-nv in #389 - Add support for tests that should fail at runtime by @ahendriksen in #418
- Port
DeviceAdjacentDifference::SubtractRighttests to catch2 by @miscco in #390 - Project automation - Fix indentation for
continue-on-errorby @jarmak-nv in #425 - [BUG] Ensure that all headers build on their own by @miscco in #200
- Remove
util_device.cuhfrom iterator headers to enable online compilation by @leofang in #412 - Fix ci-overview example by @gevtushenko in #428
- Port
cub::DeviceRunLengthEncodetests to catch2 by @miscco in #411 - Add cuda::device::barrier_arrive tx by @ahendriksen in #358
- Fix CubDebug by @gevtushenko in #430
- Do not use static member functions to initialize static member variables. by @miscco in #438
- Implement the
projectedhelper struct by @miscco in #385 - Add PTX wrapping functions for TMA features by @ahendriksen in #379
- Clarify docstring for num_items parameter of DeviceSegmentedRadixSort by @HapeMask in #320
- Enable lit to determine the compute architectures by @miscco in #447
- Add NVRTC_SKIP_KERNEL_RUN tag to compile, but skip running NVRTC test by @ahendriksen in #434
- Improve documentation of
cuda::barrierby @ahendriksen in #440 - Extend
thrust::complexunit tests to prepare for upcoming replacement withstd::complexby @Blonck in #413 - Remove having two install rules for -header-search.cmake by @robertmaynard in #298
- Run
.devcontainer/launch.shwith bash + add error checking by @wence- in #407 - Remove C++03 compatability from unit tests by @Blonck in #378
- [libcu++] Fix use of
__ppc64__by @miscco in #451 - Update the README by @jrhemstad in #291
- [libcu++] Try to avoid gcc misscompilation issues by @miscco in #452
- Consolidate matrix logic into single script/job by @jrhemstad in #361
- Implement the
indirectly_comparableconcept by @miscco in #445 - Fix compute matrix dropping trailing zeros by @jrhemstad in #466
- Avoid integer promotion warnings with MSVC by @miscco in #460
- Implement ranges comparison objects by @miscco in #464
- Fix CUB/MSVC/RDC tests by @gevtushenko in #469
- Fix Thrust/CUB Linkage Issues by @gevtushenko in #443
- Script for Running CUB Benchmarks by @gevtushenko in #472
- [skip ci] Add list of CCCL users to README by @jrhemstad in #474
constexprall the things by @pb-dseifert in #476- Add Gonzalo/Allard to trustees by @jrhemstad in #482
- Implement the
sortableconcept by @miscco in #471 - [libcu++] Add _LIBCUDACXX_CUDACC_BELOW_12_3 macro by @gonzalobg in #479
- Refactor
thrust::complexas a struct derived fromcuda::std::complexby @Blonck in #454 - Add ci scripts for windows by @miscco in #251
- Enable complex interop on MSVC by @miscco in #490
- [skip ci] Add related projects to readme. by @jrhemstad in #492
- Reenable nvrtc tests by @miscco in #488
- Implement the
mergeableconcept by @miscco in #484 - 64-bit indexing for DeviceSegmentedReduce by @jecs in #414
- Implement
move_sentinelby @miscco in #496 - Support skipped benches in run script by @gevtushenko in #508
- Implement
unreachable_sentinelby @miscco in #506 - Disable flaky barrier tests by @miscco in #510
- Add constant initialization of managed variable to silence gcc warning by @miscco in #509
- Add verbose flag to ninja build. by @jrhemstad in #491
- Add devcontainer readme by @jrhemstad in #481
- Add contributor guide by @jrhemstad in #500
- [skip ci] Fix devcontainer guide link by @jrhemstad in #518
- [skip ci] Add example godbolt link. by @jrhemstad in #519
- Replace cuda::atomic with legacy functions for old arch compatibility. by @allisonvacanti in #516
- Simplify examples matrix. by @jrhemstad in #517
- Disable PR workflow triggering on pushes to main. by @jrhemstad in #532
- Add CI job to verify devcontainers are always up to date by @jrhemstad in #514
- [CI] Sink error when git repo is missing from build. by @wmaxey in #533
- Rework our tuple implementation to work with older MSVC by @miscco in #530
- Add jobs using clang as CUDA compiler by @jrhemstad in #493
- Remove cudaDeviceSetSharedMemConfig from CUB tests by @gevtushenko in #538
- Implement
__bounded_iterby @miscco in #540 - Fix cub::BlockAdjacentDifference documentation by @pauleonix in #542
- Add cuda::device::memcpy_async_tx by @ahendriksen in #405
- Introduce Thrust benchmarks by @gevtushenko in #534
- Fix MSVC benchmarks build by @gevtushenko in #536
- Fix nvc++ as host compiler by @gevtushenko in #560
- Add missing overload definition of thrust::complex operator!= by @srinivasyadav18 in #564
- Make template parameters consistent in thrust::complex operators by @srinivasyadav18 in #555
- Migrate CI configs to CMake presets. by @allisonvacanti in #324
- Replace thrust::detail::integral_constant with libcudacxx implementation by @ZelboK in #561
- Add
cuda::device::barrier_expect_txby @ahendriksen in #498 - Add ARM build configs for latest gcc/clang. by @jrhemstad in #468
- Fea/486 Improve thrust::complex operators compile time throughput by @srinivasyadav18 in #567
- Define compiler env vars for CMake in dev containers. by @allisonvacanti in #576
- Revert back to working nvbench commit by @miscco in #582
- use clang-format in dev containers by @miscco in #513
- Introduce CCCL clang-format by @gevtushenko in #551
- Add
cp.async.bulkglobal -> shared support tocuda::memcpy_asyncby @ahendriksen in #501 - [skip ci] Also update the base image by @miscco in #584
- Replace
thrust::tupleimplementation withcuda::std::tupleby @miscco in #262 - Fix clangd integration by @gevtushenko in #588
- Always treat CCCL as system headers by @miscco in #531
- Refactor inline comments by @gevtushenko in #581
- Relax Catch2 include order requirements by @gevtushenko in #601
- Project Automation - Fix issue/pr sync workflow by @jarmak-nv in #504
- [skip-tests] Add a preset that builds all configs of all projects. by @allisonvacanti in #580
- Implement
ranges::advanceby @miscco in #546 - Update status check job to check status of precursor jobs by @jrhemstad in #605
- Report times for libcudacxx tests in CI by @jrhemstad in #606
- Fix bug in the construct_at optimization by @miscco in #608
- [skip-tests] Disable rdc tests for windows. by @miscco in #615
- Implement
ranges::nextby @miscco in #611 - Support FP8 in radix sort by @gevtushenko in #623
- Fix examples/cccl_infra mixup in ci. by @wmaxey in #633
- Fixes block-scope run-length decode one-past-the-end memory access into smem TempStorage by @elstehle in #626
- Harmonize CUB includes by @gevtushenko in #632
- Create NVRTCC, a utility for running tests under NVRTC by @wmaxey in #494
- Fix typo and grammar errors by @VaibhavWakde52 in #639
- [Backport branch/2.3.x] Add CCCL_VERSION and script for updating version by @github-actions in #667
- Backport 574 ptx by @miscco in #663
- [Backport branch/2.3.x] Fix C++11 support of recently added tests by @github-actions in #658
- [Backport branch/2.3.x] Update CUDA newest to CTK 12.3 by @github-actions in #1072
- [Backport to branch/2.3.x] Rework our system header approach to be more error proof (#661) by @miscco in #675
- [Backport branch/2.3.x] Fix fallback when checking git repo by @github-actions in #1086
- [Backport branch/2.3.x] Currently the verbose option does not work beacuse of a typo in the argument handling by @github-actions in #1090
- [Backport branch/2.3.x] Add
cuda::ptx::st_asyncby @github-actions in #1093 - [Backport branch/2.3.x] Add
cuda::ptx::red_asyncby @github-actions in #1094 - Backport PR #1075 by @wmaxey in #1100
- [Backport branch/2.3.x] Add
cuda::ptx:mbarrier_{try/test}_wait{_parity}by @github-actions in #1106 - [Backport branch/2.3.x] Fix
cuda::ptx::red.asyncfor int32_t types by @github-actions in #1107 - [Backport branch/2.3.x] Fix local test runs with lit by @github-actions in #1110
- [Backport branch/2.3.x] Fix config when only non-CDPv1 arches are enabled. by @github-actions in #1111
- [Backport branch/2.3.x] Fix GCC6 / FP8 warning by @github-actions in #1131
- [Backport branch/2.3.x] Fix
ptx.st.async.compile.pass.cppfailing in C++11. by @github-actions in #1136 - BACKPORT: Fix
_LIBCUDACXX_UNREACHABLEfor old MSVC (#1114) by @miscco in #1143 - [2.3.x] Backport benchmarking PRs by @wmaxey in #1168
- Backport P0 filter commit. by @wmaxey in #1172
- [BACKPORT] Implement math functions for thrust::complex by @miscco in #1191
- Backport fix icc / cub (#1152) by @wmaxey in #1171
- [BACKPORT]: Fix availability of is_constant_evaluated on old MSVC by @miscco in #1198
- [BACKPORT] Add icc to the ci matrix by @miscco in #1209
- [BACKPORT]: Add missing overloads for thrust::pow by @miscco in #1223
New Contributors
- @siboehm made their first contribution in #264
- @hageboeck made their first contribution in #334
- @Blonck made their first contribution in #349
- @leofang made their first contribution in #412
- @HapeMask made their first contribution in #320
- @jecs made their first contribution in #414
- @pauleonix made their first contribution in #542
- @srinivasyadav18 made their first contribution in #564
- @ZelboK made their first contribution in #561
- @VaibhavWakde52 made their first contribution in #639
Full Changelog: v2.2.0...2.3.0