Is your feature request related to a problem? Please describe.
The packed_cas update routine shows better performance compared to back_to_back_cas and cas_dependent_write.
On sm_90 and higher we have hardware support for 16B atomic CAS which we currently don't make use of.
Describe the solution you'd like
16B atomicCAS was introduced with CUDA 12.3 (see docs).
Idea: Add a dedicated codepath for sm_90+ by adding something like
NV_IF_TARGET(some_target_that_means_sm_90_or_higher,
atomicCAS(...) // 16B CAS,
// pre-sm_90 code path);
Describe alternatives you've considered
Convince CCCL to expose cuda::atomic_ref::compare_exchange_* for 16B types ;)
Additional context
No response
Is your feature request related to a problem? Please describe.
The
packed_casupdate routine shows better performance compared toback_to_back_casandcas_dependent_write.On sm_90 and higher we have hardware support for 16B atomic CAS which we currently don't make use of.
Describe the solution you'd like
16B
atomicCASwas introduced with CUDA 12.3 (see docs).Idea: Add a dedicated codepath for sm_90+ by adding something like
Describe alternatives you've considered
Convince CCCL to expose
cuda::atomic_ref::compare_exchange_*for 16B types ;)Additional context
No response