[ENHANCEMENT]: Enable `packed_cas` codepath using 16B CAS on sm_90+ architectures

### Is your feature request related to a problem? Please describe.

The `packed_cas` update routine shows better performance compared to `back_to_back_cas` and `cas_dependent_write`.

On sm_90 and higher we have hardware support for 16B atomic CAS which we currently don't make use of.

### Describe the solution you'd like

16B `atomicCAS` was introduced with CUDA 12.3 (see [docs](https://docs.nvidia.com/cuda/archive/12.3.0/cuda-c-programming-guide/index.html#atomiccas)).

Idea: Add a dedicated codepath for sm_90+ by adding something like

```cpp
NV_IF_TARGET(some_target_that_means_sm_90_or_higher,
             atomicCAS(...) // 16B CAS,
             // pre-sm_90 code path);
```

### Describe alternatives you've considered

Convince CCCL to expose `cuda::atomic_ref::compare_exchange_*` for 16B types ;)

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENHANCEMENT]: Enable `packed_cas` codepath using 16B CAS on sm_90+ architectures #547

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[ENHANCEMENT]: Enable packed_cas codepath using 16B CAS on sm_90+ architectures #547

Description

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[ENHANCEMENT]: Enable `packed_cas` codepath using 16B CAS on sm_90+ architectures #547