[BUG]: Investigate the performance regression with the new cuco data structures

### Is this a duplicate?

- [x] I confirmed there appear to be no duplicate issues for this bug (https://github.com/NVIDIA/cuCollections/issues)

### Type of Bug

Performance

### Describe the bug

To address [rapidsai/cudf#12261](https://github.com/rapidsai/cudf/issues/12261), we are migrating hash-based algorithms in cudf to use the new cuco data structures. Specifically, for hash join operations, the new multiset is being adopted to replace the legacy multimap-based implementation. 

In the previous design, the implementation used row hash values as keys and row indices as payloads, involving somewhat awkward `pair_count` and `pair_retrieve` operations. The new approach uses a pair consisting of the row hash value and row index as the key, and leverages `count` and `retrieve` operations directly.

However, during the initial migration https://github.com/rapidsai/cudf/pull/18021, we observed significant performance regressions up to 60% with the new multiset/multimap for hash join. 

![Image](https://github.com/user-attachments/assets/56324e42-6ff3-42aa-9d91-1f09f847f443)

This was traced to the addition of `erase` support in the new cuco data structures. The issue was addressed in #681, reducing the performance gap between the legacy hash join and the new multiset-based implementation to approximately 20%, and narrowing the difference in multimap insertion performance to around 5%.

Based on profiling results after merging #681, we found that the majority of the remaining 20% performance gap between the legacy and new hash join implementations comes from hash table `count` operations. While these operations show an ~8% slowdown in cuco benchmarks, they exhibit up to a 50% slowdown in cudf hash join benchmarks. Re-profiling the code post-681 revealed that the new vector load logic introduces additional SASS instructions (more precisely, more store local `STL`):
**_vector load in the new design requires extra `STL`:_**

![Image](https://github.com/user-attachments/assets/9203107c-4b60-4d9b-a316-d3392b7b51f1)

**_while the legacy vector load is much less expensive:_**

![Image](https://github.com/user-attachments/assets/019ed796-adca-422d-858c-89e1f3e7fe2e)

After further investigation, we found that using the new cuco `bucket_storage`, which is essentially an array of `cuda::std::array<slot_type>`, introduces approximately 40% more branching instructions compared to the original 1D array of `slot_type`. Based on our preliminary tests, replacing the 2D bucket storage with a flat 1D storage eliminates the performance gap between the new and legacy multimap `count` operations. We plan to address this issue in #694.

### Notes
profiling results of the legacy and new count: [count_profiling.zip](https://github.com/user-attachments/files/19677426/count_profiling.zip)

Note that even flat storage may eliminate the performance gap between legacy and the new `count`, we still find the new implementation always load about 35% more data from global memory (`dram_bytes_read.sum`) and cannot find a good explanation:

![Image](https://github.com/user-attachments/assets/9ee33b3c-31d9-4e71-ada9-40b19da3ed82)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Investigate the performance regression with the new cuco data structures #698

Is this a duplicate?

Type of Bug

Describe the bug

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]: Investigate the performance regression with the new cuco data structures #698

Description

Is this a duplicate?

Type of Bug

Describe the bug

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions