Is this a duplicate?
Type of Bug
Performance
Describe the bug
To address rapidsai/cudf#12261, we are migrating hash-based algorithms in cudf to use the new cuco data structures. Specifically, for hash join operations, the new multiset is being adopted to replace the legacy multimap-based implementation.
In the previous design, the implementation used row hash values as keys and row indices as payloads, involving somewhat awkward pair_count and pair_retrieve operations. The new approach uses a pair consisting of the row hash value and row index as the key, and leverages count and retrieve operations directly.
However, during the initial migration rapidsai/cudf#18021, we observed significant performance regressions up to 60% with the new multiset/multimap for hash join.

This was traced to the addition of erase support in the new cuco data structures. The issue was addressed in #681, reducing the performance gap between the legacy hash join and the new multiset-based implementation to approximately 20%, and narrowing the difference in multimap insertion performance to around 5%.
Based on profiling results after merging #681, we found that the majority of the remaining 20% performance gap between the legacy and new hash join implementations comes from hash table count operations. While these operations show an ~8% slowdown in cuco benchmarks, they exhibit up to a 50% slowdown in cudf hash join benchmarks. Re-profiling the code post-681 revealed that the new vector load logic introduces additional SASS instructions (more precisely, more store local STL):
vector load in the new design requires extra STL:

while the legacy vector load is much less expensive:

After further investigation, we found that using the new cuco bucket_storage, which is essentially an array of cuda::std::array<slot_type>, introduces approximately 40% more branching instructions compared to the original 1D array of slot_type. Based on our preliminary tests, replacing the 2D bucket storage with a flat 1D storage eliminates the performance gap between the new and legacy multimap count operations. We plan to address this issue in #694.
Notes
profiling results of the legacy and new count: count_profiling.zip
Note that even flat storage may eliminate the performance gap between legacy and the new count, we still find the new implementation always load about 35% more data from global memory (dram_bytes_read.sum) and cannot find a good explanation:

Is this a duplicate?
Type of Bug
Performance
Describe the bug
To address rapidsai/cudf#12261, we are migrating hash-based algorithms in cudf to use the new cuco data structures. Specifically, for hash join operations, the new multiset is being adopted to replace the legacy multimap-based implementation.
In the previous design, the implementation used row hash values as keys and row indices as payloads, involving somewhat awkward
pair_countandpair_retrieveoperations. The new approach uses a pair consisting of the row hash value and row index as the key, and leveragescountandretrieveoperations directly.However, during the initial migration rapidsai/cudf#18021, we observed significant performance regressions up to 60% with the new multiset/multimap for hash join.
This was traced to the addition of
erasesupport in the new cuco data structures. The issue was addressed in #681, reducing the performance gap between the legacy hash join and the new multiset-based implementation to approximately 20%, and narrowing the difference in multimap insertion performance to around 5%.Based on profiling results after merging #681, we found that the majority of the remaining 20% performance gap between the legacy and new hash join implementations comes from hash table
countoperations. While these operations show an ~8% slowdown in cuco benchmarks, they exhibit up to a 50% slowdown in cudf hash join benchmarks. Re-profiling the code post-681 revealed that the new vector load logic introduces additional SASS instructions (more precisely, more store localSTL):vector load in the new design requires extra
STL:while the legacy vector load is much less expensive:
After further investigation, we found that using the new cuco
bucket_storage, which is essentially an array ofcuda::std::array<slot_type>, introduces approximately 40% more branching instructions compared to the original 1D array ofslot_type. Based on our preliminary tests, replacing the 2D bucket storage with a flat 1D storage eliminates the performance gap between the new and legacy multimapcountoperations. We plan to address this issue in #694.Notes
profiling results of the legacy and new count: count_profiling.zip
Note that even flat storage may eliminate the performance gap between legacy and the new
count, we still find the new implementation always load about 35% more data from global memory (dram_bytes_read.sum) and cannot find a good explanation: