In order to support fully asynchronous bulk operations, e.g., for multi-GPU hash tables, as requested in #65, we need to rethink the way we compute the table's size.
For now, insert tracks the number of successful insertions on-the-fly. After the kernel has finished, we copy the number back to the host and add it to the size_ member. This implies that insert synchronizes with the host.
In order to overcome this limitation for cuco::static_reduction_map (PR #98), 902b93a proposes a standalone size computation based on thrust::count_if and also implements fully asynchronous bulk operations.
The size computation using thrust::count_if shows near SOL performance in terms of throughput.
Additionally, since we do not need to reduce the number of added pairs during insertion, the overall performance of the insert bulk operation improves by ~3-5%.
IMHO we should also add this feature to the other hash table implementations.
This feature also implicitly solves issue #39.
In order to support fully asynchronous bulk operations, e.g., for multi-GPU hash tables, as requested in #65, we need to rethink the way we compute the table's
size.For now,
inserttracks the number of successful insertions on-the-fly. After the kernel has finished, we copy the number back to the host and add it to thesize_member. This implies thatinsertsynchronizes with the host.In order to overcome this limitation for
cuco::static_reduction_map(PR #98), 902b93a proposes a standalonesizecomputation based onthrust::count_ifand also implements fully asynchronous bulk operations.The
sizecomputation usingthrust::count_ifshows near SOL performance in terms of throughput.Additionally, since we do not need to reduce the number of added pairs during insertion, the overall performance of the
insertbulk operation improves by ~3-5%.IMHO we should also add this feature to the other hash table implementations.
This feature also implicitly solves issue #39.