Add an efficient unstable thread sort, use it in unstable block/device merge/segmented sorts, and improve tests#1552
Add an efficient unstable thread sort, use it in unstable block/device merge/segmented sorts, and improve tests#1552Nyrio wants to merge 15 commits intoNVIDIA:mainfrom
Conversation
…erge sort, and fix many issues with warp/block merge sort tests
…ce segmented sort
|
/ok to test |
|
There is an odd compiler error in C++17 builds in this code in CUB_IF_CONSTEXPR(IS_LAST_TILE)
{
#pragma unroll
for (int item = 1; item < ITEMS_PER_THREAD; ++item)
{
...
}
}I never saw that error before and am struggling to make a reproducer. I'm changing it back to a regular Also CI ran into issues with constructing an |
|
/ok to test |
cub/cub/agent/agent_merge_sort.cuh
Outdated
|
|
||
| /// \brief This agent is responsible for the initial in-tile sorting. | ||
| template <typename Policy, | ||
| bool IS_STABLE, |
There was a problem hiding this comment.
When adding API extensions I would strongly prefer if we can get away from raw booleans.
It is much harder to discern what true means in some API call deep in the code, as opposed to Stability::Stable or Stability::Unstable
That requires us to put a bit more work into the implementation but makes is much easier to work with the API
We have some examples for that here cub\cub\device\dispatch\tuning\tuning_reduce_by_key.cuh
| int valid_items, | ||
| KeyT oob_default) | ||
| { | ||
| if (IS_LAST_TILE) |
There was a problem hiding this comment.
I assume this is the one place you mentioned where CUB_IF_CONSTEXPR is having issues?
FYI, I'm updating the new tests in that PR to support unstable sorting. Should have something ready in the next couple of days. |
|
#1484 now supports unstable sort for the new |
gevtushenko
left a comment
There was a problem hiding this comment.
@Nyrio thank you for the contribution! I'm sorry that it takes us so long to review it. The PR doesn't introduce any difference in codegen for stable sort. Regarding the unstable sort, preliminary benchmarks show about the same performance for built-in types, and about 4% speedup for complex data types (benchmarked on H100 and A6000 Ada). Is this expected improvement or you had a different workload in mind? If you have a different workload illustrating better speedup, we'd highly appreciate if you could contribute it with this PR. Apart from that, while @elstehle is looking at the algorithm itself, I've left a few minor comments below.
| typename ValueIteratorT, | ||
| typename OffsetT, | ||
| typename CompareOpT, | ||
| bool IS_STABLE = true, |
There was a problem hiding this comment.
important: unfortunately, dispatch structure is part of CUB API. I'm afraid that the new template parameter should have to go after the selected policy, not to break existing code that relies on dispatch directly. If you don't want to duplicate policy selection code in every usage, you could add a DispatchStableMergeSort type alias with different order of arguments.
cub/cub/thread/thread_sort.cuh
Outdated
| CUB_NAMESPACE_BEGIN | ||
|
|
||
| template <typename IntT> | ||
| _CCCL_HOST_DEVICE _CCCL_FORCEINLINE constexpr IntT NetworkDegree(IntT n, IntT m = IntT{1}) |
There was a problem hiding this comment.
question: would you like this function to be part of CUB API? If so, it'll need its own tests and documentation. I'd suggest putting it into a detail namespace.
There was a problem hiding this comment.
Indeed it should be in a detail namespace.
What is your opinion about Swap which existed before this PR: should I add documentation, or move to detail? I feel like it should not be part of the API, but that would be a breaking change.
|
|
||
| template <typename KeyT, typename ValueT, typename CompareOp> | ||
| _CCCL_DEVICE _CCCL_FORCEINLINE void | ||
| CompareSwap(KeyT& key_lhs, KeyT& key_rhs, ValueT& item_lhs, ValueT& item_rhs, CompareOp compare_op) |
There was a problem hiding this comment.
important: same note, if you want this function to be part of public API, we'll need docs and tests.
cub/cub/thread/thread_sort.cuh
Outdated
| SPECIALIZE_SORT_ASC(float) | ||
| SPECIALIZE_SORT_DESC(::cuda::std::int32_t) | ||
| SPECIALIZE_SORT_DESC(::cuda::std::uint32_t) | ||
| SPECIALIZE_SORT_DESC(float) |
There was a problem hiding this comment.
suggestion: we'd probably like to avoid leaking this macro into user code. I'd suggest to undefine it after usage.
cub/cub/thread/thread_sort.cuh
Outdated
| } | ||
| } | ||
|
|
||
| #define SPECIALIZE_SORT_ASC(T) \ |
There was a problem hiding this comment.
suggestion: we don't know if users have the same macro or not. To avoid potential collisions, I'd suggest to add a CUB_ prefix.
cub/cub/thread/thread_sort.cuh
Outdated
| CompareSwapMinMaxAsc(key_rhs, key_lhs); \ | ||
| } | ||
|
|
||
| SPECIALIZE_SORT_ASC(::cuda::std::int32_t) |
There was a problem hiding this comment.
question: I'm not sure why {u,}int32 and float are special. Do you think we could go with enable_if + is_arithmetic on CompareSwap instead of specializations?
There was a problem hiding this comment.
For arithmetic types and keys only, using min and max takes two instructions per compare-swap, whereas the conditional version takes 3 (1 set predicate, 2 selections). For integers, the compiler does the optimization automatically, so this specialization is not strictly required, but for float32 it can't do the optimization because the behavior is slightly different with special cases like NaN.
Regarding NaN, the conditional version would not produce a sorted array if there are any NaNs, because comparisons with NaN always evaluate to false, breaking the rules of strict weak ordering:
IN: { 5, 4, NaN, 3, 2, 1 }
OUT: { 4, 5, NaN, 1, 2, 3 }
With the min/max version, afaik if one input of CompareSwap is NaN it will duplicate the other, so a possible output of the sort would be:
IN: { 5, 4, NaN, 3, 2, 1 }
OUT: { 1 2 3 3 4 5 }
The point is that I regard NaNs in the array as invalid inputs and prefer to use the fast implementation with 2 instructions instead of 3.
| using value_it_t = value_t *; | ||
| using offset_t = OffsetT; | ||
| using compare_op_t = less_t; | ||
| constexpr bool is_stable = true; |
There was a problem hiding this comment.
important: I'd like the new algorithm to be benchmarked. Could you please copy this file into unstable directory with is_stable = false?
There was a problem hiding this comment.
question: why copy the file and not parameterize for better code reuse?
Is that
It's in line with my expectations because the block merge sort is memory-bound, the bottleneck is the merging part in shared memory, so even if the per-thread sort issues fewer instructions, that does not affect the overall runtime much. The goals of the MR are: (a) to expose a more efficient thread sort, e.g. if the user wants to do a segmented sort of many small arrays of the same size, one array per thread with Parberry's pairwise sort is much faster than using CUB's segmented sort ; (b) to enable using more items/thread, as the quadratic cost would previously have prevented that.
No problem, thanks for the reviews. I will try to make some changes this week. |
…d inflating the diff)
…n last place to avoid breaking API
|
I've made most of the requested changes. What remains to be done is adding the unstable benchmark. |
elstehle
left a comment
There was a problem hiding this comment.
The first stage of the sorting network looks good to me. Currently going through the second stage.
|
@Nyrio We have recently applied formatting to the cub subproject. I have merged in main and applied formatting to your changes. I hope that should make this transition as painless as possible |
|
/ok to test |
|
Thanks @miscco for applying formatting. :) I think I've made all the requested changes. @miscco and @gevtushenko to resolve discussions if you're satisfied with the changes. |
|
Hey @Nyrio, I was going through old PRs and came across this and wanted to let you know that I dropped the ball in not making sure we got back to you. My apologies. If you are still interested in moving forward with this work, we would be happy to work with you. I believe what would help move this forward is to show some performance graphs that demonstrate speed-ups in benchmarks that are part of the repo. I see you've already added some benchmarks. Could you provide a summary table/graph of the performance improvements you're seeing? |
Description
closes #1551
Changes:
cub:Lessandcub::Greateroperators and provide specializations of compare-swap to improve fp32 key-only sorting performance using min/max instead of predicate.Testing improvements/bug fixes:
test_thread_sortwas casting the values to KeyT.test_warp_merge_sortwas comparing sort-pairs-by-key to a lexicographic ref sort.test_warp_merge_sortwas comparing stable and unstable variants to a stable ref sort.test_warp_merge_sortdid not generate the values, they were all 0...test_block_merge_sortwas only testing stable APIs.test_block_merge_sortwas not testing inputs for which stability is relevant (drawing random int32_t keys has a very small chance of conflict).test_device_segmented_sorttested the unstable sort against a stable reference sort.Future work (in follow-up PRs):
IS_STABLE, and re-tune.IS_STABLE, and re-tune.Misc notes:
is_stable = truein the benchmarks but ideally, we'd want to benchmark both. What would be the best way to do that? (a) add a new set; (b) add a boolean axis; (c) a separate benchmark (quite redundant).Checklist