Add Merge Sort implementation for c.parallel#3636
Merged
NaderAlAwar merged 38 commits intoNVIDIA:mainfrom Feb 7, 2025
Merged
Conversation
…te parameters to allow calling merge sort from c.parallel
…e to missing definition when using NVRTC, and make changes to included thrust headers to make them NVRTC compilable
Contributor
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
bernhardmgruber
requested changes
Jan 31, 2025
… per thread for the tuning policy
…ng set incorrectly
This reverts commit ee290a4.
…as the copy versions
1 task
Contributor
🟩 CI finished in 1h 34m: Pass: 100%/90 | Total: 2d 16h | Avg: 42m 51s | Max: 1h 19m | Hits: 73%/132233
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| +/- | CCCL C Parallel Library |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 90)
| # | Runner |
|---|---|
| 65 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
| 1 | linux-amd64-gpu-h100-latest-1 |
shwina
approved these changes
Feb 7, 2025
Contributor
shwina
left a comment
There was a problem hiding this comment.
Looks great to me! Approving with just a couple of nits
c/parallel/test/test_util.h
Outdated
| } | ||
|
|
||
| template <class T> | ||
| std::vector<T> make_shuffled_key_ranks_vector(std::size_t num_items) |
Contributor
There was a problem hiding this comment.
Nit: Given that we're defining this in the general-purpose test_util.h, we could consider a more general name
Suggested change
| std::vector<T> make_shuffled_key_ranks_vector(std::size_t num_items) | |
| std::vector<T> make_shuffled_sequence(std::size_t num_items) |
c/parallel/src/kernels/operators.h
Outdated
|
|
||
| std::string make_kernel_user_binary_operator(std::string_view input_value_t, cccl_op_t operation); | ||
| std::string | ||
| make_kernel_user_binary_operator(std::string_view input_value_t, cccl_op_t operation, bool comparison_op = false); |
Contributor
There was a problem hiding this comment.
Nit: perhaps we should just have a distinct make_kernel_user_comparison_operator helper.
| using iterator_category = cuda::std::random_access_iterator_tag; | ||
| using difference_type = {0}; | ||
| using value_type = void; | ||
| using value_type = VALUE_T; |
Contributor
There was a problem hiding this comment.
Is this change required independent of #3722?
If not, I'd recommend we make all the output_iterator changes in that PR.
1d4dc30 to
5124fe2
Compare
Contributor
🟨 CI finished in 1h 02m: Pass: 98%/90 | Total: 15h 54m | Avg: 10m 36s | Max: 56m 09s | Hits: 94%/132089
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| +/- | CCCL C Parallel Library |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 90)
| # | Runner |
|---|---|
| 65 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
| 1 | linux-amd64-gpu-h100-latest-1 |
Contributor
🟩 CI finished in 1h 19m: Pass: 100%/90 | Total: 16h 00m | Avg: 10m 40s | Max: 56m 09s | Hits: 94%/132233
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| python | |
| +/- | CCCL C Parallel Library |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 90)
| # | Runner |
|---|---|
| 65 | linux-amd64-cpu16 |
| 9 | windows-amd64-cpu16 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 4 | linux-arm64-cpu16 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| 2 | linux-amd64-gpu-rtx2080-latest-1 |
| 1 | linux-amd64-gpu-h100-latest-1 |
bernhardmgruber
approved these changes
Feb 7, 2025
shwina
pushed a commit
to shwina/cccl
that referenced
this pull request
Feb 10, 2025
* Rename merge_sort VSMem helper methods * Change `KeyT` and `ValueT` in `DispatchMergeSort` to defaulted template parameters to allow calling merge sort from c.parallel * Add `dependent_launch` parameter to `CudaDriverLauncherFactory` * Add initial c.parallel `merge_sort` implementation * Fix policy name in merge_sort src * Add missing `ITEMS_PER_THREAD` member to merge_sort policy * Add missing merge_sort policy parameters * Replace `make_load_iterator` forward declaration with file include due to missing definition when using NVRTC, and make changes to included thrust headers to make them NVRTC compilable * Conditionally include `cuda/std/iterator` if NVRTC is being used * Use `dependent_launch` parameter in `CudaDriverLauncher` * Use proper driver API instead of runtime API * Add initial c.parallel merge sort test * Pass in `cub::NullType` if merge sort item iterators are `nullptr` * Fix wrong enum value being passed to iterator name * Introduce `nominal_4b_items_to_items` to properly calculate the items per thread for the tuning policy * Use the correct BlockLoad and BlockStore algorithms in the policy * Fix issue where we incorrectly use the fallback policy in the vsmem helper * Complete `merge_sort` initial test * Fix issue with obtaining item iterator names * Add merge sort test with copy * Fix wrong assertion in merge sort test * Make c.parallel `merge_sort` tests similar to cub tests * Fix doubles not being sorted properly due to operator return type being set incorrectly * Add `merge_sort` tests when items are not null * Remove unneeded storage type from merge sort kernel source * Revert "Remove unneeded storage type from merge sort kernel source" This reverts commit ee290a4. * Add custom data type test to `merge_sort` * Fix custom types being passed to `merge_sort` * Add initial support for input iterators to `merge_sort` * Fix issue where the non-copy merge sort tests were being implemented as the copy versions * Add support for output iterators * Add more merge_sort iterator tests * Reverse NVRTC include guard change following merge with main * Rename function to generate merge sort key sequence * Refactor c.parallel custom operator source generation * Undo output iterator changes so they can be resolved with NVIDIA#3722 * clang-format
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Closes #2547
This PR adds the
c.parallelmerge_sort API. It also includes some changes tomerge_sort.cuhand other included headers to enable NVRTC compilation.Notes to reviewers
merge_sort.cuwere hardcoded in order to avoid the assertion failure that checks for the presence of certain memory ops (presumably for performance reasons). This means that we will get suboptimal performance. I think this can be addressed once [FEA]: Redesign default tuning #3570 is solved.Checklist