Add NVTX ranges for all CUB algorithms#1657
Conversation
|
|
||
| #pragma once | ||
|
|
||
| #include <cub/config.cuh> |
There was a problem hiding this comment.
Question: This could be a quite useful facility also for other libraries, do we want to move this within cccl?
There was a problem hiding this comment.
With your permission, sure! The scope of the PR is CUB though, so let's maybe move it in a second step? Maybe @gevtushenko let's me do the same work for thrust, then I could move it at this occasion.
|
/ok to test |
00260d0 to
f9de904
Compare
|
minor suggestion/note: This likely doesn't matter, but technically there is some extra performance overhead here because when you call See more here: https://nvidia.github.io/NVTX/doxygen-cpp/index.html#REGISTERED_MESSAGE |
25c2423 to
c963082
Compare
f16a826 to
21e6697
Compare
@jrhemstad addressed. |
| CUB_DETAIL_NVTX_RANGE_SCOPE_IF(d_temp_storage, GetName()); | ||
| //return SortPairsNoNVTX<KeyT, ValueT, BeginOffsetIteratorT, EndOffsetIteratorT>( | ||
| return SortPairs<KeyT, ValueT, BeginOffsetIteratorT, EndOffsetIteratorT>( // FIXME(bgruber): bug on purpose to validate unit tests |
There was a problem hiding this comment.
That's a bug on purpose to see if the CI fires
There was a problem hiding this comment.
And this one and a few others got it, looks like this:
208/327 Test #208: cub.cpp17.test.device_segmented_sort_keys.lid_0 .....................Subprocess aborted***Exception: 0.22 sec
Recursion detected
b9585ee to
9ea100e
Compare
| // The purpose of this test is to verify that CUB can use NVTX without any additional dependencies. It is built as part | ||
| // of the unit tests, but can also be built standalone: | ||
|
|
||
| // Compile (from current directory): |
There was a problem hiding this comment.
Question: did you verify that this builds in CI with the right command line options?
There was a problem hiding this comment.
Right. So I considered hooking up this test as a special minimal build analogous to the header checks like we discussed offline. However, I noticed the header checks also link against the cub target in the same way as the non-Catch2 unit tests do. So adding this standalone test as a non-Catch2 test should be equally minimal.
For reference here are the compile and link command lines for building VERBOSE=1 make cub.cpp17.test.nvtx_standalone. Compile:
/usr/local/cuda/bin/nvcc
-forward-unknown-to-host-compiler
-ccbin=clang++-17
-DCUB_DETAIL_DEBUG_ENABLE_SYNC
-DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA
-DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_CPP
-D_CCCL_NO_SYSTEM_HEADER
-I/home/bgruber/dev/cccl/cub/test
-I/home/bgruber/dev/cccl/cub/cub/cmake/../..
-I/home/bgruber/dev/cccl/libcudacxx/lib/cmake/libcudacxx/../../../include
-I/home/bgruber/dev/cccl/thrust/thrust/cmake/../..
-O2
-g
-DNDEBUG
-std=c++17
"--generate-code=arch=compute_89,code=[compute_89,sm_89]"
-Xcompiler=-Werror
-Xcompiler=-Wall
-Xcompiler=-Wextra
-Xcompiler=-Winit-self
-Xcompiler=-Woverloaded-virtual
-Xcompiler=-Wcast-qual
-Xcompiler=-Wpointer-arith
-Xcompiler=-Wunused-local-typedef
-Xcompiler=-Wvla
-Xcompiler=-Wgnu
-Xcompiler=-Wno-gnu-line-marker
-Xcompiler=-Wno-gnu-zero-variadic-macro-arguments
-Xcompiler=-Wno-unused-function
-Wreorder
-Xcudafe=--display_error_number
-Xcudafe=--promote_warnings
-Wno-deprecated-gpu-targets
-MD
-MT
cub/test/CMakeFiles/cub.cpp17.test.nvtx_standalone.dir/test_nvtx_standalone.cu.o
-MF
CMakeFiles/cub.cpp17.test.nvtx_standalone.dir/test_nvtx_standalone.cu.o.d
-x
cu
-c
/home/bgruber/dev/cccl/cub/test/test_nvtx_standalone.cu
-o
CMakeFiles/cub.cpp17.test.nvtx_standalone.dir/test_nvtx_standalone.cu.o
Link:
/usr/bin/clang++-17
CMakeFiles/cub.cpp17.test.nvtx_standalone.dir/test_nvtx_standalone.cu.o
-o
../../bin/cub.cpp17.test.nvtx_standalone
-lcudadevrt
-lcudart_static
-lrt
-lpthread
-ldl
-L"/usr/local/cuda/targets/x86_64-linux/lib/stubs"
-L"/usr/local/cuda/targets/x86_64-linux/lib"
However, the test compiles and runs equally well if I just compile it with:
nvcc test_nvtx_standalone.cu -I../../cub -I../../thrust -I../../libcudacxx/include -o nvtx_standalone
Fixes: NVIDIA#719 Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
c:\cccl\cub\cub\detail\nvtx3.hpp(807): warning C4459: declaration of 'd' hides global declaration C:/cccl/thrust/examples/cuda/global_device_vector.cu(30): note: see declaration of 'd'
|
Because @gevtushenko asked again about whether NVTX requires any new link-time dependencies, like Linux nvcc: Linux nvc++: Linux clang CUDA: Windows MSVC: All executables run and produce the expected output in Nsight Systems. |
Description
Add NVTX ranges to all CUB algorithms, so they can be visualized in a profiler such as Nsight Systems.
Here is how this could look like in Nsight Systems:

Fixes: #719
Checklist
Open questions
CCCL(like in Feature Request: Add NVTX Ranges to Thrust/CUB algorithms #719) orCUB/Thrustetc.? @jrhemstad saysCCCL.