Extract BFS paths SG implementation#1838
Extract BFS paths SG implementation#1838rapids-bot[bot] merged 17 commits intorapidsai:branch-21.12from
Conversation
seunghwak
left a comment
There was a problem hiding this comment.
Review part 1 (I am still reading extract_bfs_paths_impl.cuh).
| vertex_t const* distances_; | ||
| vertex_t const* destinations_; | ||
|
|
||
| size_t __device__ operator()(vertex_t idx) |
There was a problem hiding this comment.
shouldn't this be size_t idx?
There was a problem hiding this comment.
Fixed in next push
|
|
||
| ################################################################################################### | ||
| # - MST tests ---------------------------------------------------------------------------- | ||
| # - MST tests ------------------------------------------------------------------------------------- |
There was a problem hiding this comment.
Thanks for fixing these :-)
| vertex_t const* distances_; | ||
|
|
||
| thrust::tuple<vertex_t, vertex_t, int> __device__ | ||
| operator()(thrust::tuple<vertex_t, vertex_t, int> const& tuple) |
There was a problem hiding this comment.
What's the third tuple element (type int)?
There was a problem hiding this comment.
The original rank. Each rank has to retrieve results from another rank. The current implementation is:
- Construct the tuple (current_vertex, offset_in_current_frontier_array my_rank)
- Shuffle the tuples based on which rank contains the predecessor information for current_vertex
- Replace current_vertex with previous_vertex
- Shuffle the tuples back to the original rank where the data is required
- Update current_frontier in the proper location with the predecessor value
I could achieve the same result without the int parameter by writing a more sophisticated communication pattern (sort the vertices by which rank they are assigned to, do block transfers keeping track of the offsets for each rank, etc). I suspect this would be more time efficient as well as memory efficient. But I was trying to reuse existing functions as much as possible to get something working. We can optimize this more later.
There was a problem hiding this comment.
Ah... OK, so I guess this code can go away if we implement collect_values_for_vertices.
There was a problem hiding this comment.
So, is this functor still used? Can't find a place this function is actually used.
| vertex_t const* predecessors_; | ||
|
|
||
| thrust::tuple<vertex_t, vertex_t, int> __device__ | ||
| operator()(thrust::tuple<vertex_t, vertex_t, int> const& tuple) |
There was a problem hiding this comment.
What's the third tuple element (type int)?
There was a problem hiding this comment.
So, is this functor still used? Can't find a place this function is actually used.
| rmm::device_uvector<vertex_t> const& d_predecessors, | ||
| size_t num_paths_to_check, | ||
| uint64_t seed) | ||
| { |
There was a problem hiding this comment.
So, if I am not mistaken, this randomly selects num_paths_to_check vertices (that are not sources).
I think we can achieve this by
thrust::copy_if();
thrust::shuffle();
There was a problem hiding this comment.
And as this is only for test, another option is to do this in CPU, so no need to add this additional file and keep our codebase size in check.
There was a problem hiding this comment.
I can look into shuffle, that's one I haven't used before.
I can certainly look at a CPU implementation. I was imagining that we might have other tests that might benefit from a function to randomly select vertices - although I started with it here because it was easier.
There was a problem hiding this comment.
Yeah... so, for personalized PageRank, we're randomly selecting personalization vertices as well (it's currently implemented as host code).
https://github.com/rapidsai/cugraph/blob/branch-21.10/cpp/tests/link_analysis/pagerank_test.cpp#L178
https://github.com/rapidsai/cugraph/blob/branch-21.10/cpp/tests/link_analysis/mg_pagerank_test.cpp#L104
We may better combine these and add something under tests/utilities?
| * @param graph_view Graph view object. | ||
| * @param distances Pointer to the distance array constructed by bfs. | ||
| * @param predecessors Pointer to the predecessor array constructed by bfs. | ||
| * @param destinations Destination vertices, extract path from source to each of these destinations |
There was a problem hiding this comment.
So, in multi-GPU, destinations should be local to this GPU, right? Shouldn't we explain this here?
And I think this should be mentioned in
https://github.com/rapidsai/cugraph/blob/branch-21.10/cpp/include/cugraph/algorithms.hpp#L1106
https://github.com/rapidsai/cugraph/blob/branch-21.10/cpp/include/cugraph/algorithms.hpp#L1185
as well.
There was a problem hiding this comment.
Initially I started with this assumption. However, once I started implementing I realized I had to handle off-GPU vertices anyway, so I dropped that assumption. If you specify on GPU 0 a destination that resides on GPU 1, the implementation will just go to GPU 1 to get the predecessor of that destination.
I agree that should be mentioned in the other places. I will add those in this PR.
There was a problem hiding this comment.
I think we may still enforce this for consistency with other APIs (and we can avoid communication for the first step).
There was a problem hiding this comment.
Changed this back to this assumption.
- Added a check to validate that destinations are on the correct gpu
- Eliminated the initial shuffle logic (outside the loop in the implementation)
- Updated the documentation (including the other locations)
There was a problem hiding this comment.
3. Updated the documentation (including the other locations)
I can see updates in other locations, but I can't find documentation updates for extract_bfs_paths? Could you point out the exact line?
There was a problem hiding this comment.
Somehow missed this one, updated in next push.
| size_t count = | ||
| multi_gpu | ||
| ? cugraph::host_scalar_allreduce( | ||
| handle.get_comms(), current_frontier.size(), raft::comms::op_t::MAX, handle.get_stream()) |
There was a problem hiding this comment.
Shouldn't this better be SUM reduction? (to minimize the difference in SG & MG logics)
There was a problem hiding this comment.
I think MAX is semantically what I want, although SUM would suffice.
I'm actually thinking I'll change this loop to a for loop from over the range [0, max_path_length), since we already know how many iterations are required.
There was a problem hiding this comment.
I'm actually thinking I'll change this loop to a for loop from over the range [0, max_path_length), since we already know how many iterations are required.
+1 for this, this will be clearer.
There was a problem hiding this comment.
Changed to a for loop, definitely much clearer and saves the allreduce cost.
|
|
||
| count = multi_gpu ? cugraph::host_scalar_allreduce(handle.get_comms(), | ||
| current_frontier.size(), | ||
| raft::comms::op_t::MAX, |
There was a problem hiding this comment.
Shouldn't this better be SUM reduction? (to minimize the difference in SG & MG logics)
There was a problem hiding this comment.
Same as above.
There was a problem hiding this comment.
Same fix as above
Codecov Report
@@ Coverage Diff @@
## branch-21.12 #1838 +/- ##
===============================================
Coverage ? 70.11%
===============================================
Files ? 143
Lines ? 8827
Branches ? 0
===============================================
Hits ? 6189
Misses ? 2638
Partials ? 0 Continue to review full report at Codecov.
|
|
Pushing out to 21.12 to incorporate more of the discussed changes. |
| * @param graph_view Graph view object. | ||
| * @param distances Pointer to the distance array constructed by bfs. | ||
| * @param predecessors Pointer to the predecessor array constructed by bfs. | ||
| * @param destinations Destination vertices, extract path from source to each of these destinations |
There was a problem hiding this comment.
3. Updated the documentation (including the other locations)
I can see updates in other locations, but I can't find documentation updates for extract_bfs_paths? Could you point out the exact line?
|
|
||
| template <typename VertexIterator, | ||
| typename ValueIterator, | ||
| typename KeyToGPUIdOp, |
There was a problem hiding this comment.
I think we'd better rename this to VertexToGPUIdOp to clarify that this function behavior is different from collect_values_for_keys.
There was a problem hiding this comment.
Fixed in next push
| VertexIterator map_vertex_first, | ||
| VertexIterator map_vertex_last, | ||
| ValueIterator map_value_first, | ||
| KeyToGPUIdOp key_to_gpu_id_op, |
There was a problem hiding this comment.
Better be VertexToGPUIdOp vertex_to_gpu_id_op?
There was a problem hiding this comment.
Fixed in next push
There was a problem hiding this comment.
Changed to vertex_partition_lasts
|
|
||
| raft::update_host(h_gpu_counts.data(), gpu_counts.data(), gpu_counts.size(), stream_view); | ||
|
|
||
| std::tie(shuffled_vertices, shuffled_counts) = |
There was a problem hiding this comment.
auto [shuffled_vertices, shuffled_counts) = (so, no need to explicitly define shuffled_vertices & shuffled_counts)
There was a problem hiding this comment.
And actually, input_vertices here will not be used after this line, so better invent a name that covers both "shuffled_vertices" and "input_vertices" and do something like std::tie(new_name, ...) = shuffle_values(..., std::move(new_name), ...); (so if this function becomes a memory bottleneck, we can free memory hold by input_vertices).
There was a problem hiding this comment.
input_vertices is used at the end of the function as the keys for a lookup.
| using vertex_t = typename std::iterator_traits<VertexIterator>::value_type; | ||
| using value_t = typename std::iterator_traits<ValueIterator>::value_type; | ||
|
|
||
| size_t input_size = thrust::distance(map_vertex_first, map_vertex_last); |
There was a problem hiding this comment.
If I am not mistaken, [map_vertex_first, map_vertex_last) are vertices we're querying values (not the vertices in local GPU).
In the collect_values_for_keys function,
VertexIterator0 map_key_first,
VertexIterator0 map_key_last,
ValueIterator map_value_first,
store (key, value) pairs local to this GPU and
VertexIterator1 collect_key_first,
VertexIterator1 collect_key_last,
are keys we want to query values for. So, I think map_key_first & last here are inconsistent (and misnomers).
collect_values_for_vertices(raft::comms::comms_t const& comm,
VertexPartitionDeviceViewType vertex_partition_device_view (or local_vertex_first, local_vertex_last to better mirror the above collect_values_for_keys function),
ValueIterator map_value_first,
VertexToGPUIdOp vertex_to_gpu_id_op,
VertexIterator1 collect_unique_vertex_first,
VertexIterator1 collect_unique_vertex_last,
...
Or even better as vertex to GPU ID mapping is pretty much pre-determined by partitioning, we may take std::vector<vertex_t> const& vertex_partition_lasts instead of vertex_to_gpu_id_op (then no need to take vertex_partition_device_view).
collect_values_for_vertices(raft::comms::comms_t const& comm,
ValueIterator local_value_first,
VertexIterator1 collect_unique_vertex_first, (or sorted_unique_vertex_first, if we want to ask caller to sort if not already sorted)
VertexIterator1 collect_unique_vertex_last,
std::vector<vertex_t> const& vertex_partition_lasts, (size == # GPUs),
...
There was a problem hiding this comment.
Restructured in the push I just made considering some of these ideas.
| thrust::sort(rmm::exec_policy(stream_view), input_vertices.begin(), input_vertices.end()); | ||
|
|
||
| // 1: Shuffle vertices to the correct node | ||
| auto gpu_counts = groupby_and_count( | ||
| input_vertices.begin(), input_vertices.end(), key_to_gpu_id_op, comm.get_size(), stream_view); |
There was a problem hiding this comment.
So, this is an optimization issue, and we may not need to address this right now (having at least a FIXME statement will be nice), but there are few issues we need to think about.
So, in some cases, users may pass an already sorted vertex list to query values for. So, this sort can be redundant in such cases. And for sorted input_vertices, we can just run thrust::lower_bound for # GPUs values instead of visiting and groupbying every element in input vertices (and if we're running group_by_and_count, I think the sort above is unnecessary; please double check). See code around https://github.com/rapidsai/cugraph/blob/branch-21.12/cpp/src/structure/renumber_utils_impl.cuh#L258 for reference.
There was a problem hiding this comment.
Restructured the code in the next push to accommodate this concern and some of the other comments above.
There was a problem hiding this comment.
The sort here is necessary to use the keys for the lookup at the end of this function.
I added a call to thrust::unique, since there's no point in a particular GPU asking for the same key more than once in a single call.
| vertex_t const* distances_; | ||
|
|
||
| thrust::tuple<vertex_t, vertex_t, int> __device__ | ||
| operator()(thrust::tuple<vertex_t, vertex_t, int> const& tuple) |
There was a problem hiding this comment.
So, is this functor still used? Can't find a place this function is actually used.
| vertex_t const* predecessors_; | ||
|
|
||
| thrust::tuple<vertex_t, vertex_t, int> __device__ | ||
| operator()(thrust::tuple<vertex_t, vertex_t, int> const& tuple) |
There was a problem hiding this comment.
So, is this functor still used? Can't find a place this function is actually used.
| rmm::device_uvector<vertex_t> d_vertex_partition_offsets(0, handle.get_stream()); | ||
|
|
||
| if constexpr (multi_gpu) { | ||
| // FIXME: There should be a better way to do this |
There was a problem hiding this comment.
Yeah... I think we'd better add a member function that returns std::vector<vertex_t> type (size == # GPUs) vertex_partition_lasts to graph_view_t.
There was a problem hiding this comment.
Added the function to graph_view_t in next push
| vertex_t number_of_vertices, | ||
| vertex_t local_vertex_first, | ||
| rmm::device_uvector<vertex_t> const& d_predecessors, | ||
| size_t num_paths_to_check) |
There was a problem hiding this comment.
Just an issue to think about.
So, for testing, should we use CUDA code for this kind of input preparation functions (and add additional files as we cannot use thrust from .cpp files) or we'd better just use a slower host function for this (as this is only for testing.... and performance for this part may not matter much in many cases unless this will become a bottleneck in large scale benchmarking... it might be helpful in keeping our test code simpler).
There was a problem hiding this comment.
Definitely worth considering. Not sure how frequent this pattern will be.
…values_for_vertices, some code cleanup
seunghwak
left a comment
There was a problem hiding this comment.
Looks good to me besides minor complaint.
cpp/include/cugraph/algorithms.hpp
Outdated
| * @param distances Pointer to the distance array constructed by bfs. | ||
| * @param predecessors Pointer to the predecessor array constructed by bfs. | ||
| * @param destinations Destination vertices, extract path from source to each of these destinations | ||
| * In a multi-gpu context the source vertex should be local to this GPU. |
There was a problem hiding this comment.
In a multi-gpu context the source vertex should be local to this GPU.
=>
In a multi-gpu context the destination vertex should be local to this GPU.
| vertex_t num_vertices, | ||
| ValueIterator local_value_first, | ||
| std::vector<vertex_t> const& vertex_partition_lasts, | ||
| VertexPartitionDeviceViewType vertex_partition_device_view, |
There was a problem hiding this comment.
Just a minor point, I feel like vertex_partition_device_view is a bit redundant as we can easily compute local_vertex_first from vertex_partition_lasts (0 if comm_rank == 0 and vertex_partition_lasts[comm_rank - 1] otherwise).
There was a problem hiding this comment.
I guess since collect_comm.cuh is part of the primitive functionality it's reasonable to have it know the internal workings of the data structures. I have been trying to avoid having things outside the primitives (most of what I work on) rely on understanding the inner workings of the data structures.
I'll make these two changes, hopefully we can merge later today.
| detail::update_paths<vertex_t>{paths.data(), invalid_vertex}); | ||
| } | ||
|
|
||
| return std::make_tuple(std::move(paths), max_path_length); |
There was a problem hiding this comment.
Just want to say that this code looks beautiful :-) The MG part is just slightly more complex than the SG path.
There was a problem hiding this comment.
It does look nice. Creating the collect_values_for_vertices function really cleaned this up well.
|
@gpucibot merge |
Partially addresses #1753
Creates a new C++ function to extract BFS paths.
This PR includes the SG implementation and the SG unit tests. A separate PR will provide the MG unit test.