API improvements for end-to-end MG sampling performance by ChuckHastings · Pull Request #3269 · rapidsai/cugraph

ChuckHastings · 2023-02-14T00:43:31Z

Recent discussion has identified some performance issues in end-to-end testing of the MG sampling code.

This PR implements changes to the API to improve the end-to-end performance.

…kflow

seunghwak

Few quick comments, let me think more about this and I will post additional questions.

cpp/include/cugraph/graph_functions.hpp

cpp/include/cugraph/algorithms.hpp

seunghwak · 2023-02-14T01:19:32Z

cpp/include/cugraph/algorithms.hpp

+           std::optional<rmm::device_uvector<int32_t>>,
+           std::optional<rmm::device_vector<edge_t>>>
 uniform_neighbor_sample(
  raft::handle_t const& handle,


So, these are what this API should satisfy.

If starting_vertices includes seeds from multiple batches, the caller should be able to separate sampling outputs from different batches and the caller should be able to place sampling outputs on different GPUs based on mapping between batch IDs and GPUs. If the caller is invoking this function for a single batch, the caller may want the sampling outputs for the local sampling_vertices to be stored locally as well.

For the output stored on a single GPU, we should be able to sort them based on batch IDs.

Once we sort based on the batch ID, some users may want to sort based on hops as a primary key (if they don't need to create a separate tree per seed within a batch) or seed IDs as a primary key and hops as a secondary key (if they need to create a separate tree per seed).

I am thinking about the following,

This function optionally takes rmm::device_uvector<label_t>&& labels and rmm::device_uvector<int>&& dst_comm_ranks (labels.size() == dst_comm_ranks.size() == start_vertices.size()).
Here labels don't necessarily coincide with batch IDs, but users may set labels to batch_id * batch_size + seed index within a batch (if the result need to be sorted within a batch using seed IDs as a primary key and hops as a secondary key) or just batch_id (if no need to create a tree per seed). The output results are shuffled based on dst_comm_ranks for each starting vertex (if dst_comm_ranks is valid) or all the results for local seeds will be stored locally.

We return optional labels and hops for each edge data (src, dst, optional (edge weight, ID, type)). Within each GPU, the output results will be sorted using labels as a primary key (if labels are provided in the input) and hops as a secondary key.

One concern is that computing global "seed index within a batch" might be challenging for the caller if they want to distribute seeds for a single batch to multiple GPUs, but I guess this is not a common use case. If there are 100 batches and 10 GPUs, users may assign batch [0-10) to GPU0 (so start_vertices for GPU0 holding all the seeds for batch [0-10)), batch [10-20) to GPU1, and so on.

We may be able to reduce the output size if we don't return label, hop per every edge data and create offset arrays per label & hop, but I am not sure the benefit out-weights the increase in complexity. If the memory footprint is a major concern, we may just reduce the number of batches we process per call (as the main reason for processing multiple batches in a single call is to saturate GPUs, but if we're hitting the memory limit, we might be going to far in this direction).

Anything am I missing or any other thoughts?

This is a good point, they're a lot to unpack here. Let me create a few separate comments.

The original objective of processing multiple batches in a single call is to efficiently utilize the parallelism on the GPU. If we do individual calls with a small batch size then we don't get enough parallelism and the overhead of calling the function is large relative to the work accomplished. By processing multiple batches of seeds in the same call we increase the amount of parallel work to be done on the GPU and we reduce the number of times we face the overhead of calling the function, allocating memory, launching kernels, etc. This improves our overall efficiency.

I don't know where the threshold is, but at some point there are enough vertices in a call to get the efficiencies we are looking for. Doing larger numbers of batches in a single call should still drive down overhead, but the improvement should decrease dramatically once we reach saturation of the GPU parallelism. It seems like (with the large graphs we are targeting) that this threshold is much less than "all seeds to be sampled in a training epoch".

In the current pipeline, the output from this processing is going to be moved out of GPU memory to be retrieved by the trainers as they work (currently written at the python layer to parquet files). As long as we process in the first call enough batches to keep the trainers busy while we do a second call, there would be no drop in pipeline throughput by making multiple calls to the sampling code with smaller numbers of batches. That would reduce the memory requirement, and we can overlap the sampling with the training a bit more. It might actually improve the overall latency for training an epoch (although perhaps only marginally, I don't really know the performance numbers you're dealing with).

This does suggest that the extra memory required to drive this function doesn't need to be a driving factor in the design of the API.

Regarding the labels discussion. Part of the reason I went with the name "label" rather than "batch" was this notion that batches might not be the only reason for labeling the data.

There's absolutely no reason that the sampling code needs to understand anything about how the caller assigns labels and uses them. If we keep with the label as a vector (labels.size() == start_vertices.size()) rather than switching to a CSR-style representation, that label can literally be any arbitrary 32-bit integer value (doesn't need to be contiguous values starting from 0). Either route for organizing the result (either mapping label to GPU id or providing a destination vector (dst_comm_ranks.size() == start_vertices.size()) allows for us to shuffle the data to the correct GPU. We could talk about sorting options (beyond sorting by label) if that would be helpful, although that seems like an easy enough feature to add later when we need it.

It seems like much of the use cases you describe can be addressed by the caller creating the labels in a more sophisticated way and then grouping the results once they are returned. There's no reason the caller couldn't create labels such that the training batch is decomposed into k labels that all get mapped to the same GPU and the results will be combined by the caller. That allows for construction of the trees. That allows for the same seed to be repeated within a batch and being able to differentiate between which tree came from which seed.

The beauty there is that the responsibility for managing the sophistication lies in the caller rather than in the library function.

I don't know where the threshold is, but at some point there are enough vertices in a call to get the efficiencies we are looking for.

I guess the memory footprint for the sampling is a significant percentage (say more than 10%) of the GPU's total memory, we may not be too far from saturating the GPU. And I agree that "This does suggest that the extra memory required to drive this function doesn't need to be a driving factor in the design of the API." We still need to be frugal with memory but it should not be the #1 priority when we need to make trade-offs.

Regarding the labels discussion. Part of the reason I went with the name "label" rather than "batch" was this notion that batches might not be the only reason for labeling the data.
There's absolutely no reason that the sampling code needs to understand anything about how the caller assigns labels and uses them.

Yes, now I agree with this. And I just want to emphasize that this API should allow creating one tree per seed within a batch, so we need to provide a mechanism to distinguishes sampled edges from different seeds, and labels can serve this purpose (32 bit might be sufficient for the foreseeable future, but in few GPU generations in the future, especially with grace-hopper like systems, we may exceed the 32 bit boundary, so we may keep it as label_t than int32_t).

I will add a label_t and only instantiate int32_t for now.

cpp/include/cugraph/algorithms.hpp

seunghwak

LGTM

alexbarghi-nv

👍

VibhuJawa

LGTM

…od sampling, default to current behavior

seunghwak · 2023-03-14T15:57:15Z

cpp/include/cugraph/graph_functions.hpp

 template <typename vertex_t, typename edge_t, bool store_transposed, bool multi_gpu>
 rmm::device_uvector<vertex_t> select_random_vertices(
  raft::handle_t const& handle,
+


No need for this empty space.

Removed in next push

seunghwak · 2023-03-14T17:32:58Z

cpp/src/sampling/detail/sampling_utils_impl.cuh

+      labels->end(),
+      [output_label = std::get<0>(*label_to_output_comm_rank),
+       output_rank  = std::get<1>(*label_to_output_comm_rank)] __device__(auto val) {
+        auto pos = thrust::find(thrust::seq, output_label.begin(), output_label.end(), val);


thrust::find is linear scan, this can be expensive if output_label.size() is large. Better use binary search (if output_labels can be sorted) or a hash table. May look for prims/kv_store.cuh as well.

I meant to go back and look at that, I will fix in next push.

seunghwak · 2023-03-14T17:38:39Z

cpp/src/sampling/detail/sampling_utils_impl.cuh

+      thrust::count_if(handle.get_thrust_policy(),
+                       thrust::make_counting_iterator<size_t>(0),
+                       thrust::make_counting_iterator<size_t>(labels->size()),
+                       [d_labels = labels->data()] __device__(size_t idx) {


You can use predefined is_first_in_run_t.

auto num_unique_majors = thrust::count_if(rmm::exec_policy(loop_stream), thrust::make_counting_iterator(size_t{0}), thrust::make_counting_iterator(this_chunk_size), is_first_in_run_t<vertex_t const*>{tmp_majors.data()});

https://github.com/rapidsai/cugraph/blob/branch-23.04/cpp/src/structure/renumber_edgelist_impl.cuh#L449

Thanks for the link. I knew we did something like this in the renumbering code, but I wasn't able to find it.

Refactored in the next push to look like the linked renumbering code.

seunghwak · 2023-03-14T17:43:50Z

cpp/src/sampling/detail/sampling_utils_impl.cuh

+    rmm::device_uvector<label_t> unique_labels(num_unique_labels, handle.get_stream());
+
+    thrust::copy_if(
+      handle.get_thrust_policy(),
+      thrust::make_zip_iterator(thrust::make_counting_iterator<size_t>(0), labels->begin()),
+      thrust::make_zip_iterator(thrust::make_counting_iterator<size_t>(labels->size()),
+                                labels->end()),
+      thrust::make_zip_iterator(thrust::make_discard_iterator(), unique_labels.begin()),
+      [d_labels = labels->data()] __device__(auto tuple) {
+        size_t idx = thrust::get<0>(tuple);
+        return (idx == 0) || (d_labels[idx - 1] != d_labels[idx]);
+      });
+
+    thrust::transform(handle.get_thrust_policy(),
+                      labels->begin(),
+                      labels->end(),
+                      labels->begin(),
+                      [d_unique_labels = raft::device_span<label_t const>{
+                         unique_labels.data(), unique_labels.size()}] __device__(label_t label) {
+                        auto pos = thrust::lower_bound(
+                          thrust::seq, d_unique_labels.begin(), d_unique_labels.end(), label);
+                        return static_cast<label_t>(thrust::distance(d_unique_labels.begin(), pos));
+                      });
+
+    offsets = detail::compute_sparse_offsets<size_t>(
+      labels->begin(), labels->end(), size_t{0}, unique_labels.size(), handle.get_stream());
+    labels = std::move(unique_labels);


Can't you just use reduce_by_key? labels are already sorted and you want to compute the counts of each labels (you can compute offsets from counts by running scan).

Refactored in the next push to look like the linked renumbering code.

seunghwak

LGTM

seunghwak · 2023-03-16T17:37:23Z

cpp/include/cugraph/detail/utility_wrappers.hpp

+ * @return true if sorted, false if not sorted
+ */
+template <typename data_t>
+bool is_span_sorted(raft::handle_t const& handle, raft::device_span<data_t> span);


Just a suggestion, but span in is_span_sorted sounds a bit redundant. This is taking a device_span so it is clear that this checks whether a device_span is sorted or not even without span in the function name.

Fixed in next push.

alexbarghi-nv

👍

seunghwak · 2023-03-17T22:37:55Z

cpp/tests/sampling/mg_uniform_neighbor_sampling.cu

+    thrust::tabulate(
+      handle_->get_thrust_policy(),
+      batch_number.begin(),
+      batch_number.end(),
+      [rank       = handle_->get_comms().get_rank(),
+       batch_size = uniform_neighbor_sampling_usecase.batch_size] __device__(int32_t index) {
+        return rank + (index / batch_size);
+      });
+
+    size_t num_batches = 1 + handle_->get_comms().get_rank() +
+                         (batch_number.size() / uniform_neighbor_sampling_usecase.batch_size);
+    num_batches = cugraph::host_scalar_allreduce(
+      handle_->get_comms(), num_batches, raft::comms::op_t::MAX, handle_->get_stream());
+
+    rmm::device_uvector<int32_t> unique_batches(num_batches, handle_->get_stream());
+    rmm::device_uvector<int32_t> comm_ranks(num_batches, handle_->get_stream());
+
+    cugraph::detail::sequence_fill(
+      handle_->get_stream(), unique_batches.data(), unique_batches.size(), int32_t{0});
    thrust::tabulate(handle_->get_thrust_policy(),
-                     batch_number.begin(),
-                     batch_number.end(),
-                     [batch_size = uniform_neighbor_sampling_usecase.batch_size] __device__(
-                       int32_t index) { return index / batch_size; });
+                     comm_ranks.begin(),
+                     comm_ranks.end(),
+                     [num_gpus = handle_->get_comms().get_size()] __device__(auto index) {
+                       return index % num_gpus;
+                     });


Is this logic correct?

After

thrust::tabulate( handle_->get_thrust_policy(), batch_number.begin(), batch_number.end(), [rank = handle_->get_comms().get_rank(), batch_size = uniform_neighbor_sampling_usecase.batch_size] __device__(int32_t index) { return rank + (index / batch_size); });

Batch IDs in each GPU will be [rank, (local_number_of_sources + (batch_size - 1)) / batch_size).

Say local_number_of_sources = 100 and batch_size = 10.

GPU 0, batch IDs will be [0, 10)
GPU 1, Batch IDs will be [1, 11).

Isn't num_batches the number of unique batch IDs? In the case above, this should be 11 (0, 1, 2, ... , 10), right?

Not sure how

size_t num_batches = 1 + handle_->get_comms().get_rank() + (batch_number.size() / uniform_neighbor_sampling_usecase.batch_size); num_batches = cugraph::host_scalar_allreduce( handle_->get_comms(), num_batches, raft::comms::op_t::MAX, handle_->get_stream());

computes the number of unique batches.

And shouldn't the batch_size be globally enforced for each batch ID? Based on this logic, batch_size will be a local batch_size, and global batch_size for each batch ID will grow with # GPUs. Am I missing something?

Yes, this logic is faulty. I will reconsider.

Restructured this code in latest push. Batch sizes should now be correct except for the last batch which might be small.

The new update looks good to me.

seunghwak · 2023-03-20T16:46:32Z

cpp/tests/sampling/mg_uniform_neighbor_sampling.cu

+       num_batches] __device__(int32_t index) { return (seed_offset + index) % num_batches; });
+
+    rmm::device_uvector<int32_t> unique_batches(num_batches, handle_->get_stream());
+    rmm::device_uvector<int32_t> comm_ranks(num_batches, handle_->get_stream());


So, are we replicating the same unique_batches and comm_ranks in every GPU?

I might be thinking too far, but this can be a problem if # labels is very large. May not need to address in this PR, but something we need to at least start thinking about.

BradReesWork · 2023-03-21T13:42:14Z

python/cugraph/cugraph/dask/sampling/uniform_neighbor_sample.py

+if TYPE_CHECKING:
+    from cugraph import Graph
+
 src_n = "sources"


no code change

We really need a utility class with all the standard column names.

ChuckHastings · 2023-03-22T02:19:45Z

/merge

Propose new API to improve efficient of MG sampling in end-to-end wor…

fe5c3fd

…kflow

ChuckHastings requested a review from a team as a code owner February 14, 2023 00:43

ChuckHastings changed the title ~~Propose new API to improve efficient of MG sampling in end-to-end wor…~~ Propose new API to improve end-to-end MG sampling performance Feb 14, 2023

ChuckHastings marked this pull request as draft February 14, 2023 00:47

ChuckHastings requested review from VibhuJawa, alexbarghi-nv, rlratzel and seunghwak February 14, 2023 00:47

ChuckHastings self-assigned this Feb 14, 2023

ChuckHastings added improvement Improvement / enhancement to an existing function breaking Breaking change labels Feb 14, 2023

ChuckHastings changed the title ~~Propose new API to improve end-to-end MG sampling performance~~ [skip-ci] Propose new API to improve end-to-end MG sampling performance Feb 14, 2023

seunghwak reviewed Feb 14, 2023

View reviewed changes

cpp/include/cugraph/graph_functions.hpp Outdated Show resolved Hide resolved

cpp/include/cugraph/algorithms.hpp Outdated Show resolved Hide resolved

cpp/include/cugraph/algorithms.hpp Outdated Show resolved Hide resolved

seunghwak reviewed Feb 14, 2023

View reviewed changes

respond to PR review feedback

e9fcd31

seunghwak reviewed Feb 15, 2023

View reviewed changes

cpp/include/cugraph/algorithms.hpp Outdated Show resolved Hide resolved

Update API based on PR comments

ebb51a7

seunghwak approved these changes Feb 16, 2023

View reviewed changes

alexbarghi-nv approved these changes Feb 16, 2023

View reviewed changes

VibhuJawa approved these changes Feb 16, 2023

View reviewed changes

ChuckHastings added 7 commits February 22, 2023 20:31

implement new uniform neighborhood sampling API

fd81f64

Merge branch 'branch-23.04' into uniform_neighbor_sampling_tuning

12de841

missed a few formatting issues

a451397

update pylibcugraph files to add new parameters to uniform neighborho…

75e6acb

…od sampling, default to current behavior

some cleanup to make names consistent

38ff456

debug failing C API test

c4d5f0b

improve tests, some debugging

47ed81d

ChuckHastings changed the title ~~[skip-ci] Propose new API to improve end-to-end MG sampling performance~~ API improvements for end-to-end MG sampling performance Feb 24, 2023

Merge branch 'branch-23.04' into uniform_neighbor_sampling_tuning

e1c8298

ChuckHastings marked this pull request as ready for review February 24, 2023 23:32

address some PR comments

a09713c

seunghwak mentioned this pull request Mar 8, 2023

[FEA] Optimize cugraph.uniform_neighbor_sample #2433

Closed

ChuckHastings added 3 commits March 13, 2023 11:56

address PR comments with new structure

d04ac14

update names of parameters in PLC

994c52c

Merge branch 'branch-23.04' into uniform_neighbor_sampling_tuning

ff49397

ChuckHastings requested a review from a team as a code owner March 13, 2023 20:58

fix formatting errors

9461846

seunghwak reviewed Mar 14, 2023

View reviewed changes

address PR comments

5fd2d17

seunghwak approved these changes Mar 16, 2023

View reviewed changes

seunghwak reviewed Mar 16, 2023

View reviewed changes

ChuckHastings and others added 3 commits March 16, 2023 11:15

rename is_span_sorted to is_sorted

cca68de

add unit test to check shuffling, need to sort before shuffling

ccbb3d2

update and verify python tests

27a5647

alexbarghi-nv approved these changes Mar 17, 2023

View reviewed changes

seunghwak reviewed Mar 17, 2023

View reviewed changes

ChuckHastings added the DO NOT MERGE Hold off on merging; see PR for details label Mar 18, 2023

seunghwak mentioned this pull request Mar 18, 2023

Random vertex sampling utility function for C++ tests #3347

Merged

refactor code to group test seeds into batches

7f39f16

seunghwak reviewed Mar 20, 2023

View reviewed changes

Merge branch 'branch-23.04' into uniform_neighbor_sampling_tuning

99a7a98

ChuckHastings removed the DO NOT MERGE Hold off on merging; see PR for details label Mar 20, 2023

BradReesWork reviewed Mar 21, 2023

View reviewed changes

alexbarghi-nv mentioned this pull request Mar 21, 2023

[FEA] Support new C++ Sampling Options for Assigning GPUs, Optional Arguments, etc. #3318

Closed

BradReesWork approved these changes Mar 21, 2023

View reviewed changes

alexbarghi-nv and others added 2 commits March 21, 2023 06:56

Merge branch 'branch-23.04' into uniform_neighbor_sampling_tuning

b63f601

Merge branch 'branch-23.04' into uniform_neighbor_sampling_tuning

b575a5d

rapids-bot bot merged commit 393db1d into rapidsai:branch-23.04 Mar 22, 2023

ChuckHastings deleted the uniform_neighbor_sampling_tuning branch September 27, 2023 21:39

Conversation

ChuckHastings commented Feb 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seunghwak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

seunghwak Feb 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seunghwak Feb 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

seunghwak left a comment

Choose a reason for hiding this comment

Uh oh!

alexbarghi-nv left a comment

Choose a reason for hiding this comment

Uh oh!

VibhuJawa left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seunghwak left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexbarghi-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChuckHastings commented Mar 22, 2023

Uh oh!

Reviewers

Assignees

ChuckHastings commented Feb 14, 2023 •

edited

Loading

seunghwak Feb 14, 2023 •

edited

Loading

seunghwak Feb 14, 2023 •

edited

Loading