-
Notifications
You must be signed in to change notification settings - Fork 345
API improvements for end-to-end MG sampling performance #3269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
rapids-bot
merged 24 commits into
rapidsai:branch-23.04
from
ChuckHastings:uniform_neighbor_sampling_tuning
Mar 22, 2023
Merged
Changes from all commits
Commits
Show all changes
24 commits
Select commit
Hold shift + click to select a range
fe5c3fd
Propose new API to improve efficient of MG sampling in end-to-end wor…
ChuckHastings e9fcd31
respond to PR review feedback
ChuckHastings ebb51a7
Update API based on PR comments
ChuckHastings fd81f64
implement new uniform neighborhood sampling API
ChuckHastings 12de841
Merge branch 'branch-23.04' into uniform_neighbor_sampling_tuning
ChuckHastings a451397
missed a few formatting issues
ChuckHastings 75e6acb
update pylibcugraph files to add new parameters to uniform neighborho…
ChuckHastings 38ff456
some cleanup to make names consistent
ChuckHastings c4d5f0b
debug failing C API test
ChuckHastings 47ed81d
improve tests, some debugging
ChuckHastings e1c8298
Merge branch 'branch-23.04' into uniform_neighbor_sampling_tuning
ChuckHastings a09713c
address some PR comments
ChuckHastings d04ac14
address PR comments with new structure
ChuckHastings 994c52c
update names of parameters in PLC
ChuckHastings ff49397
Merge branch 'branch-23.04' into uniform_neighbor_sampling_tuning
ChuckHastings 9461846
fix formatting errors
ChuckHastings 5fd2d17
address PR comments
ChuckHastings cca68de
rename is_span_sorted to is_sorted
ChuckHastings ccbb3d2
add unit test to check shuffling, need to sort before shuffling
ChuckHastings 27a5647
update and verify python tests
alexbarghi-nv 7f39f16
refactor code to group test seeds into batches
ChuckHastings 99a7a98
Merge branch 'branch-23.04' into uniform_neighbor_sampling_tuning
ChuckHastings b63f601
Merge branch 'branch-23.04' into uniform_neighbor_sampling_tuning
alexbarghi-nv b575a5d
Merge branch 'branch-23.04' into uniform_neighbor_sampling_tuning
ChuckHastings File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, these are what this API should satisfy.
starting_verticesincludes seeds from multiple batches, the caller should be able to separate sampling outputs from different batches and the caller should be able to place sampling outputs on different GPUs based on mapping between batch IDs and GPUs. If the caller is invoking this function for a single batch, the caller may want the sampling outputs for the localsampling_verticesto be stored locally as well.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking about the following,
This function optionally takes
rmm::device_uvector<label_t>&& labelsandrmm::device_uvector<int>&& dst_comm_ranks(labels.size()==dst_comm_ranks.size()==start_vertices.size()).Here labels don't necessarily coincide with batch IDs, but users may set labels to
batch_id * batch_size + seed index within a batch(if the result need to be sorted within a batch using seed IDs as a primary key and hops as a secondary key) or justbatch_id(if no need to create a tree per seed). The output results are shuffled based ondst_comm_ranksfor each starting vertex (ifdst_comm_ranksis valid) or all the results for local seeds will be stored locally.We return optional labels and hops for each edge data (src, dst, optional (edge weight, ID, type)). Within each GPU, the output results will be sorted using labels as a primary key (if
labelsare provided in the input) and hops as a secondary key.One concern is that computing global "seed index within a batch" might be challenging for the caller if they want to distribute seeds for a single batch to multiple GPUs, but I guess this is not a common use case. If there are 100 batches and 10 GPUs, users may assign batch [0-10) to GPU0 (so
start_verticesfor GPU0 holding all the seeds for batch [0-10)), batch [10-20) to GPU1, and so on.We may be able to reduce the output size if we don't return label, hop per every edge data and create offset arrays per label & hop, but I am not sure the benefit out-weights the increase in complexity. If the memory footprint is a major concern, we may just reduce the number of batches we process per call (as the main reason for processing multiple batches in a single call is to saturate GPUs, but if we're hitting the memory limit, we might be going to far in this direction).
Anything am I missing or any other thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, they're a lot to unpack here. Let me create a few separate comments.
The original objective of processing multiple batches in a single call is to efficiently utilize the parallelism on the GPU. If we do individual calls with a small batch size then we don't get enough parallelism and the overhead of calling the function is large relative to the work accomplished. By processing multiple batches of seeds in the same call we increase the amount of parallel work to be done on the GPU and we reduce the number of times we face the overhead of calling the function, allocating memory, launching kernels, etc. This improves our overall efficiency.
I don't know where the threshold is, but at some point there are enough vertices in a call to get the efficiencies we are looking for. Doing larger numbers of batches in a single call should still drive down overhead, but the improvement should decrease dramatically once we reach saturation of the GPU parallelism. It seems like (with the large graphs we are targeting) that this threshold is much less than "all seeds to be sampled in a training epoch".
In the current pipeline, the output from this processing is going to be moved out of GPU memory to be retrieved by the trainers as they work (currently written at the python layer to parquet files). As long as we process in the first call enough batches to keep the trainers busy while we do a second call, there would be no drop in pipeline throughput by making multiple calls to the sampling code with smaller numbers of batches. That would reduce the memory requirement, and we can overlap the sampling with the training a bit more. It might actually improve the overall latency for training an epoch (although perhaps only marginally, I don't really know the performance numbers you're dealing with).
This does suggest that the extra memory required to drive this function doesn't need to be a driving factor in the design of the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the labels discussion. Part of the reason I went with the name "label" rather than "batch" was this notion that batches might not be the only reason for labeling the data.
There's absolutely no reason that the sampling code needs to understand anything about how the caller assigns labels and uses them. If we keep with the label as a vector (
labels.size() == start_vertices.size()) rather than switching to a CSR-style representation, that label can literally be any arbitrary 32-bit integer value (doesn't need to be contiguous values starting from 0). Either route for organizing the result (either mapping label to GPU id or providing a destination vector (dst_comm_ranks.size() == start_vertices.size()) allows for us to shuffle the data to the correct GPU. We could talk about sorting options (beyond sorting by label) if that would be helpful, although that seems like an easy enough feature to add later when we need it.It seems like much of the use cases you describe can be addressed by the caller creating the labels in a more sophisticated way and then grouping the results once they are returned. There's no reason the caller couldn't create labels such that the training batch is decomposed into
klabels that all get mapped to the same GPU and the results will be combined by the caller. That allows for construction of the trees. That allows for the same seed to be repeated within a batch and being able to differentiate between which tree came from which seed.The beauty there is that the responsibility for managing the sophistication lies in the caller rather than in the library function.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the memory footprint for the sampling is a significant percentage (say more than 10%) of the GPU's total memory, we may not be too far from saturating the GPU. And I agree that "This does suggest that the extra memory required to drive this function doesn't need to be a driving factor in the design of the API." We still need to be frugal with memory but it should not be the #1 priority when we need to make trade-offs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, now I agree with this. And I just want to emphasize that this API should allow creating one tree per seed within a batch, so we need to provide a mechanism to distinguishes sampled edges from different seeds, and labels can serve this purpose (32 bit might be sufficient for the foreseeable future, but in few GPU generations in the future, especially with grace-hopper like systems, we may exceed the 32 bit boundary, so we may keep it as label_t than int32_t).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a
label_tand only instantiateint32_tfor now.