Moving MG functions into unified API + raft::device_resources_snmg as device resource type for MG functions#454
Conversation
cpp/include/cuvs/neighbors/mg.hpp
Outdated
| * @return the constructed IVF-Flat MG index | ||
| */ | ||
| auto build(const raft::device_resources& handle, | ||
| auto build(const raft::device_resources_snmg& clique, |
There was a problem hiding this comment.
One of the main goals with having the "snmg" resources object match the single-gpu object is that we wanted to be able to remove this additional MG. We should now be able to accept device_resources and then check to see if a nccl clique has been set on it (which would imply that it's a multi-gpu resources object and not a single-gpu object). The whole goal with doing this was to consolidate code paths.
There was a problem hiding this comment.
The NCCL clique is not set as a resource anymore, but we should still be able to implement the dispatching by checking the dynamic type of the device_resources. The real question then is, do we truly want dispatching on both the regular API (cuvs::neighbors::build) and the mg namespace (cuvs::neighbors::mg::build)? It kind of make sense that a user providing a device_resources_snmg instance to the regular API (cuvs::neighbors::build) would want things to be deployed on multiple GPUs. However, the reverse is not necessarily true. A user who explicitly chose the mg namespace, but did not provide the adequate device_resources_snmg would fallback to single GPU, potentially unintentionally. Is this what we want?
I propose that we implement the dispatching mechanism solely on the regular API (cuvs::neighbors::build) in a dedicated follow-up PR? This also allows the MG doc to explicitly avert users that they should use an adequate device_resources_snmg to use the MG API. What do you think?
|
Not sure why CI is still failing. The RAFT PR was merged quite awhile ago but I'm still seeing errors like this: |
|
This fell back to old 24.12 RAFT packages for unclear reasons: https://github.com/rapidsai/cuvs/actions/runs/12818869804/job/35749392606?pr=454#step:8:521 |
Oh fooey! Thanks for noticing that @bdice, I just realized the darn target branch was never updated! |
6ae1c70 to
b0c7ab9
Compare
raft::device_resources_snmg as device resource type for MG functions
|
/ok to test |
@cjnolet, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
|
/ok to test |
@cjnolet, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
|
/ok to test da48376 |
|
/merge |
|
/ok to test da48 |
|
/ok to test da48376 |
@cjnolet, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/ |
|
/ok to test d2e2be9 |
|
/merge |
92ced86 to
1e1d97e
Compare
|
/ok to test |
@viclafargue, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
|
/ok to test |
@achirkin, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
|
/ok to test 1e1d97e |
|
/ok to test 1e1d97e |
Co-authored-by: Artem M. Chirkin <9253178+achirkin@users.noreply.github.com>
|
/ok to test 34d67ee |
|
/merge |
raft::device_resources_snmgas device resource type for MG functionsraft::comms::build_comms_nccl_onlysolving [FEA] Remove dependency ofbuild_comms_nccl_onlyon UCP raft#2465raft::device_resources_snmgis introduced in this PR : rapidsai/raft#2549