[WIP][Java] Exposing CAGRA graph#1102

Closed

chatman wants to merge 2 commits intorapidsai:pull-request/1086from

SearchScale:ishan/exposed-cagra-graph

Contributor

chatman commented Jul 10, 2025 •

edited

Loading

Initial attempt at exposing the CAGRA graph exposed via the Java API. This is currently based on @benfred's PR, I'll change the base branch to branch-25.08 once #1086 is merged there.

Note: Introduced int[][] getGraph() in CagraIndex for now. We should revisit performance implications of this, and possibly avoid this double dimensional on-heap array. Just wanted to try something out for testing the functionality.


          First pass at exposing CAGRA graph via Java

github-project-automation bot added this to Vector Search, ML, & Data Mining Release Board

github-project-automation bot moved this to Todo in Vector Search, ML, & Data Mining Release Board

copy-pr-bot bot commented Jul 10, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.


          Cleanup of javadocs

3cb1938

ldematte reviewed

View reviewed changes

java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CagraIndexImpl.java

+                    // Prepare dataset tensor
+                    long[] datasetShape = {rows, cols};
+                    MemorySegment datasetTensor =
+                        prepareTensor(resources.getArena(), dataSeg, datasetShape, 2, 32, 2, 2, 1);

Contributor

ldematte Jul 11, 2025

You can/should use the localArena here

ldematte reviewed

View reviewed changes

Contributor

ldematte left a comment

FYI, I have something similar but without the on-heap array: #1105
I think we should combine the 2 PRs: I like how you extended the CagraRandomizedIT, but I have a preference for letting the C API deal with GPU memory (IMO it is cleaner and in at least one case avoids a copy in GPU memory), unless there is a real need/use case for it.

java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CagraIndexImpl.java

+                        prepareTensor(resources.getArena(), dataSeg, datasetShape, 2, 32, 2, 2, 1);
+                    // Prepare graph tensor
+                    Arena arena = resources.getArena();

Contributor

ldematte Jul 11, 2025

Same, the resources arena will go away

java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CagraIndexImpl.java

+                    }
+                    // Allocate device memory for the graph
+                    MemorySegment graphD = arena.allocate(C_POINTER);

Contributor

ldematte Jul 11, 2025

It's interesting you are copying things to device memory here, but I don't think this is needed: cuvsCagraIndexFromGraph will work with host memory and do the copy itself (and maybe a tiny bit more efficiently)

java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CagraIndexImpl.java

+                    // Allocate memory for the graph
+                    long graphElements = (long) size * graphDegree;
+                    Arena arena = resources.getArena();

Contributor

ldematte Jul 11, 2025

Same here, this can be local

java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CagraIndexImpl.java

+                    MemorySegment graphMemorySegment = arena.allocate(graphSequenceLayout);
+                    // Allocate device memory for the graph
+                    MemorySegment graphD = arena.allocate(C_POINTER);

Contributor

ldematte Jul 11, 2025

Same here, I think cuvsCagraIndexGetGraph does this for you; this time even more important, as (I think) this will save a GPU memory copy.

java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CagraIndexImpl.java

+                    // Convert to 2D int array
+                    int[][] graph = new int[size][graphDegree];
+                    for (int i = 0; i < size; i++) {
+                      for (int j = 0; j < graphDegree; j++) {

Contributor

ldematte Jul 11, 2025

Nit: this can be done "by-row" more efficiently with MemorySegment.copy

java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CagraIndexImpl.java

+                      CuVSResourcesImpl resources) {
+                    this.memorySegment = indexMemorySegment;
+                    this.dataset = dataset;
+                    this.graphDevicePointer = graphDevicePointer;

Contributor

ldematte Jul 11, 2025

I think this can be avoided if you don't allocate GPU graph memory yourself, keeping things easier/tidier (but better double check)

cjnolet added improvement non-breaking labels

cjnolet moved this from Todo to In Progress in Vector Search, ML, & Data Mining Release Board

cjnolet assigned chatman

ldematte mentioned this pull request

[Java] Extend Dataset to work as an output data container #1111

Merged

Contributor Author

chatman commented Jul 14, 2025

FYI, I have something similar but without the on-heap array: #1105
I think we should combine the 2 PRs

@ldematte That's fantastic. We can proceed via your PR (#1105) and close this (#1102). I opened this draft PR even before #1086 is merged so that @punAhuja can test the end-to-end workflow for building a HNSW graph in Lucene. He reported that it is working for him :-)

chatman closed this

github-project-automation bot moved this from In Progress to Done in Vector Search, ML, & Data Mining Release Board

rapids-bot bot pushed a commit that referenced this pull request


          [Java] Extend Dataset to work as an output data container (#1111)

1155a3a

In #902 and #1034 we introduced a `Dataset` interface to support on-heap and off-heap ("native") memory seamlessly as inputs for cagra and bruteforce index building.

As we expand the functionality of cuvs-java, we realized we have similar needs for outputs (see e.g. #1105 / #1102 or #1104).

This PR extends `Dataset` to support being used as an output, wrapping native (off-heap) memory in a convenient and efficient way, and providing common utilities to transform to and from on-heap memory.
This work is inspired by the existing raft `mdspan` and `DLTensor` data structures, but tailored to our needs (2d only, just 3 data types, etc.). The PR keeps the current implementation simple and minimal on purpose, but structured in a way that is simple to extend.

By itself, the PR is just a refactoring to extend the `Dataset` implementation and reorganize the implementation classes; its real usefulness will be in using it in the PRs mentioned above (in fact, this PR has been extracted from #1105).
The implementation class hierarchy is implemented with future extensions in mind: atm we have one `HostMemoryDatasetImpl`, but we are already thinking to have a corresponding `DeviceMemoryDatasetImpl` that will wrap and manage (views) on GPU memory to avoid (in some cases) extra copies of data from GPU memory to CPU memory only to process them or forward them to another algorithm (e.g quantization followed by indexing).

Future work will also include add support/refactoring to allocate and manage GPU memory and DLTensors (e.g. working better with/refactoring `prepareTensor`).

Authors:
- Lorenzo Dematté (https://github.com/ldematte)
- MithunR (https://github.com/mythrocks)

Approvers:
- MithunR (https://github.com/mythrocks)

URL: #1111

lowener pushed a commit to lowener/cuvs that referenced this pull request


          [Java] Extend Dataset to work as an output data container (rapidsai…

4be8346

…#1111)

In rapidsai#902 and rapidsai#1034 we introduced a `Dataset` interface to support on-heap and off-heap ("native") memory seamlessly as inputs for cagra and bruteforce index building.

As we expand the functionality of cuvs-java, we realized we have similar needs for outputs (see e.g. rapidsai#1105 / rapidsai#1102 or rapidsai#1104).

This PR extends `Dataset` to support being used as an output, wrapping native (off-heap) memory in a convenient and efficient way, and providing common utilities to transform to and from on-heap memory.
This work is inspired by the existing raft `mdspan` and `DLTensor` data structures, but tailored to our needs (2d only, just 3 data types, etc.). The PR keeps the current implementation simple and minimal on purpose, but structured in a way that is simple to extend.

By itself, the PR is just a refactoring to extend the `Dataset` implementation and reorganize the implementation classes; its real usefulness will be in using it in the PRs mentioned above (in fact, this PR has been extracted from rapidsai#1105).
The implementation class hierarchy is implemented with future extensions in mind: atm we have one `HostMemoryDatasetImpl`, but we are already thinking to have a corresponding `DeviceMemoryDatasetImpl` that will wrap and manage (views) on GPU memory to avoid (in some cases) extra copies of data from GPU memory to CPU memory only to process them or forward them to another algorithm (e.g quantization followed by indexing).

Future work will also include add support/refactoring to allocate and manage GPU memory and DLTensors (e.g. working better with/refactoring `prepareTensor`).

Authors:
- Lorenzo Dematté (https://github.com/ldematte)
- MithunR (https://github.com/mythrocks)

Approvers:
- MithunR (https://github.com/mythrocks)

URL: rapidsai#1111

enp1s0 pushed a commit to enp1s0/cuvs that referenced this pull request


          [Java] Extend Dataset to work as an output data container (rapidsai…

9622d93

…#1111)

In rapidsai#902 and rapidsai#1034 we introduced a `Dataset` interface to support on-heap and off-heap ("native") memory seamlessly as inputs for cagra and bruteforce index building.

As we expand the functionality of cuvs-java, we realized we have similar needs for outputs (see e.g. rapidsai#1105 / rapidsai#1102 or rapidsai#1104).

This PR extends `Dataset` to support being used as an output, wrapping native (off-heap) memory in a convenient and efficient way, and providing common utilities to transform to and from on-heap memory.
This work is inspired by the existing raft `mdspan` and `DLTensor` data structures, but tailored to our needs (2d only, just 3 data types, etc.). The PR keeps the current implementation simple and minimal on purpose, but structured in a way that is simple to extend.

By itself, the PR is just a refactoring to extend the `Dataset` implementation and reorganize the implementation classes; its real usefulness will be in using it in the PRs mentioned above (in fact, this PR has been extracted from rapidsai#1105).
The implementation class hierarchy is implemented with future extensions in mind: atm we have one `HostMemoryDatasetImpl`, but we are already thinking to have a corresponding `DeviceMemoryDatasetImpl` that will wrap and manage (views) on GPU memory to avoid (in some cases) extra copies of data from GPU memory to CPU memory only to process them or forward them to another algorithm (e.g quantization followed by indexing).

Future work will also include add support/refactoring to allocate and manage GPU memory and DLTensors (e.g. working better with/refactoring `prepareTensor`).

Authors:
- Lorenzo Dematté (https://github.com/ldematte)
- MithunR (https://github.com/mythrocks)

Approvers:
- MithunR (https://github.com/mythrocks)

URL: rapidsai#1111

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement non-breaking