Skip to content

[WIP][Java] Exposing CAGRA graph#1102

Closed
chatman wants to merge 2 commits intorapidsai:pull-request/1086from
SearchScale:ishan/exposed-cagra-graph
Closed

[WIP][Java] Exposing CAGRA graph#1102
chatman wants to merge 2 commits intorapidsai:pull-request/1086from
SearchScale:ishan/exposed-cagra-graph

Conversation

@chatman
Copy link
Contributor

@chatman chatman commented Jul 10, 2025

Initial attempt at exposing the CAGRA graph exposed via the Java API. This is currently based on @benfred's PR, I'll change the base branch to branch-25.08 once #1086 is merged there.

Note: Introduced int[][] getGraph() in CagraIndex for now. We should revisit performance implications of this, and possibly avoid this double dimensional on-heap array. Just wanted to try something out for testing the functionality.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jul 10, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

// Prepare dataset tensor
long[] datasetShape = {rows, cols};
MemorySegment datasetTensor =
prepareTensor(resources.getArena(), dataSeg, datasetShape, 2, 32, 2, 2, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can/should use the localArena here

Copy link
Contributor

@ldematte ldematte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I have something similar but without the on-heap array: #1105
I think we should combine the 2 PRs: I like how you extended the CagraRandomizedIT, but I have a preference for letting the C API deal with GPU memory (IMO it is cleaner and in at least one case avoids a copy in GPU memory), unless there is a real need/use case for it.

prepareTensor(resources.getArena(), dataSeg, datasetShape, 2, 32, 2, 2, 1);

// Prepare graph tensor
Arena arena = resources.getArena();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, the resources arena will go away

}

// Allocate device memory for the graph
MemorySegment graphD = arena.allocate(C_POINTER);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's interesting you are copying things to device memory here, but I don't think this is needed: cuvsCagraIndexFromGraph will work with host memory and do the copy itself (and maybe a tiny bit more efficiently)


// Allocate memory for the graph
long graphElements = (long) size * graphDegree;
Arena arena = resources.getArena();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, this can be local

MemorySegment graphMemorySegment = arena.allocate(graphSequenceLayout);

// Allocate device memory for the graph
MemorySegment graphD = arena.allocate(C_POINTER);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, I think cuvsCagraIndexGetGraph does this for you; this time even more important, as (I think) this will save a GPU memory copy.

// Convert to 2D int array
int[][] graph = new int[size][graphDegree];
for (int i = 0; i < size; i++) {
for (int j = 0; j < graphDegree; j++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this can be done "by-row" more efficiently with MemorySegment.copy

CuVSResourcesImpl resources) {
this.memorySegment = indexMemorySegment;
this.dataset = dataset;
this.graphDevicePointer = graphDevicePointer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be avoided if you don't allocate GPU graph memory yourself, keeping things easier/tidier (but better double check)

@cjnolet cjnolet added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Jul 11, 2025
@chatman
Copy link
Contributor Author

chatman commented Jul 14, 2025

FYI, I have something similar but without the on-heap array: #1105
I think we should combine the 2 PRs

@ldematte That's fantastic. We can proceed via your PR (#1105) and close this (#1102). I opened this draft PR even before #1086 is merged so that @punAhuja can test the end-to-end workflow for building a HNSW graph in Lucene. He reported that it is working for him :-)

@chatman chatman closed this Jul 14, 2025
rapids-bot bot pushed a commit that referenced this pull request Jul 26, 2025
In #902 and #1034 we introduced a `Dataset` interface to support on-heap and off-heap ("native") memory seamlessly as inputs for cagra and bruteforce index building.

As we expand the functionality of cuvs-java, we realized we have similar needs for outputs (see e.g. #1105 / #1102 or #1104).

This PR extends `Dataset` to support being used as an output, wrapping native (off-heap) memory in a convenient and efficient way, and providing common utilities to transform to and from on-heap memory.
This work is inspired by the existing raft `mdspan` and `DLTensor` data structures, but tailored to our needs (2d only, just 3 data types, etc.). The PR keeps the current implementation simple and minimal on purpose, but structured in a way that is simple to extend.

By itself, the PR is just a refactoring to extend the `Dataset` implementation and reorganize the implementation classes; its real usefulness will be in using it in the PRs mentioned above (in fact, this PR has been extracted from #1105).
The implementation class hierarchy is implemented with future extensions in mind: atm we have one `HostMemoryDatasetImpl`, but we are already thinking to have a corresponding `DeviceMemoryDatasetImpl` that will wrap and manage (views) on GPU memory to avoid (in some cases) extra copies of data from GPU memory to CPU memory only to process them or forward them to another algorithm (e.g quantization followed by indexing).

Future work will also include add support/refactoring to allocate and manage GPU memory and DLTensors (e.g. working better with/refactoring `prepareTensor`).

Authors:
  - Lorenzo Dematté (https://github.com/ldematte)
  - MithunR (https://github.com/mythrocks)

Approvers:
  - MithunR (https://github.com/mythrocks)

URL: #1111
lowener pushed a commit to lowener/cuvs that referenced this pull request Aug 11, 2025
…#1111)

In rapidsai#902 and rapidsai#1034 we introduced a `Dataset` interface to support on-heap and off-heap ("native") memory seamlessly as inputs for cagra and bruteforce index building.

As we expand the functionality of cuvs-java, we realized we have similar needs for outputs (see e.g. rapidsai#1105 / rapidsai#1102 or rapidsai#1104).

This PR extends `Dataset` to support being used as an output, wrapping native (off-heap) memory in a convenient and efficient way, and providing common utilities to transform to and from on-heap memory.
This work is inspired by the existing raft `mdspan` and `DLTensor` data structures, but tailored to our needs (2d only, just 3 data types, etc.). The PR keeps the current implementation simple and minimal on purpose, but structured in a way that is simple to extend.

By itself, the PR is just a refactoring to extend the `Dataset` implementation and reorganize the implementation classes; its real usefulness will be in using it in the PRs mentioned above (in fact, this PR has been extracted from rapidsai#1105).
The implementation class hierarchy is implemented with future extensions in mind: atm we have one `HostMemoryDatasetImpl`, but we are already thinking to have a corresponding `DeviceMemoryDatasetImpl` that will wrap and manage (views) on GPU memory to avoid (in some cases) extra copies of data from GPU memory to CPU memory only to process them or forward them to another algorithm (e.g quantization followed by indexing).

Future work will also include add support/refactoring to allocate and manage GPU memory and DLTensors (e.g. working better with/refactoring `prepareTensor`).

Authors:
  - Lorenzo Dematté (https://github.com/ldematte)
  - MithunR (https://github.com/mythrocks)

Approvers:
  - MithunR (https://github.com/mythrocks)

URL: rapidsai#1111
enp1s0 pushed a commit to enp1s0/cuvs that referenced this pull request Aug 22, 2025
…#1111)

In rapidsai#902 and rapidsai#1034 we introduced a `Dataset` interface to support on-heap and off-heap ("native") memory seamlessly as inputs for cagra and bruteforce index building.

As we expand the functionality of cuvs-java, we realized we have similar needs for outputs (see e.g. rapidsai#1105 / rapidsai#1102 or rapidsai#1104).

This PR extends `Dataset` to support being used as an output, wrapping native (off-heap) memory in a convenient and efficient way, and providing common utilities to transform to and from on-heap memory.
This work is inspired by the existing raft `mdspan` and `DLTensor` data structures, but tailored to our needs (2d only, just 3 data types, etc.). The PR keeps the current implementation simple and minimal on purpose, but structured in a way that is simple to extend.

By itself, the PR is just a refactoring to extend the `Dataset` implementation and reorganize the implementation classes; its real usefulness will be in using it in the PRs mentioned above (in fact, this PR has been extracted from rapidsai#1105).
The implementation class hierarchy is implemented with future extensions in mind: atm we have one `HostMemoryDatasetImpl`, but we are already thinking to have a corresponding `DeviceMemoryDatasetImpl` that will wrap and manage (views) on GPU memory to avoid (in some cases) extra copies of data from GPU memory to CPU memory only to process them or forward them to another algorithm (e.g quantization followed by indexing).

Future work will also include add support/refactoring to allocate and manage GPU memory and DLTensors (e.g. working better with/refactoring `prepareTensor`).

Authors:
  - Lorenzo Dematté (https://github.com/ldematte)
  - MithunR (https://github.com/mythrocks)

Approvers:
  - MithunR (https://github.com/mythrocks)

URL: rapidsai#1111
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Development

Successfully merging this pull request may close these issues.

3 participants