[Java] New off-heap Dataset support for CAGRA and Bruteforce by chatman · Pull Request #902 · rapidsai/cuvs

chatman · 2025-05-15T20:24:34Z

As reported in #698, current withDataset(float[][] arr) requires the entire dataset to be copied in heap first, before writing out the MemorySegment for it.

Introducing a new Dataset (interface and impl) support with a addVector(float[] vector) support for adding the vectors into the MemorySegment one by one, without needing to load them all at once.

copy-pr-bot · 2025-05-15T20:24:38Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

chatman · 2025-05-15T20:25:29Z

FYI, @ChrisHegarty @narangvivek10.

java/cuvs-java/src/main/java/com/nvidia/cuvs/Dataset.java

java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/DatasetImpl.java

java/cuvs-java/src/main/java/com/nvidia/cuvs/Dataset.java

java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/DatasetImpl.java

java/cuvs-java/src/main/java/com/nvidia/cuvs/CagraIndex.java

narangvivek10 · 2025-05-26T22:59:39Z

Also, maybe update the title to [Java] New off-heap Dataset support for CAGRA and Bruteforce. Thanks!

java/cuvs-java/src/test/java/com/nvidia/cuvs/BruteForceRandomizedIT.java

chatman · 2025-05-27T18:41:44Z

@narangvivek10 @punAhuja Incorporated your feedback, thanks! Can you approve the PR?
@cjnolet I think this is ready for CI testing/merge now.

cjnolet · 2025-05-27T18:55:39Z

/ok to test 7a75d9a

…[][]

Co-authored-by: MithunR <mythrocks@gmail.com>

chatman · 2025-05-27T22:11:43Z

Thanks @mythrocks , incorporated your suggestions.

I think we need to add a style check for the Java project and standardize tabs vs spaces in a separate PR. This codebase is a mix right now :-(

chatman · 2025-05-27T22:14:27Z

@mythrocks can you please trigger a CI run?

chatman · 2025-05-28T16:52:32Z

@cjnolet Can you please review and merge this?

mythrocks

LGTM.

cjnolet · 2025-05-28T21:45:50Z

/ok to test 6148cff

cjnolet · 2025-05-28T23:38:34Z

/merge

ChrisHegarty

Belated LGTM. ❤️

…i#902) As reported in rapidsai#698, current `withDataset(float[][] arr)` requires the entire dataset to be copied in heap first, before writing out the MemorySegment for it. Introducing a new `Dataset` (interface and impl) support with a `addVector(float[] vector)` support for adding the vectors into the MemorySegment one by one, without needing to load them all at once. Authors: - Ishan Chattopadhyaya (https://github.com/chatman) - Vivek Narang (https://github.com/narangvivek10) Approvers: - MithunR (https://github.com/mythrocks) - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#902

chatman · 2025-06-13T14:11:18Z

That's a good idea. Please feel free to rig up a PR. Thanks!

…

On Fri, 13 Jun, 2025, 7:28 pm Lorenzo Dematté, ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/BruteForceIndexImpl.java <#902 (comment)>: > + private float[][] vectors; + private Dataset dataset; So, externally you can provide data either as a float[][] or a Dataset, but internally it will just be a Dataset. — Reply to this email directly, view it on GitHub <#902 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABDCR5BWXEQDJV6NPM6PY4T3DLKJNAVCNFSM6AAAAAB5HCKD6SVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDSMRUHE2DCMJUGA> . You are receiving this because you were assigned.Message ID: ***@***.***>

This PR is a follow-up from #902. Still WIP (see self-comments on the changes) but I'd like some early feedback. Authors: - Lorenzo Dematté (https://github.com/ldematte) - MithunR (https://github.com/mythrocks) Approvers: - Chris Hegarty (https://github.com/ChrisHegarty) - MithunR (https://github.com/mythrocks) URL: #1024

This PR is a follow-up from rapidsai#902. Still WIP (see self-comments on the changes) but I'd like some early feedback. Authors: - Lorenzo Dematté (https://github.com/ldematte) - MithunR (https://github.com/mythrocks) Approvers: - Chris Hegarty (https://github.com/ChrisHegarty) - MithunR (https://github.com/mythrocks) URL: rapidsai#1024

In #902 and #1034 we introduced a `Dataset` interface to support on-heap and off-heap ("native") memory seamlessly as inputs for cagra and bruteforce index building. As we expand the functionality of cuvs-java, we realized we have similar needs for outputs (see e.g. #1105 / #1102 or #1104). This PR extends `Dataset` to support being used as an output, wrapping native (off-heap) memory in a convenient and efficient way, and providing common utilities to transform to and from on-heap memory. This work is inspired by the existing raft `mdspan` and `DLTensor` data structures, but tailored to our needs (2d only, just 3 data types, etc.). The PR keeps the current implementation simple and minimal on purpose, but structured in a way that is simple to extend. By itself, the PR is just a refactoring to extend the `Dataset` implementation and reorganize the implementation classes; its real usefulness will be in using it in the PRs mentioned above (in fact, this PR has been extracted from #1105). The implementation class hierarchy is implemented with future extensions in mind: atm we have one `HostMemoryDatasetImpl`, but we are already thinking to have a corresponding `DeviceMemoryDatasetImpl` that will wrap and manage (views) on GPU memory to avoid (in some cases) extra copies of data from GPU memory to CPU memory only to process them or forward them to another algorithm (e.g quantization followed by indexing). Future work will also include add support/refactoring to allocate and manage GPU memory and DLTensors (e.g. working better with/refactoring `prepareTensor`). Authors: - Lorenzo Dematté (https://github.com/ldematte) - MithunR (https://github.com/mythrocks) Approvers: - MithunR (https://github.com/mythrocks) URL: #1111

…#1111) In rapidsai#902 and rapidsai#1034 we introduced a `Dataset` interface to support on-heap and off-heap ("native") memory seamlessly as inputs for cagra and bruteforce index building. As we expand the functionality of cuvs-java, we realized we have similar needs for outputs (see e.g. rapidsai#1105 / rapidsai#1102 or rapidsai#1104). This PR extends `Dataset` to support being used as an output, wrapping native (off-heap) memory in a convenient and efficient way, and providing common utilities to transform to and from on-heap memory. This work is inspired by the existing raft `mdspan` and `DLTensor` data structures, but tailored to our needs (2d only, just 3 data types, etc.). The PR keeps the current implementation simple and minimal on purpose, but structured in a way that is simple to extend. By itself, the PR is just a refactoring to extend the `Dataset` implementation and reorganize the implementation classes; its real usefulness will be in using it in the PRs mentioned above (in fact, this PR has been extracted from rapidsai#1105). The implementation class hierarchy is implemented with future extensions in mind: atm we have one `HostMemoryDatasetImpl`, but we are already thinking to have a corresponding `DeviceMemoryDatasetImpl` that will wrap and manage (views) on GPU memory to avoid (in some cases) extra copies of data from GPU memory to CPU memory only to process them or forward them to another algorithm (e.g quantization followed by indexing). Future work will also include add support/refactoring to allocate and manage GPU memory and DLTensors (e.g. working better with/refactoring `prepareTensor`). Authors: - Lorenzo Dematté (https://github.com/ldematte) - MithunR (https://github.com/mythrocks) Approvers: - MithunR (https://github.com/mythrocks) URL: rapidsai#1111

Java: New off-heap Dataset support for CAGRA and Brute Force

297fe51

cjnolet assigned chatman May 15, 2025

cjnolet added improvement Improves an existing functionality non-breaking Introduces a non-breaking change Java labels May 15, 2025

cjnolet added this to Vector Search, ML, & Data Mining Release Board May 15, 2025

cjnolet moved this to In Progress in Vector Search, ML, & Data Mining Release Board May 15, 2025

chatman added 2 commits May 16, 2025 02:25

Merge branch 'branch-25.06' into ishan/new-dataset-method

fa1e2e4

Merge branch 'branch-25.06' into ishan/new-dataset-method

9241eed

chatman marked this pull request as ready for review May 23, 2025 16:27