feat(index): add streaming ivf kmeans training#6913
Open
BubbleCal wants to merge 4 commits into
Open
Conversation
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds optional streaming IVF kmeans training controls so very large IVF partition counts can train with bounded peak raw-vector memory while keeping the existing non-streaming path unchanged by default.
This PR introduces LanceStream: a streaming coreset + weighted hierarchical kmeans training path for IVF. The path is enabled only when
streaming_sample_rateis configured below the existingsample_rate; otherwise the current trainer is used.Feature
streaming_sample_rate,streaming_coreset_rate, andstreaming_refine_passesto IVF build parameters.num_partitions > 256on a hierarchical kmeans path. The streaming path does not fall back to flat kmeans for largek.Algorithm
The existing IVF trainer samples up to
num_partitions * sample_rateraw vectors before training. For very largenum_partitions, this can require too much memory. LanceStream decouples the total sample size from the peak raw vectors loaded at one time.The streaming path works as follows:
num_partitions * sample_rateand the dataset row count. Using fixed indices keeps coreset construction, optional refine passes, and final loss measurement on the same training sample.num_partitions * streaming_sample_rateraw vectors.num_partitions * streaming_coreset_rate.num_partitionsis reached.streaming_refine_passesraw-vector Lloyd passes. Each pass streams the same fixed sampled rows in chunks, assigns raw vectors to the current centroids, accumulates global cluster sums/counts/loss, updates non-empty centroids at pass end, and keeps old centroids for empty clusters.min_c ||x - c||^2.Memory model
Peak raw-vector memory is bounded by:
The coreset budget is bounded separately by:
The final centroid and accumulation buffers scale with
num_partitions * dimension. This avoids materializing the fullnum_partitions * sample_rateraw-vector sample in memory.Quality model
The local coreset preserves the chunk distribution with centroid weights and local losses. Weighted hierarchical kmeans keeps Lance's existing high-loss partition splitting heuristic, but applies it to a bounded weighted summary instead of the full raw sample. Optional streaming refine passes then optimize the true raw-vector kmeans objective without increasing the raw-vector memory bound.
Benchmarks
Setup
c2-standard-16, 16 vCPU, ~64 GiB RAM, AVX512, 500 GB pd-ssd.float32.k * 256rows.sample_rate=256,streaming_sample_rate=64,streaming_coreset_rate=16,streaming_refine_passes=0,max_iters=50.streaming_sample_rate=None,sample_rate=256,max_iters=50, defaulthierarchical_k=16, and IVFbalance_factor=1.0.Loss is reported as the exact kmeans objective over the same training sample whenever centroids were available:
For BICO and treeCoreset, the native reported objective is not directly comparable, so exact loss was recomputed from saved centroids with
faiss.IndexFlatL2fork=1,024,4,096, and16,384. Lance non-stream was evaluated the same way for those exact-loss rows. Exact full loss fork=65,536and131,072was not recomputed because it is very expensive at those sample sizes; the large-k table uses each completed Lance run's reported training loss.Exact Loss Results
Relative loss is shown against LanceStream for the same
k.Large-k Scalability
For
k=131,072, LanceStream's raw chunk size is approximately131072 * 64 * 128 * 4 = 4 GiB; observed peak RSS was 7.12 GB including coreset, centroids, accumulators, allocator overhead, and runtime overhead. Lance non-stream loads the fullk * 256sample and reached 32.52 GB RSS on the same workload. The non-stream path is faster and has slightly lower or comparable loss on these SIFT runs, but its memory scales with the full raw sample and is the pressure point this PR is designed to avoid.Validation
cargo fmt --all --checkcargo check -p lance-index -p lancecargo clippy -p lance-index --all-targets -- -D warningscargo clippy -p lance --all-targets -- -D warningscargo clippy --profile ci --locked --features aws,azure,bitpacking,cli,credential-vendor-aws,credential-vendor-azure,credential-vendor-gcp,datafusion,default,dir-aws,dir-azure,dir-gcp,dir-huggingface,dir-oss,dynamodb,dynamodb_tests,fp16kernels,gcp,gcs-test,geo,huggingface,jieba-rs,lindera,lz4,oss,protoc,rest,rest-adapter,slow_tests,substrait,tencent,test-util,tokenizer-jieba,tokenizer-lindera,zstd --all-targets -- -D warningsgit diff --check