Skip to content

feat(index): add streaming ivf kmeans training#6913

Open
BubbleCal wants to merge 4 commits into
mainfrom
yang/streaming-ivf-kmeans-training
Open

feat(index): add streaming ivf kmeans training#6913
BubbleCal wants to merge 4 commits into
mainfrom
yang/streaming-ivf-kmeans-training

Conversation

@BubbleCal
Copy link
Copy Markdown
Contributor

@BubbleCal BubbleCal commented May 22, 2026

Summary

Adds optional streaming IVF kmeans training controls so very large IVF partition counts can train with bounded peak raw-vector memory while keeping the existing non-streaming path unchanged by default.

This PR introduces LanceStream: a streaming coreset + weighted hierarchical kmeans training path for IVF. The path is enabled only when streaming_sample_rate is configured below the existing sample_rate; otherwise the current trainer is used.

Feature

  • Adds streaming_sample_rate, streaming_coreset_rate, and streaming_refine_passes to IVF build parameters.
  • Exposes the new parameters through the Python index creation path.
  • Keeps num_partitions > 256 on a hierarchical kmeans path. The streaming path does not fall back to flat kmeans for large k.
  • Reports final loss against the same fixed sampled vectors used for training.

Algorithm

The existing IVF trainer samples up to num_partitions * sample_rate raw vectors before training. For very large num_partitions, this can require too much memory. LanceStream decouples the total sample size from the peak raw vectors loaded at one time.

The streaming path works as follows:

  1. Build a fixed set of sampled row indices, bounded by num_partitions * sample_rate and the dataset row count. Using fixed indices keeps coreset construction, optional refine passes, and final loss measurement on the same training sample.
  2. Split the sampled indices into chunks of at most num_partitions * streaming_sample_rate raw vectors.
  3. For each chunk:
    • load only the current raw vectors;
    • train a local kmeans summary over the chunk;
    • assign raw vectors to local centroids;
    • record each local centroid with its weight and local loss contribution.
  4. Merge the weighted local summaries into a global coreset, bounded by num_partitions * streaming_coreset_rate.
  5. Train final IVF centroids with weighted hierarchical kmeans over the coreset:
    • start with a small weighted kmeans fanout, currently up to 16 clusters;
    • repeatedly split the highest-loss / highest-weight partitions;
    • continue until the requested num_partitions is reached.
  6. Run a short weighted Lloyd refinement over the coreset centroids.
  7. Optionally run streaming_refine_passes raw-vector Lloyd passes. Each pass streams the same fixed sampled rows in chunks, assigns raw vectors to the current centroids, accumulates global cluster sums/counts/loss, updates non-empty centroids at pass end, and keeps old centroids for empty clusters.
  8. Compute final true loss by streaming the fixed sampled rows again and summing min_c ||x - c||^2.

Memory model

Peak raw-vector memory is bounded by:

num_partitions * streaming_sample_rate * dimension * sizeof(float32)

The coreset budget is bounded separately by:

num_partitions * streaming_coreset_rate

The final centroid and accumulation buffers scale with num_partitions * dimension. This avoids materializing the full num_partitions * sample_rate raw-vector sample in memory.

Quality model

The local coreset preserves the chunk distribution with centroid weights and local losses. Weighted hierarchical kmeans keeps Lance's existing high-loss partition splitting heuristic, but applies it to a bounded weighted summary instead of the full raw sample. Optional streaming refine passes then optimize the true raw-vector kmeans objective without increasing the raw-vector memory bound.

Benchmarks

Setup

  • VM: GCP c2-standard-16, 16 vCPU, ~64 GiB RAM, AVX512, 500 GB pd-ssd.
  • Dataset: real SIFT vectors, 128-dimensional float32.
  • Training sample size: k * 256 rows.
  • Parameters for LanceStream: sample_rate=256, streaming_sample_rate=64, streaming_coreset_rate=16, streaming_refine_passes=0, max_iters=50.
  • Parameters for Lance non-stream: production IVF training path with streaming_sample_rate=None, sample_rate=256, max_iters=50, default hierarchical_k=16, and IVF balance_factor=1.0.
  • Timeout: 1 hour per run.
  • Compared implementations:
    • LanceStream: this PR.
    • Lance non-stream: current production non-stream heuristic / hierarchical IVF kmeans path.
    • Faiss Kmeans v1.12.0 built from source with AVX512 optimization.
    • sklearn MiniBatchKMeans.
    • BICO Python package.
    • treeCoreset / StreamKM++ style implementation.

Loss is reported as the exact kmeans objective over the same training sample whenever centroids were available:

sum_x min_c ||x - c||^2

For BICO and treeCoreset, the native reported objective is not directly comparable, so exact loss was recomputed from saved centroids with faiss.IndexFlatL2 for k=1,024, 4,096, and 16,384. Lance non-stream was evaluated the same way for those exact-loss rows. Exact full loss for k=65,536 and 131,072 was not recomputed because it is very expensive at those sample sizes; the large-k table uses each completed Lance run's reported training loss.

Exact Loss Results

Relative loss is shown against LanceStream for the same k.

k algorithm status train time peak RSS exact loss loss vs LanceStream
1,024 LanceStream ok 11.29s 413 MB 1.3828e10 baseline
1,024 Lance non-stream ok 2.22s 344 MB 1.3840e10 +0.09%
1,024 Faiss ok 34.92s 201 MB 1.3279e10 -4.0%
1,024 MiniBatchKMeans ok 33.33s 329 MB 1.3494e10 -2.4%
1,024 BICO ok 34.25s 548 MB 1.4307e10 +3.5%
1,024 treeCoreset ok 18.36s 1.35 GB 1.4697e10 +6.3%
4,096 LanceStream ok 45.79s 416 MB 6.3529e10 baseline
4,096 Lance non-stream ok 16.51s 1.30 GB 6.2720e10 -1.27%
4,096 Faiss ok 9.40m 603 MB 6.0843e10 -4.2%
4,096 MiniBatchKMeans ok 7.57m 881 MB 6.1763e10 -2.8%
4,096 BICO ok 2.46m 1.05 GB 6.5588e10 +3.2%
4,096 treeCoreset ok 3.00m 3.50 GB 6.7396e10 +6.1%
16,384 LanceStream ok 2.28m 1.01 GB 2.2856e11 baseline
16,384 Lance non-stream ok 30.04s 4.15 GB 2.2923e11 +0.30%
16,384 Faiss timeout 1h cutoff 2.16 GB n/a n/a
16,384 MiniBatchKMeans timeout 1h cutoff 3.02 GB n/a n/a
16,384 BICO ok 45.40m 3.79 GB 2.3061e11 +0.90%
16,384 treeCoreset ok 40.33m 12.33 GB 2.3749e11 +3.91%

Large-k Scalability

k algorithm status train time peak RSS reported training loss notes
16,384 LanceStream ok 2.28m 1.01 GB 2.2761e11 streaming coreset, completed
16,384 Lance non-stream ok 30.04s 4.15 GB 2.3169e11 production non-stream heuristic kmeans
16,384 Faiss timeout 1h cutoff 2.16 GB n/a no centroids/loss
16,384 MiniBatchKMeans timeout 1h cutoff 3.02 GB n/a no centroids/loss
16,384 BICO ok 45.40m 3.79 GB n/a exact loss +0.90% vs LanceStream
16,384 treeCoreset ok 40.33m 12.33 GB n/a exact loss +3.91% vs LanceStream
65,536 LanceStream ok 11.75m 3.64 GB 8.3324e11 streaming coreset, completed
65,536 Lance non-stream ok 2.67m 16.34 GB 8.1670e11 production non-stream heuristic kmeans
65,536 Faiss skipped n/a n/a n/a skipped after 16K timeout
65,536 MiniBatchKMeans skipped n/a n/a n/a skipped after 16K timeout
65,536 BICO timeout 1h cutoff 7.29 GB n/a did not complete
65,536 treeCoreset error 139 27.53s 22.31 GB n/a segfault / dumped core
131,072 LanceStream ok 20.80m 7.12 GB 1.6070e12 streaming coreset, completed
131,072 Lance non-stream ok 5.33m 32.52 GB 1.5818e12 production non-stream heuristic kmeans
131,072 Faiss skipped n/a n/a n/a skipped after 16K timeout
131,072 MiniBatchKMeans skipped n/a n/a n/a skipped after 16K timeout
131,072 BICO skipped n/a n/a n/a skipped after 64K timeout
131,072 treeCoreset skipped n/a n/a n/a skipped after 64K error

For k=131,072, LanceStream's raw chunk size is approximately 131072 * 64 * 128 * 4 = 4 GiB; observed peak RSS was 7.12 GB including coreset, centroids, accumulators, allocator overhead, and runtime overhead. Lance non-stream loads the full k * 256 sample and reached 32.52 GB RSS on the same workload. The non-stream path is faster and has slightly lower or comparable loss on these SIFT runs, but its memory scales with the full raw sample and is the pressure point this PR is designed to avoid.

Validation

  • cargo fmt --all --check
  • cargo check -p lance-index -p lance
  • cargo clippy -p lance-index --all-targets -- -D warnings
  • cargo clippy -p lance --all-targets -- -D warnings
  • cargo clippy --profile ci --locked --features aws,azure,bitpacking,cli,credential-vendor-aws,credential-vendor-azure,credential-vendor-gcp,datafusion,default,dir-aws,dir-azure,dir-gcp,dir-huggingface,dir-oss,dynamodb,dynamodb_tests,fp16kernels,gcp,gcs-test,geo,huggingface,jieba-rs,lindera,lz4,oss,protoc,rest,rest-adapter,slow_tests,substrait,tencent,test-util,tokenizer-jieba,tokenizer-lindera,zstd --all-targets -- -D warnings
  • git diff --check

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

@github-actions github-actions Bot added enhancement New feature or request python labels May 22, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 22, 2026

Codecov Report

❌ Patch coverage is 69.76271% with 446 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/index/vector/ivf.rs 68.74% 395 Missing and 36 partials ⚠️
rust/lance-index/src/vector/kmeans.rs 83.33% 11 Missing and 4 partials ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant