feat(index): add streaming ivf kmeans training by BubbleCal · Pull Request #6913 · lance-format/lance

BubbleCal · 2026-05-22T08:52:52Z

Summary

Adds optional streaming IVF kmeans training controls so very large IVF partition counts can train with bounded peak raw-vector memory while keeping the existing non-streaming path unchanged by default.

This PR introduces LanceStream: a streaming coreset + weighted hierarchical kmeans training path for IVF. The path is enabled only when streaming_sample_rate is configured below the existing sample_rate; otherwise the current trainer is used.

Feature

Adds streaming_sample_rate, streaming_coreset_rate, and streaming_refine_passes to IVF build parameters.
Exposes the new parameters through the Python index creation path.
Keeps num_partitions > 256 on a hierarchical kmeans path. The streaming path does not fall back to flat kmeans for large k.
Reports final loss against the same fixed sampled vectors used for training.

Algorithm

The existing IVF trainer samples up to num_partitions * sample_rate raw vectors before training. For very large num_partitions, this can require too much memory. LanceStream decouples the total sample size from the peak raw vectors loaded at one time.

The streaming path works as follows:

Build a fixed set of sampled row indices, bounded by num_partitions * sample_rate and the dataset row count. Using fixed indices keeps coreset construction, optional refine passes, and final loss measurement on the same training sample.
Split the sampled indices into chunks of at most num_partitions * streaming_sample_rate raw vectors.
For each chunk:
- load only the current raw vectors;
- train a local kmeans summary over the chunk;
- assign raw vectors to local centroids;
- record each local centroid with its weight and local loss contribution.
Merge the weighted local summaries into a global coreset, bounded by num_partitions * streaming_coreset_rate.
Train final IVF centroids with weighted hierarchical kmeans over the coreset:
- start with a small weighted kmeans fanout, currently up to 16 clusters;
- repeatedly split the highest-loss / highest-weight partitions;
- continue until the requested num_partitions is reached.
Run a short weighted Lloyd refinement over the coreset centroids.
Optionally run streaming_refine_passes raw-vector Lloyd passes. Each pass streams the same fixed sampled rows in chunks, assigns raw vectors to the current centroids, accumulates global cluster sums/counts/loss, updates non-empty centroids at pass end, and keeps old centroids for empty clusters.
Compute final true loss by streaming the fixed sampled rows again and summing min_c ||x - c||^2.

Memory model

Peak raw-vector memory is bounded by:

num_partitions * streaming_sample_rate * dimension * sizeof(float32)

The coreset budget is bounded separately by:

num_partitions * streaming_coreset_rate

The final centroid and accumulation buffers scale with num_partitions * dimension. This avoids materializing the full num_partitions * sample_rate raw-vector sample in memory.

Quality model

The local coreset preserves the chunk distribution with centroid weights and local losses. Weighted hierarchical kmeans keeps Lance's existing high-loss partition splitting heuristic, but applies it to a bounded weighted summary instead of the full raw sample. Optional streaming refine passes then optimize the true raw-vector kmeans objective without increasing the raw-vector memory bound.

Benchmarks

Setup

VM: GCP c2-standard-16, 16 vCPU, ~64 GiB RAM, AVX512, 500 GB pd-ssd.
Dataset: real SIFT vectors, 128-dimensional float32.
Training sample size: k * 256 rows.
Parameters for LanceStream: sample_rate=256, streaming_sample_rate=64, streaming_coreset_rate=16, streaming_refine_passes=0, max_iters=50.
Parameters for Lance non-stream: production IVF training path with streaming_sample_rate=None, sample_rate=256, max_iters=50, default hierarchical_k=16, and IVF balance_factor=1.0.
Timeout: 1 hour per run.
Compared implementations:
- LanceStream: this PR.
- Lance non-stream: current production non-stream heuristic / hierarchical IVF kmeans path.
- Faiss Kmeans v1.12.0 built from source with AVX512 optimization.
- sklearn MiniBatchKMeans.
- BICO Python package.
- treeCoreset / StreamKM++ style implementation.

Loss is reported as the exact kmeans objective over the same training sample whenever centroids were available:

sum_x min_c ||x - c||^2

For BICO and treeCoreset, the native reported objective is not directly comparable, so exact loss was recomputed from saved centroids with faiss.IndexFlatL2 for k=1,024, 4,096, and 16,384. Lance non-stream was evaluated the same way for those exact-loss rows. Exact full loss for k=65,536 and 131,072 was not recomputed because it is very expensive at those sample sizes; the large-k table uses each completed Lance run's reported training loss.

Exact Loss Results

Relative loss is shown against LanceStream for the same k.

k	algorithm	status	train time	peak RSS	exact loss	loss vs LanceStream
1,024	LanceStream	ok	11.29s	413 MB	1.3828e10	baseline
1,024	Lance non-stream	ok	2.22s	344 MB	1.3840e10	+0.09%
1,024	Faiss	ok	34.92s	201 MB	1.3279e10	-4.0%
1,024	MiniBatchKMeans	ok	33.33s	329 MB	1.3494e10	-2.4%
1,024	BICO	ok	34.25s	548 MB	1.4307e10	+3.5%
1,024	treeCoreset	ok	18.36s	1.35 GB	1.4697e10	+6.3%
4,096	LanceStream	ok	45.79s	416 MB	6.3529e10	baseline
4,096	Lance non-stream	ok	16.51s	1.30 GB	6.2720e10	-1.27%
4,096	Faiss	ok	9.40m	603 MB	6.0843e10	-4.2%
4,096	MiniBatchKMeans	ok	7.57m	881 MB	6.1763e10	-2.8%
4,096	BICO	ok	2.46m	1.05 GB	6.5588e10	+3.2%
4,096	treeCoreset	ok	3.00m	3.50 GB	6.7396e10	+6.1%
16,384	LanceStream	ok	2.28m	1.01 GB	2.2856e11	baseline
16,384	Lance non-stream	ok	30.04s	4.15 GB	2.2923e11	+0.30%
16,384	Faiss	timeout	1h cutoff	2.16 GB	n/a	n/a
16,384	MiniBatchKMeans	timeout	1h cutoff	3.02 GB	n/a	n/a
16,384	BICO	ok	45.40m	3.79 GB	2.3061e11	+0.90%
16,384	treeCoreset	ok	40.33m	12.33 GB	2.3749e11	+3.91%

Large-k Scalability

k	algorithm	status	train time	peak RSS	reported training loss	notes
16,384	LanceStream	ok	2.28m	1.01 GB	2.2761e11	streaming coreset, completed
16,384	Lance non-stream	ok	30.04s	4.15 GB	2.3169e11	production non-stream heuristic kmeans
16,384	Faiss	timeout	1h cutoff	2.16 GB	n/a	no centroids/loss
16,384	MiniBatchKMeans	timeout	1h cutoff	3.02 GB	n/a	no centroids/loss
16,384	BICO	ok	45.40m	3.79 GB	n/a	exact loss +0.90% vs LanceStream
16,384	treeCoreset	ok	40.33m	12.33 GB	n/a	exact loss +3.91% vs LanceStream
65,536	LanceStream	ok	11.75m	3.64 GB	8.3324e11	streaming coreset, completed
65,536	Lance non-stream	ok	2.67m	16.34 GB	8.1670e11	production non-stream heuristic kmeans
65,536	Faiss	skipped	n/a	n/a	n/a	skipped after 16K timeout
65,536	MiniBatchKMeans	skipped	n/a	n/a	n/a	skipped after 16K timeout
65,536	BICO	timeout	1h cutoff	7.29 GB	n/a	did not complete
65,536	treeCoreset	error 139	27.53s	22.31 GB	n/a	segfault / dumped core
131,072	LanceStream	ok	20.80m	7.12 GB	1.6070e12	streaming coreset, completed
131,072	Lance non-stream	ok	5.33m	32.52 GB	1.5818e12	production non-stream heuristic kmeans
131,072	Faiss	skipped	n/a	n/a	n/a	skipped after 16K timeout
131,072	MiniBatchKMeans	skipped	n/a	n/a	n/a	skipped after 16K timeout
131,072	BICO	skipped	n/a	n/a	n/a	skipped after 64K timeout
131,072	treeCoreset	skipped	n/a	n/a	n/a	skipped after 64K error

For k=131,072, LanceStream's raw chunk size is approximately 131072 * 64 * 128 * 4 = 4 GiB; observed peak RSS was 7.12 GB including coreset, centroids, accumulators, allocator overhead, and runtime overhead. Lance non-stream loads the full k * 256 sample and reached 32.52 GB RSS on the same workload. The non-stream path is faster and has slightly lower or comparable loss on these SIFT runs, but its memory scales with the full raw sample and is the pressure point this PR is designed to avoid.

Validation

cargo fmt --all --check
cargo check -p lance-index -p lance
cargo clippy -p lance-index --all-targets -- -D warnings
cargo clippy -p lance --all-targets -- -D warnings
cargo clippy --profile ci --locked --features aws,azure,bitpacking,cli,credential-vendor-aws,credential-vendor-azure,credential-vendor-gcp,datafusion,default,dir-aws,dir-azure,dir-gcp,dir-huggingface,dir-oss,dynamodb,dynamodb_tests,fp16kernels,gcp,gcs-test,geo,huggingface,jieba-rs,lindera,lz4,oss,protoc,rest,rest-adapter,slow_tests,substrait,tencent,test-util,tokenizer-jieba,tokenizer-lindera,zstd --all-targets -- -D warnings
git diff --check

claude

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

_{Tip: disable this comment in your organization's Code Review settings.}

codecov · 2026-05-22T09:38:03Z

Codecov Report

❌ Patch coverage is 69.76271% with 446 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance/src/index/vector/ivf.rs	68.74%	395 Missing and 36 partials ⚠️
rust/lance-index/src/vector/kmeans.rs	83.33%	11 Missing and 4 partials ⚠️

📢 Thoughts on this report? Let us know!

feat(index): add streaming ivf kmeans training

ed5b3be

claude Bot reviewed May 22, 2026

View reviewed changes

github-actions Bot added enhancement New feature or request python labels May 22, 2026

chore: merge main into streaming ivf kmeans branch

6c5574d

BubbleCal added 2 commits May 22, 2026 19:06

fix(index): simplify kmeans helper return types

0b1c7d8

fix(index): satisfy streaming ivf clippy lints

ac0e6cb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(index): add streaming ivf kmeans training#6913

feat(index): add streaming ivf kmeans training#6913
BubbleCal wants to merge 4 commits into
mainfrom
yang/streaming-ivf-kmeans-training

BubbleCal commented May 22, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

codecov Bot commented May 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BubbleCal commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Feature

Algorithm

Memory model

Quality model

Benchmarks

Setup

Exact Loss Results

Large-k Scalability

Validation

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

codecov Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

BubbleCal commented May 22, 2026 •

edited

Loading

codecov Bot commented May 22, 2026 •

edited

Loading