feat: support batch vector queries#6828
Conversation
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
Updated based on review feedback:
Local disk benchmark result for 8 queries, 50k rows, dim=4: separate mean 3.8834 ms vs batch mean 3.3045 ms, about 1.17x speedup. The gain is modest on local disk because the repeated reads are served from OS page cache. |
|
Updated the benchmark scale per feedback:
Local result with OS cache accepted:
This is meaningfully higher than the previous small-data local run (~1.17x), which matches the expectation that larger scan workloads show more benefit from sharing read/decode work across queries. |
|
Added a controlled benchmark matrix to make the trend clearer. Query-count scaling at 1M rows x 512d:
Dataset-size scaling at m=10, 512d:
So the relative speedup clearly increases with m. For dataset size, the absolute time saved grows from ~71 ms to ~600 ms while relative speedup stays above 2x on local disk with OS cache effects accepted. The benchmark is now parameterized with env vars so these rows can be reproduced without editing source. |
| ); | ||
| } | ||
|
|
||
| fn bench_batch_flat_knn(c: &mut Criterion) { |
There was a problem hiding this comment.
can we port this to be in Python?
There was a problem hiding this comment.
it has been ported to: python/python/benchmarks/test_search.py:227
| DataType::List(_) | DataType::FixedSizeList(_, _) => { | ||
| if !matches!(vector_type, DataType::List(_)) { | ||
| return Err(Error::invalid_input(format!( | ||
| "Query is multivector but column {}({})is not multivector", |
There was a problem hiding this comment.
Can you explain more how this distinguishes between multivector query and query batch?
There was a problem hiding this comment.
Batch-vs-multivector is distinguished by the vector column type: list-like q + List column means one multivector query; list-like q + FixedSizeList column means a batch of single-vector queries.
There was a problem hiding this comment.
batch-vs-multivector is decided by vector column type with comments added in Scanner::nearest lines 1467-1475
BubbleCal
left a comment
There was a problem hiding this comment.
- distance_range param is lost if it's a batch query
- this forces the query to be executed by flat KNN even there's an index, we still need to use the index if there is one (just query the index for each query vector).
plz add tests for verifying they are really fixed
|
if the query is with:
it's expected to return an empty result, but the schema should still contain |
| In that case Lance runs a flat batch KNN query, returns up to ``k`` rows | ||
| for each query vector, and adds ``query_index`` to identify the source | ||
| query for each result row. Indexed/ANN batch search is not used in this | ||
| first implementation. |
There was a problem hiding this comment.
this comments look not correct
There was a problem hiding this comment.
- outdated comments
- there will be a
query_indexfor batch
| q: QueryVectorLike | ||
| The query vector. | ||
| The query vector. For fixed-size vector columns, this may be a 2-D | ||
| array-like batch of query vectors. Batch queries run flat KNN, apply |
There was a problem hiding this comment.
outdated comments were committed with old batch implementation but not being updated by later commits, btw indexed batch & distance_range has been consistent with comments like above
| .into_iter() | ||
| .flat_map(BinaryHeap::into_vec) | ||
| .collect::<Vec<_>>(); | ||
| results.sort_by(|left, right| { |
There was a problem hiding this comment.
I think we need to rethink how to implement this, now the results will be truncated because of SortExec has limit=k.
Say for batch query with 2 vectors, and k=10, this would return 20 rows, but SortExec will keep only 10 results then we will lose the rest results
There was a problem hiding this comment.
introducing Scanner::is_batch_nearest & skipping SortExec on batch flat path, per-query top-k will be handled by KNNVectorDistanceExec::execute_batch
There was a problem hiding this comment.
I don't think this is fixed, plz add a test to verify it
| match t { | ||
| DisplayFormatType::Default | DisplayFormatType::Verbose => { | ||
| write!(f, "KNNVectorDistance: metric={}", self.distance_type,) | ||
| if self.query_count > 1 { |
There was a problem hiding this comment.
I think all checks self.query_count > 1 need to be replaced by self.is_batch_query() which should check by query shape not query count.
Say if the query is a list of vectors but with only 1 vector, it's still a batch query, or the behavior will be hard to predict
There was a problem hiding this comment.
- change batch query judgement
Scanner::is_batch_nearest+KNNVectorDistanceExec::is_batchreplacesself.query_count > 1- indexed batch still leverage existing single query ANN path with
is_batch_nearest=false
99de360 to
bbe534b
Compare
|
Test dedup only — no production or indexed-batch behavior changes.
Net: fewer overlapping assertions, same coverage. |
Add a flat KNN batch query path so callers can submit multiple query vectors and share scan work while preserving per-query top-k results. Co-authored-by: Cursor <cursoragent@cursor.com>
Fold batch flat KNN into the existing nearest and KNN execution paths so the public API and plan nodes stay consistent with reviewer feedback. Co-authored-by: Cursor <cursoragent@cursor.com>
Use a larger local-disk dataset and stream benchmark data generation so batch query gains are measured under a more realistic scan workload. Co-authored-by: Cursor <cursoragent@cursor.com>
Allow the local-disk batch KNN benchmark to vary row count, dimensionality, and query count so PR results can show scaling trends. Co-authored-by: Cursor <cursoragent@cursor.com>
Use the LanceDB-compatible query_index result column and move the batch flat KNN benchmark to Python so benchmark scaling can be reproduced from the binding API. Co-authored-by: Cursor <cursoragent@cursor.com>
Apply rustfmt output expected by CI for the batch query binding change. Co-authored-by: Cursor <cursoragent@cursor.com>
Move batch flat KNN benchmark configuration into pytest parameters so review and reproduction do not rely on environment variables. Co-authored-by: Cursor <cursoragent@cursor.com>
Route batched queries through vector indices when available and apply distance range bounds before per-query top-k selection on the flat path. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Update nearest/search docstrings to describe indexed batch queries and add Python tests that batch distance_range matches per-query searches. Co-authored-by: Cursor <cursoragent@cursor.com>
When fast_search is used with a batch nearest query and no vector index, return an empty result whose schema still contains query_index. Co-authored-by: Cursor <cursoragent@cursor.com>
Use is_batch_nearest based on list-like queries on fixed-size vector columns instead of query_count > 1, so single-vector batch queries still get query_index and avoid SortExec TopK(fetch=k) truncating m*k results to k rows. Co-authored-by: Cursor <cursoragent@cursor.com>
Consolidate overlapping Rust/Python batch nearest tests via shared helpers. No production changes; merge with main deferred. Co-authored-by: Cursor <cursoragent@cursor.com>
Keep the main-based batch vector query branch compiling cleanly after conflict resolution. Co-authored-by: Cursor <cursoragent@cursor.com>
bbe534b to
fc0e7f0
Compare
| let vector_expr = expressions::col(DIST_COL, current_schema)?; | ||
| output_expr.push((vector_expr, DIST_COL.to_string())); | ||
| } | ||
| if self.is_batch_nearest && output_expr.iter().all(|(_, name)| name != QUERY_INDEX_COL) |
There was a problem hiding this comment.
I think the query_index column shouldn't be added into autoproject_scoring_columns, it's a little bit confusing
Add a regression test that batch flat KNN returns k rows per query instead of being truncated by SortExec, and keep query_index autoprojection separate from scoring-column autoprojection. Co-authored-by: Cursor <cursoragent@cursor.com>
Summary
Scanner::nearestAPI to accept batched query vectors for fixed-size vector columns (no separatenearest_batchAPI).KNNVectorDistanceExec: each data batch is loaded once, all query vectors are evaluated against it, and results are returned in one stream with up tom * krows.query_indexto batch results so callers can split top-k rows per input query (LanceDB-compatible name, not_query_index).use_index=trueand a vector index is available, batch queries run through the indexed path (per-query ANN search, union, andquery_indextagging) instead of forcing flat search.distance_rangeis applied before per-query top-k selection on the flat path; indexed batch respects the same bounds.fast_searchwith batch queries and no vector index returns an empty result whose schema still includesquery_index.FixedSizeList→ batch of single-vector queries;Listmultivector column → one multivector query).Closes #6821.
API contract
FixedSizeListembedding column.krows per query vector, plusquery_index(0-based index into the input query batch)._rowidin the scan plan.Benchmark
Python benchmark command:
Dataset size, dimensionality, query count, batch size, and rounds are declared in the benchmark's
@pytest.mark.parametrizevalues. Adjust those parameters inpython/benchmarks/test_search.pyto reproduce the scaling rows below.Dataset: random float32 vectors written to a real local
.lancedataset. Nomemory://dataset and no throttled/simulated object store latency. OS page cache effects are accepted.Query Count Scaling
Fixed dataset: 1,000,000 rows, dim=512, k=10. This is about 1.9 GiB of raw vector values.
m)Batching becomes more valuable as
mincreases because shared scan/decode work is amortized over more query vectors.Dataset Size Scaling
Fixed query count: m=10, dim=512, k=10.
m)On local disk with OS page cache, relative speedup is not strictly monotonic with row count because both plans become increasingly dominated by cached vector decoding and distance-compute work. The robust trend here is absolute time saved, which grows from ~71 ms to ~600 ms as dataset size grows.
Test plan
cargo test -p lance test_batch_knncargo test -p lance fast_search_withoutuv run pytest python/tests/test_vector_index.py -k batchuv run --extra benchmarks pytest --collect-only python/benchmarks/test_search.py::test_batch_flat_knncargo clippy -p lance --tests -- -D warningsuv run ruff format --check python/lance/dataset.py python/tests/test_vector_index.py python/benchmarks/test_search.py