feat: add manifest version hint for fast latest-version lookup#6752
Conversation
fa96252 to
d326472
Compare
|
Thanks for carry on this work 🙏 One discussion outcome of this topic in the last community sync was that we would like to see how the conflict resolution and commit rate works when there are multiple writers. I think we need to add that benchmark with results before we are able to ensure this new approach does not regress performance. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
@jackye1995 added a Results
( Takeaways
Reproducingexport AWS_REGION=us-east-1
export NUM_WRITERS=32 APPENDS_PER_WRITER=100 ROWS_PER_APPEND=100 BASE_ROWS=100000
# S3 Standard
export DATASET_URI=s3://your-bucket/bench/concurrent_append
cargo bench --release --bench concurrent_append
# S3 Express (the `--x-s3` suffix is auto-detected to set s3_express=true)
export DATASET_URI=s3://your-bucket--use1-az4--x-s3/bench/concurrent_append
cargo bench --release --bench concurrent_append |
|
Thanks @touch-of-grey, those are good results! I'm good with these changes, but would like @jackye1995 to also have a chance to take a look through it. |
|
Ran the full scaling sweep on a Setup
Throughput (commits/sec)
p50 commit-attempt latency (ms)
Takeaways
Disabling the hint at runtimeAnyone (operators, future benchmarks) can run with the hint off via: export LANCE_USE_VERSION_HINT=0Read once at first use; affects writes (no hint PUT), |
jackye1995
left a comment
There was a problem hiding this comment.
looks good to me, just nit regarding the change in the format spec. I think we will need a vote to add this? Please start a vote thread in discussions.
|
@jackye1995 @wjones127 spec change vote thread is up: #6797 (1-week minimum, 3 binding PMC +1 needed). Also pushed the two doc nits from the review ( |
ef302f0 to
b16a2d4
Compare
|
@wjones127 the vote has passed, any further comments? |
On object stores where listing is not lexicographically ordered (e.g. S3
Express, the local filesystem), resolving the latest manifest version is
O(n) in the number of versions. After every successful commit on such a
store, write a small JSON file `_versions/latest_version_hint.json`
(`{"version":N}`); readers use it as a starting point and probe a few
higher versions with HEAD requests (O(k), k = versions added since the
hint was written), falling back to a full listing if the hint is missing
(older datasets) or stale, or if a transient object-store error makes the
probed range untrustworthy.
The hint is written/read only on non-lexically-ordered stores — on S3
Standard / GCS / Azure / DynamoDB / memory the ordered listing already
resolves the latest version in roughly one request. The write is awaited
as part of the commit (no fire-and-forget mode) and is best-effort:
failures are logged and ignored, since the hint only accelerates reads
and never affects correctness. Detached versions are never hinted.
`current_manifest_path` uses the hint for non-lexically-ordered, non-local
stores (the local filesystem keeps its single-directory-read fast path);
`CommitHandler::list_manifest_locations_since` (used by
`load_new_transactions`) follows the same strategy, with the gap-fill
HEADs bounded by `io_parallelism()` and a fallback to a single paginated
listing once a reader is more than 1000 versions behind.
Carries on lance-format#5997 / discussion lance-format#5947, and follows up on lance-format#6728 where moving
S3 Express to a version hint was raised.
- Mark uses_version_hint as pub so the doc link from write_version_hint resolves under rustdoc. - Update test_dir_listing_extra_calls_with_migration to expect one fewer listing call: on local FS the __manifest reload now uses the version hint (a HEAD-and-probe on _versions/latest_version_hint.json) instead of a full LIST, so table_exists / describe_table in the migration path now make only the table-directory fallback list call.
A new `concurrent_append` benchmark seeds a 100k-row base table then runs N tokio writer tasks that each loop calling `InsertBuilder::execute` on the same dataset. The output records commits/sec, per-commit latency distribution (p50/p90/p95/p99/max/mean), and the final version count, so the version-hint optimisation can be measured against S3 Standard and S3 Express directly. Designed to be driven from a single very large EC2 host so the writer count itself isn't the bottleneck. Configurable via env vars (DATASET_URI, NUM_WRITERS, APPENDS_PER_WRITER, ROWS_PER_APPEND, BASE_ROWS, KEEP_DATASET) and detects S3 Express via the `--x-s3` suffix.
Setting BASE_ROWS=0 now creates the dataset with a single zero-row batch so writers begin at version 1 with no data, instead of the previous ~100k-row seed.
The hint is now controlled by a process-wide env var (read once via OnceLock) that overrides every store-type check. Setting LANCE_USE_VERSION_HINT=0 (or false / off) makes write_version_hint a no-op, makes current_manifest_path skip the hint probe, and makes CommitHandler::list_manifest_locations_since fall back to the listing path on every store — so the same binary can be benchmarked with and without the optimization, and operators have a clear escape hatch if it ever misbehaves.
Lets each writer stop after a wall-clock budget instead of always finishing APPENDS_PER_WRITER commits, so high-concurrency runs (where contention drags per-commit latency up) don't run unbounded.
Lets the driver bound a run's total wall to MAX_WALL_SECS + per-attempt timeout, even when contention pushes a single commit attempt's retry chain past several minutes.
… link CI clippy was tripping on the 8-arg run_writer signature; tag it explicitly. rustdoc was rejecting the link from public uses_version_hint to the private VERSION_HINT_ENV const, so inline the env-var name instead.
Drop the env-var mention (implementation detail) and drop the specific non-lex store examples; describe what the file is and the contract readers can rely on, not which stores choose to write it.
|
thanks for pushing this through! |
Carries on #5997 (and the benchmarking in discussion #5947), and follows up on #6728 where moving S3 Express away from O(n) manifest listing to a version hint was raised — picking that up here.
What
On object stores where
listis not lexicographically ordered (e.g. S3 Express, the local filesystem), resolving the latest manifest version is O(n) in the number of versions. To avoid this, after every successful commit on such a store we write a small JSON file_versions/latest_version_hint.jsonwith content{"version":N}. A reader then does a GET on the hint file plus a few HEAD probes (O(k), where k = versions added since the hint was written), and falls back to a full listing if the hint is missing (older datasets) or stale.current_manifest_pathuses the hint for non-lexically-ordered, non-local stores (the local filesystem keeps its existing single-directory-read fast path);CommitHandler::list_manifest_locations_since(used byload_new_transactions) follows the same strategy.NotFound) object-store error while probing abandons the hint path so the caller falls back to a full listing rather than trust a possibly-stale or incomplete result. The gap-fill HEADs are bounded byio_parallelism(), and a far-behind reader (gap > 1000) falls back to a single paginated listing.Differences from #5997
current_manifest_pathpicks one strategy based on the store rather than racing a HEAD-probe against a listing, keeping IO behavior deterministic.A
manifest_commitbenchmark is included to measure commit/load latency growth with many small fragments.