feat: add manifest version hint for fast latest-version lookup by touch-of-grey · Pull Request #6752 · lance-format/lance

touch-of-grey · 2026-05-13T03:05:02Z

Carries on #5997 (and the benchmarking in discussion #5947), and follows up on #6728 where moving S3 Express away from O(n) manifest listing to a version hint was raised — picking that up here.

What

On object stores where list is not lexicographically ordered (e.g. S3 Express, the local filesystem), resolving the latest manifest version is O(n) in the number of versions. To avoid this, after every successful commit on such a store we write a small JSON file _versions/latest_version_hint.json with content {"version":N}. A reader then does a GET on the hint file plus a few HEAD probes (O(k), where k = versions added since the hint was written), and falls back to a full listing if the hint is missing (older datasets) or stale.

The hint is written/read only on non-lexically-ordered stores. On S3 Standard / GCS / Azure / OSS / Tencent / DynamoDB / memory the ordered listing already resolves the latest version in roughly one request, so the hint would only add a PUT per commit for nothing.
current_manifest_path uses the hint for non-lexically-ordered, non-local stores (the local filesystem keeps its existing single-directory-read fast path); CommitHandler::list_manifest_locations_since (used by load_new_transactions) follows the same strategy.
The hint write is awaited as part of the commit (no fire-and-forget mode). It is best-effort: failures are logged and ignored, since the hint only accelerates reads and never affects correctness — readers always verify the hinted version and probe upward from it. Detached versions are never written to the hint.
A transient (non-NotFound) object-store error while probing abandons the hint path so the caller falls back to a full listing rather than trust a possibly-stale or incomplete result. The gap-fill HEADs are bounded by io_parallelism(), and a far-behind reader (gap > 1000) falls back to a single paginated listing.

Differences from #5997

Only the JSON hint format is kept (the alternative file-size-encoded format and its env var are dropped).
The fire-and-forget / async hint-write mode is removed — the hint is always written synchronously, which keeps concurrent writes simpler with no meaningful latency cost.
The hint is gated to non-lexically-ordered stores, where it's actually read.
current_manifest_path picks one strategy based on the store rather than racing a HEAD-probe against a listing, keeping IO behavior deterministic.

A manifest_commit benchmark is included to measure commit/load latency growth with many small fragments.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

jackye1995 · 2026-05-13T05:23:06Z

Thanks for carry on this work 🙏

One discussion outcome of this topic in the last community sync was that we would like to see how the conflict resolution and commit rate works when there are multiple writers. I think we need to add that benchmark with results before we are able to ensure this new approach does not regress performance.

codecov · 2026-05-13T06:44:09Z

Codecov Report

❌ Patch coverage is 96.16368% with 15 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-table/src/io/commit.rs	95.17%	9 Missing and 6 partials ⚠️

📢 Thoughts on this report? Let us know!

touch-of-grey · 2026-05-13T08:16:48Z

@jackye1995 added a concurrent_append benchmark (rust/lance/benches/concurrent_append.rs) and ran it on a c7i.48xlarge against both S3 Standard and S3 Express, with this PR and against main as a baseline. 100 k-row base table, 32 concurrent writer tasks, each performing 100 appends of 100 rows (3200 commits per run). All commits succeeded in every run, so the new approach handles concurrent conflict resolution without introducing failures.

Results

Run	Wall time	Throughput	p50	p95	p99	mean	max
`main` / S3 Express	1414.00 s	2.26 commits/s	144 ms	186 ms	3.22 s	3.25 s	1395 s
VersionHint / S3 Express	694.86 s	4.61 commits/s	39 ms	47 ms	4.65 s	1.66 s	691 s
VersionHint / S3 Standard	1793.68 s	1.78 commits/s	157 ms	275 ms	2.93 s	4.45 s	1775 s

(main / S3 Standard was not re-run separately because the hint is gated to non-lexically-ordered stores, so on S3 Standard this PR's commit path is identical to main.)

Takeaways

S3 Express (the case this PR targets) is ~2× faster with the hint than on main: throughput goes from 2.26 → 4.61 commits/s, and steady-state per-attempt latency goes from 144 → 39 ms p50 and 186 → 47 ms p95. That comes from the conflict-resolution path no longer doing an O(n) listing on every retry.
S3 Standard is unchanged. The hint is only written/read on non-lexically-ordered stores (S3 Standard / GCS / Azure listings already resolve the latest version in roughly one ordered list request), so commits on those stores don't pay an extra PUT or a different read path. The 1.78 commits/s on S3 Standard reflects S3's higher per-request latency vs Express, not anything this PR changes.
Zero failures across all 3 runs (3200 commits each). The default 20-retry conflict resolution keeps converging even with 32 writers hammering the same dataset.
The tail (p99 / max) is dominated by conflict-retry pile-ups under sustained 32-way concurrency. The hint roughly halves the tail (max 1395 s → 691 s on Express), but doesn't eliminate it — that's the next thing worth looking at if we want to push concurrent throughput further.

Reproducing

export AWS_REGION=us-east-1
export NUM_WRITERS=32 APPENDS_PER_WRITER=100 ROWS_PER_APPEND=100 BASE_ROWS=100000

# S3 Standard
export DATASET_URI=s3://your-bucket/bench/concurrent_append
cargo bench --release --bench concurrent_append

# S3 Express (the `--x-s3` suffix is auto-detected to set s3_express=true)
export DATASET_URI=s3://your-bucket--use1-az4--x-s3/bench/concurrent_append
cargo bench --release --bench concurrent_append

wjones127 · 2026-05-13T21:54:56Z

Thanks @touch-of-grey, those are good results!

I'm good with these changes, but would like @jackye1995 to also have a chance to take a look through it.

touch-of-grey · 2026-05-14T09:40:08Z

Ran the full scaling sweep on a c7i.48xlarge (us-east-1d / use1-az1), 11 writer counts × 4 cases. The bench now also exposes LANCE_USE_VERSION_HINT=0/1 so the baseline and the new path can be compared from the same binary (no main re-build / file-swap needed), plus MAX_WALL_SECS and PER_ATTEMPT_TIMEOUT_SECS so every config has a bounded wall.

Setup

Fresh empty table per run (BASE_ROWS=0).
10-row appends, each writer commits as fast as it can.
MAX_WALL_SECS=30: each writer schedules new commits for ~30s.
PER_ATTEMPT_TIMEOUT_SECS=30: any single attempt (including retries) is capped at 30s, then counted as a failure and the writer reloads.
4 cases: {LANCE_USE_VERSION_HINT=0, =1} × {S3 Standard, S3 Express}.

Throughput (commits/sec)

N	baseline / S3	hint / S3	baseline / S3 Express	hint / S3 Express
10	8.91	7.07	14.97	23.84
20	8.10	5.79	14.29	22.17
50	6.74	7.06	11.90	20.60
100	6.49	7.00	9.91	15.57
200	7.15	6.49	9.96	10.74
300	8.29	6.76	8.26	9.95
400	7.88	8.22	7.39	8.46
500	7.62	8.66	6.33	6.89
600	7.08	7.92	6.99	6.97
700	7.74	8.04	6.95	4.15
800	7.62	8.16	6.58	4.60

p50 commit-attempt latency (ms)

N	baseline / S3	hint / S3	baseline / S3 Express	hint / S3 Express
10	102	107	63	41
20	109	110	64	40
50	105	112	64	41
100	110	103	67	42
200	109	110	75	43
300	108	112	62	44
400	109	110	68	45
500	112	104	70	53
600	117	112	69	46
700	111	108	64	51
800	110	106	66	68

Takeaways

S3 Standard is untouched. baseline/S3 and hint/S3 are statistically indistinguishable across the whole range (the hint is gated off on lexically-ordered stores). No regression.
S3 Express benefits clearly in the realistic regime (N ≤ 200). With the hint, p50 drops from ~65 ms to ~40 ms (1.6×) and throughput rises 1.6× at low N (24 vs 15 at N=10) and 1.6× even at N=100 (15.6 vs 9.9). The conflict-rebase path no longer re-lists every commit on S3 Express.
At very high concurrency (N ≥ 300) all four cases converge to ~5–9 commits/s. The bottleneck stops being "find the latest version" and becomes the conflict-retry storm itself — every commit collides, every commit rebases, and the system is throughput-limited by the conditional PUT on the same key. The hint can't help once that's the dominant cost.
hint / S3 Express regresses at N=700–800 (p50 jumps to ~5s, throughput drops to 4 commits/s). Almost certainly a connection-pool / fd-pressure effect at this writer count; the run still produces non-zero commits and zero data corruption.
No data corruption. Across all 44 runs the final dataset versions match the succeeded counts — failures are 30s per-attempt timeouts, not actual write errors.

Disabling the hint at runtime

Anyone (operators, future benchmarks) can run with the hint off via:

export LANCE_USE_VERSION_HINT=0

Read once at first use; affects writes (no hint PUT), current_manifest_path, and CommitHandler::list_manifest_locations_since.

jackye1995

looks good to me, just nit regarding the change in the format spec. I think we will need a vote to add this? Please start a vote thread in discussions.

touch-of-grey · 2026-05-15T08:34:54Z

@jackye1995 @wjones127 spec change vote thread is up: #6797 (1-week minimum, 3 binding PMC +1 needed). Also pushed the two doc nits from the review (ef302f030): dropped the LANCE_USE_VERSION_HINT env-var line from the spec, and dropped the specific store examples from the version-hint section so it just describes the file and the contract.

jackye1995 · 2026-05-19T18:38:52Z

@wjones127 the vote has passed, any further comments?

On object stores where listing is not lexicographically ordered (e.g. S3 Express, the local filesystem), resolving the latest manifest version is O(n) in the number of versions. After every successful commit on such a store, write a small JSON file `_versions/latest_version_hint.json` (`{"version":N}`); readers use it as a starting point and probe a few higher versions with HEAD requests (O(k), k = versions added since the hint was written), falling back to a full listing if the hint is missing (older datasets) or stale, or if a transient object-store error makes the probed range untrustworthy. The hint is written/read only on non-lexically-ordered stores — on S3 Standard / GCS / Azure / DynamoDB / memory the ordered listing already resolves the latest version in roughly one request. The write is awaited as part of the commit (no fire-and-forget mode) and is best-effort: failures are logged and ignored, since the hint only accelerates reads and never affects correctness. Detached versions are never hinted. `current_manifest_path` uses the hint for non-lexically-ordered, non-local stores (the local filesystem keeps its single-directory-read fast path); `CommitHandler::list_manifest_locations_since` (used by `load_new_transactions`) follows the same strategy, with the gap-fill HEADs bounded by `io_parallelism()` and a fallback to a single paginated listing once a reader is more than 1000 versions behind. Carries on lance-format#5997 / discussion lance-format#5947, and follows up on lance-format#6728 where moving S3 Express to a version hint was raised.

- Mark uses_version_hint as pub so the doc link from write_version_hint resolves under rustdoc. - Update test_dir_listing_extra_calls_with_migration to expect one fewer listing call: on local FS the __manifest reload now uses the version hint (a HEAD-and-probe on _versions/latest_version_hint.json) instead of a full LIST, so table_exists / describe_table in the migration path now make only the table-directory fallback list call.

A new `concurrent_append` benchmark seeds a 100k-row base table then runs N tokio writer tasks that each loop calling `InsertBuilder::execute` on the same dataset. The output records commits/sec, per-commit latency distribution (p50/p90/p95/p99/max/mean), and the final version count, so the version-hint optimisation can be measured against S3 Standard and S3 Express directly. Designed to be driven from a single very large EC2 host so the writer count itself isn't the bottleneck. Configurable via env vars (DATASET_URI, NUM_WRITERS, APPENDS_PER_WRITER, ROWS_PER_APPEND, BASE_ROWS, KEEP_DATASET) and detects S3 Express via the `--x-s3` suffix.

Setting BASE_ROWS=0 now creates the dataset with a single zero-row batch so writers begin at version 1 with no data, instead of the previous ~100k-row seed.

The hint is now controlled by a process-wide env var (read once via OnceLock) that overrides every store-type check. Setting LANCE_USE_VERSION_HINT=0 (or false / off) makes write_version_hint a no-op, makes current_manifest_path skip the hint probe, and makes CommitHandler::list_manifest_locations_since fall back to the listing path on every store — so the same binary can be benchmarked with and without the optimization, and operators have a clear escape hatch if it ever misbehaves.

Lets each writer stop after a wall-clock budget instead of always finishing APPENDS_PER_WRITER commits, so high-concurrency runs (where contention drags per-commit latency up) don't run unbounded.

Lets the driver bound a run's total wall to MAX_WALL_SECS + per-attempt timeout, even when contention pushes a single commit attempt's retry chain past several minutes.

… link CI clippy was tripping on the 8-arg run_writer signature; tag it explicitly. rustdoc was rejecting the link from public uses_version_hint to the private VERSION_HINT_ENV const, so inline the env-var name instead.

Drop the env-var mention (implementation detail) and drop the specific non-lex store examples; describe what the file is and the contract readers can rely on, not which stores choose to write it.

jackye1995 · 2026-05-19T20:13:27Z

thanks for pushing this through!

claude Bot reviewed May 13, 2026

View reviewed changes

github-actions Bot added enhancement New feature or request A-python Python bindings A-java Java bindings + JNI labels May 13, 2026

touch-of-grey force-pushed the VersionHint branch from fa96252 to d326472 Compare May 13, 2026 03:45

wjones127 self-assigned this May 13, 2026

jackye1995 approved these changes May 15, 2026

View reviewed changes

Comment thread docs/src/format/table/layout.md Outdated

Comment thread docs/src/format/table/layout.md Outdated

touch-of-grey force-pushed the VersionHint branch from ef302f0 to b16a2d4 Compare May 19, 2026 06:22

touch-of-grey added 9 commits May 19, 2026 12:16

bench: allow concurrent_append to start from a fully empty table

c48d005

Setting BASE_ROWS=0 now creates the dataset with a single zero-row batch so writers begin at version 1 with no data, instead of the previous ~100k-row seed.

bench: add MAX_WALL_SECS time-budget cap for concurrent_append

6f0cdce

Lets each writer stop after a wall-clock budget instead of always finishing APPENDS_PER_WRITER commits, so high-concurrency runs (where contention drags per-commit latency up) don't run unbounded.

bench: add PER_ATTEMPT_TIMEOUT_SECS to cap commit-attempt wall time

90ee182

Lets the driver bound a run's total wall to MAX_WALL_SECS + per-attempt timeout, even when contention pushes a single commit attempt's retry chain past several minutes.

docs(spec): tighten version-hint section per review

a191615

Drop the env-var mention (implementation detail) and drop the specific non-lex store examples; describe what the file is and the contract readers can rely on, not which stores choose to write it.

jackye1995 force-pushed the VersionHint branch from b16a2d4 to a191615 Compare May 19, 2026 19:18

jackye1995 merged commit dd887ec into lance-format:main May 19, 2026
29 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add manifest version hint for fast latest-version lookup#6752

feat: add manifest version hint for fast latest-version lookup#6752
jackye1995 merged 9 commits into
lance-format:mainfrom
touch-of-grey:VersionHint

touch-of-grey commented May 13, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

jackye1995 commented May 13, 2026

Uh oh!

codecov Bot commented May 13, 2026 •

edited

Loading

Uh oh!

touch-of-grey commented May 13, 2026 •

edited by jackye1995

Loading

Uh oh!

wjones127 commented May 13, 2026

Uh oh!

touch-of-grey commented May 14, 2026 •

edited by jackye1995

Loading

Uh oh!

jackye1995 left a comment

Uh oh!

Uh oh!

Uh oh!

touch-of-grey commented May 15, 2026

Uh oh!

jackye1995 commented May 19, 2026

Uh oh!

jackye1995 commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

touch-of-grey commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Differences from #5997

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

jackye1995 commented May 13, 2026

Uh oh!

codecov Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

touch-of-grey commented May 13, 2026 • edited by jackye1995 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Results

Takeaways

Reproducing

Uh oh!

wjones127 commented May 13, 2026

Uh oh!

touch-of-grey commented May 14, 2026 • edited by jackye1995 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Setup

Throughput (commits/sec)

p50 commit-attempt latency (ms)

Takeaways

Disabling the hint at runtime

Uh oh!

jackye1995 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

touch-of-grey commented May 15, 2026

Uh oh!

jackye1995 commented May 19, 2026

Uh oh!

jackye1995 commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

touch-of-grey commented May 13, 2026 •

edited

Loading

codecov Bot commented May 13, 2026 •

edited

Loading

touch-of-grey commented May 13, 2026 •

edited by jackye1995

Loading

touch-of-grey commented May 14, 2026 •

edited by jackye1995

Loading