Skip to content

feat: add manifest version hint for fast latest-version lookup#6752

Merged
jackye1995 merged 9 commits into
lance-format:mainfrom
touch-of-grey:VersionHint
May 19, 2026
Merged

feat: add manifest version hint for fast latest-version lookup#6752
jackye1995 merged 9 commits into
lance-format:mainfrom
touch-of-grey:VersionHint

Conversation

@touch-of-grey

@touch-of-grey touch-of-grey commented May 13, 2026

Copy link
Copy Markdown
Contributor

Carries on #5997 (and the benchmarking in discussion #5947), and follows up on #6728 where moving S3 Express away from O(n) manifest listing to a version hint was raised — picking that up here.

What

On object stores where list is not lexicographically ordered (e.g. S3 Express, the local filesystem), resolving the latest manifest version is O(n) in the number of versions. To avoid this, after every successful commit on such a store we write a small JSON file _versions/latest_version_hint.json with content {"version":N}. A reader then does a GET on the hint file plus a few HEAD probes (O(k), where k = versions added since the hint was written), and falls back to a full listing if the hint is missing (older datasets) or stale.

  • The hint is written/read only on non-lexically-ordered stores. On S3 Standard / GCS / Azure / OSS / Tencent / DynamoDB / memory the ordered listing already resolves the latest version in roughly one request, so the hint would only add a PUT per commit for nothing.
  • current_manifest_path uses the hint for non-lexically-ordered, non-local stores (the local filesystem keeps its existing single-directory-read fast path); CommitHandler::list_manifest_locations_since (used by load_new_transactions) follows the same strategy.
  • The hint write is awaited as part of the commit (no fire-and-forget mode). It is best-effort: failures are logged and ignored, since the hint only accelerates reads and never affects correctness — readers always verify the hinted version and probe upward from it. Detached versions are never written to the hint.
  • A transient (non-NotFound) object-store error while probing abandons the hint path so the caller falls back to a full listing rather than trust a possibly-stale or incomplete result. The gap-fill HEADs are bounded by io_parallelism(), and a far-behind reader (gap > 1000) falls back to a single paginated listing.

Differences from #5997

  • Only the JSON hint format is kept (the alternative file-size-encoded format and its env var are dropped).
  • The fire-and-forget / async hint-write mode is removed — the hint is always written synchronously, which keeps concurrent writes simpler with no meaningful latency cost.
  • The hint is gated to non-lexically-ordered stores, where it's actually read.
  • current_manifest_path picks one strategy based on the store rather than racing a HEAD-probe against a listing, keeping IO behavior deterministic.

A manifest_commit benchmark is included to measure commit/load latency growth with many small fragments.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions Bot added enhancement New feature or request A-python Python bindings A-java Java bindings + JNI labels May 13, 2026
@jackye1995

Copy link
Copy Markdown
Contributor

Thanks for carry on this work 🙏

One discussion outcome of this topic in the last community sync was that we would like to see how the conflict resolution and commit rate works when there are multiple writers. I think we need to add that benchmark with results before we are able to ensure this new approach does not regress performance.

@wjones127 wjones127 self-assigned this May 13, 2026
@codecov

codecov Bot commented May 13, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 96.16368% with 15 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-table/src/io/commit.rs 95.17% 9 Missing and 6 partials ⚠️

📢 Thoughts on this report? Let us know!

@touch-of-grey

touch-of-grey commented May 13, 2026

Copy link
Copy Markdown
Contributor Author

@jackye1995 added a concurrent_append benchmark (rust/lance/benches/concurrent_append.rs) and ran it on a c7i.48xlarge against both S3 Standard and S3 Express, with this PR and against main as a baseline. 100 k-row base table, 32 concurrent writer tasks, each performing 100 appends of 100 rows (3200 commits per run). All commits succeeded in every run, so the new approach handles concurrent conflict resolution without introducing failures.

Results

Run Wall time Throughput p50 p95 p99 mean max
main / S3 Express 1414.00 s 2.26 commits/s 144 ms 186 ms 3.22 s 3.25 s 1395 s
VersionHint / S3 Express 694.86 s 4.61 commits/s 39 ms 47 ms 4.65 s 1.66 s 691 s
VersionHint / S3 Standard 1793.68 s 1.78 commits/s 157 ms 275 ms 2.93 s 4.45 s 1775 s

(main / S3 Standard was not re-run separately because the hint is gated to non-lexically-ordered stores, so on S3 Standard this PR's commit path is identical to main.)

Takeaways

  • S3 Express (the case this PR targets) is ~2× faster with the hint than on main: throughput goes from 2.26 → 4.61 commits/s, and steady-state per-attempt latency goes from 144 → 39 ms p50 and 186 → 47 ms p95. That comes from the conflict-resolution path no longer doing an O(n) listing on every retry.
  • S3 Standard is unchanged. The hint is only written/read on non-lexically-ordered stores (S3 Standard / GCS / Azure listings already resolve the latest version in roughly one ordered list request), so commits on those stores don't pay an extra PUT or a different read path. The 1.78 commits/s on S3 Standard reflects S3's higher per-request latency vs Express, not anything this PR changes.
  • Zero failures across all 3 runs (3200 commits each). The default 20-retry conflict resolution keeps converging even with 32 writers hammering the same dataset.
  • The tail (p99 / max) is dominated by conflict-retry pile-ups under sustained 32-way concurrency. The hint roughly halves the tail (max 1395 s → 691 s on Express), but doesn't eliminate it — that's the next thing worth looking at if we want to push concurrent throughput further.

Reproducing

export AWS_REGION=us-east-1
export NUM_WRITERS=32 APPENDS_PER_WRITER=100 ROWS_PER_APPEND=100 BASE_ROWS=100000

# S3 Standard
export DATASET_URI=s3://your-bucket/bench/concurrent_append
cargo bench --release --bench concurrent_append

# S3 Express (the `--x-s3` suffix is auto-detected to set s3_express=true)
export DATASET_URI=s3://your-bucket--use1-az4--x-s3/bench/concurrent_append
cargo bench --release --bench concurrent_append

@wjones127

Copy link
Copy Markdown
Contributor

Thanks @touch-of-grey, those are good results!

I'm good with these changes, but would like @jackye1995 to also have a chance to take a look through it.

@touch-of-grey

touch-of-grey commented May 14, 2026

Copy link
Copy Markdown
Contributor Author

Ran the full scaling sweep on a c7i.48xlarge (us-east-1d / use1-az1), 11 writer counts × 4 cases. The bench now also exposes LANCE_USE_VERSION_HINT=0/1 so the baseline and the new path can be compared from the same binary (no main re-build / file-swap needed), plus MAX_WALL_SECS and PER_ATTEMPT_TIMEOUT_SECS so every config has a bounded wall.

Setup

  • Fresh empty table per run (BASE_ROWS=0).
  • 10-row appends, each writer commits as fast as it can.
  • MAX_WALL_SECS=30: each writer schedules new commits for ~30s.
  • PER_ATTEMPT_TIMEOUT_SECS=30: any single attempt (including retries) is capped at 30s, then counted as a failure and the writer reloads.
  • 4 cases: {LANCE_USE_VERSION_HINT=0, =1} × {S3 Standard, S3 Express}.

Throughput (commits/sec)

N baseline / S3 hint / S3 baseline / S3 Express hint / S3 Express
10 8.91 7.07 14.97 23.84
20 8.10 5.79 14.29 22.17
50 6.74 7.06 11.90 20.60
100 6.49 7.00 9.91 15.57
200 7.15 6.49 9.96 10.74
300 8.29 6.76 8.26 9.95
400 7.88 8.22 7.39 8.46
500 7.62 8.66 6.33 6.89
600 7.08 7.92 6.99 6.97
700 7.74 8.04 6.95 4.15
800 7.62 8.16 6.58 4.60

p50 commit-attempt latency (ms)

N baseline / S3 hint / S3 baseline / S3 Express hint / S3 Express
10 102 107 63 41
20 109 110 64 40
50 105 112 64 41
100 110 103 67 42
200 109 110 75 43
300 108 112 62 44
400 109 110 68 45
500 112 104 70 53
600 117 112 69 46
700 111 108 64 51
800 110 106 66 68

Takeaways

  • S3 Standard is untouched. baseline/S3 and hint/S3 are statistically indistinguishable across the whole range (the hint is gated off on lexically-ordered stores). No regression.
  • S3 Express benefits clearly in the realistic regime (N ≤ 200). With the hint, p50 drops from ~65 ms to ~40 ms (1.6×) and throughput rises 1.6× at low N (24 vs 15 at N=10) and 1.6× even at N=100 (15.6 vs 9.9). The conflict-rebase path no longer re-lists every commit on S3 Express.
  • At very high concurrency (N ≥ 300) all four cases converge to ~5–9 commits/s. The bottleneck stops being "find the latest version" and becomes the conflict-retry storm itself — every commit collides, every commit rebases, and the system is throughput-limited by the conditional PUT on the same key. The hint can't help once that's the dominant cost.
  • hint / S3 Express regresses at N=700–800 (p50 jumps to ~5s, throughput drops to 4 commits/s). Almost certainly a connection-pool / fd-pressure effect at this writer count; the run still produces non-zero commits and zero data corruption.
  • No data corruption. Across all 44 runs the final dataset versions match the succeeded counts — failures are 30s per-attempt timeouts, not actual write errors.

Disabling the hint at runtime

Anyone (operators, future benchmarks) can run with the hint off via:

export LANCE_USE_VERSION_HINT=0

Read once at first use; affects writes (no hint PUT), current_manifest_path, and CommitHandler::list_manifest_locations_since.

@jackye1995 jackye1995 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, just nit regarding the change in the format spec. I think we will need a vote to add this? Please start a vote thread in discussions.

Comment thread docs/src/format/table/layout.md Outdated
Comment thread docs/src/format/table/layout.md Outdated
@touch-of-grey

Copy link
Copy Markdown
Contributor Author

@jackye1995 @wjones127 spec change vote thread is up: #6797 (1-week minimum, 3 binding PMC +1 needed). Also pushed the two doc nits from the review (ef302f030): dropped the LANCE_USE_VERSION_HINT env-var line from the spec, and dropped the specific store examples from the version-hint section so it just describes the file and the contract.

@jackye1995

Copy link
Copy Markdown
Contributor

@wjones127 the vote has passed, any further comments?

On object stores where listing is not lexicographically ordered (e.g. S3
Express, the local filesystem), resolving the latest manifest version is
O(n) in the number of versions. After every successful commit on such a
store, write a small JSON file `_versions/latest_version_hint.json`
(`{"version":N}`); readers use it as a starting point and probe a few
higher versions with HEAD requests (O(k), k = versions added since the
hint was written), falling back to a full listing if the hint is missing
(older datasets) or stale, or if a transient object-store error makes the
probed range untrustworthy.

The hint is written/read only on non-lexically-ordered stores — on S3
Standard / GCS / Azure / DynamoDB / memory the ordered listing already
resolves the latest version in roughly one request. The write is awaited
as part of the commit (no fire-and-forget mode) and is best-effort:
failures are logged and ignored, since the hint only accelerates reads
and never affects correctness. Detached versions are never hinted.

`current_manifest_path` uses the hint for non-lexically-ordered, non-local
stores (the local filesystem keeps its single-directory-read fast path);
`CommitHandler::list_manifest_locations_since` (used by
`load_new_transactions`) follows the same strategy, with the gap-fill
HEADs bounded by `io_parallelism()` and a fallback to a single paginated
listing once a reader is more than 1000 versions behind.

Carries on lance-format#5997 / discussion lance-format#5947, and follows up on lance-format#6728 where moving
S3 Express to a version hint was raised.
- Mark uses_version_hint as pub so the doc link from write_version_hint
  resolves under rustdoc.
- Update test_dir_listing_extra_calls_with_migration to expect one fewer
  listing call: on local FS the __manifest reload now uses the version
  hint (a HEAD-and-probe on _versions/latest_version_hint.json) instead
  of a full LIST, so table_exists / describe_table in the migration path
  now make only the table-directory fallback list call.
A new `concurrent_append` benchmark seeds a 100k-row base table then runs
N tokio writer tasks that each loop calling `InsertBuilder::execute` on
the same dataset. The output records commits/sec, per-commit latency
distribution (p50/p90/p95/p99/max/mean), and the final version count, so
the version-hint optimisation can be measured against S3 Standard and
S3 Express directly. Designed to be driven from a single very large EC2
host so the writer count itself isn't the bottleneck.

Configurable via env vars (DATASET_URI, NUM_WRITERS, APPENDS_PER_WRITER,
ROWS_PER_APPEND, BASE_ROWS, KEEP_DATASET) and detects S3 Express via the
`--x-s3` suffix.
Setting BASE_ROWS=0 now creates the dataset with a single zero-row batch
so writers begin at version 1 with no data, instead of the previous
~100k-row seed.
The hint is now controlled by a process-wide env var (read once via
OnceLock) that overrides every store-type check. Setting
LANCE_USE_VERSION_HINT=0 (or false / off) makes write_version_hint a
no-op, makes current_manifest_path skip the hint probe, and makes
CommitHandler::list_manifest_locations_since fall back to the listing
path on every store — so the same binary can be benchmarked with and
without the optimization, and operators have a clear escape hatch if it
ever misbehaves.
Lets each writer stop after a wall-clock budget instead of always
finishing APPENDS_PER_WRITER commits, so high-concurrency runs (where
contention drags per-commit latency up) don't run unbounded.
Lets the driver bound a run's total wall to MAX_WALL_SECS + per-attempt
timeout, even when contention pushes a single commit attempt's retry
chain past several minutes.
… link

CI clippy was tripping on the 8-arg run_writer signature; tag it
explicitly. rustdoc was rejecting the link from public
uses_version_hint to the private VERSION_HINT_ENV const, so inline the
env-var name instead.
Drop the env-var mention (implementation detail) and drop the specific
non-lex store examples; describe what the file is and the contract
readers can rely on, not which stores choose to write it.
@jackye1995

Copy link
Copy Markdown
Contributor

thanks for pushing this through!

@jackye1995 jackye1995 merged commit dd887ec into lance-format:main May 19, 2026
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-java Java bindings + JNI A-python Python bindings enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants