Add S3 GET request coalescer#6530
Conversation
Object storage GETs have a high fixed cost (a round-trip, and a per-request charge on S3) relative to transferring a few extra bytes. A single search often issues several get_slice requests for ranges that sit close together within the same object. CoalescingStorage buffers get_slice requests for a short window (5ms by default), groups the ones targeting nearby ranges of the same object, and issues a single underlying GET spanning each group, handing every waiter a zero-copy slice of the merged buffer. The small gaps between ranges are fetched wastefully, trading a little bandwidth for fewer round-trips. Coalescing one wave of requests is also what makes the next wave coalescable. Search reads are staged: a term dictionary lookup must complete before the postings list it points to can be read. If the term dictionary requests are coalesced into one GET they all resolve at the same instant, so the postings list requests they trigger are issued together and land in a single window where they too get coalesced. If the term dictionary requests are served separately they return at staggered times, scattering the dependent postings list requests across multiple windows and preventing them from merging. So the gap/span thresholds should stay generous enough to keep co-issued term-dictionary reads together. Unlike DebouncedStorage, which only deduplicates identical in-flight requests, the coalescer merges distinct but nearby ranges; it is wired beneath the debouncer in the S3 resolver. Grouping is bounded by a max gap (1 MiB) and max merged span (8 MiB); all three knobs are tunable via QW_S3_COALESCE_* env vars, and a zero window disables coalescing. The fetch runs in a detached task and reports back over oneshot channels, so a cancelled caller never aborts an in-flight GET, and the internal mutex is never held across an await.
|
We have no clue that it will catch anything and it makes the debouncer redundant. This PR cannot be merged as is. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4842cbaab2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| ); | ||
| let relative_start = request.range.start - merged.start; | ||
| let relative_end = request.range.end - merged.start; | ||
| Ok(bytes.slice(relative_start..relative_end)) |
There was a problem hiding this comment.
Detach coalesced slices before caching
When this slice is returned for a small cached fast-field read, it can be inserted into the long-lived QuickwitCache: StorageWithCache caches the OwnedBytes returned by BundleStorage, and BundleStorage::get_slice forwards the underlying bytes without copying. In that scenario a few-KB .fast entry coalesced with nearby split reads pins the whole merged buffer (up to the 8 MiB span) while MemorySizedCache accounts only bytes.len(), so the fast-field cache can exceed its configured memory by a large factor. The module comment acknowledges this hazard but incorrectly excludes the fast-field path because the “whole fast field” is still a sub-slice of the coalesced .split allocation.
Useful? React with 👍 / 👎.
| record_cluster_metrics(&cluster); | ||
|
|
||
| let merged = cluster.range; | ||
| let result = underlying.get_slice(path, merged.clone()).await; |
There was a problem hiding this comment.
Use merged range size for storage timeouts
When searcher.storage_timeout_policy is enabled, the search path wraps this resolver storage in TimeoutAndRetryStorage (quickwit-search/src/leaf.rs), and that wrapper computes its deadline from each caller's original range.len() before this coalescer merges requests. A small read that lands in an 8 MiB cluster therefore still gets the small-read timeout, so the merged S3 GET can be timed out and retried even when it meets the configured throughput for the actual bytes being downloaded; with the default sample policy this can turn coalescing into repeated timeouts for clustered small ranges.
Useful? React with 👍 / 👎.
| let coalesced_storage = CoalescingStorage::new(storage); | ||
| Ok(Arc::new(DebouncedStorage::new(coalesced_storage))) |
There was a problem hiding this comment.
Preserve physical download stats for coalesced S3 reads
In the leaf-search path, CountingStorage::instrument_storage wraps the storage returned by this resolver and feeds SplitResourceStats.download_num_bytes/download_num_requests, which the proto documents as bytes downloaded and storage GETs issued. With the coalescer hidden below that request-scoped wrapper, two 4 KiB reads 1 MiB apart are reported as 8 KiB and 2 requests even though S3 downloads roughly 1 MiB and issues 1 GET, so per-split resource stats and experiments become inaccurate whenever coalescing occurs; the merged GET accounting needs to be propagated to that layer or measured where the physical request is made.
Useful? React with 👍 / 👎.
The purpose of this PR is to find out more. |
Object storage GETs have a high fixed cost (a round-trip, and a per-request
charge on S3) relative to transferring a few extra bytes. A single search
often issues several get_slice requests for ranges that sit close together
within the same object.
CoalescingStorage buffers get_slice requests for a short window (5ms by
default), groups the ones targeting nearby ranges of the same object, and
issues a single underlying GET spanning each group, handing every waiter a
zero-copy slice of the merged buffer. The small gaps between ranges are
fetched wastefully, trading a little bandwidth for fewer round-trips.
Coalescing one wave of requests is also what makes the next wave coalescable.
Search reads are staged: a term dictionary lookup must complete before the
postings list it points to can be read. If the term dictionary requests are
coalesced into one GET they all resolve at the same instant, so the postings
list requests they trigger are issued together and land in a single window
where they too get coalesced. If the term dictionary requests are served
separately they return at staggered times, scattering the dependent postings
list requests across multiple windows and preventing them from merging. So
the gap/span thresholds should stay generous enough to keep co-issued
term-dictionary reads together.
Unlike DebouncedStorage, which only deduplicates identical in-flight
requests, the coalescer merges distinct but nearby ranges; it is wired
beneath the debouncer in the S3 resolver. Grouping is bounded by a max gap
(1 MiB) and max merged span (8 MiB); all three knobs are tunable via
QW_S3_COALESCE_* env vars, and a zero window disables coalescing.
The fetch runs in a detached task and reports back over oneshot channels, so
a cancelled caller never aborts an in-flight GET, and the internal mutex is
never held across an await.
Description
Describe the proposed changes made in this PR.
How was this PR tested?
Describe how you tested this PR.