Skip to content

Add S3 GET request coalescer#6530

Draft
PSeitz-dd wants to merge 1 commit into
mainfrom
s3_request_coalescer
Draft

Add S3 GET request coalescer#6530
PSeitz-dd wants to merge 1 commit into
mainfrom
s3_request_coalescer

Conversation

@PSeitz-dd

Copy link
Copy Markdown
Contributor

Object storage GETs have a high fixed cost (a round-trip, and a per-request
charge on S3) relative to transferring a few extra bytes. A single search
often issues several get_slice requests for ranges that sit close together
within the same object.

CoalescingStorage buffers get_slice requests for a short window (5ms by
default), groups the ones targeting nearby ranges of the same object, and
issues a single underlying GET spanning each group, handing every waiter a
zero-copy slice of the merged buffer. The small gaps between ranges are
fetched wastefully, trading a little bandwidth for fewer round-trips.

Coalescing one wave of requests is also what makes the next wave coalescable.
Search reads are staged: a term dictionary lookup must complete before the
postings list it points to can be read. If the term dictionary requests are
coalesced into one GET they all resolve at the same instant, so the postings
list requests they trigger are issued together and land in a single window
where they too get coalesced. If the term dictionary requests are served
separately they return at staggered times, scattering the dependent postings
list requests across multiple windows and preventing them from merging. So
the gap/span thresholds should stay generous enough to keep co-issued
term-dictionary reads together.

Unlike DebouncedStorage, which only deduplicates identical in-flight
requests, the coalescer merges distinct but nearby ranges; it is wired
beneath the debouncer in the S3 resolver. Grouping is bounded by a max gap
(1 MiB) and max merged span (8 MiB); all three knobs are tunable via
QW_S3_COALESCE_* env vars, and a zero window disables coalescing.

The fetch runs in a detached task and reports back over oneshot channels, so
a cancelled caller never aborts an in-flight GET, and the internal mutex is
never held across an await.

Description

Describe the proposed changes made in this PR.

How was this PR tested?

Describe how you tested this PR.

Object storage GETs have a high fixed cost (a round-trip, and a per-request
charge on S3) relative to transferring a few extra bytes. A single search
often issues several get_slice requests for ranges that sit close together
within the same object.

CoalescingStorage buffers get_slice requests for a short window (5ms by
default), groups the ones targeting nearby ranges of the same object, and
issues a single underlying GET spanning each group, handing every waiter a
zero-copy slice of the merged buffer. The small gaps between ranges are
fetched wastefully, trading a little bandwidth for fewer round-trips.

Coalescing one wave of requests is also what makes the next wave coalescable.
Search reads are staged: a term dictionary lookup must complete before the
postings list it points to can be read. If the term dictionary requests are
coalesced into one GET they all resolve at the same instant, so the postings
list requests they trigger are issued together and land in a single window
where they too get coalesced. If the term dictionary requests are served
separately they return at staggered times, scattering the dependent postings
list requests across multiple windows and preventing them from merging. So
the gap/span thresholds should stay generous enough to keep co-issued
term-dictionary reads together.

Unlike DebouncedStorage, which only deduplicates identical in-flight
requests, the coalescer merges distinct but nearby ranges; it is wired
beneath the debouncer in the S3 resolver. Grouping is bounded by a max gap
(1 MiB) and max merged span (8 MiB); all three knobs are tunable via
QW_S3_COALESCE_* env vars, and a zero window disables coalescing.

The fetch runs in a detached task and reports back over oneshot channels, so
a cancelled caller never aborts an in-flight GET, and the internal mutex is
never held across an await.
@PSeitz-dd PSeitz-dd requested a review from a team as a code owner June 19, 2026 09:52
@fulmicoton

Copy link
Copy Markdown
Collaborator

We have no clue that it will catch anything and it makes the debouncer redundant. This PR cannot be merged as is.
Let's first study whether there is something to be done.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4842cbaab2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

);
let relative_start = request.range.start - merged.start;
let relative_end = request.range.end - merged.start;
Ok(bytes.slice(relative_start..relative_end))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Detach coalesced slices before caching

When this slice is returned for a small cached fast-field read, it can be inserted into the long-lived QuickwitCache: StorageWithCache caches the OwnedBytes returned by BundleStorage, and BundleStorage::get_slice forwards the underlying bytes without copying. In that scenario a few-KB .fast entry coalesced with nearby split reads pins the whole merged buffer (up to the 8 MiB span) while MemorySizedCache accounts only bytes.len(), so the fast-field cache can exceed its configured memory by a large factor. The module comment acknowledges this hazard but incorrectly excludes the fast-field path because the “whole fast field” is still a sub-slice of the coalesced .split allocation.

Useful? React with 👍 / 👎.

record_cluster_metrics(&cluster);

let merged = cluster.range;
let result = underlying.get_slice(path, merged.clone()).await;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use merged range size for storage timeouts

When searcher.storage_timeout_policy is enabled, the search path wraps this resolver storage in TimeoutAndRetryStorage (quickwit-search/src/leaf.rs), and that wrapper computes its deadline from each caller's original range.len() before this coalescer merges requests. A small read that lands in an 8 MiB cluster therefore still gets the small-read timeout, so the merged S3 GET can be timed out and retried even when it meets the configured throughput for the actual bytes being downloaded; with the default sample policy this can turn coalescing into repeated timeouts for clustered small ranges.

Useful? React with 👍 / 👎.

Comment on lines +67 to +68
let coalesced_storage = CoalescingStorage::new(storage);
Ok(Arc::new(DebouncedStorage::new(coalesced_storage)))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve physical download stats for coalesced S3 reads

In the leaf-search path, CountingStorage::instrument_storage wraps the storage returned by this resolver and feeds SplitResourceStats.download_num_bytes/download_num_requests, which the proto documents as bytes downloaded and storage GETs issued. With the coalescer hidden below that request-scoped wrapper, two 4 KiB reads 1 MiB apart are reported as 8 KiB and 2 requests even though S3 downloads roughly 1 MiB and issues 1 GET, so per-split resource stats and experiments become inaccurate whenever coalescing occurs; the merged GET accounting needs to be propagated to that layer or measured where the physical request is made.

Useful? React with 👍 / 👎.

@PSeitz PSeitz marked this pull request as draft June 19, 2026 11:36
@PSeitz

PSeitz commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

We have no clue that it will catch anything and it makes the debouncer redundant. This PR cannot be merged as is. Let's first study whether there is something to be done.

The purpose of this PR is to find out more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants