Add S3 GET request coalescer by PSeitz-dd · Pull Request #6530 · quickwit-oss/quickwit

PSeitz-dd · 2026-06-19T09:52:10Z

Object storage GETs have a high fixed cost (a round-trip, and a per-request
charge on S3) relative to transferring a few extra bytes. A single search
often issues several get_slice requests for ranges that sit close together
within the same object.

CoalescingStorage buffers get_slice requests for a short window (5ms by
default), groups the ones targeting nearby ranges of the same object, and
issues a single underlying GET spanning each group, handing every waiter a
zero-copy slice of the merged buffer. The small gaps between ranges are
fetched wastefully, trading a little bandwidth for fewer round-trips.

Coalescing one wave of requests is also what makes the next wave coalescable.
Search reads are staged: a term dictionary lookup must complete before the
postings list it points to can be read. If the term dictionary requests are
coalesced into one GET they all resolve at the same instant, so the postings
list requests they trigger are issued together and land in a single window
where they too get coalesced. If the term dictionary requests are served
separately they return at staggered times, scattering the dependent postings
list requests across multiple windows and preventing them from merging. So
the gap/span thresholds should stay generous enough to keep co-issued
term-dictionary reads together.

Unlike DebouncedStorage, which only deduplicates identical in-flight
requests, the coalescer merges distinct but nearby ranges; it is wired
beneath the debouncer in the S3 resolver. Grouping is bounded by a max gap
(1 MiB) and max merged span (8 MiB); all three knobs are tunable via
QW_S3_COALESCE_* env vars, and a zero window disables coalescing.

The fetch runs in a detached task and reports back over oneshot channels, so
a cancelled caller never aborts an in-flight GET, and the internal mutex is
never held across an await.

Description

Describe the proposed changes made in this PR.

How was this PR tested?

Describe how you tested this PR.

Object storage GETs have a high fixed cost (a round-trip, and a per-request charge on S3) relative to transferring a few extra bytes. A single search often issues several get_slice requests for ranges that sit close together within the same object. CoalescingStorage buffers get_slice requests for a short window (5ms by default), groups the ones targeting nearby ranges of the same object, and issues a single underlying GET spanning each group, handing every waiter a zero-copy slice of the merged buffer. The small gaps between ranges are fetched wastefully, trading a little bandwidth for fewer round-trips. Coalescing one wave of requests is also what makes the next wave coalescable. Search reads are staged: a term dictionary lookup must complete before the postings list it points to can be read. If the term dictionary requests are coalesced into one GET they all resolve at the same instant, so the postings list requests they trigger are issued together and land in a single window where they too get coalesced. If the term dictionary requests are served separately they return at staggered times, scattering the dependent postings list requests across multiple windows and preventing them from merging. So the gap/span thresholds should stay generous enough to keep co-issued term-dictionary reads together. Unlike DebouncedStorage, which only deduplicates identical in-flight requests, the coalescer merges distinct but nearby ranges; it is wired beneath the debouncer in the S3 resolver. Grouping is bounded by a max gap (1 MiB) and max merged span (8 MiB); all three knobs are tunable via QW_S3_COALESCE_* env vars, and a zero window disables coalescing. The fetch runs in a detached task and reports back over oneshot channels, so a cancelled caller never aborts an in-flight GET, and the internal mutex is never held across an await.

fulmicoton · 2026-06-19T09:59:24Z

We have no clue that it will catch anything and it makes the debouncer redundant. This PR cannot be merged as is.
Let's first study whether there is something to be done.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4842cbaab2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-19T10:02:45Z

+                );
+                let relative_start = request.range.start - merged.start;
+                let relative_end = request.range.end - merged.start;
+                Ok(bytes.slice(relative_start..relative_end))


Detach coalesced slices before caching

When this slice is returned for a small cached fast-field read, it can be inserted into the long-lived QuickwitCache: StorageWithCache caches the OwnedBytes returned by BundleStorage, and BundleStorage::get_slice forwards the underlying bytes without copying. In that scenario a few-KB .fast entry coalesced with nearby split reads pins the whole merged buffer (up to the 8 MiB span) while MemorySizedCache accounts only bytes.len(), so the fast-field cache can exceed its configured memory by a large factor. The module comment acknowledges this hazard but incorrectly excludes the fast-field path because the “whole fast field” is still a sub-slice of the coalesced .split allocation.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-19T10:02:46Z

+    record_cluster_metrics(&cluster);
+
+    let merged = cluster.range;
+    let result = underlying.get_slice(path, merged.clone()).await;


Use merged range size for storage timeouts

When searcher.storage_timeout_policy is enabled, the search path wraps this resolver storage in TimeoutAndRetryStorage (quickwit-search/src/leaf.rs), and that wrapper computes its deadline from each caller's original range.len() before this coalescer merges requests. A small read that lands in an 8 MiB cluster therefore still gets the small-read timeout, so the merged S3 GET can be timed out and retried even when it meets the configured throughput for the actual bytes being downloaded; with the default sample policy this can turn coalescing into repeated timeouts for clustered small ranges.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-19T10:02:46Z

+        let coalesced_storage = CoalescingStorage::new(storage);
+        Ok(Arc::new(DebouncedStorage::new(coalesced_storage)))


Preserve physical download stats for coalesced S3 reads

In the leaf-search path, CountingStorage::instrument_storage wraps the storage returned by this resolver and feeds SplitResourceStats.download_num_bytes/download_num_requests, which the proto documents as bytes downloaded and storage GETs issued. With the coalescer hidden below that request-scoped wrapper, two 4 KiB reads 1 MiB apart are reported as 8 KiB and 2 requests even though S3 downloads roughly 1 MiB and issues 1 GET, so per-split resource stats and experiments become inaccurate whenever coalescing occurs; the merged GET accounting needs to be propagated to that layer or measured where the physical request is made.

Useful? React with 👍 / 👎.

PSeitz · 2026-06-19T11:43:48Z

We have no clue that it will catch anything and it makes the debouncer redundant. This PR cannot be merged as is. Let's first study whether there is something to be done.

The purpose of this PR is to find out more.

PSeitz-dd requested a review from a team as a code owner June 19, 2026 09:52

chatgpt-codex-connector Bot reviewed Jun 19, 2026

View reviewed changes

PSeitz marked this pull request as draft June 19, 2026 11:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add S3 GET request coalescer#6530

Add S3 GET request coalescer#6530
PSeitz-dd wants to merge 1 commit into
mainfrom
s3_request_coalescer

PSeitz-dd commented Jun 19, 2026

Uh oh!

fulmicoton commented Jun 19, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 19, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 19, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 19, 2026

Uh oh!

PSeitz commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		let coalesced_storage = CoalescingStorage::new(storage);
		Ok(Arc::new(DebouncedStorage::new(coalesced_storage)))

Conversation

PSeitz-dd commented Jun 19, 2026

Description

How was this PR tested?

Uh oh!

fulmicoton commented Jun 19, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

PSeitz commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants