Skip to content

Coalesce consecutive page cache misses into single S3 requests#104230

Merged
alexey-milovidov merged 2 commits into
ClickHouse:masterfrom
alexey-milovidov:fix-page-cache-coalesce-bounded
May 6, 2026
Merged

Coalesce consecutive page cache misses into single S3 requests#104230
alexey-milovidov merged 2 commits into
ClickHouse:masterfrom
alexey-milovidov:fix-page-cache-coalesce-bounded

Conversation

@alexey-milovidov

@alexey-milovidov alexey-milovidov commented May 6, 2026

Copy link
Copy Markdown
Member

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Cold reads of object storage through the userspace page cache (use_page_cache_for_object_storage = 1) are now significantly faster, because consecutive cache misses are coalesced into a single HTTP request instead of one request per page_cache_block_size block.

Description

CachedInMemoryReadBufferFromFile::populateBlockRange previously issued one in->readBigAt per missing 1 MiB block. On object storage, each call is a separate HTTP request, so a cold scan of a 14 GB Parquet file through the userspace page cache made ~15k requests, each paying the TCP/TLS round-trip — measurably slower than the filesystem cache, which fetches in larger segments.

Coalescing was previously implemented in commit 682b070 and reverted in c178d2a to avoid transient memory spikes from huge temporary buffers under parallel cold reads.

This change re-introduces coalescing with a hard cap on the temporary buffer (max_coalesced_bytes = 16 MiB). Long miss runs are split into multiple fetches, bounding peak transient memory per call. Single-block misses still read directly into the cache cell, avoiding the buffer and the extra memcpy.

Test results

Measured on c8g.24xlarge against the ClickBench clickhouse-datalake queries (43 queries, single 14.7 GB Parquet on S3, totals over all queries):

Config Cold total Hot total
OLD filesystem cache 62.28s 18.57s
NEW page cache (broken) 106.14s 20.14s
NEW page cache + this fix 56.58s 13.59s

Per-query, the worst regressed queries are back to baseline:

  • Q24 (heaviest, SELECT *): 15.23s -> 34.61s broken -> 15.27s fixed
  • Q23: 4.89s -> 10.14s broken -> 4.72s fixed
  • Q22: 2.30s -> 5.00s broken -> 2.43s fixed

Context: ClickHouse/ClickBench#818

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Version info

  • Merged into: 26.5.1.356

`CachedInMemoryReadBufferFromFile::populateBlockRange` previously issued
one `in->readBigAt` per missing 1 MiB block. On object storage, each call
is a separate HTTP request, so a cold scan of a 14 GB Parquet file
through the userspace page cache made ~15k requests, each paying the
TCP/TLS round-trip — measurably slower than the filesystem cache, which
fetches in larger segments.

Coalescing was previously implemented in commit 682b070 and reverted
in c178d2a to avoid transient memory spikes from huge temporary
buffers under parallel cold reads.

Re-introduce coalescing with a hard cap on the temporary buffer
(`max_coalesced_bytes` = 16 MiB). Long miss runs are split into multiple
fetches, bounding peak transient memory per call. Single-block misses
still read directly into the cache cell, avoiding the buffer and the
extra `memcpy`.

Measured locally on c8g.24xlarge against the ClickBench
`clickhouse-datalake` queries (43 queries, single 14.7 GB Parquet on S3,
totals over all queries):

  cold runs:  filesystem cache 62.28s -> page cache (default)  56.58s
  hot runs:   filesystem cache 18.57s -> page cache (default)  13.59s

The page cache is now strictly faster than the filesystem cache on both
cold and hot, with no benchmark-script tuning required.

Context: ClickHouse/ClickBench#818

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alexey-milovidov alexey-milovidov added the comp-datalake Data lake table formats (Iceberg/Delta/Hudi) integration. label May 6, 2026
@clickhouse-gh

clickhouse-gh Bot commented May 6, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [26f5279]

Summary:


AI Review

Summary

This PR re-introduces coalesced page-cache miss reads for object storage, now bounded by a new setting page_cache_max_coalesced_bytes to avoid transient memory spikes under parallel cold reads. The change is coherent across settings plumbing and read-path implementation, and I did not find correctness, safety, or compatibility issues that require changes.

ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout
Compilation time
No large/binary files
Final Verdict
  • Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label May 6, 2026
/// The coalesced read uses a temporary buffer, capped at `max_coalesced_bytes` to bound
/// transient memory under parallel cold reads. A run longer than the cap is split.
/// Single-block misses bypass the buffer and read directly into the cache cell.
constexpr size_t max_coalesced_bytes = 16 * 1024 * 1024;

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it a setting?

@alexey-milovidov alexey-milovidov requested a review from al13n321 May 6, 2026 14:09
/// The coalesced read uses a temporary buffer, capped at `max_coalesced_bytes` to bound
/// transient memory under parallel cold reads. A run longer than the cap is split.
/// Single-block misses bypass the buffer and read directly into the cache cell.
constexpr size_t max_coalesced_bytes = 16 * 1024 * 1024;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_coalesced_bytes = 16 MiB is an important behavior threshold (memory vs request coalescing), but it's hardcoded. This makes tuning impossible for clusters with very different object-store RTT or memory pressure, and it violates the usual ClickHouse pattern of exposing such trade-offs as a setting.

Please make this cap configurable (e.g. a read/page-cache setting with a conservative default), then use that value here.

Address review feedback on PR ClickHouse#104230: the 16 MiB cap on coalesced
page-cache reads was hardcoded, which prevents tuning for clusters
with different object-store RTT or memory pressure.

Expose it as the session setting `page_cache_max_coalesced_bytes`,
with the same 16 MiB default. The setting flows through `ReadSettings`
to `CachedInMemoryReadBufferFromFile::populateBlockRange`, which uses
it to compute the per-fetch block cap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented May 6, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.10% 84.10% +0.00%
Functions 91.10% 91.10% +0.00%
Branches 76.60% 76.60% +0.00%

Changed lines: 30.77% (20/65) · Uncovered code

Full report · Diff report

@alexey-milovidov alexey-milovidov left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@alexey-milovidov alexey-milovidov self-assigned this May 6, 2026
@alexey-milovidov alexey-milovidov added this pull request to the merge queue May 6, 2026
Merged via the queue into ClickHouse:master with commit 16657f0 May 6, 2026
165 checks passed
@alexey-milovidov alexey-milovidov deleted the fix-page-cache-coalesce-bounded branch May 6, 2026 21:50
@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-synced-to-cloud The PR is synced to the cloud repo label May 6, 2026
zlareb1 added a commit to zlareb1/ClickHouse that referenced this pull request May 24, 2026
Drop two randomization adds from this PR that need additional
groundwork to ship safely, and revert the seven test files that
needed `-- Tags: no-random-settings` because of them:

* `max_streams_for_union_step` and
  `max_streams_for_union_step_to_max_threads_ratio` (ClickHouse#100176) —
  reshape the `UNION ALL` pipeline, which surfaced pre-existing
  weak-ordering invariants in 8 tests (constant ORDER BY keys, NaN
  sort keys from `sin(0/0)`, `ORDER BY tuple()`, duplicate values
  without secondary sort). The dedicated `04051_max_streams_for_union_step`
  test also failed in a way suggesting randomized session-level
  values bypass per-query `SETTINGS` for these settings — a separate
  interaction worth investigating first.

* Pairing `max_bytes_ratio_before_external_join` with
  `max_bytes_before_external_join` in `randomize_external_sort_group_by`
  using the same `threshold_generator(0.3, 0.5, 0, 10 GiB)` as
  sort/group_by — caused `03279_join_choose_build_table` to fail at
  `--max_bytes_before_external_join 7693351` (grace hash join
  returned 500/125 rows instead of 1000). That smells like a real
  correctness issue in grace hash join under spilling. Restore
  upstream's standalone `max_bytes_before_external_join: random.choice([0, 1000000000])`
  (binary off / 1 GiB, never spills) and drop the ratio randomization
  for now.

What stays in this PR:

* `optimize_dictget_tuple_element` (ClickHouse#100186)
* `use_top_k_dynamic_filtering_for_variable_length_types` (ClickHouse#104216)
  with conditional_settings parent
* `page_cache_max_coalesced_bytes` (ClickHouse#104230)
* re-enable `grace_hash_join_initial_buckets` randomization with
  the narrow, previously-working `[1, 2, 4, 8]` range

CI report:
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=105630&sha=6f3e784ddf7&name_0=PR

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
itsjoeoui pushed a commit to itsjoeoui/ClickHouse that referenced this pull request Jun 4, 2026
QA notes on recent PRs flagged that the following settings were
introduced but not added to the randomization list in
`tests/clickhouse-test`:

* `max_bytes_ratio_before_external_join` (ClickHouse#103862) — mirrors the
  existing `max_bytes_ratio_before_external_group_by` and
  `max_bytes_ratio_before_external_sort`, so wire it through the
  Bytes-vs-Ratio choice in `randomize_external_sort_group_by`
  alongside its absolute counterpart `max_bytes_before_external_join`.
* `max_streams_for_union_step` and
  `max_streams_for_union_step_to_max_threads_ratio` (ClickHouse#100176) —
  occasionally clamp to a small number of simultaneously active
  streams to exercise the `UNION ALL` narrowing path.
* `optimize_dictget_tuple_element` (ClickHouse#100186) — default-true; disable
  with a small probability so both planner paths stay covered.
* `use_top_k_dynamic_filtering_for_variable_length_types` (ClickHouse#104216) —
  cover both the fixed-length-only default and the opt-in path.
* `page_cache_max_coalesced_bytes` (ClickHouse#104230) — randomize across
  small/large values so the coalescing path and the cap that splits
  long miss runs into multiple reads both get exercised.

Also re-enable randomization of `grace_hash_join_initial_buckets`,
which had been disabled in
`tests/clickhouse-test` after wrong results on some tests. Restore
the original conservative range `[1, 2, 4, 8]`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp-datalake Data lake table formats (Iceberg/Delta/Hudi) integration. pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants