Coalesce consecutive page cache misses into single S3 requests#104230
Conversation
`CachedInMemoryReadBufferFromFile::populateBlockRange` previously issued one `in->readBigAt` per missing 1 MiB block. On object storage, each call is a separate HTTP request, so a cold scan of a 14 GB Parquet file through the userspace page cache made ~15k requests, each paying the TCP/TLS round-trip — measurably slower than the filesystem cache, which fetches in larger segments. Coalescing was previously implemented in commit 682b070 and reverted in c178d2a to avoid transient memory spikes from huge temporary buffers under parallel cold reads. Re-introduce coalescing with a hard cap on the temporary buffer (`max_coalesced_bytes` = 16 MiB). Long miss runs are split into multiple fetches, bounding peak transient memory per call. Single-block misses still read directly into the cache cell, avoiding the buffer and the extra `memcpy`. Measured locally on c8g.24xlarge against the ClickBench `clickhouse-datalake` queries (43 queries, single 14.7 GB Parquet on S3, totals over all queries): cold runs: filesystem cache 62.28s -> page cache (default) 56.58s hot runs: filesystem cache 18.57s -> page cache (default) 13.59s The page cache is now strictly faster than the filesystem cache on both cold and hot, with no benchmark-script tuning required. Context: ClickHouse/ClickBench#818 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Workflow [PR], commit [26f5279] Summary: ✅ AI ReviewSummaryThis PR re-introduces coalesced page-cache miss reads for object storage, now bounded by a new setting ClickHouse Rules
Final Verdict
|
| /// The coalesced read uses a temporary buffer, capped at `max_coalesced_bytes` to bound | ||
| /// transient memory under parallel cold reads. A run longer than the cap is split. | ||
| /// Single-block misses bypass the buffer and read directly into the cache cell. | ||
| constexpr size_t max_coalesced_bytes = 16 * 1024 * 1024; |
There was a problem hiding this comment.
Can we make it a setting?
| /// The coalesced read uses a temporary buffer, capped at `max_coalesced_bytes` to bound | ||
| /// transient memory under parallel cold reads. A run longer than the cap is split. | ||
| /// Single-block misses bypass the buffer and read directly into the cache cell. | ||
| constexpr size_t max_coalesced_bytes = 16 * 1024 * 1024; |
There was a problem hiding this comment.
max_coalesced_bytes = 16 MiB is an important behavior threshold (memory vs request coalescing), but it's hardcoded. This makes tuning impossible for clusters with very different object-store RTT or memory pressure, and it violates the usual ClickHouse pattern of exposing such trade-offs as a setting.
Please make this cap configurable (e.g. a read/page-cache setting with a conservative default), then use that value here.
Address review feedback on PR ClickHouse#104230: the 16 MiB cap on coalesced page-cache reads was hardcoded, which prevents tuning for clusters with different object-store RTT or memory pressure. Expose it as the session setting `page_cache_max_coalesced_bytes`, with the same 16 MiB default. The setting flows through `ReadSettings` to `CachedInMemoryReadBufferFromFile::populateBlockRange`, which uses it to compute the per-fetch block cap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
LLVM Coverage Report
Changed lines: 30.77% (20/65) · Uncovered code |
Drop two randomization adds from this PR that need additional groundwork to ship safely, and revert the seven test files that needed `-- Tags: no-random-settings` because of them: * `max_streams_for_union_step` and `max_streams_for_union_step_to_max_threads_ratio` (ClickHouse#100176) — reshape the `UNION ALL` pipeline, which surfaced pre-existing weak-ordering invariants in 8 tests (constant ORDER BY keys, NaN sort keys from `sin(0/0)`, `ORDER BY tuple()`, duplicate values without secondary sort). The dedicated `04051_max_streams_for_union_step` test also failed in a way suggesting randomized session-level values bypass per-query `SETTINGS` for these settings — a separate interaction worth investigating first. * Pairing `max_bytes_ratio_before_external_join` with `max_bytes_before_external_join` in `randomize_external_sort_group_by` using the same `threshold_generator(0.3, 0.5, 0, 10 GiB)` as sort/group_by — caused `03279_join_choose_build_table` to fail at `--max_bytes_before_external_join 7693351` (grace hash join returned 500/125 rows instead of 1000). That smells like a real correctness issue in grace hash join under spilling. Restore upstream's standalone `max_bytes_before_external_join: random.choice([0, 1000000000])` (binary off / 1 GiB, never spills) and drop the ratio randomization for now. What stays in this PR: * `optimize_dictget_tuple_element` (ClickHouse#100186) * `use_top_k_dynamic_filtering_for_variable_length_types` (ClickHouse#104216) with conditional_settings parent * `page_cache_max_coalesced_bytes` (ClickHouse#104230) * re-enable `grace_hash_join_initial_buckets` randomization with the narrow, previously-working `[1, 2, 4, 8]` range CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=105630&sha=6f3e784ddf7&name_0=PR Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
QA notes on recent PRs flagged that the following settings were introduced but not added to the randomization list in `tests/clickhouse-test`: * `max_bytes_ratio_before_external_join` (ClickHouse#103862) — mirrors the existing `max_bytes_ratio_before_external_group_by` and `max_bytes_ratio_before_external_sort`, so wire it through the Bytes-vs-Ratio choice in `randomize_external_sort_group_by` alongside its absolute counterpart `max_bytes_before_external_join`. * `max_streams_for_union_step` and `max_streams_for_union_step_to_max_threads_ratio` (ClickHouse#100176) — occasionally clamp to a small number of simultaneously active streams to exercise the `UNION ALL` narrowing path. * `optimize_dictget_tuple_element` (ClickHouse#100186) — default-true; disable with a small probability so both planner paths stay covered. * `use_top_k_dynamic_filtering_for_variable_length_types` (ClickHouse#104216) — cover both the fixed-length-only default and the opt-in path. * `page_cache_max_coalesced_bytes` (ClickHouse#104230) — randomize across small/large values so the coalescing path and the cap that splits long miss runs into multiple reads both get exercised. Also re-enable randomization of `grace_hash_join_initial_buckets`, which had been disabled in `tests/clickhouse-test` after wrong results on some tests. Restore the original conservative range `[1, 2, 4, 8]`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Cold reads of object storage through the userspace page cache (
use_page_cache_for_object_storage = 1) are now significantly faster, because consecutive cache misses are coalesced into a single HTTP request instead of one request perpage_cache_block_sizeblock.Description
CachedInMemoryReadBufferFromFile::populateBlockRangepreviously issued onein->readBigAtper missing 1 MiB block. On object storage, each call is a separate HTTP request, so a cold scan of a 14 GB Parquet file through the userspace page cache made ~15k requests, each paying the TCP/TLS round-trip — measurably slower than the filesystem cache, which fetches in larger segments.Coalescing was previously implemented in commit 682b070 and reverted in c178d2a to avoid transient memory spikes from huge temporary buffers under parallel cold reads.
This change re-introduces coalescing with a hard cap on the temporary buffer (
max_coalesced_bytes= 16 MiB). Long miss runs are split into multiple fetches, bounding peak transient memory per call. Single-block misses still read directly into the cache cell, avoiding the buffer and the extramemcpy.Test results
Measured on c8g.24xlarge against the ClickBench
clickhouse-datalakequeries (43 queries, single 14.7 GB Parquet on S3, totals over all queries):Per-query, the worst regressed queries are back to baseline:
SELECT *): 15.23s -> 34.61s broken -> 15.27s fixedContext: ClickHouse/ClickBench#818
Documentation entry for user-facing changes
Version info
26.5.1.356