Skip to content

ParquetPushDecoder API to clear all buffered ranges#9624

Merged
alamb merged 6 commits into
apache:mainfrom
nathanb9:parquet-push-decoder-api-to-clear-all-buffered-ranges
Apr 7, 2026
Merged

ParquetPushDecoder API to clear all buffered ranges#9624
alamb merged 6 commits into
apache:mainfrom
nathanb9:parquet-push-decoder-api-to-clear-all-buffered-ranges

Conversation

@nathanb9

@nathanb9 nathanb9 commented Mar 29, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

ParquetPushDecoder clears exact requested ranges, but larger speculative pushed ranges can remain buffered in PushBuffers. This adds a way for callers to explicitly release non exact ranges

What changes are included in this PR?

This adds release_all_ranges(), which clears all byte ranges still staged in the decoder's internal PushBuffers

Are these changes tested?

Kinda tested. Tests added to verify the buffer is empty and verified clearing does not interrupt the rowgroup reader

Are there any user-facing changes?

Yes,this adds a new public release_all_ranges() API on ParquetPushDecoder

@github-actions github-actions Bot added the parquet Changes to the parquet crate label Mar 29, 2026
@nathanb9 nathanb9 marked this pull request as ready for review March 29, 2026 22:53

@AndreaBozzo AndreaBozzo left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like this, waiting for someone else to have a look aswell

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very nice -- thank you @nathanb9 and @AndreaBozzo

My only comment is about naming. Let me know what you think

Comment thread parquet/src/arrow/push_decoder/reader_builder/mod.rs Outdated
@nathanb9

nathanb9 commented Apr 6, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @alamb @AndreaBozzo. Probably should also add an analogous one for ParquetMetaDataPushDecoder? since it could also be used to speculatively push. Ill make a PR for that too if you guys can review that

Also, if users find clever ways of getting benefits by speculatively pushing might eventually want to have a smarter version of this clear api or more granular type of clear. Maybe can experiment with this in datafusion

@alamb

alamb commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

Probably should also add an analogous one for ParquetMetaDataPushDecoder? since it could also be used to speculatively push. Ill make a PR for that too if you guys can review that

Thanks @nathanb9 --yes I agree that sounds like a good idea to me

@alamb

alamb commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

I pushed a new commit to this PR to fix CI and merged up from main

@alamb alamb merged commit aac969d into apache:main Apr 7, 2026
16 checks passed
@alamb

alamb commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

Thanks again @nathanb9 and @AndreaBozzo

HippoBaro added a commit to HippoBaro/arrow-rs that referenced this pull request Apr 13, 2026
The `PushDecoder` (introduced in apache#7997, apache#8080) is designed to decouple
IO and CPU. It holds non-contiguous byte ranges, with a
`NeedsData`/`push_range` protocol. However, it requires each logical
read to be satisfied in full by a single physical buffer: `has_range`,
`get_bytes`, and `Read::read` all searched for one buffer that entirely
covered the requested range.

This assumption conflates two orthogonal IO strategies:

- Coalescing: the IO layer merges adjacent requested ranges into fewer,
  larger fetches.
- Prefetching: the IO layer pushes data ahead of what the decoder has
  requested. This is an inversion of control: the IO layer speculatively
  fills buffers at offsets not yet requested and for arbitrary buffer
  sizes.

These two strategies interact poorly with the current release
mechanism (`clear_ranges`), which matches buffers by exact range
equality:

- Coalescing is both rewarded and punished. It is load bearing because
  without it, the number of physical buffers scale with ranges
  requested, and `clear_ranges` performs an O(N×M) scan to remove
  consumed ranges, producing quadratic overhead on wide schemas.

  But it is also punished because a coalesced buffer never exactly
  matches any individual requested range, so `clear_ranges` silently
  skips it: the buffer leaks in `PushBuffers` until the decoder
  finishes or the caller manually calls `release_all_ranges` (apache#9624).
  This increases peak RSS proportionally to the amount of data coalesced
  ahead of the current row group.

- Prefetching is structurally impossible: speculatively pushed
  buffers will straddle future read boundaries, so the decoder
  cannot consume them, and `clear_ranges` cannot release them.

This commit makes `PushBuffers` boundary-agnostic, completing the
prefetching story, and changes the internals to scale with buffer count
instead of range count:

- Buffer stitching: `has_range`, `get_bytes`, and `Read::read` resolve
  logical ranges across multiple contiguous physical buffers via binary
  search, so the IO layer is free to push arbitrarily-sized parts
  without knowing future read boundaries. This is a nice improvement,
  because some IO layer can be made much more efficient when using
  uniform buffers and vectorized reads.

- Incremental release (`release_through`): replaces `clear_ranges` with
  a watermark-based release that drops all buffers below a byte offset,
  trimming straddling buffers via zero-copy `Bytes::slice`.
  The decoder calls this automatically at row-group boundaries.

Benchmark results (vs baseline):

  push_decoder/1buf/1000ranges       321.9 µs   (was 323.5 µs,  −1%)
  push_decoder/1buf/10000ranges       3.26 ms   (was  3.25 ms,  +0%)
  push_decoder/1buf/100000ranges      34.9 ms   (was  34.6 ms,  +1%)
  push_decoder/1buf/500000ranges     192.2 ms   (was 185.3 ms,  +4%)
  push_decoder/Nbuf/1000ranges       363.9 µs   (was 437.2 µs, −17%)
  push_decoder/Nbuf/10000ranges       3.82 ms   (was  10.7 ms, −64%)
  push_decoder/Nbuf/100000ranges      42.1 ms   (was 711.6 ms, −94%)

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
HippoBaro added a commit to HippoBaro/arrow-rs that referenced this pull request Apr 13, 2026
The `PushDecoder` (introduced in apache#7997, apache#8080) is designed to decouple
IO and CPU. It holds non-contiguous byte ranges, with a
`NeedsData`/`push_range` protocol. However, it requires each logical
read to be satisfied in full by a single physical buffer: `has_range`,
`get_bytes`, and `Read::read` all searched for one buffer that entirely
covered the requested range.

This assumption conflates two orthogonal IO strategies:

- Coalescing: the IO layer merges adjacent requested ranges into fewer,
  larger fetches.
- Prefetching: the IO layer pushes data ahead of what the decoder has
  requested. This is an inversion of control: the IO layer speculatively
  fills buffers at offsets not yet requested and for arbitrary buffer
  sizes.

These two strategies interact poorly with the current release
mechanism (`clear_ranges`), which matches buffers by exact range
equality:

- Coalescing is both rewarded and punished. It is load bearing because
  without it, the number of physical buffers scale with ranges
  requested, and `clear_ranges` performs an O(N×M) scan to remove
  consumed ranges, producing quadratic overhead on wide schemas.

  But it is also punished because a coalesced buffer never exactly
  matches any individual requested range, so `clear_ranges` silently
  skips it: the buffer leaks in `PushBuffers` until the decoder
  finishes or the caller manually calls `release_all_ranges` (apache#9624).
  This increases peak RSS proportionally to the amount of data coalesced
  ahead of the current row group.

- Prefetching is structurally impossible: speculatively pushed
  buffers will straddle future read boundaries, so the decoder
  cannot consume them, and `clear_ranges` cannot release them.

This commit makes `PushBuffers` boundary-agnostic, completing the
prefetching story, and changes the internals to scale with buffer count
instead of range count:

- Buffer stitching: `has_range`, `get_bytes`, and `Read::read` resolve
  logical ranges across multiple contiguous physical buffers via binary
  search, so the IO layer is free to push arbitrarily-sized parts
  without knowing future read boundaries. This is a nice improvement,
  because some IO layer can be made much more efficient when using
  uniform buffers and vectorized reads.

- Incremental release (`release_through`): replaces `clear_ranges` with
  a watermark-based release that drops all buffers below a byte offset,
  trimming straddling buffers via zero-copy `Bytes::slice`.
  The decoder calls this automatically at row-group boundaries.

Benchmark results (vs baseline):

  push_decoder/1buf/1000ranges       321.9 µs   (was 323.5 µs,  −1%)
  push_decoder/1buf/10000ranges       3.26 ms   (was  3.25 ms,  +0%)
  push_decoder/1buf/100000ranges      34.9 ms   (was  34.6 ms,  +1%)
  push_decoder/1buf/500000ranges     192.2 ms   (was 185.3 ms,  +4%)
  push_decoder/Nbuf/1000ranges       363.9 µs   (was 437.2 µs, −17%)
  push_decoder/Nbuf/10000ranges       3.82 ms   (was  10.7 ms, −64%)
  push_decoder/Nbuf/100000ranges      42.1 ms   (was 711.6 ms, −94%)

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
HippoBaro added a commit to HippoBaro/arrow-rs that referenced this pull request Apr 13, 2026
The `PushDecoder` (introduced in apache#7997, apache#8080) is designed to decouple
IO and CPU. It holds non-contiguous byte ranges, with a
`NeedsData`/`push_range` protocol. However, it requires each logical
read to be satisfied in full by a single physical buffer: `has_range`,
`get_bytes`, and `Read::read` all searched for one buffer that entirely
covered the requested range.

This assumption conflates two orthogonal IO strategies:

- Coalescing: the IO layer merges adjacent requested ranges into fewer,
  larger fetches.
- Prefetching: the IO layer pushes data ahead of what the decoder has
  requested. This is an inversion of control: the IO layer speculatively
  fills buffers at offsets not yet requested and for arbitrary buffer
  sizes.

These two strategies interact poorly with the current release mechanism
(`clear_ranges`), which matches buffers by exact range equality:

- Coalescing is both rewarded and punished. It is load bearing because
  without it, the number of physical buffers scale with ranges
  requested, and `clear_ranges` performs an O(N×M) scan to remove
  consumed ranges, producing quadratic overhead on wide schemas.
  But it is also punished because a coalesced buffer never exactly
  matches any individual requested range, so `clear_ranges` silently
  skips it: the buffer leaks in `PushBuffers` until the decoder
  finishes or the caller manually calls `release_all_ranges` (apache#9624).
  This increases peak RSS proportionally to the amount of data coalesced
  ahead of the current row group.

- Prefetching is structurally impossible: speculatively pushed
  buffers will straddle future read boundaries, so the decoder
  cannot consume them, and `clear_ranges` cannot release them.

This commit makes `PushBuffers` boundary-agnostic, completing the
prefetching story, and changes the internals to scale with buffer count
instead of range count:

- Buffer stitching: `has_range`, `get_bytes`, and `Read::read` resolve
  logical ranges across multiple contiguous physical buffers via binary
  search, so the IO layer is free to push arbitrarily-sized parts
  without knowing future read boundaries. This is a nice improvement,
  because some IO layer can be made much more efficient when using
  uniform buffers and vectorized reads.

- Incremental release (`release_through`): replaces `clear_ranges` with
  a watermark-based release that drops all buffers below a byte offset,
  trimming straddling buffers via zero-copy `Bytes::slice`.
  The decoder calls this automatically at row-group boundaries.

Benchmark results (vs baseline):

  push_decoder/1buf/1000ranges       321.9 µs   (was 323.5 µs,  −1%)
  push_decoder/1buf/10000ranges       3.26 ms   (was  3.25 ms,  +0%)
  push_decoder/1buf/100000ranges      34.9 ms   (was  34.6 ms,  +1%)
  push_decoder/1buf/500000ranges     192.2 ms   (was 185.3 ms,  +4%)
  push_decoder/Nbuf/1000ranges       363.9 µs   (was 437.2 µs, −17%)
  push_decoder/Nbuf/10000ranges       3.82 ms   (was  10.7 ms, −64%)
  push_decoder/Nbuf/100000ranges      42.1 ms   (was 711.6 ms, −94%)

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
HippoBaro added a commit to HippoBaro/arrow-rs that referenced this pull request Apr 13, 2026
The `PushDecoder` (introduced in apache#7997, apache#8080) is designed to decouple
IO and CPU. It holds non-contiguous byte ranges, with a
`NeedsData`/`push_range` protocol. However, it requires each logical
read to be satisfied in full by a single physical buffer: `has_range`,
`get_bytes`, and `Read::read` all searched for one buffer that entirely
covered the requested range.

This assumption conflates two orthogonal IO strategies:

- Coalescing: the IO layer merges adjacent requested ranges into fewer,
  larger fetches.
- Prefetching: the IO layer pushes data ahead of what the decoder has
  requested. This is an inversion of control: the IO layer speculatively
  fills buffers at offsets not yet requested and for arbitrary buffer
  sizes.

These two strategies interact poorly with the current release mechanism
(`clear_ranges`), which matches buffers by exact range equality:

- Coalescing is both rewarded and punished. It is load bearing because
  without it, the number of physical buffers scale with ranges
  requested, and `clear_ranges` performs an O(N×M) scan to remove
  consumed ranges, producing quadratic overhead on wide schemas.
  But it is also punished because a coalesced buffer never exactly
  matches any individual requested range, so `clear_ranges` silently
  skips it: the buffer leaks in `PushBuffers` until the decoder
  finishes or the caller manually calls `release_all_ranges` (apache#9624).
  This increases peak RSS proportionally to the amount of data coalesced
  ahead of the current row group.

- Prefetching is structurally impossible: speculatively pushed
  buffers will straddle future read boundaries, so the decoder
  cannot consume them, and `clear_ranges` cannot release them.

This commit makes `PushBuffers` boundary-agnostic, completing the
prefetching story, and changes the internals to scale with buffer count
instead of range count:

- Buffer stitching: `has_range`, `get_bytes`, and `Read::read` resolve
  logical ranges across multiple contiguous physical buffers via binary
  search, so the IO layer is free to push arbitrarily-sized parts
  without knowing future read boundaries. This is a nice improvement,
  because some IO layer can be made much more efficient when using
  uniform buffers and vectorized reads.

- Incremental release (`release_through`): replaces `clear_ranges` with
  a watermark-based release that drops all buffers below a byte offset,
  trimming straddling buffers via zero-copy `Bytes::slice`.
  The decoder calls this automatically at row-group boundaries.

Benchmark results (vs baseline):

  push_decoder/1buf/1000ranges       321.9 µs   (was 323.5 µs,  −1%)
  push_decoder/1buf/10000ranges       3.26 ms   (was  3.25 ms,  +0%)
  push_decoder/1buf/100000ranges      34.9 ms   (was  34.6 ms,  +1%)
  push_decoder/1buf/500000ranges     192.2 ms   (was 185.3 ms,  +4%)
  push_decoder/Nbuf/1000ranges       363.9 µs   (was 437.2 µs, −17%)
  push_decoder/Nbuf/10000ranges       3.82 ms   (was  10.7 ms, −64%)
  push_decoder/Nbuf/100000ranges      42.1 ms   (was 711.6 ms, −94%)

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
HippoBaro added a commit to HippoBaro/arrow-rs that referenced this pull request Apr 13, 2026
The `PushDecoder` (introduced in apache#7997, apache#8080) is designed to decouple
IO and CPU. It holds non-contiguous byte ranges, with a
`NeedsData`/`push_range` protocol. However, it requires each logical
read to be satisfied in full by a single physical buffer: `has_range`,
`get_bytes`, and `Read::read` all searched for one buffer that entirely
covered the requested range.

This assumption conflates two orthogonal IO strategies:

- Coalescing: the IO layer merges adjacent requested ranges into fewer,
  larger fetches.
- Prefetching: the IO layer pushes data ahead of what the decoder has
  requested. This is an inversion of control: the IO layer speculatively
  fills buffers at offsets not yet requested and for arbitrary buffer
  sizes.

These two strategies interact poorly with the current release mechanism
(`clear_ranges`), which matches buffers by exact range equality:

- Coalescing is both rewarded and punished. It is load bearing because
  without it, the number of physical buffers scale with ranges
  requested, and `clear_ranges` performs an O(N×M) scan to remove
  consumed ranges, producing quadratic overhead on wide schemas.
  But it is also punished because a coalesced buffer never exactly
  matches any individual requested range, so `clear_ranges` silently
  skips it: the buffer leaks in `PushBuffers` until the decoder
  finishes or the caller manually calls `release_all_ranges` (apache#9624).
  This increases peak RSS proportionally to the amount of data coalesced
  ahead of the current row group.

- Prefetching is structurally impossible: speculatively pushed
  buffers will straddle future read boundaries, so the decoder
  cannot consume them, and `clear_ranges` cannot release them.

This commit makes `PushBuffers` boundary-agnostic, completing the
prefetching story, and changes the internals to scale with buffer count
instead of range count:

- Buffer stitching: `has_range`, `get_bytes`, and `Read::read` resolve
  logical ranges across multiple contiguous physical buffers via binary
  search, so the IO layer is free to push arbitrarily-sized parts
  without knowing future read boundaries. This is a nice improvement,
  because some IO layer can be made much more efficient when using
  uniform buffers and vectorized reads.

- Incremental release (`release_through`): replaces `clear_ranges` with
  a watermark-based release that drops all buffers below a byte offset,
  trimming straddling buffers via zero-copy `Bytes::slice`.
  The decoder calls this automatically at row-group boundaries.

Benchmark results (vs baseline):

  push_decoder/1buf/1000ranges       321.9 µs   (was 323.5 µs,  −1%)
  push_decoder/1buf/10000ranges       3.26 ms   (was  3.25 ms,  +0%)
  push_decoder/1buf/100000ranges      34.9 ms   (was  34.6 ms,  +1%)
  push_decoder/1buf/500000ranges     192.2 ms   (was 185.3 ms,  +4%)
  push_decoder/Nbuf/1000ranges       363.9 µs   (was 437.2 µs, −17%)
  push_decoder/Nbuf/10000ranges       3.82 ms   (was  10.7 ms, −64%)
  push_decoder/Nbuf/100000ranges      42.1 ms   (was 711.6 ms, −94%)

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
alamb pushed a commit that referenced this pull request Apr 13, 2026
This PR is a follow up for [this ticket
](#8676). Implement same API
but for the metadata decoder.

See also
#9624 (comment)

## Rationale for this change
`ParquetMetaDataPushDecoder` clears exact requested ranges, but larger
speculative pushed ranges can remain buffered in `PushBuffers`. This
adds a way for callers to explicitly release non exact ranges

## What changes are included in this PR?
This adds `clear_all_ranges()`, which clears all byte ranges still
staged in the decoder's internal `PushBuffers`
## Are these changes tested?
yes
## Are there any user-facing changes?
Yes, this adds a new public `clear_all_ranges()` API on
`ParquetMetaDataPushDecoder`
HippoBaro added a commit to HippoBaro/arrow-rs that referenced this pull request Apr 13, 2026
The `PushDecoder` (introduced in apache#7997, apache#8080) is designed to decouple
IO and CPU. It holds non-contiguous byte ranges, with a
`NeedsData`/`push_range` protocol. However, it requires each logical
read to be satisfied in full by a single physical buffer: `has_range`,
`get_bytes`, and `Read::read` all searched for one buffer that entirely
covered the requested range.

This assumption conflates two orthogonal IO strategies:

- Coalescing: the IO layer merges adjacent requested ranges into fewer,
  larger fetches.
- Prefetching: the IO layer pushes data ahead of what the decoder has
  requested. This is an inversion of control: the IO layer speculatively
  fills buffers at offsets not yet requested and for arbitrary buffer
  sizes.

These two strategies interact poorly with the current release mechanism
(`clear_ranges`), which matches buffers by exact range equality:

- Coalescing is both rewarded and punished. It is load bearing because
  without it, the number of physical buffers scale with ranges
  requested, and `clear_ranges` performs an O(N×M) scan to remove
  consumed ranges, producing quadratic overhead on wide schemas.
  But it is also punished because a coalesced buffer never exactly
  matches any individual requested range, so `clear_ranges` silently
  skips it: the buffer leaks in `PushBuffers` until the decoder
  finishes or the caller manually calls `release_all_ranges` (apache#9624).
  This increases peak RSS proportionally to the amount of data coalesced
  ahead of the current row group.

- Prefetching is structurally impossible: speculatively pushed
  buffers will straddle future read boundaries, so the decoder
  cannot consume them, and `clear_ranges` cannot release them.

This commit makes `PushBuffers` boundary-agnostic, completing the
prefetching story, and changes the internals to scale with buffer count
instead of range count:

- Buffer stitching: `has_range`, `get_bytes`, and `Read::read` resolve
  logical ranges across multiple contiguous physical buffers via binary
  search, so the IO layer is free to push arbitrarily-sized parts
  without knowing future read boundaries. This is a nice improvement,
  because some IO layer can be made much more efficient when using
  uniform buffers and vectorized reads.

- Incremental release (`release_through`): replaces `clear_ranges` with
  a watermark-based release that drops all buffers below a byte offset,
  trimming straddling buffers via zero-copy `Bytes::slice`.
  The decoder calls this automatically at row-group boundaries.

Benchmark results (vs baseline):

  push_decoder/1buf/1000ranges       321.9 µs   (was 323.5 µs,  −1%)
  push_decoder/1buf/10000ranges       3.26 ms   (was  3.25 ms,  +0%)
  push_decoder/1buf/100000ranges      34.9 ms   (was  34.6 ms,  +1%)
  push_decoder/1buf/500000ranges     192.2 ms   (was 185.3 ms,  +4%)
  push_decoder/Nbuf/1000ranges       363.9 µs   (was 437.2 µs, −17%)
  push_decoder/Nbuf/10000ranges       3.82 ms   (was  10.7 ms, −64%)
  push_decoder/Nbuf/100000ranges      42.1 ms   (was 711.6 ms, −94%)

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
HippoBaro added a commit to HippoBaro/arrow-rs that referenced this pull request Apr 13, 2026
The `PushDecoder` (introduced in apache#7997, apache#8080) is designed to decouple
IO and CPU. It holds non-contiguous byte ranges, with a
`NeedsData`/`push_range` protocol. However, it requires each logical
read to be satisfied in full by a single physical buffer: `has_range`,
`get_bytes`, and `Read::read` all searched for one buffer that entirely
covered the requested range.

This assumption conflates two orthogonal IO strategies:

- Coalescing: the IO layer merges adjacent requested ranges into fewer,
  larger fetches.
- Prefetching: the IO layer pushes data ahead of what the decoder has
  requested. This is an inversion of control: the IO layer speculatively
  fills buffers at offsets not yet requested and for arbitrary buffer
  sizes.

These two strategies interact poorly with the current release mechanism
(`clear_ranges`), which matches buffers by exact range equality:

- Coalescing is both rewarded and punished. It is load bearing because
  without it, the number of physical buffers scale with ranges
  requested, and `clear_ranges` performs an O(N×M) scan to remove
  consumed ranges, producing quadratic overhead on wide schemas.
  But it is also punished because a coalesced buffer never exactly
  matches any individual requested range, so `clear_ranges` silently
  skips it: the buffer leaks in `PushBuffers` until the decoder
  finishes or the caller manually calls `release_all_ranges` (apache#9624).
  This increases peak RSS proportionally to the amount of data coalesced
  ahead of the current row group.

- Prefetching is structurally impossible: speculatively pushed
  buffers will straddle future read boundaries, so the decoder
  cannot consume them, and `clear_ranges` cannot release them.

This commit makes `PushBuffers` boundary-agnostic, completing the
prefetching story, and changes the internals to scale with buffer count
instead of range count:

- Buffer stitching: `has_range`, `get_bytes`, and `Read::read` resolve
  logical ranges across multiple contiguous physical buffers via binary
  search, so the IO layer is free to push arbitrarily-sized parts
  without knowing future read boundaries. This is a nice improvement,
  because some IO layer can be made much more efficient when using
  uniform buffers and vectorized reads.

- Incremental release (`release_through`): replaces `clear_ranges` with
  a watermark-based release that drops all buffers below a byte offset,
  trimming straddling buffers via zero-copy `Bytes::slice`.
  The decoder calls this automatically at row-group boundaries.

Benchmark results (vs baseline):

  push_decoder/1buf/1000ranges       321.9 µs   (was 323.5 µs,  −1%)
  push_decoder/1buf/10000ranges       3.26 ms   (was  3.25 ms,  +0%)
  push_decoder/1buf/100000ranges      34.9 ms   (was  34.6 ms,  +1%)
  push_decoder/1buf/500000ranges     192.2 ms   (was 185.3 ms,  +4%)
  push_decoder/Nbuf/1000ranges       363.9 µs   (was 437.2 µs, −17%)
  push_decoder/Nbuf/10000ranges       3.82 ms   (was  10.7 ms, −64%)
  push_decoder/Nbuf/100000ranges      42.1 ms   (was 711.6 ms, −94%)

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
HippoBaro added a commit to HippoBaro/arrow-rs that referenced this pull request Apr 13, 2026
The `PushDecoder` (introduced in apache#7997, apache#8080) is designed to decouple
IO and CPU. It holds non-contiguous byte ranges, with a
`NeedsData`/`push_range` protocol. However, it requires each logical
read to be satisfied in full by a single physical buffer: `has_range`,
`get_bytes`, and `Read::read` all searched for one buffer that entirely
covered the requested range.

This assumption conflates two orthogonal IO strategies:

- Coalescing: the IO layer merges adjacent requested ranges into fewer,
  larger fetches.
- Prefetching: the IO layer pushes data ahead of what the decoder has
  requested. This is an inversion of control: the IO layer speculatively
  fills buffers at offsets not yet requested and for arbitrary buffer
  sizes.

These two strategies interact poorly with the current release mechanism
(`clear_ranges`), which matches buffers by exact range equality:

- Coalescing is both rewarded and punished. It is load bearing because
  without it, the number of physical buffers scale with ranges
  requested, and `clear_ranges` performs an O(N×M) scan to remove
  consumed ranges, producing quadratic overhead on wide schemas.
  But it is also punished because a coalesced buffer never exactly
  matches any individual requested range, so `clear_ranges` silently
  skips it: the buffer leaks in `PushBuffers` until the decoder
  finishes or the caller manually calls `release_all_ranges` (apache#9624).
  This increases peak RSS proportionally to the amount of data coalesced
  ahead of the current row group.

- Prefetching is structurally impossible: speculatively pushed
  buffers will straddle future read boundaries, so the decoder
  cannot consume them, and `clear_ranges` cannot release them.

This commit makes `PushBuffers` boundary-agnostic, completing the
prefetching story, and changes the internals to scale with buffer count
instead of range count:

- Buffer stitching: `has_range`, `get_bytes`, and `Read::read` resolve
  logical ranges across multiple contiguous physical buffers via binary
  search, so the IO layer is free to push arbitrarily-sized parts
  without knowing future read boundaries. This is a nice improvement,
  because some IO layer can be made much more efficient when using
  uniform buffers and vectorized reads.

- Incremental release (`release_through`): replaces `clear_ranges` with
  a watermark-based release that drops all buffers below a byte offset,
  trimming straddling buffers via zero-copy `Bytes::slice`.
  The decoder calls this automatically at row-group boundaries.

Benchmark results (vs baseline):

  push_decoder/1buf/1000ranges       321.9 µs   (was 323.5 µs,  −1%)
  push_decoder/1buf/10000ranges       3.26 ms   (was  3.25 ms,  +0%)
  push_decoder/1buf/100000ranges      34.9 ms   (was  34.6 ms,  +1%)
  push_decoder/1buf/500000ranges     192.2 ms   (was 185.3 ms,  +4%)
  push_decoder/Nbuf/1000ranges       363.9 µs   (was 437.2 µs, −17%)
  push_decoder/Nbuf/10000ranges       3.82 ms   (was  10.7 ms, −64%)
  push_decoder/Nbuf/100000ranges      42.1 ms   (was 711.6 ms, −94%)

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
Rich-T-kid pushed a commit to Rich-T-kid/arrow-rs that referenced this pull request Jun 2, 2026
## Which issue does this PR close?
- Closes apache#8676

## Rationale for this change
`ParquetPushDecoder` clears exact requested ranges, but larger
speculative pushed ranges can remain buffered in `PushBuffers`. This
adds a way for callers to explicitly release non exact ranges

## What changes are included in this PR?
This adds `release_all_ranges()`, which clears all byte ranges still
staged in the decoder's internal `PushBuffers`

## Are these changes tested?
Kinda tested. Tests added to verify the buffer is empty and verified
clearing does not interrupt the rowgroup reader

## Are there any user-facing changes?
Yes,this adds a new public `release_all_ranges()` API on
`ParquetPushDecoder`

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Rich-T-kid pushed a commit to Rich-T-kid/arrow-rs that referenced this pull request Jun 2, 2026
)

This PR is a follow up for [this ticket
](apache#8676). Implement same API
but for the metadata decoder.

See also
apache#9624 (comment)

## Rationale for this change
`ParquetMetaDataPushDecoder` clears exact requested ranges, but larger
speculative pushed ranges can remain buffered in `PushBuffers`. This
adds a way for callers to explicitly release non exact ranges

## What changes are included in this PR?
This adds `clear_all_ranges()`, which clears all byte ranges still
staged in the decoder's internal `PushBuffers`
## Are these changes tested?
yes
## Are there any user-facing changes?
Yes, this adds a new public `clear_all_ranges()` API on
`ParquetMetaDataPushDecoder`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a way to clear out all buffered ranges from ParquetPushDecoder

3 participants