feat: Add ArrivalOrder to ArrowScan for bounded-memory concurrent reads#44
Merged
robreeves merged 1 commit intolinkedin:li-0.11from Apr 3, 2026
Merged
feat: Add ArrivalOrder to ArrowScan for bounded-memory concurrent reads#44robreeves merged 1 commit intolinkedin:li-0.11from
robreeves merged 1 commit intolinkedin:li-0.11from
Conversation
Backport of apache/iceberg-python#3046. Adds a new `order` parameter to `to_arrow_batch_reader()` with TaskOrder (default) and ArrivalOrder implementations to support bounded-memory concurrent reads.
ShreyeshArangath
approved these changes
Apr 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Addresses #3036 — ArrowScan.to_record_batches() uses executor.map + list() which eagerly materializes all record batches per file into memory, causing OOM on large tables.
This PR adds a new
orderparameter toto_arrow_batch_reader()with two implementations:TaskOrder(default) — preserves existing behavior: batches grouped by file in task submission order, each file fully materialized before proceeding to the next.ArrivalOrder— yields batches as they are produced across files without materializing entire files into memory. Accepts three sub-parameters:concurrent_streams: int— number of files to read concurrently (default: 8). A per-scanThreadPoolExecutor(max_workers=concurrent_streams)bounds concurrency.batch_size: int | None— number of rows per batch passed to PyArrow's ds.Scanner (default: PyArrow's built-in 131,072).max_buffered_batches: int— size of the bounded queue between producers and consumer (default: 16), providing backpressure to cap memory usage.Problem
The current implementation materializes all batches from each file via list() inside executor.map, which runs up to min(32, cpu_count+4) files in parallel. For large files this means all batches from ~20 files are held in memory simultaneously before any are yielded to the consumer.
Solution
Before: OOM on large tables
After: bounded memory, tunable parallelism
Default behavior is unchanged —
TaskOrderpreserves the existing executor.map + list() path for backwards compatibility.Architecture
When
order=ArrivalOrder(...), batches flow through_bounded_concurrent_batches:ThreadPoolExecutor(max_workers=concurrent_streams)Queue(maxsize=max_buffered_batches)— when full, workers block (backpressure)queue.get()Refactored
to_record_batchesinto helpers:_prepare_tasks_and_deletes,_iter_batches_arrival,_iter_batches_materialized,_apply_limit.Ordering semantics
TaskOrder()(default)ArrivalOrder(concurrent_streams=1)ArrivalOrder(concurrent_streams>1)Benchmark results
32 files × 500K rows, 5 columns (int64, float64, string, bool, timestamp), batch_size=131,072 (PyArrow default):
TTFR = Time to First Record, cs = concurrent_streams
Are these changes tested?
Yes. 25 new unit tests across two test files, plus a micro-benchmark.
Are there any user-facing changes?
Yes. New
orderparameter onDataScan.to_arrow_batch_reader():order: ScanOrder | None— controls batch ordering. PassTaskOrder()(default) orArrivalOrder(concurrent_streams=N, batch_size=B, max_buffered_batches=M).New public classes
TaskOrderandArrivalOrder(subclasses ofScanOrder) exported frompyiceberg.table.All parameters are optional with backwards-compatible defaults. Existing code is unaffected.
Documentation updated in
mkdocs/docs/api.mdwith usage examples, ordering semantics, and configuration guidance table.