Skip to content

feat: add dictionary_columns to scan API for memory-efficient string reads#3234

Closed
tanmayrauth wants to merge 1 commit into
apache:mainfrom
tanmayrauth:feat/dictionary-columns-scan
Closed

feat: add dictionary_columns to scan API for memory-efficient string reads#3234
tanmayrauth wants to merge 1 commit into
apache:mainfrom
tanmayrauth:feat/dictionary-columns-scan

Conversation

@tanmayrauth
Copy link
Copy Markdown

Exposes dictionary_columns: tuple[str, ...] | None = None on Table.scan() and DataScan, threading it through to PyArrow's ParquetFileFormat so that named columns are read as DictionaryArray instead of plain large_utf8. This dramatically reduces memory usage for high-cardinality repeated JSON/string columns (issue #3168) and addresses the general scan parameter extensibility request (issue #3170).

Key implementation details:

  • ORC files are guarded — dictionary_columns is only passed for Parquet
  • ArrowScan.to_table() rebuilds the Arrow schema with dict types before the empty-table fast-path so schema is consistent regardless of row count
  • DataScan.to_arrow_batch_reader() rebuilds target_schema with dict types to prevent .cast() from silently decoding DictionaryArray back to plain string
  • DataScan.__init__ declares and stores the param so TableScan.update() (which uses inspect.signature) preserves it across scan copies

Fixes #3168, closes #3170

Rationale for this change

Are these changes tested? Yes

Are there any user-facing changes? No

@tanmayrauth tanmayrauth force-pushed the feat/dictionary-columns-scan branch from 52b2070 to 9fc3b0c Compare April 13, 2026 21:48
@tanmayrauth
Copy link
Copy Markdown
Author

@kevinjqliu @Fokko can you please review and approve this?

@tanmayrauth
Copy link
Copy Markdown
Author

tanmayrauth commented Apr 16, 2026

@geruh @kevinjqliu @Fokko can you please review this implementation?

@github-actions
Copy link
Copy Markdown

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that's incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions Bot added the stale label May 17, 2026
@github-actions
Copy link
Copy Markdown

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions Bot closed this May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

1 participant