Core: Improve HadoopFileIO performance when working with cloud storage by steveloughran · Pull Request #15586 · apache/iceberg

steveloughran · 2026-03-11T17:57:31Z

Improve file opening and read times by

keeping file status when known, using it in openFile() call to eliminate HEAD requests
choosing file input policy when reading a file (Util.determineReadPolicy()).

ParquetIO already hands down file opening to parquet, which does the right thing.l
What matters for it is retaining any FileStatus already obtained, which is what the changes in TableMigrationUtil do.

It's a shame that parquet (currently) lacks a way to skip that stat() call which is does to get file length, as this adds a HEAD request to all openings of a parquet file where the length is known from a manifest. That is fixable and would save 100+mS per file opening, as well as the associated IO capacity.

However, until #12554 is fixed, that manifest file length can't be trusted, so the stat() matters. It's possibly why this issue hasn't been noticed on iceberg-java code

github-actions · 2026-05-11T00:40:52Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

steveloughran · 2026-05-11T12:57:23Z

still live

* set read policy on filetype. * if status is set, pass in * if length is known, pass in * handle the future completion by extracting the cause

- read policy set in open file TODO - validate policy choice of puffin files

+ roll back making listing closeable(); too much a change for too little value.

github-actions · 2026-06-12T00:52:17Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

github-actions · 2026-06-19T00:56:50Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

steveloughran · 2026-06-22T16:39:42Z

This was marked stale and closed while I was on a european length holiday...clearly it is designed for a US developer world where a 2 week break is unusual

steveloughran marked this pull request as draft March 11, 2026 17:57

github-actions Bot added core data labels Mar 11, 2026

steveloughran changed the title ~~Improve HadoopFileIO when working with cloud storage~~ Core: Improve HadoopFileIO performance when working with cloud storage Mar 11, 2026

steveloughran mentioned this pull request Apr 10, 2026

Core: update manifest delete file size after rewrite table action #15470

Merged

github-actions Bot added the stale label May 11, 2026

steveloughran force-pushed the pr/15353-cloud-io branch from 64eadc2 to fa92607 Compare May 11, 2026 12:58

github-actions Bot removed the stale label May 12, 2026

steveloughran added 5 commits May 12, 2026 11:26

Optimise HadoopFileIO for cloud IO: open file

6cdb748

* set read policy on filetype. * if status is set, pass in * if length is known, pass in * handle the future completion by extracting the cause

Cloud performance improvements

782f7be

- read policy set in open file TODO - validate policy choice of puffin files

Replace HadoopInputFile.fromPath() with fromStatus()

bae71d8

+ roll back making listing closeable(); too much a change for too little value.

Spotless

e09267e

checksyle

7d0dd4e

steveloughran force-pushed the pr/15353-cloud-io branch from fa92607 to 7d0dd4e Compare May 12, 2026 16:41

github-actions Bot added the stale label Jun 12, 2026

github-actions Bot closed this Jun 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Core: Improve HadoopFileIO performance when working with cloud storage#15586

Core: Improve HadoopFileIO performance when working with cloud storage#15586
steveloughran wants to merge 5 commits into
apache:mainfrom
steveloughran:pr/15353-cloud-io

steveloughran commented Mar 11, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

steveloughran commented May 11, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

steveloughran commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

steveloughran commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

steveloughran commented May 11, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

steveloughran commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

steveloughran commented Mar 11, 2026 •

edited

Loading