Spark: Reduce TestRewriteDataFilesAction data volume to speed up CI by wombatu-kun · Pull Request #16565 · apache/iceberg

wombatu-kun · 2026-05-26T06:37:07Z

Part of #16397.

TestRewriteDataFilesAction is the #2 slowest Spark test class in spark-ci (~12.4 min in the profiling gist linked from #16397). It is parameterized only on formatVersion = [2, 3, 4] - each version is meaningful (v2 position deletes, v3 deletion vectors, v4 Parquet manifests) - so its matrix cannot be trimmed. Its runtime is instead dominated by data volume: a shared SCALE = 400000 consumed by ~50 @TestTemplate methods that each write and then rewrite ~400k rows, across three format versions.

What changed

Most methods only assert on file/snapshot counts and rewrite structure, which do not depend on the absolute row count, so they now use a small SCALE = 400. The few methods whose assertions genuinely depend on large files (size-based splitting, sort/z-order shuffle output) keep the original volume via a new LARGE_SCALE = 400000 constant, so they stay byte-for-byte equivalent: testBinPackSplitLargeFile, testBinPackCombineMixedFiles, testBinPackCombineMediumFiles, testAutoSortShuffleOutput, testZOrderSort, testSortCustomSortOrderRequiresRepartition, and testBinPackAfterPartitionChange.

The unpartitioned createTable(int files) gains a createTable(int files, int numRecords) overload that threads the row count to writeRecords (mirroring the existing partitioned createTablePartitioned(..., numRecords, ...)), so the large volume is requested only where it matters.

The same change is applied identically to the v3.5, v4.0, and v4.1 Spark trees.

Measured impact

Measured locally as the JUnit testsuite time summed across the three formatVersion suites, three runs each, via cleanTest test --no-build-cache (forces real re-execution, no cache):

	mean of 3 runs	tests
Before	688 s (11.5 min)	171 pass / 0 fail
After	316 s (5.3 min)	171 pass / 0 fail

That is a ~54% reduction (≈60% at warm steady-state). Test counts and pass/fail are unchanged across all three trees, so coverage is preserved - only the data volume shrank.

kevinjqliu · 2026-05-26T16:50:40Z

Thanks for the PR! this was one of the optimizations I observed as well.
Would be great to get some Spark folks to review this; I want to make sure that the smaller SCALE variable does not affect the semantics of those tests

kevinjqliu · 2026-05-26T16:53:05Z

btw TestRewriteDataFilesAction is ran sequentially 168 times in CI, totaling ~13 minutes
https://gist.github.com/kevinjqliu/86827bedf9d02adad36881f476484208#top-25--core-spark

so a 50% reduction should drop ~6 minutes in total wall time for spark-ci 🥳

wombatu-kun

Thanks @kevinjqliu! Here is the reasoning for why the smaller SCALE is semantics-preserving, to make the review quick.

The methods that stayed on the small SCALE = 400 assert only on file/snapshot counts and rewrite structure, and those are driven by the number of input files, not by the row count. createTable(files) -> writeRecords(files, numRecords, ...) builds the rows and then does .repartition(files), so the file count is fixed by the files argument and is independent of SCALE. For example testBinPackUnpartitionedTable does createTable(4) -> shouldHaveFiles(table, 4) -> rewrite -> addedDataFilesCount() == 1; none of those numbers move when rows-per-file shrinks.

The handful of tests whose assertions genuinely depend on large files are exactly the ones pinned back to LARGE_SCALE = 400000, so they stay byte-for-byte equivalent to before: testBinPackSplitLargeFile, testBinPackCombineMixedFiles, testBinPackCombineMediumFiles, testAutoSortShuffleOutput, testZOrderSort (the last two assert ">= 40 output files", which needs real volume to hold).

The remaining size-based tests that kept the small SCALE are scale-invariant by construction: they derive their target from the actual table, e.g. targetSize = testDataSize(table) / 3 or averageFileSize(table) + 1000, instead of a hardcoded byte count, so the split/combine math holds at any volume.

This matches what the suite reports: test counts and pass/fail are identical before and after, 171 pass / 0 fail, across all three format versions and all three Spark trees (v3.5/v4.0/v4.1). Coverage is unchanged; only the data volume shrank, and no production code is touched.

I am keen to land this quickly: on dev@ yesterday Bob Thomson noted that "with respect to Iceberg we can see over the last 7 days that the project is the top consumer of runner time", and this PR is part of #16397, which targets exactly that - it should drop the ~6 min off spark-ci you measured. Could you give it a review and merge when you get a chance? Happy to also pin any borderline test to LARGE_SCALE if a Spark reviewer prefers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

wombatu-kun · 2026-06-23T01:57:56Z

@kevinjqliu rebased onto current main. The branch had been stacked on the input-file cache from #16740, which was reverted in #16912, so I re-expressed the data-volume reduction as a standalone change (no cache dependency) and restored the original no-cache numbers in the description (688s -> 316s, ~54%). Ready for another look.

github-actions Bot added the spark label May 26, 2026

wombatu-kun commented Jun 12, 2026

View reviewed changes

wombatu-kun force-pushed the issue/16397-rewrite-data-files-test-scale branch from bb80a8b to 24d3e83 Compare June 20, 2026 03:16

Spark: Reduce TestRewriteDataFilesAction data volume to speed up CI

f23f3ec

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

wombatu-kun force-pushed the issue/16397-rewrite-data-files-test-scale branch from 24d3e83 to f23f3ec Compare June 23, 2026 01:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spark: Reduce TestRewriteDataFilesAction data volume to speed up CI#16565

Spark: Reduce TestRewriteDataFilesAction data volume to speed up CI#16565
wombatu-kun wants to merge 1 commit into
apache:mainfrom
wombatu-kun:issue/16397-rewrite-data-files-test-scale

wombatu-kun commented May 26, 2026 •

edited

Loading

Uh oh!

kevinjqliu commented May 26, 2026

Uh oh!

kevinjqliu commented May 26, 2026

Uh oh!

wombatu-kun left a comment

Uh oh!

wombatu-kun commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

wombatu-kun commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Measured impact

Uh oh!

kevinjqliu commented May 26, 2026

Uh oh!

kevinjqliu commented May 26, 2026

Uh oh!

wombatu-kun left a comment

Choose a reason for hiding this comment

Uh oh!

wombatu-kun commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wombatu-kun commented May 26, 2026 •

edited

Loading