Skip to content

Spark: Reduce TestRewriteDataFilesAction data volume to speed up CI#16565

Open
wombatu-kun wants to merge 1 commit into
apache:mainfrom
wombatu-kun:issue/16397-rewrite-data-files-test-scale
Open

Spark: Reduce TestRewriteDataFilesAction data volume to speed up CI#16565
wombatu-kun wants to merge 1 commit into
apache:mainfrom
wombatu-kun:issue/16397-rewrite-data-files-test-scale

Conversation

@wombatu-kun

@wombatu-kun wombatu-kun commented May 26, 2026

Copy link
Copy Markdown
Contributor

Part of #16397.

TestRewriteDataFilesAction is the #2 slowest Spark test class in spark-ci (~12.4 min in the profiling gist linked from #16397). It is parameterized only on formatVersion = [2, 3, 4] - each version is meaningful (v2 position deletes, v3 deletion vectors, v4 Parquet manifests) - so its matrix cannot be trimmed. Its runtime is instead dominated by data volume: a shared SCALE = 400000 consumed by ~50 @TestTemplate methods that each write and then rewrite ~400k rows, across three format versions.

What changed

Most methods only assert on file/snapshot counts and rewrite structure, which do not depend on the absolute row count, so they now use a small SCALE = 400. The few methods whose assertions genuinely depend on large files (size-based splitting, sort/z-order shuffle output) keep the original volume via a new LARGE_SCALE = 400000 constant, so they stay byte-for-byte equivalent: testBinPackSplitLargeFile, testBinPackCombineMixedFiles, testBinPackCombineMediumFiles, testAutoSortShuffleOutput, testZOrderSort, testSortCustomSortOrderRequiresRepartition, and testBinPackAfterPartitionChange.

The unpartitioned createTable(int files) gains a createTable(int files, int numRecords) overload that threads the row count to writeRecords (mirroring the existing partitioned createTablePartitioned(..., numRecords, ...)), so the large volume is requested only where it matters.

The same change is applied identically to the v3.5, v4.0, and v4.1 Spark trees.

Measured impact

Measured locally as the JUnit testsuite time summed across the three formatVersion suites, three runs each, via cleanTest test --no-build-cache (forces real re-execution, no cache):

mean of 3 runs tests
Before 688 s (11.5 min) 171 pass / 0 fail
After 316 s (5.3 min) 171 pass / 0 fail

That is a ~54% reduction (≈60% at warm steady-state). Test counts and pass/fail are unchanged across all three trees, so coverage is preserved - only the data volume shrank.

@github-actions github-actions Bot added the spark label May 26, 2026
@kevinjqliu

Copy link
Copy Markdown
Contributor

Thanks for the PR! this was one of the optimizations I observed as well.
Would be great to get some Spark folks to review this; I want to make sure that the smaller SCALE variable does not affect the semantics of those tests

@kevinjqliu

Copy link
Copy Markdown
Contributor

btw TestRewriteDataFilesAction is ran sequentially 168 times in CI, totaling ~13 minutes
https://gist.github.com/kevinjqliu/86827bedf9d02adad36881f476484208#top-25--core-spark

so a 50% reduction should drop ~6 minutes in total wall time for spark-ci 🥳

@wombatu-kun wombatu-kun left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kevinjqliu! Here is the reasoning for why the smaller SCALE is semantics-preserving, to make the review quick.

The methods that stayed on the small SCALE = 400 assert only on file/snapshot counts and rewrite structure, and those are driven by the number of input files, not by the row count. createTable(files) -> writeRecords(files, numRecords, ...) builds the rows and then does .repartition(files), so the file count is fixed by the files argument and is independent of SCALE. For example testBinPackUnpartitionedTable does createTable(4) -> shouldHaveFiles(table, 4) -> rewrite -> addedDataFilesCount() == 1; none of those numbers move when rows-per-file shrinks.

The handful of tests whose assertions genuinely depend on large files are exactly the ones pinned back to LARGE_SCALE = 400000, so they stay byte-for-byte equivalent to before: testBinPackSplitLargeFile, testBinPackCombineMixedFiles, testBinPackCombineMediumFiles, testAutoSortShuffleOutput, testZOrderSort (the last two assert ">= 40 output files", which needs real volume to hold).

The remaining size-based tests that kept the small SCALE are scale-invariant by construction: they derive their target from the actual table, e.g. targetSize = testDataSize(table) / 3 or averageFileSize(table) + 1000, instead of a hardcoded byte count, so the split/combine math holds at any volume.

This matches what the suite reports: test counts and pass/fail are identical before and after, 171 pass / 0 fail, across all three format versions and all three Spark trees (v3.5/v4.0/v4.1). Coverage is unchanged; only the data volume shrank, and no production code is touched.

I am keen to land this quickly: on dev@ yesterday Bob Thomson noted that "with respect to Iceberg we can see over the last 7 days that the project is the top consumer of runner time", and this PR is part of #16397, which targets exactly that - it should drop the ~6 min off spark-ci you measured. Could you give it a review and merge when you get a chance? Happy to also pin any borderline test to LARGE_SCALE if a Spark reviewer prefers.

@wombatu-kun wombatu-kun force-pushed the issue/16397-rewrite-data-files-test-scale branch from bb80a8b to 24d3e83 Compare June 20, 2026 03:16
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@wombatu-kun wombatu-kun force-pushed the issue/16397-rewrite-data-files-test-scale branch from 24d3e83 to f23f3ec Compare June 23, 2026 01:45
@wombatu-kun

Copy link
Copy Markdown
Contributor Author

@kevinjqliu rebased onto current main. The branch had been stacked on the input-file cache from #16740, which was reverted in #16912, so I re-expressed the data-volume reduction as a standalone change (no cache dependency) and restored the original no-cache numbers in the description (688s -> 316s, ~54%). Ready for another look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants