Spark: Reduce TestRewriteDataFilesAction data volume to speed up CI#16565
Spark: Reduce TestRewriteDataFilesAction data volume to speed up CI#16565wombatu-kun wants to merge 1 commit into
Conversation
|
Thanks for the PR! this was one of the optimizations I observed as well. |
|
btw so a 50% reduction should drop ~6 minutes in total wall time for spark-ci 🥳 |
wombatu-kun
left a comment
There was a problem hiding this comment.
Thanks @kevinjqliu! Here is the reasoning for why the smaller SCALE is semantics-preserving, to make the review quick.
The methods that stayed on the small SCALE = 400 assert only on file/snapshot counts and rewrite structure, and those are driven by the number of input files, not by the row count. createTable(files) -> writeRecords(files, numRecords, ...) builds the rows and then does .repartition(files), so the file count is fixed by the files argument and is independent of SCALE. For example testBinPackUnpartitionedTable does createTable(4) -> shouldHaveFiles(table, 4) -> rewrite -> addedDataFilesCount() == 1; none of those numbers move when rows-per-file shrinks.
The handful of tests whose assertions genuinely depend on large files are exactly the ones pinned back to LARGE_SCALE = 400000, so they stay byte-for-byte equivalent to before: testBinPackSplitLargeFile, testBinPackCombineMixedFiles, testBinPackCombineMediumFiles, testAutoSortShuffleOutput, testZOrderSort (the last two assert ">= 40 output files", which needs real volume to hold).
The remaining size-based tests that kept the small SCALE are scale-invariant by construction: they derive their target from the actual table, e.g. targetSize = testDataSize(table) / 3 or averageFileSize(table) + 1000, instead of a hardcoded byte count, so the split/combine math holds at any volume.
This matches what the suite reports: test counts and pass/fail are identical before and after, 171 pass / 0 fail, across all three format versions and all three Spark trees (v3.5/v4.0/v4.1). Coverage is unchanged; only the data volume shrank, and no production code is touched.
I am keen to land this quickly: on dev@ yesterday Bob Thomson noted that "with respect to Iceberg we can see over the last 7 days that the project is the top consumer of runner time", and this PR is part of #16397, which targets exactly that - it should drop the ~6 min off spark-ci you measured. Could you give it a review and merge when you get a chance? Happy to also pin any borderline test to LARGE_SCALE if a Spark reviewer prefers.
bb80a8b to
24d3e83
Compare
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
24d3e83 to
f23f3ec
Compare
|
@kevinjqliu rebased onto current |
Part of #16397.
TestRewriteDataFilesActionis the #2 slowest Spark test class inspark-ci(~12.4 min in the profiling gist linked from #16397). It is parameterized only onformatVersion = [2, 3, 4]- each version is meaningful (v2 position deletes, v3 deletion vectors, v4 Parquet manifests) - so its matrix cannot be trimmed. Its runtime is instead dominated by data volume: a sharedSCALE = 400000consumed by ~50@TestTemplatemethods that each write and then rewrite ~400k rows, across three format versions.What changed
Most methods only assert on file/snapshot counts and rewrite structure, which do not depend on the absolute row count, so they now use a small
SCALE = 400. The few methods whose assertions genuinely depend on large files (size-based splitting, sort/z-order shuffle output) keep the original volume via a newLARGE_SCALE = 400000constant, so they stay byte-for-byte equivalent:testBinPackSplitLargeFile,testBinPackCombineMixedFiles,testBinPackCombineMediumFiles,testAutoSortShuffleOutput,testZOrderSort,testSortCustomSortOrderRequiresRepartition, andtestBinPackAfterPartitionChange.The unpartitioned
createTable(int files)gains acreateTable(int files, int numRecords)overload that threads the row count towriteRecords(mirroring the existing partitionedcreateTablePartitioned(..., numRecords, ...)), so the large volume is requested only where it matters.The same change is applied identically to the v3.5, v4.0, and v4.1 Spark trees.
Measured impact
Measured locally as the JUnit testsuite time summed across the three
formatVersionsuites, three runs each, viacleanTest test --no-build-cache(forces real re-execution, no cache):That is a ~54% reduction (≈60% at warm steady-state). Test counts and pass/fail are unchanged across all three trees, so coverage is preserved - only the data volume shrank.