Spark: Spark tests cache rewrite input by Baunsgaard · Pull Request #16740 · apache/iceberg

Baunsgaard · 2026-06-09T08:29:58Z

What

TestRewriteDataFilesAction is the slowest single class in the Spark core test module. Each test materializes a large (SCALE = 400000-row) input table via a Spark write before exercising the rewrite under test, and many tests reuse the same input shape across the formatVersion matrix.

This caches the written input data files keyed by table shape (formatVersion, spec, files, rows, partitions, properties) and reuses them by re-appending the cached DataFiles to a fresh table. The expensive Spark write of the input now runs once per JVM fork instead of once per test; the rewrite under test still runs per test on its own fresh table.

Applied identically to Spark 3.5, 4.0 and 4.1.

Why it is safe

The generated data is deterministic (fixed Random(42) seed), so reuse is
byte-identical to regenerating it.
Cached files live in a static @TempDir, so they survive across tests (not wiped
by the per-test temp dir) and are cleaned up after the class.
includeColumnStats() is used when collecting the cached files so lower/upper
bounds and value counts are preserved on re-append.
The rewrite under test is unchanged and still runs per test, so no assertion is weakened.

Results

CI core job durations (baseline main run vs. this PR's latest green run; core jobs averaged per Spark version across the JVM/Scala matrix):

Spark core jobs	Before	After	Δ
3.5	40.6 min	36.5 min	−10.0%
4.0	47.3 min	43.8 min	−7.5%
4.1	42.9 min	40.9 min	−4.6%
Average	42.8 min	39.4 min	−7.9%

Test-only change, applied identically across v3.5, v4.0, and v4.1.

Notes

Scoped to the createTable(int) / createTablePartitioned(...) helpers; the few
in-test writeRecords(..., SCALE, ...) call sites are not yet cached, so there is more potential for this change.

TestRewriteDataFilesAction materializes a large (SCALE-row) input table via a Spark write before exercising the rewrite under test, and many tests reuse the same input shape. Cache the written input data files keyed by table shape (formatVersion, spec, files, rows, partitions, properties) and reuse them by re-appending the cached DataFiles to a fresh table, so the Spark write runs once per JVM fork instead of once per test. The rewrite under test still runs per test on its own table, and the generated data is deterministic so reuse is byte-identical. Applied to Spark 3.5, 4.0 and 4.1.

Replace inline string concatenation with String.format when building the input-file cache keys in createTable and createTablePartitioned, for readability. The produced key strings are unchanged.

Replace new ConcurrentHashMap<>() with Maps.newConcurrentMap() to satisfy the Iceberg checkstyle rule that forbids direct ConcurrentHashMap construction.

Replace the computeIfAbsent build with explicit per-key locking and double-checked caching so concurrent callers requesting the same table shape block on the first build and reuse its result instead of materializing identical input twice. The heavy Spark write now happens outside any map lock, so distinct shapes can still build in parallel.

Encode formatVersion and PartitionSpec explicitly in the partitioned cache key so distinct table shapes can never collide on the same key through implicit option dependencies, mirroring the unpartitioned key. Collect cached input files into an immutable list so callers cannot mutate the shared, per-fork cached instance.

Baunsgaard · 2026-06-10T11:19:08Z

Small update on the times:

Core-job wall time, before = baseline main run, after = mean of 3 green
PR runs (per Spark version, averaged across the JVM x Scala matrix to smooth
runner noise):

Spark core jobs	Before	After (mean of 3 runs)	Δ
3.5	40.6 min	37.5 min	−7.6%
4.0	47.3 min	42.9 min	−9.3%
4.1	42.9 min	39.1 min	−8.9%
Average	42.8 min	39.2 min	−8.4%

The per-run overall core average is stable across the three runs (39.4 / 39.3 /
38.9 min), so the ~8% improvement is consistent rather than a single lucky run.

laskoviymishka

Nice, this is right improvement. SCALE write dominates these tests, and caching the golden files per table shape is exactly where the ~8% win should come from. Re-appending the same physical files is spec-legal too, since sequence numbers and firstRowId are assigned on each append.

I’d hold before merge, mostly around the new static cache.

Main issue: cachedInputFiles passes planFiles() into Streams.stream, so the CloseableIterable is never closed. Each cache miss leaks manifest readers. On a long-lived JVM that’s FD pressure, and on Windows it can leave manifest files locked. A try-with-resources around the scan should fix it.

The other one is the unpartitioned cache key. The partitioned path includes table properties, but the unpartitioned key drops them, even though createTable() sets row-group size and that shapes the cached files. Dormant today, but easy future collision.

Before merge I’d fix:

close the planFiles() scan
include table properties in the unpartitioned key
make the static cache safe with static @TempDir lifecycle: clear in @AfterAll, or key by cache dir
add a note/assertion on weaker orphan-scan coverage, since cached inputs live outside table.location()

DCL-vs-PER_METHOD and virtual dispatch are smaller; fine to fold in here.

Once those are addressed, happy to do one more pass.

laskoviymishka · 2026-06-10T15:41:24Z

+        // includeColumnStats() is required: a plain scan drops lower/upper bounds and
+        // value counts, and re-appending stat-less files breaks tests that read bounds.
+        List<DataFile> built =
+            Streams.stream(golden.newScan().includeColumnStats().planFiles())


planFiles() returns a CloseableIterable, and handing it to Streams.stream(Iterable) means it never gets closed — so every cache miss leaks the manifest readers it holds open. One leak per distinct key per fork; on a long-lived JVM that's FD pressure, on Windows it leaves the manifest files locked, and @TempDir cleanup can warn at teardown.

I'd wrap the scan in try-with-resources and collect inside the try:

List<DataFile> built; try (CloseableIterable<FileScanTask> tasks = golden.newScan().includeColumnStats().planFiles()) { built = Streams.stream(tasks) .map(FileScanTask::file) .map(DataFile::copy) .collect(ImmutableList.toImmutableList()); } INPUT_FILE_CACHE.put(key, built); return built;

(The pre-existing planFiles() calls elsewhere in the file have the same shape but are out of scope here.) Same in v4.0 and v4.1.

Okay! Wrapped the scan in try-with-resources and collect inside, mapping IOException to UncheckedIOException. No more reader leak per cache miss.

laskoviymishka · 2026-06-10T15:41:24Z

   * @return the created table
   */
  protected Table createTable(int files) {
+    String key = String.format("unpartitioned|fv=%d|files=%d|rows=%d", formatVersion, files, SCALE);


The unpartitioned key drops the table-properties dimension that the partitioned path includes. The partitioned key folds in opts via new TreeMap<>(options), but here the key is just unpartitioned|fv|files|rows while createTable() sets PARQUET_ROW_GROUP_SIZE_BYTES=20*1024, which shapes split_offsets on every cached DataFile.

It's dormant today because the value is hardcoded, but it's a silent shape collision waiting to happen: if any subclass or future test overrides createTable()'s properties, the cache serves files with mismatched split offsets, and tests like testBinPackSplitLargeFile that lean on accurate offsets fail in a way that's very hard to trace back to here.

I'd mirror the partitioned path and fold the relevant properties into the unpartitioned key. wdyt? Same in v4.0 and v4.1.

I think this is very speculative about future commits, but fine. I pulled the row-group size into a constant (INPUT_PARQUET_ROW_GROUP_SIZE_BYTES) and folded it into the key (rowGroup=%d), mirroring the partitioned path. A future createTable() property override now changes the key instead of silently colliding.

laskoviymishka · 2026-06-10T15:41:24Z

+  // fork and reused by every test that asks for the same shape. The Spark write of SCALE
+  // rows dominates these tests; the rewrite under test still runs per test on a fresh table.
+  @TempDir private static Path inputCacheDir;
+  private static final Map<String, List<DataFile>> INPUT_FILE_CACHE = Maps.newConcurrentMap();


I think there's a subtle lifecycle issue when the class runs twice in one JVM. INPUT_FILE_CACHE is static and never cleared, but inputCacheDir is a static @TempDir, so JUnit deletes and recreates it on a second run in the same JVM (IDE re-run, forkCount=0, suite aggregation). On that second run the cache still holds DataFiles whose location() points into the first, now-deleted directory, so a cache hit hands back dangling paths and you get a FileNotFoundException deep in the Spark Parquet reader.

Invisible in normal Gradle CI since each fork is a fresh JVM, but it bites IntelliJ re-runs.

Either clear the maps and reset the seq in @AfterAll, or key the cache by inputCacheDir.toAbsolutePath() so stale entries can't match a new directory. Either is fine, but one of them should be in. Same in v4.0 and v4.1.

Okay, accordingly added an @AfterAll that clears the cache + lock map and resets the seq. @AfterAll runs once after all tests, so within-run cross-test caching is unchanged; it only stops a second in-JVM run (IDE re-run) from returning DataFiles pointing into the recreated @TempDir.

laskoviymishka · 2026-06-10T15:41:24Z

+      String savedLocation = this.tableLocation;
+      try {
+        this.tableLocation =
+            inputCacheDir.resolve("input-" + INPUT_CACHE_SEQ.incrementAndGet()).toUri().toString();


Caching the input files outside the per-test table location quietly weakens the orphan-file coverage. shouldHaveNoOrphans(table) runs deleteOrphanFiles, which by default only scans table.location() (the per-test tableDir), but the cached inputs live under this static inputCacheDir, outside that prefix — so the startsWith(location) filter in DeleteOrphanFilesSparkAction never sees them.

The tests still pass, but if a bug ever leaked a cached input path into a post-rewrite snapshot, the no-orphans assertions wouldn't catch it — we've lost diagnostic value, not gained a false failure.

I'd at minimum drop a comment near shouldHaveNoOrphans noting the cached inputs are out of scan scope by design; or, if we want to keep the invariant strong, add a manifest-scan assertion that the fresh table's current snapshot contains no inputCacheDir-prefixed paths. Same in v4.0 and v4.1.

Since the cache explicitly saves the files in a different folder, and actively copies them over for the individual tests it does not invalidate any of the shouldHaveNoOrphans calls, but to be defensive, sure we can add it.

Therefore, accordingly I a comment at shouldHaveNoOrphans noting cached inputs live under the static inputCacheDir, outside table.location(), so deleteOrphanFiles intentionally doesn't scan them.

@afterall

- Close planFiles() via try-with-resources in cachedInputFiles so each cache miss no longer leaks the manifest readers the scan holds open. - Encode the Parquet row group size (via a shared constant) in the unpartitioned cache key so a future createTable() property override cannot collide with cached files of a different shape. - Clear the static input-file cache and locks and reset the sequence in @afterall, so a second in-JVM run (IDE re-run, forkCount=0) cannot return DataFiles pointing into a recreated @tempdir. - Document that cached inputs live outside table.location() and are intentionally outside deleteOrphanFiles scan scope.

Baunsgaard · 2026-06-10T16:52:56Z

Nice, this is right improvement. SCALE write dominates these tests, and caching the golden files per table shape is exactly where the ~8% win should come from. Re-appending the same physical files is spec-legal too, since sequence numbers and firstRowId are assigned on each append.

...

Once those are addressed, happy to do one more pass.

Thanks for the review @laskoviymishka ! I have addressed all the four blockers!

One Nit: I left the DCL-vs-PER_METHOD and virtual-dispatch nits as-is for now. I know my implementation on the locking is hypothetical for future parallel execution and the tests are run single threaded now, but i thought it would be fine to make it thread-safe already now.

laskoviymishka

Nice, thanks! Look good to me.

Baunsgaard · 2026-06-17T14:12:43Z

@kevinjqliu Could you take a look, thanks!

kevinjqliu · 2026-06-17T17:11:08Z

Thanks for the improvement @Baunsgaard, the changes looks good to me.
I did have one nit about making the changes easier to read. I think we can get rid of the locks and atomic integer. And possibly isolate the changes to the writeRecords function.

I had codex generate something similar if you would like to take a look, https://github.com/apache/iceberg/compare/main...kevinjqliu:iceberg:kevinjqliu/codex-cache-rewrite-data-files-input?expand=1

Again, its mostly for readability and maintainability -- the effect is the same 😄

Baunsgaard · 2026-06-19T12:10:04Z

Sure, I took a look at your suggestion and it makes sense. Up to you which version you like best.

The pr is open to contributor modification so you can edit directly if you wish.

Otherwise I can change the commit to yours next week and set you as coauthor.

kevinjqliu · 2026-06-19T14:42:00Z

Thanks for taking a look @Baunsgaard! I can open a PR with the edits. I mainly want this part of the tests to be more maintainable, the coauthor part is not a concern 😄

kevinjqliu · 2026-06-19T21:41:06Z

opened #16882 let me know what you think!

This reverts commit e5151f3.

…e#16912) This reverts commit e5151f3.

github-actions Bot added the spark label Jun 9, 2026

Baunsgaard changed the title ~~Spark tests cache rewrite input~~ Spark: Spark tests cache rewrite input Jun 9, 2026

Baunsgaard force-pushed the spark-tests-cache-rewrite-input branch from a7c5d80 to 1ed1cb6 Compare June 9, 2026 10:36

Baunsgaard marked this pull request as ready for review June 9, 2026 22:34

Baunsgaard added 5 commits June 10, 2026 09:58

Spark: Build TestRewriteDataFilesAction cache keys with String.format

e00ae09

Replace inline string concatenation with String.format when building the input-file cache keys in createTable and createTablePartitioned, for readability. The produced key strings are unchanged.

Spark: Use Maps.newConcurrentMap() for input file cache

3fe3e82

Replace new ConcurrentHashMap<>() with Maps.newConcurrentMap() to satisfy the Iceberg checkstyle rule that forbids direct ConcurrentHashMap construction.

Baunsgaard force-pushed the spark-tests-cache-rewrite-input branch from f489961 to 10510d8 Compare June 10, 2026 10:02

Baunsgaard mentioned this pull request Jun 10, 2026

Spark: Cap RandomData collection size to speed up nested tests #16696

Merged

nssalian requested a review from kevinjqliu June 10, 2026 15:13

laskoviymishka requested changes Jun 10, 2026

View reviewed changes

laskoviymishka self-requested a review June 10, 2026 17:08

laskoviymishka approved these changes Jun 10, 2026

View reviewed changes

laskoviymishka merged commit e5151f3 into apache:main Jun 17, 2026
43 checks passed

kevinjqliu mentioned this pull request Jun 19, 2026

Spark: Refactor rewrite input cache setup #16882

Closed

wombatu-kun mentioned this pull request Jun 20, 2026

Spark: Reduce TestRewriteDataFilesAction data volume to speed up CI #16565

Open

laskoviymishka mentioned this pull request Jun 22, 2026

Revert "Spark: Spark tests cache rewrite input" #16912

Merged

amogh-jahagirdar pushed a commit that referenced this pull request Jun 22, 2026

Revert "Spark: Spark tests cache rewrite input (#16740)" (#16912)

902ed1b

This reverts commit e5151f3.

jakelong95 pushed a commit to jakelong95/iceberg that referenced this pull request Jun 29, 2026

Revert "Spark: Spark tests cache rewrite input (apache#16740)" (apach…

09aafc0

…e#16912) This reverts commit e5151f3.

Uh oh!

Conversation

Baunsgaard commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why it is safe

Results

Notes

Uh oh!

Baunsgaard commented Jun 10, 2026

Uh oh!

laskoviymishka left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Baunsgaard commented Jun 10, 2026

Uh oh!

laskoviymishka left a comment

Choose a reason for hiding this comment

Uh oh!

Baunsgaard commented Jun 17, 2026

Uh oh!

Uh oh!

kevinjqliu commented Jun 17, 2026

Uh oh!

Baunsgaard commented Jun 19, 2026

Uh oh!

kevinjqliu commented Jun 19, 2026

Uh oh!

kevinjqliu commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Baunsgaard commented Jun 9, 2026 •

edited

Loading