Spark: Cap RandomData collection size to speed up nested tests by Baunsgaard · Pull Request #16696 · apache/iceberg

Baunsgaard · 2026-06-06T12:30:54Z

What

RandomData (Spark test helper) sized every generated list and map with
random.nextInt(20). Because the bound is applied at every nesting level, it
multiplies for deeply-nested schemas. The worst case is
AvroDataTestBase.testMixedTypes, which embeds the full ~19-field primitive
struct two-to-three levels deep across five fields — each run generates well
over a million leaf values, so the test cost is dominated by random-data volume
rather than the read/write code paths being exercised.

This replaces the hard-coded 20 with a named constant
MAX_COLLECTION_SIZE = 10 in the Spark 3.5 / 4.0 / 4.1 test copies.

Why

testMixedTypes is the single most expensive test method in the
iceberg-spark core suite, appearing at the top of every format read/write
test class. The collection size has no bearing on coverage — the schemas,
types, and nesting structures under test are identical regardless of how many
elements each collection holds — so this is pure scaffolding overhead.

Impact

Measured locally (JDK 17, Spark 3.5 core), testMixedTypes per class,
single-threaded:

Class	before	after
TestAvroDataFrameWrite	24.1s	7.8s
TestParquetDataFrameWrite	20.0s	3.6s
TestORCDataFrameWrite	19.9s	3.4s
TestParquetScan	17.7s	2.6s
TestParquetVectorizedScan	17.5s	2.3s
TestAvroScan	17.5s	2.4s

Collections still hold up to nine elements, preserving data variety.

Testing

./gradlew :iceberg-spark:iceberg-spark-3.5_2.13:test — 5,084 tests, 0
failures (identical pass/skip counts to before the change).

Baunsgaard · 2026-06-06T14:17:51Z

Locally (Spark 3.5 core, JDK 17, single-threaded), drops from
~17–24s to ~2–8s per class (3–7×), with identical pass/skip counts.

However, on CI the effect is small and within runner noise (~±10%): core jobs
but on average ~2% to ~5% faster.

nssalian · 2026-06-06T18:18:31Z

+  // Exclusive upper bound on the number of elements generated for each list/map.
+  // Applied per nesting level, so deeply-nested schemas multiply quickly; keep
+  // this small enough to avoid combinatorial blow-up in heavily-nested tests.
+  private static final int MAX_COLLECTION_SIZE = 10;


I see MAX_ENTRIES being used as the naming in RandomGenericData.java.
MAX_COLLECTION_SIZE does seem like a more apt name though but ok to keep MAX_ENTRIES as well to be consistent.

The number before was 20, and reusing the name MAX_ENTRIES might be less clear, especially since we are working with collections specifically here.

Happy to change it if there are some other prefference.

laskoviymishka

Nice and clean, capping per-level collection size is the right lever for nested-schema blowup, and documenting the multiplicative behavior on the constant is a nice touch.

One thing before merge: v3.4 is still an active module with the same four nextInt(20) sites in its RandomData.java, so its nested tests stay slow and a future v3.5→v3.4 sync could re-introduce the gap. I'd apply the same change there, or note why it's excluded.

Minor: the title says "nested tests" broadly but the shared RandomUtil/RandomGenericData paths are untouched, worth either a follow-up or narrowing to "Spark".

Smaller stuff inline (nothing is blocking).

RandomData sized every generated list and map with random.nextInt(20), applied at each nesting level. For deeply-nested schemas this multiplies into well over a million leaf values per run, so the cost is dominated by data volume rather than the read/write code paths being exercised. Replace the hard-coded bound with a named constant COLLECTION_SIZE_BOUND in the Spark 3.5, 4.0, and 4.1 test copies. The name and comment make the exclusive bound and the legal zero-length case explicit. Schemas, types, and nesting structures under test are unchanged, so coverage is preserved while the nested round-trip tests run several times faster.

laskoviymishka

Looks good to me, but will wait for merge for @kevinjqliu to look on it, since he doing a work around optimizing CI recently.

Baunsgaard · 2026-06-10T14:58:25Z

As a orthogonal followup i also have this PR: #16740 It gives another 8% if you want to take a look @laskoviymishka and @nssalian ?

kevinjqliu

LGTM

Thanks you for the change, this will definitely have an outside impact across all CI runs 💯

kevinjqliu · 2026-06-15T16:39:00Z

Thanks again for the PR @Baunsgaard and thank you @nssalian @laskoviymishka for the reviews!

github-actions Bot added the spark label Jun 6, 2026

Baunsgaard marked this pull request as ready for review June 6, 2026 14:30

nssalian reviewed Jun 6, 2026

View reviewed changes

laskoviymishka requested changes Jun 10, 2026

View reviewed changes

Baunsgaard force-pushed the spark-tests-cap-random-collection-size branch from 86450b1 to f9633fb Compare June 10, 2026 12:10

laskoviymishka approved these changes Jun 10, 2026

View reviewed changes

laskoviymishka requested a review from kevinjqliu June 10, 2026 12:17

nssalian approved these changes Jun 10, 2026

View reviewed changes

kevinjqliu approved these changes Jun 15, 2026

View reviewed changes

kevinjqliu merged commit 2f244c0 into apache:main Jun 15, 2026
43 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spark: Cap RandomData collection size to speed up nested tests#16696

Spark: Cap RandomData collection size to speed up nested tests#16696
kevinjqliu merged 1 commit into
apache:mainfrom
Baunsgaard:spark-tests-cap-random-collection-size

Baunsgaard commented Jun 6, 2026

Uh oh!

Baunsgaard commented Jun 6, 2026

Uh oh!

nssalian Jun 6, 2026

Uh oh!

Baunsgaard Jun 6, 2026

Uh oh!

laskoviymishka left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

laskoviymishka left a comment

Uh oh!

Baunsgaard commented Jun 10, 2026

Uh oh!

kevinjqliu left a comment

Uh oh!

Uh oh!

kevinjqliu commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

Baunsgaard commented Jun 6, 2026

What

Why

Impact

Testing

Uh oh!

Baunsgaard commented Jun 6, 2026

Uh oh!

nssalian Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

Baunsgaard Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

laskoviymishka left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

laskoviymishka left a comment

Choose a reason for hiding this comment

Uh oh!

Baunsgaard commented Jun 10, 2026

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kevinjqliu commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

laskoviymishka left a comment •

edited

Loading