Skip to content

Spark: Cap RandomData collection size to speed up nested tests#16696

Merged
kevinjqliu merged 1 commit into
apache:mainfrom
Baunsgaard:spark-tests-cap-random-collection-size
Jun 15, 2026
Merged

Spark: Cap RandomData collection size to speed up nested tests#16696
kevinjqliu merged 1 commit into
apache:mainfrom
Baunsgaard:spark-tests-cap-random-collection-size

Conversation

@Baunsgaard

Copy link
Copy Markdown
Contributor

What

RandomData (Spark test helper) sized every generated list and map with
random.nextInt(20). Because the bound is applied at every nesting level, it
multiplies for deeply-nested schemas. The worst case is
AvroDataTestBase.testMixedTypes, which embeds the full ~19-field primitive
struct two-to-three levels deep across five fields — each run generates well
over a million leaf values, so the test cost is dominated by random-data volume
rather than the read/write code paths being exercised.

This replaces the hard-coded 20 with a named constant
MAX_COLLECTION_SIZE = 10 in the Spark 3.5 / 4.0 / 4.1 test copies.

Why

testMixedTypes is the single most expensive test method in the
iceberg-spark core suite, appearing at the top of every format read/write
test class. The collection size has no bearing on coverage — the schemas,
types, and nesting structures under test are identical regardless of how many
elements each collection holds — so this is pure scaffolding overhead.

Impact

Measured locally (JDK 17, Spark 3.5 core), testMixedTypes per class,
single-threaded:

Class before after
TestAvroDataFrameWrite 24.1s 7.8s
TestParquetDataFrameWrite 20.0s 3.6s
TestORCDataFrameWrite 19.9s 3.4s
TestParquetScan 17.7s 2.6s
TestParquetVectorizedScan 17.5s 2.3s
TestAvroScan 17.5s 2.4s

Collections still hold up to nine elements, preserving data variety.

Testing

./gradlew :iceberg-spark:iceberg-spark-3.5_2.13:test5,084 tests, 0
failures
(identical pass/skip counts to before the change).

@github-actions github-actions Bot added the spark label Jun 6, 2026
@Baunsgaard

Copy link
Copy Markdown
Contributor Author

Locally (Spark 3.5 core, JDK 17, single-threaded), drops from
~17–24s to ~2–8s per class (3–7×), with identical pass/skip counts.

However, on CI the effect is small and within runner noise (~±10%): core jobs
but on average ~2% to ~5% faster.

@Baunsgaard Baunsgaard marked this pull request as ready for review June 6, 2026 14:30
// Exclusive upper bound on the number of elements generated for each list/map.
// Applied per nesting level, so deeply-nested schemas multiply quickly; keep
// this small enough to avoid combinatorial blow-up in heavily-nested tests.
private static final int MAX_COLLECTION_SIZE = 10;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see MAX_ENTRIES being used as the naming in RandomGenericData.java.
MAX_COLLECTION_SIZE does seem like a more apt name though but ok to keep MAX_ENTRIES as well to be consistent.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number before was 20, and reusing the name MAX_ENTRIES might be less clear, especially since we are working with collections specifically here.

Happy to change it if there are some other prefference.

@laskoviymishka laskoviymishka left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice and clean, capping per-level collection size is the right lever for nested-schema blowup, and documenting the multiplicative behavior on the constant is a nice touch.

One thing before merge: v3.4 is still an active module with the same four nextInt(20) sites in its RandomData.java, so its nested tests stay slow and a future v3.5→v3.4 sync could re-introduce the gap. I'd apply the same change there, or note why it's excluded.

Minor: the title says "nested tests" broadly but the shared RandomUtil/RandomGenericData paths are untouched, worth either a follow-up or narrowing to "Spark".

Smaller stuff inline (nothing is blocking).

Comment thread spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/data/RandomData.java Outdated
Comment thread spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/data/RandomData.java Outdated
Comment thread spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/data/RandomData.java Outdated
RandomData sized every generated list and map with random.nextInt(20),
applied at each nesting level. For deeply-nested schemas this multiplies
into well over a million leaf values per run, so the cost is dominated by
data volume rather than the read/write code paths being exercised.

Replace the hard-coded bound with a named constant COLLECTION_SIZE_BOUND
in the Spark 3.5, 4.0, and 4.1 test copies. The name and comment make the
exclusive bound and the legal zero-length case explicit. Schemas, types,
and nesting structures under test are unchanged, so coverage is preserved
while the nested round-trip tests run several times faster.
@Baunsgaard Baunsgaard force-pushed the spark-tests-cap-random-collection-size branch from 86450b1 to f9633fb Compare June 10, 2026 12:10

@laskoviymishka laskoviymishka left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but will wait for merge for @kevinjqliu to look on it, since he doing a work around optimizing CI recently.

@Baunsgaard

Copy link
Copy Markdown
Contributor Author

As a orthogonal followup i also have this PR: #16740 It gives another 8% if you want to take a look @laskoviymishka and @nssalian ?

@kevinjqliu kevinjqliu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks you for the change, this will definitely have an outside impact across all CI runs 💯

@kevinjqliu kevinjqliu merged commit 2f244c0 into apache:main Jun 15, 2026
43 checks passed
@kevinjqliu

Copy link
Copy Markdown
Contributor

Thanks again for the PR @Baunsgaard and thank you @nssalian @laskoviymishka for the reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants