Spark: Cap RandomData collection size to speed up nested tests#16696
Conversation
|
Locally (Spark 3.5 core, JDK 17, single-threaded), drops from However, on CI the effect is small and within runner noise (~±10%): core jobs |
| // Exclusive upper bound on the number of elements generated for each list/map. | ||
| // Applied per nesting level, so deeply-nested schemas multiply quickly; keep | ||
| // this small enough to avoid combinatorial blow-up in heavily-nested tests. | ||
| private static final int MAX_COLLECTION_SIZE = 10; |
There was a problem hiding this comment.
I see MAX_ENTRIES being used as the naming in RandomGenericData.java.
MAX_COLLECTION_SIZE does seem like a more apt name though but ok to keep MAX_ENTRIES as well to be consistent.
There was a problem hiding this comment.
The number before was 20, and reusing the name MAX_ENTRIES might be less clear, especially since we are working with collections specifically here.
Happy to change it if there are some other prefference.
There was a problem hiding this comment.
Nice and clean, capping per-level collection size is the right lever for nested-schema blowup, and documenting the multiplicative behavior on the constant is a nice touch.
One thing before merge: v3.4 is still an active module with the same four nextInt(20) sites in its RandomData.java, so its nested tests stay slow and a future v3.5→v3.4 sync could re-introduce the gap. I'd apply the same change there, or note why it's excluded.
Minor: the title says "nested tests" broadly but the shared RandomUtil/RandomGenericData paths are untouched, worth either a follow-up or narrowing to "Spark".
Smaller stuff inline (nothing is blocking).
RandomData sized every generated list and map with random.nextInt(20), applied at each nesting level. For deeply-nested schemas this multiplies into well over a million leaf values per run, so the cost is dominated by data volume rather than the read/write code paths being exercised. Replace the hard-coded bound with a named constant COLLECTION_SIZE_BOUND in the Spark 3.5, 4.0, and 4.1 test copies. The name and comment make the exclusive bound and the legal zero-length case explicit. Schemas, types, and nesting structures under test are unchanged, so coverage is preserved while the nested round-trip tests run several times faster.
86450b1 to
f9633fb
Compare
laskoviymishka
left a comment
There was a problem hiding this comment.
Looks good to me, but will wait for merge for @kevinjqliu to look on it, since he doing a work around optimizing CI recently.
|
As a orthogonal followup i also have this PR: #16740 It gives another 8% if you want to take a look @laskoviymishka and @nssalian ? |
kevinjqliu
left a comment
There was a problem hiding this comment.
LGTM
Thanks you for the change, this will definitely have an outside impact across all CI runs 💯
|
Thanks again for the PR @Baunsgaard and thank you @nssalian @laskoviymishka for the reviews! |
What
RandomData(Spark test helper) sized every generated list and map withrandom.nextInt(20). Because the bound is applied at every nesting level, itmultiplies for deeply-nested schemas. The worst case is
AvroDataTestBase.testMixedTypes, which embeds the full ~19-field primitivestruct two-to-three levels deep across five fields — each run generates well
over a million leaf values, so the test cost is dominated by random-data volume
rather than the read/write code paths being exercised.
This replaces the hard-coded
20with a named constantMAX_COLLECTION_SIZE = 10in the Spark 3.5 / 4.0 / 4.1 test copies.Why
testMixedTypesis the single most expensive test method in theiceberg-sparkcore suite, appearing at the top of every format read/writetest class. The collection size has no bearing on coverage — the schemas,
types, and nesting structures under test are identical regardless of how many
elements each collection holds — so this is pure scaffolding overhead.
Impact
Measured locally (JDK 17, Spark 3.5 core),
testMixedTypesper class,single-threaded:
Collections still hold up to nine elements, preserving data variety.
Testing
./gradlew :iceberg-spark:iceberg-spark-3.5_2.13:test— 5,084 tests, 0failures (identical pass/skip counts to before the change).