[SPARK-57420][INFRA] Add generate-tpcds input and early CPU check to benchmark workflow#56479
[SPARK-57420][INFRA] Add generate-tpcds input and early CPU check to benchmark workflow#56479iemejia wants to merge 2 commits into
Conversation
|
@LuciferYang Would you mind taking a look at this one when you get a chance? While working on the Parquet encoding benchmarks, I noticed the workflow was spending ~5-10 min generating TPC-DS data on every run even when the benchmark does not use it (because I also kept having to wait for full 20-30 min runs to complete only to discover the runner landed on the wrong CPU. For that, I added an optional These two small fixes should save a lot of time for anyone using the benchmark workflow with specific class patterns and CPU-sensitive comparisons. Happy to adjust anything if needed. |
| CPU_MODEL=$(grep "model name" /proc/cpuinfo | head -1 | sed 's/model name\s*:\s*//') | ||
| echo "Runner CPU: $CPU_MODEL" | ||
| echo "::notice::Runner CPU: $CPU_MODEL" | ||
| if [ -n "${{ inputs.expected-cpu }}" ]; then |
There was a problem hiding this comment.
Nit: ${{ inputs.expected-cpu }} is template-expanded before the shell runs, so special characters in the input would be interpreted as shell syntax. For workflow_dispatch this is low-risk (only committers can trigger), but the standard hygiene fix is to route through env: so it becomes a properly-quoted shell variable:
env:
EXPECTED_CPU: ${{ inputs.expected-cpu }}
run: |
...
if echo "$CPU_MODEL" | grep -qF "$EXPECTED_CPU"; thenThere was a problem hiding this comment.
Good catch, thanks! Updated to route through env: so the input is a properly-quoted shell variable. Fixed in the next push.
|
This is clean workflow improvement. The Left one minor hygiene nit on the CPU check step. |
|
Thanks for the review @shrirangmhalgi! Addressed the nit -- now routing |
…k CPU compatibility early in benchmark workflow Three improvements to the benchmark workflow: 1. Skip TPC-DS data generation for non-TPCDS benchmarks. Replace the verbose list of contains() checks with a single contains(inputs.class, 'TPCDS') that catches both exact class names and globs like '*TPCDS*'. Use exact equality for '*' to avoid matching wildcard patterns like '*VectorizedDeltaReaderBenchmark' that do not need TPC-DS data (~5-10 min saved per run). 2. Add early CPU model check step that runs immediately after checkout, before compilation. Prints the CPU as a ::notice:: annotation for live visibility, and optionally fails fast if the runner CPU does not match the expected-cpu input parameter. The input is routed through an environment variable for shell safety. 3. Simplify the TPC-DS cache condition to match the generation condition, keeping both in sync. Assisted-by: OpenCode:claude-opus-4.6
84f28d6 to
2fb16a7
Compare
|
@uros-b Sorry for the noise -- I had to rebase and force-push due to a mistake on my side. The changes include the fix you suggested. Could you please re-review when you get a chance? Thank you! |
Replace the heuristic condition `contains(inputs.class, 'TPCDS') || inputs.class == '*'` with an explicit `generate-tpcds` boolean input (default: true). This eliminates edge cases with glob patterns like `org.apache.spark.sql.execution.benchmark.*` and is future-proof against new class names. Assisted-by: OpenCode:claude-opus-4.6
|
It should be good now. PTAL @uros-b |
| java-version: ${{ inputs.jdk }} | ||
| - name: Cache TPC-DS generated data | ||
| if: contains(inputs.class, 'TPCDSQueryBenchmark') || contains(inputs.class, 'LZ4TPCDSDataBenchmark') || contains(inputs.class, 'ZStandardTPCDSDataBenchmark') || contains(inputs.class, '*') | ||
| if: inputs.generate-tpcds |
There was a problem hiding this comment.
I'm fine with this change, given that the auto-detection rule is tricky.
|
@LuciferYang it seems we are all good to merge this one. WDYT ? |
|
Do you have any other suggestions? @uros-b |
uros-b
left a comment
There was a problem hiding this comment.
Thank you @iemejia for following through with this approach, I think it's more clear and maintainable now! Also thank you @shrirangmhalgi @LuciferYang for reviews.
|
This has three approvals, can someone please merge it. Thanks! |
…benchmark workflow ### What changes were proposed in this pull request? Two improvements to the benchmark workflow: 1. **Add explicit `generate-tpcds` boolean input** (default: `true`) to control TPC-DS data generation. Users running non-TPC-DS benchmarks can set it to `false` to skip the expensive generation step (~5-10 min saved per run). This replaces the previous `contains(inputs.class, '*')` heuristic which incorrectly triggered TPC-DS generation for wildcard patterns like `*VectorizedDeltaReaderBenchmark`. 2. **Add early CPU model check step** that runs immediately after checkout, before compilation. Prints the CPU as a `::notice::` annotation for live visibility in the Actions UI, and optionally fails fast if the runner CPU does not match the `expected-cpu` input parameter. ### Why are the changes needed? The benchmark workflow previously used heuristic pattern matching (`contains(inputs.class, '*')`, later `contains(inputs.class, 'TPCDS')`) to decide whether to generate TPC-DS data. This approach had edge cases -- wildcard patterns that don't need TPC-DS data would trigger generation, and generic package-level globs like `org.apache.spark.sql.execution.benchmark.*` that do include TPC-DS benchmarks would miss it. An explicit boolean input eliminates all ambiguity and is future-proof against new class names. Additionally, when benchmark results need to match a specific CPU (e.g., AMD EPYC 7763 for consistent comparisons against upstream baselines), there is no way to detect a CPU mismatch until the full benchmark completes (~20-30 min). The early CPU check allows the job to fail within seconds of starting if the runner does not match, saving significant time and compute. ### Does this PR introduce _any_ user-facing change? No. This only affects the GHA benchmark workflow. Existing behavior is preserved: `generate-tpcds` defaults to `true` (matching the default `class: '*'`), and `expected-cpu` defaults to empty (no check). ### How was this patch tested? The workflow changes are self-contained in `.github/workflows/benchmark.yml`. Tested by inspection. Both new parameters are optional with backward-compatible defaults. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: OpenCode (Claude claude-opus-4.6) Closes #56479 from iemejia/SPARK-57420-benchmark-workflow-improvements. Authored-by: Ismaël Mejía <iemejia@gmail.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> (cherry picked from commit 9dab83b) Signed-off-by: yangjie01 <yangjie01@baidu.com>
|
Merged into master/branch-4.x. Thanks @iemejia @uros-b @pan3793 @shrirangmhalgi |
What changes were proposed in this pull request?
Two improvements to the benchmark workflow:
Add explicit
generate-tpcdsboolean input (default:true) to control TPC-DS data generation. Users running non-TPC-DS benchmarks can set it tofalseto skip the expensive generation step (~5-10 min saved per run). This replaces the previouscontains(inputs.class, '*')heuristic which incorrectly triggered TPC-DS generation for wildcard patterns like*VectorizedDeltaReaderBenchmark.Add early CPU model check step that runs immediately after checkout, before compilation. Prints the CPU as a
::notice::annotation for live visibility in the Actions UI, and optionally fails fast if the runner CPU does not match theexpected-cpuinput parameter.Why are the changes needed?
The benchmark workflow previously used heuristic pattern matching (
contains(inputs.class, '*'), latercontains(inputs.class, 'TPCDS')) to decide whether to generate TPC-DS data. This approach had edge cases -- wildcard patterns that don't need TPC-DS data would trigger generation, and generic package-level globs likeorg.apache.spark.sql.execution.benchmark.*that do include TPC-DS benchmarks would miss it. An explicit boolean input eliminates all ambiguity and is future-proof against new class names.Additionally, when benchmark results need to match a specific CPU (e.g., AMD EPYC 7763 for consistent comparisons against upstream baselines), there is no way to detect a CPU mismatch until the full benchmark completes (~20-30 min). The early CPU check allows the job to fail within seconds of starting if the runner does not match, saving significant time and compute.
Does this PR introduce any user-facing change?
No. This only affects the GHA benchmark workflow. Existing behavior is preserved:
generate-tpcdsdefaults totrue(matching the defaultclass: '*'), andexpected-cpudefaults to empty (no check).How was this patch tested?
The workflow changes are self-contained in
.github/workflows/benchmark.yml. Tested by inspection. Both new parameters are optional with backward-compatible defaults.Was this patch authored or co-authored using generative AI tooling?
Generated-by: OpenCode (Claude claude-opus-4.6)