[SPARK-57420][INFRA] Add generate-tpcds input and early CPU check to benchmark workflow by iemejia · Pull Request #56479 · apache/spark

iemejia · 2026-06-12T18:53:27Z

What changes were proposed in this pull request?

Two improvements to the benchmark workflow:

Add explicit generate-tpcds boolean input (default: true) to control TPC-DS data generation. Users running non-TPC-DS benchmarks can set it to false to skip the expensive generation step (~5-10 min saved per run). This replaces the previous contains(inputs.class, '*') heuristic which incorrectly triggered TPC-DS generation for wildcard patterns like *VectorizedDeltaReaderBenchmark.
Add early CPU model check step that runs immediately after checkout, before compilation. Prints the CPU as a ::notice:: annotation for live visibility in the Actions UI, and optionally fails fast if the runner CPU does not match the expected-cpu input parameter.

Why are the changes needed?

The benchmark workflow previously used heuristic pattern matching (contains(inputs.class, '*'), later contains(inputs.class, 'TPCDS')) to decide whether to generate TPC-DS data. This approach had edge cases -- wildcard patterns that don't need TPC-DS data would trigger generation, and generic package-level globs like org.apache.spark.sql.execution.benchmark.* that do include TPC-DS benchmarks would miss it. An explicit boolean input eliminates all ambiguity and is future-proof against new class names.

Additionally, when benchmark results need to match a specific CPU (e.g., AMD EPYC 7763 for consistent comparisons against upstream baselines), there is no way to detect a CPU mismatch until the full benchmark completes (~20-30 min). The early CPU check allows the job to fail within seconds of starting if the runner does not match, saving significant time and compute.

Does this PR introduce any user-facing change?

No. This only affects the GHA benchmark workflow. Existing behavior is preserved: generate-tpcds defaults to true (matching the default class: '*'), and expected-cpu defaults to empty (no check).

How was this patch tested?

The workflow changes are self-contained in .github/workflows/benchmark.yml. Tested by inspection. Both new parameters are optional with backward-compatible defaults.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: OpenCode (Claude claude-opus-4.6)

iemejia · 2026-06-12T18:58:49Z

@LuciferYang Would you mind taking a look at this one when you get a chance? While working on the Parquet encoding benchmarks, I noticed the workflow was spending ~5-10 min generating TPC-DS data on every run even when the benchmark does not use it (because contains(inputs.class, '*') matches any wildcard pattern, not just the literal *).

I also kept having to wait for full 20-30 min runs to complete only to discover the runner landed on the wrong CPU. For that, I added an optional expected-cpu input parameter that detects the runner CPU immediately after checkout and fails the job within seconds if it does not match -- so you do not waste the entire compilation + benchmark time before finding out.

These two small fixes should save a lot of time for anyone using the benchmark workflow with specific class patterns and CPU-sensitive comparisons. Happy to adjust anything if needed.

shrirangmhalgi · 2026-06-15T06:31:33Z

+        CPU_MODEL=$(grep "model name" /proc/cpuinfo | head -1 | sed 's/model name\s*:\s*//')
+        echo "Runner CPU: $CPU_MODEL"
+        echo "::notice::Runner CPU: $CPU_MODEL"
+        if [ -n "${{ inputs.expected-cpu }}" ]; then


Nit: ${{ inputs.expected-cpu }} is template-expanded before the shell runs, so special characters in the input would be interpreted as shell syntax. For workflow_dispatch this is low-risk (only committers can trigger), but the standard hygiene fix is to route through env: so it becomes a properly-quoted shell variable:

env: EXPECTED_CPU: ${{ inputs.expected-cpu }} run: | ... if echo "$CPU_MODEL" | grep -qF "$EXPECTED_CPU"; then

Good catch, thanks! Updated to route through env: so the input is a properly-quoted shell variable. Fixed in the next push.

shrirangmhalgi · 2026-06-15T06:34:08Z

This is clean workflow improvement. The contains -> == fix for the wildcard case and the early CPU check are both well thought out.

Left one minor hygiene nit on the CPU check step.

iemejia · 2026-06-15T07:18:55Z

Thanks for the review @shrirangmhalgi! Addressed the nit -- now routing expected-cpu through an env variable for proper shell quoting.

…k CPU compatibility early in benchmark workflow Three improvements to the benchmark workflow: 1. Skip TPC-DS data generation for non-TPCDS benchmarks. Replace the verbose list of contains() checks with a single contains(inputs.class, 'TPCDS') that catches both exact class names and globs like '*TPCDS*'. Use exact equality for '*' to avoid matching wildcard patterns like '*VectorizedDeltaReaderBenchmark' that do not need TPC-DS data (~5-10 min saved per run). 2. Add early CPU model check step that runs immediately after checkout, before compilation. Prints the CPU as a ::notice:: annotation for live visibility, and optionally fails fast if the runner CPU does not match the expected-cpu input parameter. The input is routed through an environment variable for shell safety. 3. Simplify the TPC-DS cache condition to match the generation condition, keeping both in sync. Assisted-by: OpenCode:claude-opus-4.6

iemejia · 2026-06-18T05:21:30Z

@uros-b Sorry for the noise -- I had to rebase and force-push due to a mistake on my side. The changes include the fix you suggested. Could you please re-review when you get a chance? Thank you!

Replace the heuristic condition `contains(inputs.class, 'TPCDS') || inputs.class == '*'` with an explicit `generate-tpcds` boolean input (default: true). This eliminates edge cases with glob patterns like `org.apache.spark.sql.execution.benchmark.*` and is future-proof against new class names. Assisted-by: OpenCode:claude-opus-4.6

iemejia · 2026-06-20T16:45:04Z

It should be good now. PTAL @uros-b

LuciferYang · 2026-06-21T11:51:12Z

        java-version: ${{ inputs.jdk }}
    - name: Cache TPC-DS generated data
-      if: contains(inputs.class, 'TPCDSQueryBenchmark') || contains(inputs.class, 'LZ4TPCDSDataBenchmark') || contains(inputs.class, 'ZStandardTPCDSDataBenchmark') || contains(inputs.class, '*')
+      if: inputs.generate-tpcds


also cc @pan3793 @dongjoon-hyun

I'm fine with this change, given that the auto-detection rule is tricky.

iemejia · 2026-06-23T08:15:41Z

@LuciferYang it seems we are all good to merge this one. WDYT ?

LuciferYang · 2026-06-23T14:50:14Z

Do you have any other suggestions? @uros-b

uros-b

Thank you @iemejia for following through with this approach, I think it's more clear and maintainable now! Also thank you @shrirangmhalgi @LuciferYang for reviews.

iemejia · 2026-06-24T15:46:14Z

This has three approvals, can someone please merge it. Thanks!

…benchmark workflow ### What changes were proposed in this pull request? Two improvements to the benchmark workflow: 1. **Add explicit `generate-tpcds` boolean input** (default: `true`) to control TPC-DS data generation. Users running non-TPC-DS benchmarks can set it to `false` to skip the expensive generation step (~5-10 min saved per run). This replaces the previous `contains(inputs.class, '*')` heuristic which incorrectly triggered TPC-DS generation for wildcard patterns like `*VectorizedDeltaReaderBenchmark`. 2. **Add early CPU model check step** that runs immediately after checkout, before compilation. Prints the CPU as a `::notice::` annotation for live visibility in the Actions UI, and optionally fails fast if the runner CPU does not match the `expected-cpu` input parameter. ### Why are the changes needed? The benchmark workflow previously used heuristic pattern matching (`contains(inputs.class, '*')`, later `contains(inputs.class, 'TPCDS')`) to decide whether to generate TPC-DS data. This approach had edge cases -- wildcard patterns that don't need TPC-DS data would trigger generation, and generic package-level globs like `org.apache.spark.sql.execution.benchmark.*` that do include TPC-DS benchmarks would miss it. An explicit boolean input eliminates all ambiguity and is future-proof against new class names. Additionally, when benchmark results need to match a specific CPU (e.g., AMD EPYC 7763 for consistent comparisons against upstream baselines), there is no way to detect a CPU mismatch until the full benchmark completes (~20-30 min). The early CPU check allows the job to fail within seconds of starting if the runner does not match, saving significant time and compute. ### Does this PR introduce _any_ user-facing change? No. This only affects the GHA benchmark workflow. Existing behavior is preserved: `generate-tpcds` defaults to `true` (matching the default `class: '*'`), and `expected-cpu` defaults to empty (no check). ### How was this patch tested? The workflow changes are self-contained in `.github/workflows/benchmark.yml`. Tested by inspection. Both new parameters are optional with backward-compatible defaults. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: OpenCode (Claude claude-opus-4.6) Closes #56479 from iemejia/SPARK-57420-benchmark-workflow-improvements. Authored-by: Ismaël Mejía <iemejia@gmail.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> (cherry picked from commit 9dab83b) Signed-off-by: yangjie01 <yangjie01@baidu.com>

LuciferYang · 2026-06-24T15:50:23Z

Merged into master/branch-4.x. Thanks @iemejia @uros-b @pan3793 @shrirangmhalgi

shrirangmhalgi reviewed Jun 15, 2026

View reviewed changes

shrirangmhalgi approved these changes Jun 16, 2026

View reviewed changes

This was referenced Jun 16, 2026

[SPARK-57415][SQL] Parquet vectorized reader performance improvements (umbrella) #56011

Open

[SPARK-56892][SQL] Bulk read optimization for Parquet DELTA_BINARY_PACKED decoding #55919

Closed

uros-b reviewed Jun 17, 2026

View reviewed changes

Comment thread .github/workflows/benchmark.yml Outdated

iemejia force-pushed the SPARK-57420-benchmark-workflow-improvements branch from 84f28d6 to 2fb16a7 Compare June 18, 2026 05:14

uros-b reviewed Jun 19, 2026

View reviewed changes

Comment thread .github/workflows/benchmark.yml Outdated

iemejia changed the title ~~[SPARK-57420][INFRA] Only generate TPC-DS data when required and check CPU compatibility early in benchmark workflow~~ [SPARK-57420][INFRA] Add generate-tpcds input and early CPU check to benchmark workflow Jun 20, 2026

LuciferYang reviewed Jun 21, 2026

View reviewed changes

LuciferYang approved these changes Jun 23, 2026

View reviewed changes

uros-b approved these changes Jun 23, 2026

View reviewed changes

LuciferYang closed this in 9dab83b Jun 24, 2026

iemejia deleted the SPARK-57420-benchmark-workflow-improvements branch June 24, 2026 15:54

Uh oh!

Conversation

iemejia commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

iemejia commented Jun 12, 2026

Uh oh!

shrirangmhalgi Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

iemejia Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

shrirangmhalgi commented Jun 15, 2026

Uh oh!

iemejia commented Jun 15, 2026

Uh oh!

Uh oh!

iemejia commented Jun 18, 2026

Uh oh!

Uh oh!

iemejia commented Jun 20, 2026

Uh oh!

LuciferYang Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

pan3793 Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

iemejia commented Jun 23, 2026

Uh oh!

LuciferYang commented Jun 23, 2026

Uh oh!

uros-b left a comment

Choose a reason for hiding this comment

Uh oh!

iemejia commented Jun 24, 2026

Uh oh!

LuciferYang commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

iemejia commented Jun 12, 2026 •

edited

Loading