Skip to content

[SPARK-57420][INFRA] Add generate-tpcds input and early CPU check to benchmark workflow#56479

Closed
iemejia wants to merge 2 commits into
apache:masterfrom
iemejia:SPARK-57420-benchmark-workflow-improvements
Closed

[SPARK-57420][INFRA] Add generate-tpcds input and early CPU check to benchmark workflow#56479
iemejia wants to merge 2 commits into
apache:masterfrom
iemejia:SPARK-57420-benchmark-workflow-improvements

Conversation

@iemejia

@iemejia iemejia commented Jun 12, 2026

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

Two improvements to the benchmark workflow:

  1. Add explicit generate-tpcds boolean input (default: true) to control TPC-DS data generation. Users running non-TPC-DS benchmarks can set it to false to skip the expensive generation step (~5-10 min saved per run). This replaces the previous contains(inputs.class, '*') heuristic which incorrectly triggered TPC-DS generation for wildcard patterns like *VectorizedDeltaReaderBenchmark.

  2. Add early CPU model check step that runs immediately after checkout, before compilation. Prints the CPU as a ::notice:: annotation for live visibility in the Actions UI, and optionally fails fast if the runner CPU does not match the expected-cpu input parameter.

Why are the changes needed?

The benchmark workflow previously used heuristic pattern matching (contains(inputs.class, '*'), later contains(inputs.class, 'TPCDS')) to decide whether to generate TPC-DS data. This approach had edge cases -- wildcard patterns that don't need TPC-DS data would trigger generation, and generic package-level globs like org.apache.spark.sql.execution.benchmark.* that do include TPC-DS benchmarks would miss it. An explicit boolean input eliminates all ambiguity and is future-proof against new class names.

Additionally, when benchmark results need to match a specific CPU (e.g., AMD EPYC 7763 for consistent comparisons against upstream baselines), there is no way to detect a CPU mismatch until the full benchmark completes (~20-30 min). The early CPU check allows the job to fail within seconds of starting if the runner does not match, saving significant time and compute.

Does this PR introduce any user-facing change?

No. This only affects the GHA benchmark workflow. Existing behavior is preserved: generate-tpcds defaults to true (matching the default class: '*'), and expected-cpu defaults to empty (no check).

How was this patch tested?

The workflow changes are self-contained in .github/workflows/benchmark.yml. Tested by inspection. Both new parameters are optional with backward-compatible defaults.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: OpenCode (Claude claude-opus-4.6)

@iemejia

iemejia commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

@LuciferYang Would you mind taking a look at this one when you get a chance? While working on the Parquet encoding benchmarks, I noticed the workflow was spending ~5-10 min generating TPC-DS data on every run even when the benchmark does not use it (because contains(inputs.class, '*') matches any wildcard pattern, not just the literal *).

I also kept having to wait for full 20-30 min runs to complete only to discover the runner landed on the wrong CPU. For that, I added an optional expected-cpu input parameter that detects the runner CPU immediately after checkout and fails the job within seconds if it does not match -- so you do not waste the entire compilation + benchmark time before finding out.

These two small fixes should save a lot of time for anyone using the benchmark workflow with specific class patterns and CPU-sensitive comparisons. Happy to adjust anything if needed.

Comment thread .github/workflows/benchmark.yml Outdated
CPU_MODEL=$(grep "model name" /proc/cpuinfo | head -1 | sed 's/model name\s*:\s*//')
echo "Runner CPU: $CPU_MODEL"
echo "::notice::Runner CPU: $CPU_MODEL"
if [ -n "${{ inputs.expected-cpu }}" ]; then

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: ${{ inputs.expected-cpu }} is template-expanded before the shell runs, so special characters in the input would be interpreted as shell syntax. For workflow_dispatch this is low-risk (only committers can trigger), but the standard hygiene fix is to route through env: so it becomes a properly-quoted shell variable:

env:
  EXPECTED_CPU: ${{ inputs.expected-cpu }}
run: |
  ...
  if echo "$CPU_MODEL" | grep -qF "$EXPECTED_CPU"; then

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks! Updated to route through env: so the input is a properly-quoted shell variable. Fixed in the next push.

@shrirangmhalgi

Copy link
Copy Markdown
Contributor

This is clean workflow improvement. The contains -> == fix for the wildcard case and the early CPU check are both well thought out.

Left one minor hygiene nit on the CPU check step.

@iemejia

iemejia commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

Thanks for the review @shrirangmhalgi! Addressed the nit -- now routing expected-cpu through an env variable for proper shell quoting.

Comment thread .github/workflows/benchmark.yml Outdated
…k CPU compatibility early in benchmark workflow

Three improvements to the benchmark workflow:

1. Skip TPC-DS data generation for non-TPCDS benchmarks. Replace the
   verbose list of contains() checks with a single
   contains(inputs.class, 'TPCDS') that catches both exact class
   names and globs like '*TPCDS*'. Use exact equality for '*' to
   avoid matching wildcard patterns like
   '*VectorizedDeltaReaderBenchmark' that do not need TPC-DS data
   (~5-10 min saved per run).

2. Add early CPU model check step that runs immediately after checkout,
   before compilation. Prints the CPU as a ::notice:: annotation for
   live visibility, and optionally fails fast if the runner CPU does
   not match the expected-cpu input parameter. The input is routed
   through an environment variable for shell safety.

3. Simplify the TPC-DS cache condition to match the generation
   condition, keeping both in sync.

Assisted-by: OpenCode:claude-opus-4.6
@iemejia iemejia force-pushed the SPARK-57420-benchmark-workflow-improvements branch from 84f28d6 to 2fb16a7 Compare June 18, 2026 05:14
@iemejia

iemejia commented Jun 18, 2026

Copy link
Copy Markdown
Member Author

@uros-b Sorry for the noise -- I had to rebase and force-push due to a mistake on my side. The changes include the fix you suggested. Could you please re-review when you get a chance? Thank you!

Comment thread .github/workflows/benchmark.yml Outdated
Replace the heuristic condition
`contains(inputs.class, 'TPCDS') || inputs.class == '*'`
with an explicit `generate-tpcds` boolean input (default: true).
This eliminates edge cases with glob patterns like
`org.apache.spark.sql.execution.benchmark.*` and is future-proof
against new class names.

Assisted-by: OpenCode:claude-opus-4.6
@iemejia iemejia changed the title [SPARK-57420][INFRA] Only generate TPC-DS data when required and check CPU compatibility early in benchmark workflow [SPARK-57420][INFRA] Add generate-tpcds input and early CPU check to benchmark workflow Jun 20, 2026
@iemejia

iemejia commented Jun 20, 2026

Copy link
Copy Markdown
Member Author

It should be good now. PTAL @uros-b

java-version: ${{ inputs.jdk }}
- name: Cache TPC-DS generated data
if: contains(inputs.class, 'TPCDSQueryBenchmark') || contains(inputs.class, 'LZ4TPCDSDataBenchmark') || contains(inputs.class, 'ZStandardTPCDSDataBenchmark') || contains(inputs.class, '*')
if: inputs.generate-tpcds

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this change, given that the auto-detection rule is tricky.

@iemejia

iemejia commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

@LuciferYang it seems we are all good to merge this one. WDYT ?

@LuciferYang

Copy link
Copy Markdown
Contributor

Do you have any other suggestions? @uros-b

@uros-b uros-b left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @iemejia for following through with this approach, I think it's more clear and maintainable now! Also thank you @shrirangmhalgi @LuciferYang for reviews.

@iemejia

iemejia commented Jun 24, 2026

Copy link
Copy Markdown
Member Author

This has three approvals, can someone please merge it. Thanks!

LuciferYang pushed a commit that referenced this pull request Jun 24, 2026
…benchmark workflow

### What changes were proposed in this pull request?

Two improvements to the benchmark workflow:

1. **Add explicit `generate-tpcds` boolean input** (default: `true`) to control TPC-DS data generation. Users running non-TPC-DS benchmarks can set it to `false` to skip the expensive generation step (~5-10 min saved per run). This replaces the previous `contains(inputs.class, '*')` heuristic which incorrectly triggered TPC-DS generation for wildcard patterns like `*VectorizedDeltaReaderBenchmark`.

2. **Add early CPU model check step** that runs immediately after checkout, before compilation. Prints the CPU as a `::notice::` annotation for live visibility in the Actions UI, and optionally fails fast if the runner CPU does not match the `expected-cpu` input parameter.

### Why are the changes needed?

The benchmark workflow previously used heuristic pattern matching (`contains(inputs.class, '*')`, later `contains(inputs.class, 'TPCDS')`) to decide whether to generate TPC-DS data. This approach had edge cases -- wildcard patterns that don't need TPC-DS data would trigger generation, and generic package-level globs like `org.apache.spark.sql.execution.benchmark.*` that do include TPC-DS benchmarks would miss it. An explicit boolean input eliminates all ambiguity and is future-proof against new class names.

Additionally, when benchmark results need to match a specific CPU (e.g., AMD EPYC 7763 for consistent comparisons against upstream baselines), there is no way to detect a CPU mismatch until the full benchmark completes (~20-30 min). The early CPU check allows the job to fail within seconds of starting if the runner does not match, saving significant time and compute.

### Does this PR introduce _any_ user-facing change?

No. This only affects the GHA benchmark workflow. Existing behavior is preserved: `generate-tpcds` defaults to `true` (matching the default `class: '*'`), and `expected-cpu` defaults to empty (no check).

### How was this patch tested?

The workflow changes are self-contained in `.github/workflows/benchmark.yml`. Tested by inspection. Both new parameters are optional with backward-compatible defaults.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: OpenCode (Claude claude-opus-4.6)

Closes #56479 from iemejia/SPARK-57420-benchmark-workflow-improvements.

Authored-by: Ismaël Mejía <iemejia@gmail.com>
Signed-off-by: yangjie01 <yangjie01@baidu.com>
(cherry picked from commit 9dab83b)
Signed-off-by: yangjie01 <yangjie01@baidu.com>
@LuciferYang

Copy link
Copy Markdown
Contributor

Merged into master/branch-4.x. Thanks @iemejia @uros-b @pan3793 @shrirangmhalgi

@iemejia iemejia deleted the SPARK-57420-benchmark-workflow-improvements branch June 24, 2026 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants