Skip to content

Add optional energy efficiency reporting schema for inference benchmarks#2587

Open
hongping-zh wants to merge 1 commit into
mlcommons:masterfrom
hongping-zh:energy-reporting-schema-rfc
Open

Add optional energy efficiency reporting schema for inference benchmarks#2587
hongping-zh wants to merge 1 commit into
mlcommons:masterfrom
hongping-zh:energy-reporting-schema-rfc

Conversation

@hongping-zh
Copy link
Copy Markdown

Summary

This PR proposes an optional energy-efficiency reporting schema for MLPerf Inference results.

It adds a standalone schema package under energy-reporting/ and does not modify existing benchmark logic, submission flow, or current compliance requirements.


Motivation

During multi-round technical discussion in Issue #2558, several design directions converged:

  1. Task-appropriate normalization:
    • energy_per_token_joules for LLM workloads
    • energy_per_query_joules for CV workloads
  2. Prefill vs. generation energy separation for LLM
  3. Static vs. active energy separation
  4. Architecture-agnostic metric design
  5. Multiple measurement backends (nvml, dcgm, rocm_smi, rapl, external_analyzer)

This PR translates those discussion outcomes into a concrete, reviewable schema artifact.


Scope of this PR (intentionally minimal)

This PR includes only:

  • energy-reporting/mlperf_energy_schema_v6.json
    (JSON Schema, draft 2020-12)
  • energy-reporting/README.md
    (field definitions, examples, validation-rule summary)

This PR does not include:

  • toolkit/runtime measurement code
  • reference result uploads
  • modifications to existing submission checker behavior
  • changes to current required fields

Compatibility / Impact

  • Backward compatible: Yes
  • Breaking change: No
  • Existing submitters affected: No (all fields are optional)
  • Compliance behavior changed: No (RFC proposal stage)

Validation

Schema and examples were validated locally:

  • JSON syntax/schema validity: ✅
  • LLM single-accelerator valid sample: ✅
  • CV multi-accelerator valid sample: ✅
  • LLM sample missing conditional required fields: ✅ correctly rejected

(Validation logs can be provided if reviewers request them.)


Request for Comments (RFC)

This PR is submitted as an RFC to collect Working Group feedback on field design and integration direction before any broader implementation steps.

Feedback is especially welcome on:

  • field naming and granularity
  • conditional requirements by task type
  • whether variability fields (e.g., std) should become mandatory in future revisions

cc @JiwaniZakir @arav-agarwal2


References

@hongping-zh hongping-zh requested a review from a team as a code owner May 14, 2026 08:36
@github-actions
Copy link
Copy Markdown
Contributor

MLCommons CLA bot:
Thank you very much for your submission; we really appreciate it. Before we can accept your contribution,
we ask that you sign the MLCommons CLA (Apache 2). Please submit your GitHub ID to our onboarding form to initiate
authorization. If you are from a MLCommons member organization, we will request that you be added to the CLA.
If you are not from a member organization, we will email you a CLA to sign. For any questions, please contact
support@mlcommons.org.
0 out of 1 committers have signed the MLCommons CLA.
@hongping
hongping seems not to be a GitHub user. You need a GitHub account after you become MLCommons member. If you have already a GitHub account, please add the email address used for this commit to your account.
You can retrigger this bot by commenting recheck in this Pull Request

@hongping-zh
Copy link
Copy Markdown
Author

recheck

1 similar comment
@hongping-zh
Copy link
Copy Markdown
Author

recheck

@hongping-zh
Copy link
Copy Markdown
Author

Quick update: I have contacted support@mlcommons.org to resolve CLA mapping for GitHub account "hongping-zh". Waiting for support-side refresh, then I will run recheck immediately.

@hongping-zh
Copy link
Copy Markdown
Author

recheck

@dslik
Copy link
Copy Markdown

dslik commented May 26, 2026

Has there been a discussion in the inference WG regarding if solely measuring the accelerator power consumption is a useful (and non-misleading) reporting metric? High-performance inference requires coordination, processing and data movement tasks to be performed on the CPUs, and system DRAM and network usage also consumes significant power. I can see how this data would be valuable to augment entire-system power measurements, but I have concerns about it being presented on its own.

Also, it is important to ensure that measurements are taken of cumulative power draw, rather than instantaneous power draw, since the latter can easily result in misleading results. Careful rules (and verified implementations) are needed to prevent power measurements from easily being gamed.

@hongping-zh
Copy link
Copy Markdown
Author

Thank you, David — this is an important concern, and I agree.

My intent is not for accelerator-only measurements to replace whole-system power or energy measurements. For high-performance inference, CPU coordination, host DRAM, networking, storage, and data movement can all be significant, and accelerator-only numbers would be misleading if presented as total system energy efficiency.

A better framing for this PR is therefore as an optional accelerator-level energy breakdown / supplementary reporting schema. The intended use is to augment whole-system measurements where available, and to provide attribution/debugging information about the accelerator-side behavior of a run, rather than to define a standalone system-level efficiency metric.

I also agree on cumulative energy. The schema should define fields such as total_energy_joules, active_energy_joules, and energy_per_token_joules as integrated energy over the benchmark measurement window, not as instantaneous power snapshots. Instantaneous power samples should only be intermediate samples used for integration, with the measurement window, sampling rate, and integration method documented.

I can update the README/schema wording to make this explicit, for example:

  • accelerator-only measurements must not be interpreted as whole-system energy efficiency;
  • whole-system power/energy should be reported where available;
  • reported energy fields are cumulative/integrated over the benchmark window;
  • validation rules should check measurement-window consistency and discourage easily gameable reporting.

Would this framing address your concern, or would you prefer that the fields be renamed more explicitly as accelerator-level fields to avoid ambiguity?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants