eval skill: auto-detect predefined per-cluster execution configs by cjluo-nv · Pull Request #1599 · NVIDIA/Model-Optimizer

cjluo-nv · 2026-06-02T06:34:45Z

What does this PR do?

Type of change: documentation

Some NEL installs ship ready-made per-cluster execution configs as an internal/slurm/<cluster> group (via the optional nemo_evaluator_launcher_internal package). When present, the matching one pre-fills the cluster's hostname / partition / gres (and node-exclusivity), so the user only sets account / output_dir / walltime instead of hand-entering the hostname.

SKILL.md Step 4 — adds a "check FIRST" step: run a discovery snippet that lists the available cluster → hostname pairs from the installed package at runtime; on a hostname match, use defaults: - execution: internal/slurm/<cluster> (replacing slurm/default) and drop the now-redundant execution.hostname. If the package isn't installed or nothing matches, fall back to slurm/default and fill the fields manually.
example_eval.yaml — a short, name-free comment on the defaults block pointing to that check.

Discovery-based by design (no internal data committed): cluster names, hostnames, and accounts are read from the install at runtime and are not hardcoded here. The only internal references in the repo are the package name nemo_evaluator_launcher_internal and the generic internal/slurm/<cluster> group pattern. External users without the package degrade gracefully (import fails → slurm/default).

Usage

N/A — documentation / skill guidance only.

Testing

Ran the discovery snippet verbatim (lists cluster → hostname from the installed package) and confirmed internal/slurm/<cluster> resolves via a --dry-run. pre-commit run passes (markdownlint + YAML format). Grep-verified no cluster names/hostnames are hardcoded in the changed files.

Before your PR is "Ready for review"

Is this change backward compatible?: ✅ (falls back to slurm/default)
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
Did you write any new necessary tests?: N/A (documentation)
Did you update Changelog?: N/A (skill docs)
Did you get Claude approval on this PR?: ❌ (pending)

🤖 Generated with Claude Code

Summary by CodeRabbit

Documentation
- Added Step 4 guidance for discovering and using optional internal per-cluster SLURM execution configs: how to list available internal configs, select the matching cluster by switching the default execution, remove redundant hostname entries, and verify with a --dry-run.
- Clarified that slurm/default is cluster-agnostic, recommend using internal per-cluster configs when present, and explained fallback behavior to manual hostname/account/output_dir entry.

Some NEL installs ship ready-made per-cluster execution configs as an internal/slurm/<cluster> group (optional nemo_evaluator_launcher_internal package). When present, the matching one pre-fills hostname/partition/gres (and node-exclusivity), so the user only sets account/output_dir/walltime. SKILL.md Step 4 now runs a discovery check FIRST: list the available cluster->hostname pairs from the installed package at runtime and, on a match, use internal/slurm/<cluster> instead of slurm/default; otherwise fall back to slurm/default and fill hostname/account/output_dir manually. Discovery-based by design: cluster names / hostnames / accounts are read from the install at runtime and never hardcoded here. example_eval.yaml gets a short name-free pointer to the check. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai · 2026-06-02T06:34:56Z

📝 Walkthrough

Walkthrough

Adds documentation and comments that explain how to discover and use optional internal per-cluster SLURM execution configs (internal/slurm/) instead of the generic slurm/default, and instructs verification via --dry-run or fallback to manual slurm/default configuration.

Changes

SLURM Configuration Guidance

Layer / File(s)	Summary
Internal SLURM cluster config guidance `.claude/skills/evaluation/SKILL.md`, `.claude/skills/evaluation/recipes/examples/example_eval.yaml`	Step 4 now documents a command to list internal `internal/slurm/<cluster>.yaml` files and their `hostname` values, instructs switching `defaults` from `slurm/default` to `internal/slurm/<cluster>` (dropping redundant `execution.hostname`) when a hostname matches and to confirm via `--dry-run`. The example YAML adds comments noting `slurm/default` is cluster-agnostic and recommends using internal per-cluster configs for pre-filled scheduling parameters.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Suggested reviewers

kaix-nv
chadvoegele

🚥 Pre-merge checks | ✅ 6

✅ Passed checks (6 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title directly and clearly describes the main change: adding documentation and guidance for auto-detecting per-cluster execution configs from an internal package.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	PR contains only documentation and example YAML changes with no Python code, dependencies, or security anti-patterns. Bash snippet uses safe imports and string parsing without eval/exec.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch skill/eval-internal-cluster-exec

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Compress the Step 4 discovery guidance and the example_eval.yaml comment added in the prior commit; no behavior change (discovery snippet unchanged). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.claude/skills/evaluation/SKILL.md:
- Line 205: Update the fallback instructions so the explicit fields users must
fill when no internal SLURM profile matches include partition, gres, and
walltime in addition to hostname/account/output_dir; specifically modify the
sentence about replacing `slurm/default` and removing `execution.hostname` (the
`defaults: - execution: internal/slurm/<cluster>` guidance) to state that if no
match is found you should keep `slurm/default` and manually set `hostname`,
`account`, `output_dir`, `partition`, `gres`, and `walltime` so it matches the
earlier claim that internal profiles auto-fill partition/gres/walltime.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: aa91707f-d38e-4d05-8a24-1445c9bc1e1e

📥 Commits

Reviewing files that changed from the base of the PR and between db9ea8f and 33b532c.

📒 Files selected for processing (2)

.claude/skills/evaluation/SKILL.md
.claude/skills/evaluation/recipes/examples/example_eval.yaml

coderabbitai · 2026-06-02T06:38:25Z

+  echo "$(basename "$f" .yaml) -> $(grep -E '^hostname:' "$f" | awk '{print $2}')"; done
+```
+
+If a listed hostname matches the target SLURM cluster, set `defaults: - execution: internal/slurm/<cluster>` (replacing `slurm/default`), and remove any now-redundant `execution.hostname` (keep `account` / `output_dir` / `walltime` overrides). Confirm it resolves with a `--dry-run`. If the package isn't installed or nothing matches, keep `slurm/default` and fill `hostname` / `account` / `output_dir` manually.


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Align fallback fields with the fields you say are auto-filled

Line 205’s fallback omits partition/gres (and walltime mention), even though Line 196 says the internal profile pre-fills those. Please make the fallback list explicit and consistent so users don’t miss required manual values when no internal profile matches.

Suggested doc edit

-If the package isn't installed or nothing matches, keep `slurm/default` and fill `hostname` / `account` / `output_dir` manually. +If the package isn't installed or nothing matches, keep `slurm/default` and fill `hostname` / `partition` / `gres` / `account` / `output_dir` / `walltime` manually.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

If a listed hostname matches the target SLURM cluster, set `defaults: - execution: internal/slurm/<cluster>` (replacing `slurm/default`), and remove any now-redundant `execution.hostname` (keep `account` / `output_dir` / `walltime` overrides). Confirm it resolves with a `--dry-run`. If the package isn't installed or nothing matches, keep `slurm/default` and fill `hostname` / `account` / `output_dir` manually.

If a listed hostname matches the target SLURM cluster, set `defaults: - execution: internal/slurm/<cluster>` (replacing `slurm/default`), and remove any now-redundant `execution.hostname` (keep `account` / `output_dir` / `walltime` overrides). Confirm it resolves with a `--dry-run`. If the package isn't installed or nothing matches, keep `slurm/default` and fill `hostname` / `partition` / `gres` / `account` / `output_dir` / `walltime` manually.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.claude/skills/evaluation/SKILL.md at line 205, Update the fallback instructions so the explicit fields users must fill when no internal SLURM profile matches include partition, gres, and walltime in addition to hostname/account/output_dir; specifically modify the sentence about replacing `slurm/default` and removing `execution.hostname` (the `defaults: - execution: internal/slurm/<cluster>` guidance) to state that if no match is found you should keep `slurm/default` and manually set `hostname`, `account`, `output_dir`, `partition`, `gres`, and `walltime` so it matches the earlier claim that internal profiles auto-fill partition/gres/walltime.

coderabbitai

♻️ Duplicate comments (1)

.claude/skills/evaluation/SKILL.md (1)

205-205: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix fallback field list to match the pre-fill contract.

Line 205 still omits partition, gres, and walltime, which conflicts with Line 196’s statement about what internal profiles pre-fill. Please make the fallback explicit and consistent.

Proposed doc fix

-Hostname match → set `defaults: - execution: internal/slurm/<cluster>`, drop the redundant `execution.hostname` (keep account/output_dir/walltime), verify with `--dry-run`. Else keep `slurm/default` and fill hostname/account/output_dir manually.
+Hostname match → set `defaults: - execution: internal/slurm/<cluster>`, drop the redundant `execution.hostname` (keep account/output_dir/walltime), verify with `--dry-run`. Else keep `slurm/default` and fill hostname/partition/gres/account/output_dir/walltime manually.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/evaluation/SKILL.md at line 205, Update the fallback field
list in SKILL.md so it matches the pre-fill contract: when hostname matches set
defaults to "execution: internal/slurm/<cluster>" and remove the redundant
"execution.hostname" entry, but explicitly include and pre-fill the fields
account, output_dir, walltime, partition, and gres; note the user should verify
with "--dry-run". Otherwise keep "slurm/default" and instruct filling
hostname/account/output_dir/partition/gres/walltime manually. Ensure the text
around the existing "Hostname match → set `defaults: - execution:
internal/slurm/<cluster>`" line is changed to list the full fallback fields
(account, output_dir, walltime, partition, gres) to be consistent with the
statement at line 196.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In @.claude/skills/evaluation/SKILL.md:
- Line 205: Update the fallback field list in SKILL.md so it matches the
pre-fill contract: when hostname matches set defaults to "execution:
internal/slurm/<cluster>" and remove the redundant "execution.hostname" entry,
but explicitly include and pre-fill the fields account, output_dir, walltime,
partition, and gres; note the user should verify with "--dry-run". Otherwise
keep "slurm/default" and instruct filling
hostname/account/output_dir/partition/gres/walltime manually. Ensure the text
around the existing "Hostname match → set `defaults: - execution:
internal/slurm/<cluster>`" line is changed to list the full fallback fields
(account, output_dir, walltime, partition, gres) to be consistent with the
statement at line 196.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e94bef91-0dc6-4ebb-972c-6ead79589133

📥 Commits

Reviewing files that changed from the base of the PR and between 33b532c and 24ba62a.

📒 Files selected for processing (2)

.claude/skills/evaluation/SKILL.md
.claude/skills/evaluation/recipes/examples/example_eval.yaml

✅ Files skipped from review due to trivial changes (1)

.claude/skills/evaluation/recipes/examples/example_eval.yaml

codecov · 2026-06-02T06:48:01Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.38%. Comparing base (905259f) to head (24ba62a).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1599   +/-   ##
=======================================
  Coverage   77.38%   77.38%           
=======================================
  Files         479      479           
  Lines       52435    52435           
=======================================
  Hits        40578    40578           
  Misses      11857    11857

Flag	Coverage Δ
unit	`53.62% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

meenchen

Bot review — DM the bot to share feedback.

Docs-only change (+13/-0) adding a "check FIRST" discovery snippet for optional nemo_evaluator_launcher_internal per-cluster execution configs, plus a pointer comment in example_eval.yaml. No cluster-specific data is hardcoded — discovery is at runtime and falls back gracefully to slurm/default when the package isn't installed. No correctness, licensing, or test concerns.

coderabbitai Bot reviewed Jun 2, 2026

View reviewed changes

cjluo-nv requested a review from Edwardf0t1 June 2, 2026 06:38

coderabbitai Bot reviewed Jun 2, 2026

View reviewed changes

cjluo-nv requested review from chadvoegele and meenchen June 2, 2026 16:05

meenchen approved these changes Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval skill: auto-detect predefined per-cluster execution configs#1599

eval skill: auto-detect predefined per-cluster execution configs#1599
cjluo-nv wants to merge 2 commits into
mainfrom
skill/eval-internal-cluster-exec

cjluo-nv commented Jun 2, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 2, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

codecov Bot commented Jun 2, 2026

Uh oh!

meenchen left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cjluo-nv commented Jun 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 2, 2026

Codecov Report

Uh oh!

meenchen left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cjluo-nv commented Jun 2, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading