Skip to content

eval skill: auto-detect predefined per-cluster execution configs#1599

Open
cjluo-nv wants to merge 2 commits into
mainfrom
skill/eval-internal-cluster-exec
Open

eval skill: auto-detect predefined per-cluster execution configs#1599
cjluo-nv wants to merge 2 commits into
mainfrom
skill/eval-internal-cluster-exec

Conversation

@cjluo-nv
Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv commented Jun 2, 2026

What does this PR do?

Type of change: documentation

Some NEL installs ship ready-made per-cluster execution configs as an internal/slurm/<cluster> group (via the optional nemo_evaluator_launcher_internal package). When present, the matching one pre-fills the cluster's hostname / partition / gres (and node-exclusivity), so the user only sets account / output_dir / walltime instead of hand-entering the hostname.

  • SKILL.md Step 4 — adds a "check FIRST" step: run a discovery snippet that lists the available cluster → hostname pairs from the installed package at runtime; on a hostname match, use defaults: - execution: internal/slurm/<cluster> (replacing slurm/default) and drop the now-redundant execution.hostname. If the package isn't installed or nothing matches, fall back to slurm/default and fill the fields manually.
  • example_eval.yaml — a short, name-free comment on the defaults block pointing to that check.

Discovery-based by design (no internal data committed): cluster names, hostnames, and accounts are read from the install at runtime and are not hardcoded here. The only internal references in the repo are the package name nemo_evaluator_launcher_internal and the generic internal/slurm/<cluster> group pattern. External users without the package degrade gracefully (import fails → slurm/default).

Usage

N/A — documentation / skill guidance only.

Testing

Ran the discovery snippet verbatim (lists cluster → hostname from the installed package) and confirmed internal/slurm/<cluster> resolves via a --dry-run. pre-commit run passes (markdownlint + YAML format). Grep-verified no cluster names/hostnames are hardcoded in the changed files.

Before your PR is "Ready for review"

  • Is this change backward compatible?: ✅ (falls back to slurm/default)
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: N/A (documentation)
  • Did you update Changelog?: N/A (skill docs)
  • Did you get Claude approval on this PR?: ❌ (pending)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Documentation
    • Added Step 4 guidance for discovering and using optional internal per-cluster SLURM execution configs: how to list available internal configs, select the matching cluster by switching the default execution, remove redundant hostname entries, and verify with a --dry-run.
    • Clarified that slurm/default is cluster-agnostic, recommend using internal per-cluster configs when present, and explained fallback behavior to manual hostname/account/output_dir entry.

Some NEL installs ship ready-made per-cluster execution configs as an
internal/slurm/<cluster> group (optional nemo_evaluator_launcher_internal
package). When present, the matching one pre-fills hostname/partition/gres
(and node-exclusivity), so the user only sets account/output_dir/walltime.

SKILL.md Step 4 now runs a discovery check FIRST: list the available
cluster->hostname pairs from the installed package at runtime and, on a match,
use internal/slurm/<cluster> instead of slurm/default; otherwise fall back to
slurm/default and fill hostname/account/output_dir manually.

Discovery-based by design: cluster names / hostnames / accounts are read from
the install at runtime and never hardcoded here. example_eval.yaml gets a short
name-free pointer to the check.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 2, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

Adds documentation and comments that explain how to discover and use optional internal per-cluster SLURM execution configs (internal/slurm/) instead of the generic slurm/default, and instructs verification via --dry-run or fallback to manual slurm/default configuration.

Changes

SLURM Configuration Guidance

Layer / File(s) Summary
Internal SLURM cluster config guidance
.claude/skills/evaluation/SKILL.md, .claude/skills/evaluation/recipes/examples/example_eval.yaml
Step 4 now documents a command to list internal internal/slurm/<cluster>.yaml files and their hostname values, instructs switching defaults from slurm/default to internal/slurm/<cluster> (dropping redundant execution.hostname) when a hostname matches and to confirm via --dry-run. The example YAML adds comments noting slurm/default is cluster-agnostic and recommends using internal per-cluster configs for pre-filled scheduling parameters.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Suggested reviewers

  • kaix-nv
  • chadvoegele
🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title directly and clearly describes the main change: adding documentation and guidance for auto-detecting per-cluster execution configs from an internal package.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed PR contains only documentation and example YAML changes with no Python code, dependencies, or security anti-patterns. Bash snippet uses safe imports and string parsing without eval/exec.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch skill/eval-internal-cluster-exec

Comment @coderabbitai help to get the list of available commands and usage tips.

Compress the Step 4 discovery guidance and the example_eval.yaml comment added
in the prior commit; no behavior change (discovery snippet unchanged).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.claude/skills/evaluation/SKILL.md:
- Line 205: Update the fallback instructions so the explicit fields users must
fill when no internal SLURM profile matches include partition, gres, and
walltime in addition to hostname/account/output_dir; specifically modify the
sentence about replacing `slurm/default` and removing `execution.hostname` (the
`defaults: - execution: internal/slurm/<cluster>` guidance) to state that if no
match is found you should keep `slurm/default` and manually set `hostname`,
`account`, `output_dir`, `partition`, `gres`, and `walltime` so it matches the
earlier claim that internal profiles auto-fill partition/gres/walltime.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: aa91707f-d38e-4d05-8a24-1445c9bc1e1e

📥 Commits

Reviewing files that changed from the base of the PR and between db9ea8f and 33b532c.

📒 Files selected for processing (2)
  • .claude/skills/evaluation/SKILL.md
  • .claude/skills/evaluation/recipes/examples/example_eval.yaml

Comment thread .claude/skills/evaluation/SKILL.md Outdated
echo "$(basename "$f" .yaml) -> $(grep -E '^hostname:' "$f" | awk '{print $2}')"; done
```

If a listed hostname matches the target SLURM cluster, set `defaults: - execution: internal/slurm/<cluster>` (replacing `slurm/default`), and remove any now-redundant `execution.hostname` (keep `account` / `output_dir` / `walltime` overrides). Confirm it resolves with a `--dry-run`. If the package isn't installed or nothing matches, keep `slurm/default` and fill `hostname` / `account` / `output_dir` manually.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Align fallback fields with the fields you say are auto-filled

Line 205’s fallback omits partition/gres (and walltime mention), even though Line 196 says the internal profile pre-fills those. Please make the fallback list explicit and consistent so users don’t miss required manual values when no internal profile matches.

Suggested doc edit
-If the package isn't installed or nothing matches, keep `slurm/default` and fill `hostname` / `account` / `output_dir` manually.
+If the package isn't installed or nothing matches, keep `slurm/default` and fill `hostname` / `partition` / `gres` / `account` / `output_dir` / `walltime` manually.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
If a listed hostname matches the target SLURM cluster, set `defaults: - execution: internal/slurm/<cluster>` (replacing `slurm/default`), and remove any now-redundant `execution.hostname` (keep `account` / `output_dir` / `walltime` overrides). Confirm it resolves with a `--dry-run`. If the package isn't installed or nothing matches, keep `slurm/default` and fill `hostname` / `account` / `output_dir` manually.
If a listed hostname matches the target SLURM cluster, set `defaults: - execution: internal/slurm/<cluster>` (replacing `slurm/default`), and remove any now-redundant `execution.hostname` (keep `account` / `output_dir` / `walltime` overrides). Confirm it resolves with a `--dry-run`. If the package isn't installed or nothing matches, keep `slurm/default` and fill `hostname` / `partition` / `gres` / `account` / `output_dir` / `walltime` manually.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/evaluation/SKILL.md at line 205, Update the fallback
instructions so the explicit fields users must fill when no internal SLURM
profile matches include partition, gres, and walltime in addition to
hostname/account/output_dir; specifically modify the sentence about replacing
`slurm/default` and removing `execution.hostname` (the `defaults: - execution:
internal/slurm/<cluster>` guidance) to state that if no match is found you
should keep `slurm/default` and manually set `hostname`, `account`,
`output_dir`, `partition`, `gres`, and `walltime` so it matches the earlier
claim that internal profiles auto-fill partition/gres/walltime.

@cjluo-nv cjluo-nv requested a review from Edwardf0t1 June 2, 2026 06:38
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
.claude/skills/evaluation/SKILL.md (1)

205-205: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix fallback field list to match the pre-fill contract.

Line 205 still omits partition, gres, and walltime, which conflicts with Line 196’s statement about what internal profiles pre-fill. Please make the fallback explicit and consistent.

Proposed doc fix
-Hostname match → set `defaults: - execution: internal/slurm/<cluster>`, drop the redundant `execution.hostname` (keep account/output_dir/walltime), verify with `--dry-run`. Else keep `slurm/default` and fill hostname/account/output_dir manually.
+Hostname match → set `defaults: - execution: internal/slurm/<cluster>`, drop the redundant `execution.hostname` (keep account/output_dir/walltime), verify with `--dry-run`. Else keep `slurm/default` and fill hostname/partition/gres/account/output_dir/walltime manually.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/evaluation/SKILL.md at line 205, Update the fallback field
list in SKILL.md so it matches the pre-fill contract: when hostname matches set
defaults to "execution: internal/slurm/<cluster>" and remove the redundant
"execution.hostname" entry, but explicitly include and pre-fill the fields
account, output_dir, walltime, partition, and gres; note the user should verify
with "--dry-run". Otherwise keep "slurm/default" and instruct filling
hostname/account/output_dir/partition/gres/walltime manually. Ensure the text
around the existing "Hostname match → set `defaults: - execution:
internal/slurm/<cluster>`" line is changed to list the full fallback fields
(account, output_dir, walltime, partition, gres) to be consistent with the
statement at line 196.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In @.claude/skills/evaluation/SKILL.md:
- Line 205: Update the fallback field list in SKILL.md so it matches the
pre-fill contract: when hostname matches set defaults to "execution:
internal/slurm/<cluster>" and remove the redundant "execution.hostname" entry,
but explicitly include and pre-fill the fields account, output_dir, walltime,
partition, and gres; note the user should verify with "--dry-run". Otherwise
keep "slurm/default" and instruct filling
hostname/account/output_dir/partition/gres/walltime manually. Ensure the text
around the existing "Hostname match → set `defaults: - execution:
internal/slurm/<cluster>`" line is changed to list the full fallback fields
(account, output_dir, walltime, partition, gres) to be consistent with the
statement at line 196.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e94bef91-0dc6-4ebb-972c-6ead79589133

📥 Commits

Reviewing files that changed from the base of the PR and between 33b532c and 24ba62a.

📒 Files selected for processing (2)
  • .claude/skills/evaluation/SKILL.md
  • .claude/skills/evaluation/recipes/examples/example_eval.yaml
✅ Files skipped from review due to trivial changes (1)
  • .claude/skills/evaluation/recipes/examples/example_eval.yaml

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.38%. Comparing base (905259f) to head (24ba62a).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1599   +/-   ##
=======================================
  Coverage   77.38%   77.38%           
=======================================
  Files         479      479           
  Lines       52435    52435           
=======================================
  Hits        40578    40578           
  Misses      11857    11857           
Flag Coverage Δ
unit 53.62% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@cjluo-nv cjluo-nv requested review from chadvoegele and meenchen June 2, 2026 16:05
Copy link
Copy Markdown
Contributor

@meenchen meenchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Docs-only change (+13/-0) adding a "check FIRST" discovery snippet for optional nemo_evaluator_launcher_internal per-cluster execution configs, plus a pointer comment in example_eval.yaml. No cluster-specific data is hardcoded — discovery is at runtime and falls back gracefully to slurm/default when the package isn't installed. No correctness, licensing, or test concerns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants