Skip to content

[Early Testing] Add day0-release orchestration skill with enforced gates#1596

Open
Edwardf0t1 wants to merge 2 commits into
mainfrom
day0-release
Open

[Early Testing] Add day0-release orchestration skill with enforced gates#1596
Edwardf0t1 wants to merge 2 commits into
mainfrom
day0-release

Conversation

@Edwardf0t1
Copy link
Copy Markdown
Contributor

@Edwardf0t1 Edwardf0t1 commented Jun 2, 2026

What does this PR do?

Type of change: new feature (Claude agent skill)

Adds a day0-release orchestration skill that chains the existing domain skills (ptqdeploymentevaluationcompare-results) into the day-0 release happy path, with code-enforced gates between stages.

The problem it solves. Today the day-0 release sequence is model-driven — the agent re-decides the PTQ → deploy → eval → compare order each session, and can silently skip a gate (e.g. report scores from an incomplete eval, or hand off a checkpoint that missed quantization coverage). These are failure modes we hit repeatedly in live trials. day0-release makes the sequence and the gates deterministic so the common case is repeatable.

What it is — and isn't. It's a conductor, not a new instrument. It owns the sequence, the gates, and the accept/retry decision; the domain skills still own the actual work. It is not for single-stage asks ("just quantize X" → ptq; "run MMLU on this endpoint" → evaluation) — its negative trigger excludes those. It fires only for the full goal-driven release.

Goal it drives toward (the documented day-0 criterion): a quantized checkpoint smaller than the original, with <1% accuracy drop on the standard set vs the matching baseline, plus a publish recommendation.

Contents:

  • .claude/skills/day0-release/SKILL.md — the chain + gate calls + decision logic.
  • .claude/skills/day0-release/scripts/gate_ptq.py, gate_run.py, gate_compare.py — deterministic gate scripts. Each is a pure decision function (evaluate_*) plus a thin file-reading main, returning JSON {pass, failure_class, detail, ...}. The failure_class values match the modelopt-agent-protocols strawman (MODEL_UNSUPPORTED, QUANT_COVERAGE_FAILURE, EVAL_JUDGE_FAILED, SAMPLE_ACCOUNTING_FAILED, …).
  • .claude/skills/day0-release/scripts/test_gates.py — 18 unit tests for the gate decision functions (GPU-free, no cluster).
  • .claude/skills/day0-release/tests/evals.json — 5 routing + behavior assertions.

As-built note — gate_ptq.py input. v1 takes a --summary <validation.json> (size scan + hf_ptq quant summary: source_bytes, output_bytes, recipe, layer_precision_counts, metadata_diffs) rather than reading the safetensors checkpoint directly. The --checkpoint/--source/--recipe args are reserved stubs; wiring the gate to build that summary from the exported checkpoint is a follow-up. gate_run.py and gate_compare.py likewise read small JSON summaries the agent produces from the run artifacts.

Usage

Ask Claude Code:

Release `<org>/<model>` at day-0: quantize to NVFP4, validate accuracy is within
1% of the BF16 baseline on the AA suite, and tell me if it's publishable.
Run on <cluster>.

The skill then runs the chain, enforcing a gate after each stage:

setup ─▶ PTQ ─▶ deploy ─▶ baseline-eval ─▶ quantized-eval ─▶ compare ─▶ closeout
          │       │            │                │              │
       gate_ptq  health     gate_run         gate_run      gate_compare
After stage Gate Pass condition On fail
Setup reachability creds present, cluster SSH ok SYSTEMIC → abort
PTQ gate_ptq.py size ratio <1, layer coverage matches recipe, metadata consistent triage → PATCH / skip recipe / abort
Deploy health endpoint up + 1 successful generation triage → PATCH flags/TP / skip
Each eval gate_run.py complete, all samples scored, no judge/parse failure retry / triage
Compare gate_compare.py every task within <1% drop REGRESSION → report; ANOMALOUS → human

It returns a decision, not a raw artifact: ACCEPT (publishable, with report) / REGRESSION (which tasks failed the threshold) / ANOMALOUS / INFEASIBLE — plus the workspace path and MLflow run IDs.

Testing

Tested in two layers — deterministic control flow (CI-able) separate from the agentic stages (integration-level):

  • Gate-script unit teststest_gates.py, 18 cases, all passing (python -m pytest .claude/skills/day0-release/scripts/test_gates.py). Covers the pass path plus each failure_class branch for all three gates: gate_compare (ACCEPT / REGRESSION / ANOMALOUS-on-implausible-gain / ANOMALOUS-out-of-range / mismatched-task-sets / relative-threshold), gate_run (valid / dropped-samples / judge-error / missing-score / timeout-not-terminal / no-tasks), gate_ptq (pass / not-smaller / zero-coverage→MODEL_UNSUPPORTED / unexpected-unquantized / metadata-diff / unknown-recipe). No GPU or cluster needed. Also verified: SPDX headers on all scripts, py_compile clean, SKILL.md/evals.json markdown + JSON valid.
  • Routing assertions (tests/evals.json, 5 cases): documents expected routing — fires on "release model X at day-0", does not fire on "just quantize X" / "run MMLU on this endpoint", and the gate-blocking / regression-reports-and-stops behaviors. These are behavior specs for manual QA; there is no automated skill-routing harness yet (tracked as "Remaining Work" in the design doc).
  • End-to-end integration (manual, GPU + cluster): not yet run. Planned on a small known-good model (e.g. Qwen3-0.6B → FP8): full chain reaches ACCEPT and produces a report; a known-failure fixture (coverage-miss recipe) verifies the triage branch routes to PATCH/INFEASIBLE rather than silently continuing. Tracked as a follow-up before this graduates from [Early Testing].

Open question for reviewers

Does v1 run the baseline eval, or assume a baseline already exists in MLflow?

  • Running it doubles GPU cost per release.
  • Assuming it adds a precondition the user must satisfy.
  • Proposed default: look up the baseline in MLflow by (model, task set, sampling params); run it only if missing. Surfaces as a gate_run precondition. Flagging it here because it's a product/cost call, not just an implementation detail.

Scope

In v1: the linear chain + gate scripts + ACCEPT/fail-with-report outcomes. On REGRESSION, v1 reports "recipe R regressed on tasks [...]" and stops.

Deferred (follow-up PR): the evaluator-optimizer recipe loop (compare → pick next recipe → re-PTQ), which needs the bigpareto integration and the shared modelopt-agent-protocols schema adopted on both sides.

Before your PR is "Ready for review"

  • Is this change backward compatible?: ✅ (new skill; no change to existing skills)
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A (no new deps)
  • Did you write any new necessary tests?: ✅ (gate-script unit tests + routing evals; see Testing TODO for run status)
  • Did you update Changelog?: ✅ (Early Testing entry)
  • Did you get Claude approval on this PR?: ⬜ (run /claude review before marking ready)

Additional Information

Design rationale: ModelOpt Agent Skills Design doc — this skill implements the "deterministic day-0 chain driver" (prompt-chaining-with-gates, the code-driven orchestration pattern from Anthropic's Building Effective Agents). The gate scripts double as the data source for the Observability stage metrics, and gate_compare.py's verdict is the entry point for the deferred evaluator-optimizer recipe loop.

Summary by CodeRabbit

  • New Features
    • Added day0-release skill: an end-to-end PTQ → deployment → evaluation → comparison workflow with enforced gates between stages and final publish decision.
  • Tests
    • Added deterministic unit tests covering gate logic for checkpoints, runs, and baseline-vs-candidate comparisons.
  • Documentation
    • Updated changelog and added a skill spec describing inputs, per-stage checks, gating outcomes, and triage actions.
  • Chores
    • Lint config extended to include new test files.

Adds a deterministic end-to-end driver that chains the existing domain
skills (ptq -> deployment -> evaluation -> compare-results) with a gate
after every stage, returning a publish decision (ACCEPT / REGRESSION /
ANOMALOUS / INFEASIBLE) rather than a raw artifact.

The skill is a conductor: it sequences the domain skills and enforces
the gates; it does not re-implement quantization, serving, evaluation,
or comparison. Negative trigger excludes single-stage requests.

Ships three GPU-free, unit-tested gate scripts whose pure decision
functions mirror logic already specified in the domain skills:
- gate_ptq.py     -> checkpoint-validation.md (size, coverage, metadata)
- gate_run.py     -> run-validation.md (completeness, judge/sample checks)
- gate_compare.py -> compare-results threshold decision

18 unit tests cover pass + each failure_class branch; routing tests in
tests/evals.json assert the skill fires on full-release requests and not
on single-stage ones.

v1 scope: linear chain + gates + report. On REGRESSION it reports which
tasks exceeded the threshold and stops. The evaluator-optimizer recipe
loop (compare -> next recipe -> re-PTQ) is deferred to a follow-up that
needs the bigpareto integration and a shared config/result schema.

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jun 2, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 2, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

Adds a new day0-release skill spec plus three gate scripts (PTQ, run, compare), deterministic pytest coverage, eval scenarios, changelog entry, and a Ruff lint ignore for gate test files.

Changes

Day-0 Release Workflow and Gates

Layer / File(s) Summary
Workflow specification and gating contract
.claude/skills/day0-release/SKILL.md
Defines the end-to-end PTQ → deployment → baseline eval → quantized eval → comparison pipeline with enforced per-stage gating, triage action mappings, required output shape (decision + rationale + traceability), and v1 scope constraints.
Compare gate for accuracy threshold validation
.claude/skills/day0-release/scripts/gate_compare.py
Validates candidate task scores against baseline using absolute or relative thresholds, detects implausible gains and out-of-range values, requires identical task sets, and returns ACCEPT/REGRESSION/ANOMALOUS with per-task details in structured JSON and a CLI.
PTQ gate for checkpoint validation
.claude/skills/day0-release/scripts/gate_ptq.py
Validates PTQ summary JSON: enforces output smaller than source, checks quantized-layer coverage vs recipe-expected precision buckets (zero coverage → MODEL_UNSUPPORTED, unexpected unquantized layers, precision mismatches), flags metadata diffs, and returns deterministic failure_class and verdict via CLI.
Run gate for evaluation completeness validation
.claude/skills/day0-release/scripts/gate_run.py
Validates evaluation run-summary JSON per-task for terminal SUCCESS, error presence/classification, sample accounting, and canonical score extraction; aggregates per-task ok/reasons into overall verdict and exposes a CLI.
Unit tests for gate evaluation logic
.claude/skills/day0-release/scripts/test_gates.py
Deterministic pytest module covering compare/regression/anomalous cases, run-level task/sample/error handling, PTQ checkpoint validation cases, with helper builders and a __main__ test runner entrypoint.
Evaluation test specifications and version documentation
.claude/skills/day0-release/tests/evals.json, CHANGELOG.rst
Adds eval scenarios asserting correct workflow routing, stage sequencing, PTQ gate branching, and compare-based regression behavior; updates CHANGELOG with a 0.45 entry describing the new Early Testing skill.
Ruff lint config
pyproject.toml
Adds a per-file-ignores exception for .claude/skills/*/scripts/test_*.py matching existing tests/* ignores.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.54% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely describes the main addition: a new day0-release orchestration skill with enforced gates between stages of the release workflow.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed All Python code additions use only standard library imports; no unsafe torch.load/numpy.load/eval/exec/trust_remote_code patterns found; no nosec comments used; no new dependencies added.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch day0-release

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.23%. Comparing base (905259f) to head (504cb72).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1596      +/-   ##
==========================================
- Coverage   77.38%   73.23%   -4.16%     
==========================================
  Files         479      479              
  Lines       52435    52435              
==========================================
- Hits        40578    38401    -2177     
- Misses      11857    14034    +2177     
Flag Coverage Δ
unit 53.62% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Edwardf0t1 Edwardf0t1 marked this pull request as ready for review June 2, 2026 07:07
code-quality CI flagged D103 (missing docstrings) on the gate-script
main() functions and the test functions, since .claude/skills/**/scripts
isn't covered by the existing tests/* and examples/* docstring
exemptions.

- Add one-line docstrings to gate_ptq/gate_run/gate_compare main()
  (the gate scripts stay fully pydocstyle-compliant — they're the
  contract).
- Exempt skill test scripts from D / E402 via a per-file-ignore
  mirroring the existing tests/* rule (.claude/skills/*/scripts/test_*.py).
- Apply ruff format.

18 gate unit tests still pass.

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
@Edwardf0t1 Edwardf0t1 requested a review from a team as a code owner June 2, 2026 07:12
@Edwardf0t1 Edwardf0t1 requested a review from kevalmorabia97 June 2, 2026 07:12
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.claude/skills/day0-release/scripts/gate_compare.py:
- Around line 83-94: The loop currently logs anomalies for
non-numeric/out-of-range scores but still computes arithmetic (drop = b - c / b)
which raises TypeError for invalid b/c; after the validation block that appends
to anomalies for ("baseline", b) / ("candidate", c) add a short-circuit that
skips the delta/gain math when either score is invalid (e.g., check if not
isinstance(b, (int,float)) or not isinstance(c, (int,float)) or if anomalies was
appended for this task) and mark the task as anomalous/continue the loop so you
do not execute drop/limit/within calculations; update references to drop, limit,
within, and the implausible gain check to only run when b and c are valid
numbers within [_SCORE_MIN, _SCORE_MAX].

In @.claude/skills/day0-release/scripts/gate_ptq.py:
- Around line 137-149: main() currently hard-fails if args.summary is not
provided, which breaks the documented Step 2 flow that invokes the script with
--checkpoint/--source/--recipe; change the validation so that when args.summary
is missing the code will either (a) accept and derive or locate the
validation-summary.json from args.checkpoint or args.source (use the same path
resolution logic as the export/checkpoint step) or (b) fall back to using
args.recipe/args.source to reconstruct the summary before proceeding, instead of
returning exit code 2; update the branch handling around p.add_argument(...) and
the args.summary check in main() (and any helper that reads the summary) to
implement this fallback behavior so the script runs with the documented
invocation.

In @.claude/skills/day0-release/scripts/gate_run.py:
- Around line 91-97: The sample-count and score validation currently allows
missing/invalid counts or non-finite scores to pass; update the checks around
the sample accounting block (the expected/scored handling) and the score check
so that if expected is None or scored is None it sets ok = False and appends a
reason (e.g., "missing expected/scored samples"), and if score is not a finite
number (use math.isfinite after converting/parsing or handle
ValueError/TypeError) then set ok = False and append a descriptive reason (e.g.,
"invalid/non-finite score"); ensure you import math and use math.isfinite(score)
(or equivalent parsing+check) and keep the existing messages like "sample
accounting: scored {scored} of {expected}" and "no score extracted" but expand
them to cover missing/invalid cases so malformed eval outputs fail the gate.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 18ea1284-1a4f-443e-ba5f-f2606c8478c4

📥 Commits

Reviewing files that changed from the base of the PR and between 905259f and 504cb72.

📒 Files selected for processing (7)
  • .claude/skills/day0-release/SKILL.md
  • .claude/skills/day0-release/scripts/gate_compare.py
  • .claude/skills/day0-release/scripts/gate_ptq.py
  • .claude/skills/day0-release/scripts/gate_run.py
  • .claude/skills/day0-release/scripts/test_gates.py
  • .claude/skills/day0-release/tests/evals.json
  • CHANGELOG.rst

Comment on lines +83 to +94
for label, val in (("baseline", b), ("candidate", c)):
if not isinstance(val, (int, float)) or not (_SCORE_MIN <= val <= _SCORE_MAX):
anomalies.append(f"{task}: {label} score {val!r} outside [0, 100]")
if relative:
drop = (b - c) / b if b else 0.0
limit = threshold
else:
drop = b - c # percentage points
limit = threshold * 100.0 # threshold is a fraction of the 0-100 scale
within = drop <= limit
if c - b > _IMPLAUSIBLE_GAIN:
anomalies.append(f"{task}: candidate exceeds baseline by {c - b:.2f} pts (implausible)")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Short-circuit invalid score values before computing deltas.

This block records an anomaly for non-numeric scores, but still executes b - c / (b - c) / b. A payload like {"gpqa": "42"} will therefore crash the gate with TypeError instead of returning ANOMALOUS.

Suggested fix
+import math
+
 ...
     for task in sorted(baseline):
         b, c = baseline[task], candidate[task]
+        valid = True
         for label, val in (("baseline", b), ("candidate", c)):
-            if not isinstance(val, (int, float)) or not (_SCORE_MIN <= val <= _SCORE_MAX):
+            if (
+                not isinstance(val, (int, float))
+                or not math.isfinite(val)
+                or not (_SCORE_MIN <= val <= _SCORE_MAX)
+            ):
                 anomalies.append(f"{task}: {label} score {val!r} outside [0, 100]")
+                valid = False
+        if not valid:
+            per_task[task] = {
+                "baseline": b,
+                "candidate": c,
+                "drop": None,
+                "within_threshold": False,
+            }
+            continue
         if relative:
             drop = (b - c) / b if b else 0.0
             limit = threshold
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/day0-release/scripts/gate_compare.py around lines 83 - 94,
The loop currently logs anomalies for non-numeric/out-of-range scores but still
computes arithmetic (drop = b - c / b) which raises TypeError for invalid b/c;
after the validation block that appends to anomalies for ("baseline", b) /
("candidate", c) add a short-circuit that skips the delta/gain math when either
score is invalid (e.g., check if not isinstance(b, (int,float)) or not
isinstance(c, (int,float)) or if anomalies was appended for this task) and mark
the task as anomalous/continue the loop so you do not execute drop/limit/within
calculations; update references to drop, limit, within, and the implausible gain
check to only run when b and c are valid numbers within [_SCORE_MIN,
_SCORE_MAX].

Comment on lines +137 to +149
p.add_argument("--summary", help="validation-summary JSON (see module docstring)")
p.add_argument("--checkpoint", help="(reserved) checkpoint dir; v1 expects --summary")
p.add_argument("--source", help="(reserved) source model id/path")
p.add_argument("--recipe", help="(reserved) qformat; overrides summary.recipe if given")
args = p.parse_args(argv)

if not args.summary:
print(json.dumps({
"pass": False, "failure_class": "USER_CONFIG_ERROR",
"detail": "v1 requires --summary <validation-summary.json>; "
"produce it from the exported checkpoint (size scan + hf_ptq quant summary)",
}))
return 2
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Align the PTQ gate CLI with the documented Step 2 invocation.

main() hard-fails unless --summary is passed, but the published workflow calls this script with only --checkpoint/--source/--recipe. As shipped, the documented day-0 flow cannot get past the PTQ gate even when PTQ succeeds. Either support the Step 2 arguments here or update the workflow to pass --summary consistently.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/day0-release/scripts/gate_ptq.py around lines 137 - 149,
main() currently hard-fails if args.summary is not provided, which breaks the
documented Step 2 flow that invokes the script with
--checkpoint/--source/--recipe; change the validation so that when args.summary
is missing the code will either (a) accept and derive or locate the
validation-summary.json from args.checkpoint or args.source (use the same path
resolution logic as the export/checkpoint step) or (b) fall back to using
args.recipe/args.source to reconstruct the summary before proceeding, instead of
returning exit code 2; update the branch handling around p.add_argument(...) and
the args.summary check in main() (and any helper that reads the summary) to
implement this fallback behavior so the script runs with the documented
invocation.

Comment on lines +91 to +97
if expected is not None and scored is not None and scored != expected:
ok = False
reasons.append(f"sample accounting: scored {scored} of {expected}")

if score is None:
ok = False
reasons.append("no score extracted")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail the gate when sample counts or the canonical score are malformed.

A task with status="SUCCESS", no expected_samples/scored_samples, and a non-numeric or non-finite score currently passes. That lets incomplete or malformed eval outputs through to comparison. Treat missing/invalid counts and non-finite scores as gate failures here.

Suggested fix
+import math
+
 ...
-        if expected is not None and scored is not None and scored != expected:
-            ok = False
-            reasons.append(f"sample accounting: scored {scored} of {expected}")
+        if not isinstance(expected, int) or expected < 0 or not isinstance(scored, int) or scored < 0:
+            ok = False
+            reasons.append("sample accounting missing/invalid")
+        elif scored != expected:
+            ok = False
+            reasons.append(f"sample accounting: scored {scored} of {expected}")
 
-        if score is None:
+        if not isinstance(score, (int, float)) or not math.isfinite(score):
             ok = False
-            reasons.append("no score extracted")
+            reasons.append("no valid score extracted")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if expected is not None and scored is not None and scored != expected:
ok = False
reasons.append(f"sample accounting: scored {scored} of {expected}")
if score is None:
ok = False
reasons.append("no score extracted")
if not isinstance(expected, int) or expected < 0 or not isinstance(scored, int) or scored < 0:
ok = False
reasons.append("sample accounting missing/invalid")
elif scored != expected:
ok = False
reasons.append(f"sample accounting: scored {scored} of {expected}")
if not isinstance(score, (int, float)) or not math.isfinite(score):
ok = False
reasons.append("no valid score extracted")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/day0-release/scripts/gate_run.py around lines 91 - 97, The
sample-count and score validation currently allows missing/invalid counts or
non-finite scores to pass; update the checks around the sample accounting block
(the expected/scored handling) and the score check so that if expected is None
or scored is None it sets ok = False and appends a reason (e.g., "missing
expected/scored samples"), and if score is not a finite number (use
math.isfinite after converting/parsing or handle ValueError/TypeError) then set
ok = False and append a descriptive reason (e.g., "invalid/non-finite score");
ensure you import math and use math.isfinite(score) (or equivalent
parsing+check) and keep the existing messages like "sample accounting: scored
{scored} of {expected}" and "no score extracted" but expand them to cover
missing/invalid cases so malformed eval outputs fail the gate.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
.claude/skills/day0-release/scripts/gate_run.py (1)

91-97: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail the gate when sample counts or the canonical score are malformed.

The validation gaps flagged in the previous review remain unaddressed. A task with status="SUCCESS", missing expected_samples/scored_samples (None), and a non-finite score (NaN/Inf) currently passes the gate. This allows incomplete or malformed eval outputs through to comparison, which will produce misleading results or downstream failures.

Missing validations:

  • Lines 91-93: If expected or scored is None, the check is skipped entirely
  • Lines 91-93: No type or non-negative validation for sample counts
  • Lines 95-97: If score is not None but is NaN or Inf, it passes

The fix requires importing math and adding explicit checks for missing/invalid counts and non-finite scores.

🛡️ Suggested fix
+import math
+
 ...
-        if expected is not None and scored is not None and scored != expected:
+        if not isinstance(expected, int) or expected < 0 or not isinstance(scored, int) or scored < 0:
             ok = False
-            reasons.append(f"sample accounting: scored {scored} of {expected}")
+            reasons.append("sample accounting missing/invalid")
+        elif scored != expected:
+            ok = False
+            reasons.append(f"sample accounting: scored {scored} of {expected}")
 
-        if score is None:
+        if not isinstance(score, (int, float)) or not math.isfinite(score):
             ok = False
-            reasons.append("no score extracted")
+            reasons.append("no valid score extracted")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/day0-release/scripts/gate_run.py around lines 91 - 97, The
current validation around sample counts and canonical score is too permissive;
update the block that inspects expected, scored, and score (variables expected,
scored, score; function logic around lines shown) to: import math at the top,
explicitly treat expected or scored being None as a failure (set ok=False and
append a clear reason), validate that expected and scored are integers and >= 0
(fail and append reason if not), still check for mismatch when both valid, and
validate score with math.isfinite(score) so NaN/Inf are treated as invalid (if
score is None or not finite set ok=False and append "no score extracted" or
"non-finite score" as appropriate). Ensure all failures set ok=False and push a
descriptive message into reasons.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In @.claude/skills/day0-release/scripts/gate_run.py:
- Around line 91-97: The current validation around sample counts and canonical
score is too permissive; update the block that inspects expected, scored, and
score (variables expected, scored, score; function logic around lines shown) to:
import math at the top, explicitly treat expected or scored being None as a
failure (set ok=False and append a clear reason), validate that expected and
scored are integers and >= 0 (fail and append reason if not), still check for
mismatch when both valid, and validate score with math.isfinite(score) so
NaN/Inf are treated as invalid (if score is None or not finite set ok=False and
append "no score extracted" or "non-finite score" as appropriate). Ensure all
failures set ok=False and push a descriptive message into reasons.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: bc70c917-a62d-4c25-be89-6d7e5b7e8206

📥 Commits

Reviewing files that changed from the base of the PR and between 504cb72 and fcdb6a4.

📒 Files selected for processing (5)
  • .claude/skills/day0-release/scripts/gate_compare.py
  • .claude/skills/day0-release/scripts/gate_ptq.py
  • .claude/skills/day0-release/scripts/gate_run.py
  • .claude/skills/day0-release/scripts/test_gates.py
  • pyproject.toml
✅ Files skipped from review due to trivial changes (1)
  • pyproject.toml
🚧 Files skipped from review as they are similar to previous changes (3)
  • .claude/skills/day0-release/scripts/gate_compare.py
  • .claude/skills/day0-release/scripts/gate_ptq.py
  • .claude/skills/day0-release/scripts/test_gates.py

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new day0-release Claude skill that deterministically orchestrates the “day-0 release” happy path (PTQ → deploy → baseline eval → candidate eval → compare) and enforces code-driven gates between stages to prevent silently advancing with invalid artifacts/runs.

Changes:

  • Introduces day0-release skill spec (SKILL.md) describing the fixed stage chain, gating points, and decision outputs.
  • Adds three gate scripts (gate_ptq.py, gate_run.py, gate_compare.py) plus unit tests (test_gates.py) and routing/behavior assertions (tests/evals.json).
  • Updates lint config (pyproject.toml) for skill test scripts and documents the feature in CHANGELOG.rst.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
pyproject.toml Adds Ruff per-file ignores for skill-local test scripts.
CHANGELOG.rst Documents the new [Early Testing] day0-release skill and gates.
.claude/skills/day0-release/tests/evals.json Adds routing/behavior assertions for manual QA of skill selection and stop-on-gate behavior.
.claude/skills/day0-release/SKILL.md Defines the deterministic orchestration chain, gate invocation points, and triage/decision mapping.
.claude/skills/day0-release/scripts/test_gates.py Adds GPU-free unit tests covering the three gate decision functions.
.claude/skills/day0-release/scripts/gate_run.py Implements evaluation-run completeness/validity gate (sample accounting, status, judge errors).
.claude/skills/day0-release/scripts/gate_ptq.py Implements PTQ output gate (size reduction, coverage, metadata diffs) based on a summary JSON.
.claude/skills/day0-release/scripts/gate_compare.py Implements baseline-vs-candidate threshold gate producing ACCEPT/REGRESSION/ANOMALOUS decisions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +67 to +70
```bash
python .claude/skills/day0-release/scripts/gate_ptq.py \
--checkpoint <output_path> --source <source_model> --recipe <qformat>
```
Comment on lines +94 to +96
```bash
python .claude/skills/day0-release/scripts/gate_run.py --run <run_dir_or_results_yml>
```
Comment on lines +81 to +92
for task in sorted(baseline):
b, c = baseline[task], candidate[task]
for label, val in (("baseline", b), ("candidate", c)):
if not isinstance(val, (int, float)) or not (_SCORE_MIN <= val <= _SCORE_MAX):
anomalies.append(f"{task}: {label} score {val!r} outside [0, 100]")
if relative:
drop = (b - c) / b if b else 0.0
limit = threshold
else:
drop = b - c # percentage points
limit = threshold * 100.0 # threshold is a fraction of the 0-100 scale
within = drop <= limit
Comment on lines +35 to +36
A walltime TIMEOUT with a healthy resume is not a failure (NEL resumes); only a
terminal FAILED/incomplete run fails the gate.
Comment thread pyproject.toml
Comment on lines +45 to +47
_TERMINAL_OK = "SUCCESS"
_RESUMABLE = {"TIMEOUT", "RESUMING"} # NEL dependency-chain resume — not a failure

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this not in the .claude/skills/day0-release/tests/ directory?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants