chore: add data-designer skill evals#718
Conversation
Review: PR #718 —
|
Greptile SummaryThis PR adds eval infrastructure for the
|
| Filename | Overview |
|---|---|
| skills/data-designer/evals/evals.json | Single-object eval file with one positive case; claims 4 tasks in benchmark docs, mismatched expected_script usage already flagged in prior threads |
| skills/data-designer/BENCHMARK.md | New benchmark report documenting 4-task evaluation; task count is inconsistent with the 1-case evals.json committed alongside it |
| skills/data-designer/SKILL.md | Minor metadata addition: license field and owner metadata; no logic changes |
| skills/data-designer/skill-card.md | New skill card with evaluation results; also states 4 evaluation tasks, inconsistent with the single eval case in evals.json |
| skills/data-designer/skill.oms.sig | New cryptographic OMS signature bundle for the skill; no logic concerns |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[PR: add data-designer skill evals] --> B[evals/evals.json\n1 positive eval case]
A --> C[BENCHMARK.md\nclaims 4 evaluation tasks]
A --> D[skill-card.md\nclaims 4 evaluation tasks]
A --> E[SKILL.md\nadd license + metadata.owner]
A --> F[skill.oms.sig\nOMS cryptographic signature]
B -- "bare object format\nnot an array" --> G{Harness\nloads evals.json}
C -- "4 tasks documented" --> H[Benchmark results\nfrom 4 tasks]
B -- "1 task present" --> H
H --> I[⚠️ Mismatch:\nmetrics not reproducible\nfrom committed eval]
G -- "expects array?" --> J[⚠️ Possible parse failure\nor wrong eval count]
G -- "expects object?" --> K[Runs single eval case]
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
skills/data-designer/evals/evals.json:1-13
**Benchmark metadata claims 4 tasks; only 1 is present**
Both `BENCHMARK.md` ("Dataset: 4 evaluation tasks") and `skill-card.md` ("Evaluated against 4 evaluation tasks") reference a 4-task dataset, but `evals.json` ships with exactly one eval case. If a harness loads this file to reproduce the benchmark results or measure discoverability/correctness scores, it will be operating on a dataset that is 75% smaller than what the benchmark report describes, making those reported metrics unverifiable from the committed artefacts.
Reviews (15): Last reviewed commit: "Attach NVSkills validation signatures" | Re-trigger Greptile
0a5e916 to
b6cd817
Compare
|
/nvskills-ci |
|
@johnnygreco - I think this is failing because its missing the DCO sign-off. Run git rebase --signoff origin/main && git push --force-with-lease |
d0f0a40 to
467a900
Compare
|
/nvskills-ci |
467a900 to
abf988a
Compare
|
/nvskills-ci |
1 similar comment
|
/nvskills-ci |
|
/nvskills-ci |
1 similar comment
|
/nvskills-ci |
|
All contributors have signed the DCO ✍️ ✅ |
|
/nvskills-ci |
1 similar comment
|
/nvskills-ci |
|
recheck |
Signed-off-by: Johnny Greco <jogreco@nvidia.com>
Signed-off-by: Johnny Greco <jogreco@nvidia.com>
Signed-off-by: Johnny Greco <jogreco@nvidia.com>
Signed-off-by: Johnny Greco <jogreco@nvidia.com>
Signed-off-by: nvskills-svc-account <svc-nvskills-signing@nvidia.com>
Signed-off-by: Johnny Greco <jogreco@nvidia.com>
Signed-off-by: nvskills-svc-account <svc-nvskills-signing@nvidia.com>
45c323c to
da47b1e
Compare
|
/nvskills-ci |
📋 Summary
Adds targeted eval coverage for the
data-designerskill so Autopilot routing and skill-specific behaviors are easier to verify. The cases focus on Data Designer workflow use, person sampling, LLM judge score access, sampler params, and unrelated negative prompts.🔗 Related Issue
N/A
🔄 Changes
skills/data-designer/evals/evals.jsonwith focused positive evals for Autopilot dataset generation scenarios.🧪 Testing
make testpasses — not run; eval JSON onlypython3 -m json.tool skills/data-designer/evals/evals.jsonpasses✅ Checklist