Skip to content

chore: add data-designer skill evals#718

Merged
johnnygreco merged 8 commits into
mainfrom
johnny/chore/data-designer-skill-evals
Jun 2, 2026
Merged

chore: add data-designer skill evals#718
johnnygreco merged 8 commits into
mainfrom
johnny/chore/data-designer-skill-evals

Conversation

@johnnygreco
Copy link
Copy Markdown
Contributor

📋 Summary

Adds targeted eval coverage for the data-designer skill so Autopilot routing and skill-specific behaviors are easier to verify. The cases focus on Data Designer workflow use, person sampling, LLM judge score access, sampler params, and unrelated negative prompts.

🔗 Related Issue

N/A

🔄 Changes

  • Add skills/data-designer/evals/evals.json with focused positive evals for Autopilot dataset generation scenarios.
  • Add negative evals for unrelated PostgreSQL and React UI requests to verify the skill is not selected outside dataset-generation tasks.
  • Keep behavior checks short, targeted, and easy to grade as true or false.

🧪 Testing

  • make test passes — not run; eval JSON only
  • Unit tests added/updated — skill evals added
  • E2E tests added/updated (if applicable) — N/A
  • python3 -m json.tool skills/data-designer/evals/evals.json passes

✅ Checklist

  • Follows commit message conventions
  • Commits are signed off (DCO) — commit was not created with signoff
  • Architecture docs updated (if applicable) — N/A

@johnnygreco johnnygreco requested a review from a team as a code owner May 29, 2026 20:12
@github-actions
Copy link
Copy Markdown
Contributor

Review: PR #718test: add data-designer skill evals

Summary

Adds a single new file, skills/data-designer/evals/evals.json (86 lines), containing six eval cases for the data-designer skill: four positive cases exercising Autopilot routing plus specific behaviors (person sampling, LLM judge score access, sampler params), and two negative cases (PostgreSQL admin, React UI) that should not invoke the skill. No existing files are modified. JSON parses cleanly.

This is purely test/eval data — no runtime code paths are touched, blast radius is limited to anyone consuming the eval file.

Findings

Correctness & alignment with SKILL.md

The eval expectations line up well with the rules documented in skills/data-designer/SKILL.md:

  • The sampler-params case asserts params is used and sampler_params is not — directly mirrors the SKILL.md rule at line 37 (SamplerColumnConfig: Takes params, not sampler_params).
  • The llm-judge-scores case asserts the accepted-boolean derivation references .score — mirrors the LLM judge access guidance at SKILL.md line 38.
  • The person-reviews case references get_person_object_schema.py, which exists at skills/data-designer/scripts/get_person_object_schema.py. ✓
  • Both negative cases assert the skill is not selected for unrelated tasks (DB admin, React UI), which is appropriate given the SKILL.md description scopes the skill to dataset/synthetic-data/data-generation tasks.

Schema consistency

  • All six entries share the same shape: id, question, expected_skill, expected_script, ground_truth, expected_behavior. Good — uniform schema makes downstream grading simpler.
  • expected_script is null for five entries and "get_person_object_schema.py" for one. That's the only structured assertion about which script the agent should run; the rest of script-execution checks live in expected_behavior strings. Worth noting the schema isn't documented anywhere (no README in evals/, no JSON Schema), so the meaning of expected_script vs. the free-text expected_behavior items is implicit. Not blocking, but a short evals/README.md describing the schema and how cases are graded would help future contributors.

Behavior assertions — graders need to match

The expected_behavior strings are natural-language predicates that presumably get fed to an LLM judge. A few observations:

  • Sampler-params case, item "The site column is generated by a category sampler" — strong assertion. The user prompt mentions only site as a column; the agent could reasonably pick category or subcategory. Probably fine as written, but if the grader is strict about category vs. any categorical sampler, this could be flaky.
  • Person-reviews case, item "The agent ran python scripts/get_person_object_schema.py with a locale argument" — assumes the script accepts a locale arg. Worth a quick sanity check that the script's CLI actually supports this; if it's optional, a flaky-grader risk exists.
  • Negative cases — the assertion "The agent did not read the data-designer SKILL.md" is a precise behavior, but Autopilot/skill-invocation typically reads SKILL.md as part of skill selection, not just execution. Whether the harness counts skill-selection reads against this assertion depends on grader semantics. Worth verifying with one dry run of the negative cases that a passing agent actually trips zero reads.
  • "The agent did not ask the user a clarifying question before building the script" appears in only one positive case, even though all four positive prompts include "do not ask follow-up questions" / "be opinionated" language. Consider adding this assertion to the other three for consistency, or deliberately leaving it off if the grading focus is different.

Project conventions

  • File location (skills/<skill>/evals/evals.json) is the first instance of this pattern in the repo (no other evals*.json files exist). The PR establishes the convention rather than following one — fine, but worth being deliberate. If multiple skills will get evals, codifying the layout (one file per skill, top-level array, this schema) in skills/README.md or similar would prevent drift.
  • No SPDX/license header — appropriate, JSON doesn't carry one.
  • No tests or CI wiring is added to consume the eval file. The PR description notes this is intentional ("eval JSON only"), and presumably an external evaluator picks it up. If there's a make target or repo-internal runner, a follow-up that wires it in would close the loop.

Minor

  • id values use kebab-case prefixed with data-designer- — consistent and grep-friendly. ✓
  • ground_truth and expected_behavior are slightly redundant in places (e.g., the positive cases' ground_truth restates what the behavior list already covers). Not a problem, just verbose.
  • DCO sign-off is missing per the PR checklist. Will need to be addressed before merge if this repo enforces DCO.

Risks

  • Low. No code paths change. Worst case is a flaky eval grader; that surfaces as eval-suite noise, not user-facing breakage.
  • The eval assertions are tightly coupled to current SKILL.md wording (e.g., file names, the .score rule, the params vs. sampler_params rule). Future SKILL.md edits will need to keep this file in sync — a one-line note in SKILL.md or a comment in the eval file pointing at the dependency would help.

Verdict

Looks good. Small, well-scoped, internally consistent, and the assertions match documented skill behavior. Suggested non-blocking follow-ups:

  1. Add a brief skills/data-designer/evals/README.md (or top-level skills/EVALS.md) describing the schema and grading model.
  2. Verify get_person_object_schema.py actually accepts a locale argument before this eval runs in CI.
  3. Consider tightening or relaxing the "did not read SKILL.md" assertion on negative cases based on how the grader treats skill-selection reads.
  4. Address the missing DCO sign-off.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 29, 2026

Greptile Summary

This PR adds eval infrastructure for the data-designer skill: a single-case evals.json, a benchmark report, a skill card, an OMS signature, and a minor metadata update to SKILL.md.

  • evals/evals.json ships with one positive eval case (customer support tickets) as a bare JSON object rather than an array; the benchmark report and skill card both reference a 4-task evaluation dataset that has no corresponding committed eval cases.
  • SKILL.md receives a license and metadata.owner field; BENCHMARK.md and skill-card.md document evaluation results using two agents (claude-code, codex) across security, correctness, discoverability, effectiveness, and efficiency dimensions.

Confidence Score: 4/5

The benchmark report and skill card both document results from 4 evaluation tasks, but only 1 task is present in the committed evals.json, making the benchmark evidence unverifiable from this PR alone.

The benchmark artefacts (BENCHMARK.md and skill-card.md) claim coverage of 4 evaluation tasks; evals.json contains one. Anyone relying on the committed eval file to reproduce or extend the benchmark will see different coverage than what the report describes. The remaining changes — metadata in SKILL.md and the signature file — are low-risk.

skills/data-designer/evals/evals.json and skills/data-designer/BENCHMARK.md need alignment on task count

Important Files Changed

Filename Overview
skills/data-designer/evals/evals.json Single-object eval file with one positive case; claims 4 tasks in benchmark docs, mismatched expected_script usage already flagged in prior threads
skills/data-designer/BENCHMARK.md New benchmark report documenting 4-task evaluation; task count is inconsistent with the 1-case evals.json committed alongside it
skills/data-designer/SKILL.md Minor metadata addition: license field and owner metadata; no logic changes
skills/data-designer/skill-card.md New skill card with evaluation results; also states 4 evaluation tasks, inconsistent with the single eval case in evals.json
skills/data-designer/skill.oms.sig New cryptographic OMS signature bundle for the skill; no logic concerns

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[PR: add data-designer skill evals] --> B[evals/evals.json\n1 positive eval case]
    A --> C[BENCHMARK.md\nclaims 4 evaluation tasks]
    A --> D[skill-card.md\nclaims 4 evaluation tasks]
    A --> E[SKILL.md\nadd license + metadata.owner]
    A --> F[skill.oms.sig\nOMS cryptographic signature]

    B -- "bare object format\nnot an array" --> G{Harness\nloads evals.json}
    C -- "4 tasks documented" --> H[Benchmark results\nfrom 4 tasks]
    B -- "1 task present" --> H
    H --> I[⚠️ Mismatch:\nmetrics not reproducible\nfrom committed eval]

    G -- "expects array?" --> J[⚠️ Possible parse failure\nor wrong eval count]
    G -- "expects object?" --> K[Runs single eval case]
Loading
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
skills/data-designer/evals/evals.json:1-13
**Benchmark metadata claims 4 tasks; only 1 is present**

Both `BENCHMARK.md` ("Dataset: 4 evaluation tasks") and `skill-card.md` ("Evaluated against 4 evaluation tasks") reference a 4-task dataset, but `evals.json` ships with exactly one eval case. If a harness loads this file to reproduce the benchmark results or measure discoverability/correctness scores, it will be operating on a dataset that is 75% smaller than what the benchmark report describes, making those reported metrics unverifiable from the committed artefacts.

Reviews (15): Last reviewed commit: "Attach NVSkills validation signatures" | Re-trigger Greptile

Comment thread skills/data-designer/evals/evals.json Outdated
@johnnygreco johnnygreco force-pushed the johnny/chore/data-designer-skill-evals branch from 0a5e916 to b6cd817 Compare May 29, 2026 22:36
@johnnygreco
Copy link
Copy Markdown
Contributor Author

/nvskills-ci

@calliente
Copy link
Copy Markdown

@johnnygreco - I think this is failing because its missing the DCO sign-off. Run git rebase --signoff origin/main && git push --force-with-lease

@johnnygreco johnnygreco changed the title test: add data-designer skill evals chore: add data-designer skill evals Jun 1, 2026
@johnnygreco johnnygreco force-pushed the johnny/chore/data-designer-skill-evals branch 2 times, most recently from d0f0a40 to 467a900 Compare June 1, 2026 14:35
@johnnygreco
Copy link
Copy Markdown
Contributor Author

/nvskills-ci

@johnnygreco johnnygreco force-pushed the johnny/chore/data-designer-skill-evals branch from 467a900 to abf988a Compare June 2, 2026 14:59
@johnnygreco
Copy link
Copy Markdown
Contributor Author

/nvskills-ci

1 similar comment
@johnnygreco
Copy link
Copy Markdown
Contributor Author

/nvskills-ci

Comment thread skills/data-designer/evals/evals.json
@johnnygreco
Copy link
Copy Markdown
Contributor Author

/nvskills-ci

1 similar comment
@johnnygreco
Copy link
Copy Markdown
Contributor Author

/nvskills-ci

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

@johnnygreco
Copy link
Copy Markdown
Contributor Author

/nvskills-ci

1 similar comment
@johnnygreco
Copy link
Copy Markdown
Contributor Author

/nvskills-ci

@johnnygreco
Copy link
Copy Markdown
Contributor Author

recheck

johnnygreco and others added 6 commits June 2, 2026 17:18
Signed-off-by: Johnny Greco <jogreco@nvidia.com>
Signed-off-by: Johnny Greco <jogreco@nvidia.com>
Signed-off-by: Johnny Greco <jogreco@nvidia.com>
Signed-off-by: Johnny Greco <jogreco@nvidia.com>
Signed-off-by: nvskills-svc-account <svc-nvskills-signing@nvidia.com>
johnnygreco and others added 2 commits June 2, 2026 17:18
Signed-off-by: Johnny Greco <jogreco@nvidia.com>
Signed-off-by: nvskills-svc-account <svc-nvskills-signing@nvidia.com>
@johnnygreco johnnygreco force-pushed the johnny/chore/data-designer-skill-evals branch from 45c323c to da47b1e Compare June 2, 2026 17:18
@johnnygreco
Copy link
Copy Markdown
Contributor Author

/nvskills-ci

Comment thread skills/data-designer/evals/evals.json
@johnnygreco johnnygreco merged commit 864322b into main Jun 2, 2026
62 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants