Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions evals/suites/generation-quality/cases.jsonl
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,6 @@
{"id":"generation-quality.unanswered-spec-questions-ask-user","suite":"generation-quality","executor":"ricky-cli","kind":"regression","input":{"message":"Ricky receives a workflow generation spec with an explicit open question and no `--best-judgement` flag."},"expected":{"ok":false,"contentIncludes":["Generation: failed (status: needs_clarification).","Next: Clarify: Who owns final rollout signoff?"],"forbidPhrases":["Best-judgement clarifications","TypeError","ReferenceError"],"maxToolCalls":1,"must":["Stop before generating a workflow artifact when the spec carries an unanswered question.","Ask the user the unresolved question directly.","Avoid writing an implementation assumption unless the caller explicitly opts into best judgement."],"mustNot":["Generate a workflow by silently guessing the answer.","Hide the clarification question behind a generic failure."],"humanReviewRequired":false},"tags":["generation","clarification","local"],"mock":{"cwd":"temp","specFileContent":"Generate a workflow for package validation.\\nOpen questions:\\n- Who owns final rollout signoff?","argv":"--mode local --spec-file {{specFile}} --no-workforce-persona"}}
{"id":"generation-quality.best-judgement-answers-spec-questions","suite":"generation-quality","executor":"ricky-cli","kind":"regression","input":{"message":"Ricky receives the same open-question spec with `--best-judgement`."},"expected":{"ok":true,"contentIncludes":["generated; run when ready","Warning: --best-judgement Who owns final rollout signoff?","Answered by implementing agent impl-primary-codex using --best-judgement","Workflow: workflows/generated/"],"forbidPhrases":["Generation: failed","needs_clarification","TypeError","ReferenceError"],"maxToolCalls":1,"must":["Continue to workflow generation after explicitly answering the unresolved question.","Call out each best-judgement question and answer in user-visible output or generated context.","Identify the implementing agent that made the assumption."],"mustNot":["Pretend the user supplied the answer.","Drop the original question from the assumption record."],"humanReviewRequired":false},"tags":["generation","clarification","local","best-judgement"],"mock":{"cwd":"temp","specFileContent":"Generate a workflow for package validation.\\nOpen questions:\\n- Who owns final rollout signoff?","argv":"local --spec-file {{specFile}} --best-judgement --no-workforce-persona"}}
{"id":"generation-quality.mode-local-overrides-runtime-wording","suite":"generation-quality","executor":"ricky-cli","kind":"regression","input":{"message":"Ricky receives a spec that legitimately discusses both local and Cloud execution while the CLI selected local mode."},"expected":{"ok":true,"contentIncludes":["Generation: ok","Run: ricky run workflows/generated/"],"forbidPhrases":["execution-mode-conflict","needs_clarification","Should this workflow run locally/BYOH","TypeError","ReferenceError"],"maxToolCalls":1,"must":["Treat the explicit local CLI mode as the execution preference.","Generate a workflow even when the design spec mentions both local and Cloud runtime support.","Avoid re-asking the local-vs-Cloud clarification after mode has already been chosen."],"mustNot":["Infer `auto` solely from runtime keywords when an explicit CLI mode is present.","Force the user to rewrite a design spec to remove one runtime keyword."],"humanReviewRequired":false},"tags":["generation","clarification","local","issue-76"],"mock":{"cwd":"temp","specFileContent":"Generate a workflow for a primitive whose API supports local BYOH execution and Cloud hosted execution. The generated workflow should implement the primitive docs and validation gates.","argv":"--mode local --spec-file {{specFile}} --no-workforce-persona"}}
{"id":"generation-quality.target-files-from-backticked-prose","suite":"generation-quality","executor":"ricky-cli","kind":"regression","input":{"message":"Ricky receives a markdown spec that names target file paths inside backticks in prose. The parser must recognize them so the workflow targets real source files instead of falling back to the manifest-driven single-artifact path."},"expected":{"ok":true,"contentIncludes":["\"target_files\":","packages/web/app/api/v1/workflows/run/route.ts","packages/core/src/bootstrap/launcher.ts"],"forbidPhrases":["TypeError","ReferenceError"],"maxToolCalls":1,"must":["Extract paths wrapped in markdown backticks into `target_files`.","Surface `target_files` in the generation JSON so callers can verify scope."],"mustNot":["Fall back to the manifest-driven single-artifact path when the spec names concrete files.","Capture prose noise like `base/head` as a target file."],"humanReviewRequired":false},"tags":["generation","target-files","parser","local"],"mock":{"cwd":"temp","specFileContent":"# Spec\\n\\nImplementation plan:\\n\\n- Update `packages/web/app/api/v1/workflows/run/route.ts` to accept the new mode.\\n- Update `packages/core/src/bootstrap/launcher.ts` to provision a sandbox.\\n","argv":"--mode local --spec-file {{specFile}} --no-run --json --no-workforce-persona"}}
{"id":"generation-quality.target-files-from-structured-block","suite":"generation-quality","executor":"ricky-cli","kind":"regression","input":{"message":"A spec with an explicit `## Target Files` block must take precedence over any prose paths so authors can be unambiguous about scope."},"expected":{"ok":true,"contentIncludes":["\"target_files\":","packages/web/app/api/v1/workflows/run/route.ts","packages/core/src/bootstrap/launcher.ts"],"forbidPhrases":["tests/scratch/example.ts","TypeError","ReferenceError"],"maxToolCalls":1,"must":["Honor the structured `## Target Files` block as the source of truth when present.","Strip leading bullets and surrounding backticks from each line in the block."],"mustNot":["Mix prose-extracted candidates into `target_files` when a structured block is declared."],"humanReviewRequired":false},"tags":["generation","target-files","parser","local"],"mock":{"cwd":"temp","specFileContent":"# Spec\\n\\nProse mentions `tests/scratch/example.ts` casually.\\n\\n## Target Files\\n\\n- `packages/web/app/api/v1/workflows/run/route.ts`\\n- packages/core/src/bootstrap/launcher.ts\\n\\n## Acceptance\\n\\nIt works.\\n","argv":"--mode local --spec-file {{specFile}} --no-run --json --no-workforce-persona"}}
{"id":"generation-quality.target-files-suppresses-prose-noise","suite":"generation-quality","executor":"ricky-cli","kind":"regression","input":{"message":"The parser must suppress two-segment prose tokens that have no extension and no recognized leading directory (e.g. `base/head`, `my-org/my-repo`) so they are not captured as target files."},"expected":{"ok":true,"contentIncludes":["\"target_files\":","packages/web/app/api/v1/workflows/run/route.ts"],"forbidPhrases":["\"\\\"base/head\\\"\"","\"\\\"user/account\\\"\"","TypeError","ReferenceError"],"maxToolCalls":1,"must":["Keep real backticked paths in `target_files`.","Drop two-segment prose tokens that look like noise."],"mustNot":["Capture human-readable phrases as file paths."],"humanReviewRequired":false},"tags":["generation","target-files","parser","local"],"mock":{"cwd":"temp","specFileContent":"# Spec\\n\\nSend the PR number, base/head SHA, and the user/account pair to MSD. Then update `packages/web/app/api/v1/workflows/run/route.ts`.\\n","argv":"--mode local --spec-file {{specFile}} --no-run --json --no-workforce-persona"}}
99 changes: 99 additions & 0 deletions evals/suites/generation-quality/cases.md
Original file line number Diff line number Diff line change
Expand Up @@ -263,3 +263,102 @@ maxToolCalls: 1
### Must Not
- Infer `auto` solely from runtime keywords when an explicit CLI mode is present.
- Force the user to rewrite a design spec to remove one runtime keyword.
## generation-quality.target-files-from-backticked-prose
Executor: ricky-cli
Kind: regression
Tags: generation, target-files, parser, local
Human Review: false

### Message
Ricky receives a markdown spec that names target file paths inside backticks in prose. The parser must recognize them so the workflow targets real source files instead of falling back to the manifest-driven single-artifact path.

### Mock
cwd: temp
specFileContent: # Spec\n\nImplementation plan:\n\n- Update `packages/web/app/api/v1/workflows/run/route.ts` to accept the new mode.\n- Update `packages/core/src/bootstrap/launcher.ts` to provision a sandbox.\n
argv: --mode local --spec-file {{specFile}} --no-run --json --no-workforce-persona

### Deterministic Checks
ok: true
contentIncludes:
- "target_files":
- packages/web/app/api/v1/workflows/run/route.ts
- packages/core/src/bootstrap/launcher.ts
forbidPhrases:
- TypeError
- ReferenceError
maxToolCalls: 1

### Must
- Extract paths wrapped in markdown backticks into `target_files`.
- Surface `target_files` in the generation JSON so callers can verify scope.

### Must Not
- Fall back to the manifest-driven single-artifact path when the spec names concrete files.
- Capture prose noise like `base/head` as a target file.

## generation-quality.target-files-from-structured-block
Executor: ricky-cli
Kind: regression
Tags: generation, target-files, parser, local
Human Review: false

### Message
A spec with an explicit `## Target Files` block must take precedence over any prose paths so authors can be unambiguous about scope.

### Mock
cwd: temp
specFileContent: # Spec\n\nProse mentions `tests/scratch/example.ts` casually.\n\n## Target Files\n\n- `packages/web/app/api/v1/workflows/run/route.ts`\n- packages/core/src/bootstrap/launcher.ts\n\n## Acceptance\n\nIt works.\n
argv: --mode local --spec-file {{specFile}} --no-run --json --no-workforce-persona

### Deterministic Checks
ok: true
contentIncludes:
- "target_files":
- packages/web/app/api/v1/workflows/run/route.ts
- packages/core/src/bootstrap/launcher.ts
forbidPhrases:
- tests/scratch/example.ts
- TypeError
- ReferenceError
maxToolCalls: 1

### Must
- Honor the structured `## Target Files` block as the source of truth when present.
- Strip leading bullets and surrounding backticks from each line in the block.

### Must Not
- Mix prose-extracted candidates into `target_files` when a structured block is declared.

## generation-quality.target-files-suppresses-prose-noise
Executor: ricky-cli
Kind: regression
Tags: generation, target-files, parser, local
Human Review: false

### Message
The parser must suppress two-segment prose tokens that have no extension and no recognized leading directory (e.g. `base/head`, `my-org/my-repo`) so they are not captured as target files.

### Mock
cwd: temp
specFileContent: # Spec\n\nSend the PR number, base/head SHA, and the user/account pair to MSD. Then update `packages/web/app/api/v1/workflows/run/route.ts`.\n
argv: --mode local --spec-file {{specFile}} --no-run --json --no-workforce-persona

### Deterministic Checks
ok: true
contentIncludes:
- "target_files":
- packages/web/app/api/v1/workflows/run/route.ts
forbidPhrases:
- "\"base/head\""
- "\"user/account\""
- TypeError
- ReferenceError
maxToolCalls: 1

### Must
- Keep real backticked paths in `target_files`.
- Drop two-segment prose tokens that look like noise.

### Must Not
- Capture human-readable phrases as file paths.

Loading
Loading