From eb92b9d75fa8841746c6c7e81c0c86046c91d466 Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Tue, 17 Mar 2026 23:50:07 +0000 Subject: [PATCH] =?UTF-8?q?docs:=20rename=20assert=20=E2=86=92=20assertion?= =?UTF-8?q?s=20across=20all=20documentation?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Aligns documentation with the assert: → assertions: YAML key rename completed in PR #604. Updates prose references, YAML examples, table entries, SDK code samples, skill docs, and agent prompts across 17 files. Co-Authored-By: Claude Opus 4.6 --- README.md | 2 +- apps/cli/README.md | 2 +- .../content/docs/evaluation/eval-cases.mdx | 38 +++++++++---------- .../content/docs/evaluation/eval-files.mdx | 12 +++--- .../src/content/docs/evaluation/examples.mdx | 2 +- .../src/content/docs/evaluation/rubrics.mdx | 4 +- .../content/docs/evaluation/running-evals.mdx | 2 +- apps/web/src/content/docs/evaluation/sdk.mdx | 2 +- .../src/content/docs/evaluators/composite.mdx | 4 +- .../docs/evaluators/custom-evaluators.mdx | 6 +-- .../content/docs/evaluators/llm-graders.mdx | 6 +-- .../docs/guides/agent-skills-evals.mdx | 2 +- .../src/content/docs/guides/human-review.mdx | 2 +- apps/web/src/content/docs/tools/convert.mdx | 2 +- plugins/agentv-dev/agents/eval-analyzer.md | 2 +- .../skills/agentv-eval-analyzer/SKILL.md | 2 +- .../skills/agentv-eval-writer/SKILL.md | 8 ++-- 17 files changed, 49 insertions(+), 49 deletions(-) diff --git a/README.md b/README.md index 887765a98..0a160be47 100644 --- a/README.md +++ b/README.md @@ -309,7 +309,7 @@ const { results, summary } = await evaluate({ { id: 'greeting', input: 'Say hello', - assert: [{ type: 'contains', value: 'Hello' }], + assertions: [{ type: 'contains', value: 'Hello' }], }, ], }); diff --git a/apps/cli/README.md b/apps/cli/README.md index 887765a98..0a160be47 100644 --- a/apps/cli/README.md +++ b/apps/cli/README.md @@ -309,7 +309,7 @@ const { results, summary } = await evaluate({ { id: 'greeting', input: 'Say hello', - assert: [{ type: 'contains', value: 'Hello' }], + assertions: [{ type: 'contains', value: 'Hello' }], }, ], }); diff --git a/apps/web/src/content/docs/evaluation/eval-cases.mdx b/apps/web/src/content/docs/evaluation/eval-cases.mdx index a6ca3c705..ae24420c2 100644 --- a/apps/web/src/content/docs/evaluation/eval-cases.mdx +++ b/apps/web/src/content/docs/evaluation/eval-cases.mdx @@ -31,7 +31,7 @@ tests: | `workspace` | No | Per-case workspace config (overrides suite-level) | | `metadata` | No | Arbitrary key-value pairs passed to evaluators and workspace scripts | | `rubrics` | No | Structured evaluation criteria | -| `assert` | No | Per-test evaluators | +| `assertions` | No | Per-test evaluators | ## Input @@ -87,7 +87,7 @@ tests: prompt: ./graders/depth.md ``` -Per-case `assert` evaluators are **merged** with root-level `assert` evaluators — test-specific evaluators run first, then root-level defaults are appended. To opt out of root-level defaults for a specific test, set `execution.skip_defaults: true`: +Per-case `assertions` evaluators are **merged** with root-level `assertions` evaluators — test-specific evaluators run first, then root-level defaults are appended. To opt out of root-level defaults for a specific test, set `execution.skip_defaults: true`: ```yaml assertions: @@ -99,7 +99,7 @@ tests: - id: normal-case criteria: Returns correct answer input: What is 2+2? - # Gets latency_check from root-level assert + # Gets latency_check from root-level assertions - id: special-case criteria: Handles edge case @@ -161,7 +161,7 @@ The `metadata` field is included in the stdin JSON passed to lifecycle commands ## Per-Test Assertions -The `assert` field defines evaluators directly on a test. It supports both deterministic assertion types and LLM-based rubric evaluation. +The `assertions` field defines evaluators directly on a test. It supports both deterministic assertion types and LLM-based rubric evaluation. ### Deterministic Assertions @@ -217,7 +217,7 @@ tests: ### Required Gates -Any evaluator in `assert` can be marked as `required`. When a required evaluator fails, the overall test verdict is `fail` regardless of the aggregate score. +Any evaluator in `assertions` can be marked as `required`. When a required evaluator fails, the overall test verdict is `fail` regardless of the aggregate score. | Value | Behavior | |-------|----------| @@ -239,39 +239,39 @@ assertions: Required gates are evaluated after all evaluators run. If any required evaluator falls below its threshold, the verdict is forced to `fail`. -### Assert Merge Behavior +### Assertions Merge Behavior -`assert` can be defined at both suite and test levels: +`assertions` can be defined at both suite and test levels: -- Per-test `assert` evaluators run first. -- Suite-level `assert` evaluators are appended automatically. +- Per-test `assertions` evaluators run first. +- Suite-level `assertions` evaluators are appended automatically. - Set `execution.skip_defaults: true` on a test to skip suite-level defaults. -## How `criteria` and `assert` Interact +## How `criteria` and `assertions` Interact -The `criteria` field is a **data field** that describes what the response should accomplish. It is not an evaluator itself — how it gets used depends on whether `assert` is present. +The `criteria` field is a **data field** that describes what the response should accomplish. It is not an evaluator itself — how it gets used depends on whether `assertions` is present. -### No `assert` — implicit LLM grader +### No `assertions` — implicit LLM grader -When a test has no `assert` field, a default `llm-grader` evaluator runs automatically and uses `criteria` as the evaluation prompt: +When a test has no `assertions` field, a default `llm-grader` evaluator runs automatically and uses `criteria` as the evaluation prompt: ```yaml tests: - id: simple-eval criteria: Assistant correctly explains the bug and proposes a fix input: "Debug this function..." - # No assert → default llm-grader evaluates against criteria + # No assertions → default llm-grader evaluates against criteria ``` -### `assert` present — explicit evaluators only +### `assertions` present — explicit evaluators only -When `assert` is defined, only the declared evaluators run. No implicit grader is added. Graders that are declared (such as `llm-grader`, `code-grader`, or `rubrics`) receive `criteria` as input automatically. +When `assertions` is defined, only the declared evaluators run. No implicit grader is added. Graders that are declared (such as `llm-grader`, `code-grader`, or `rubrics`) receive `criteria` as input automatically. -If `assert` contains only deterministic evaluators (like `contains` or `regex`), the `criteria` field is not evaluated and a warning is emitted: +If `assertions` contains only deterministic evaluators (like `contains` or `regex`), the `criteria` field is not evaluated and a warning is emitted: ``` -Warning: Test 'my-test': criteria is defined but no evaluator in assert -will evaluate it. Add 'type: llm-grader' to assert, or remove criteria +Warning: Test 'my-test': criteria is defined but no evaluator in assertions +will evaluate it. Add 'type: llm-grader' to assertions, or remove criteria if it is documentation-only. ``` diff --git a/apps/web/src/content/docs/evaluation/eval-files.mdx b/apps/web/src/content/docs/evaluation/eval-files.mdx index 4e9c010e5..0869163e9 100644 --- a/apps/web/src/content/docs/evaluation/eval-files.mdx +++ b/apps/web/src/content/docs/evaluation/eval-files.mdx @@ -37,7 +37,7 @@ tests: | `execution` | Default execution config (`target`, `fail_on_error`, etc.) | | `workspace` | Suite-level workspace config — inline object or string path to an [external workspace file](/guides/workspace-pool/#external-workspace-config) | | `tests` | Array of individual tests, or a string path to an external file | -| `assert` | Suite-level evaluators appended to each test unless `execution.skip_defaults: true` is set on the test | +| `assertions` | Suite-level evaluators appended to each test unless `execution.skip_defaults: true` is set on the test | | `input` | Suite-level input messages prepended to each test's input unless `execution.skip_defaults: true` is set on the test | ### Metadata Fields @@ -70,9 +70,9 @@ tests: input: Screen "Acme Corp" against denied parties list ``` -### Suite-level Assert +### Suite-level Assertions -The `assert` field is the canonical way to define suite-level evaluators. Suite-level assertions are appended to every test's evaluators unless a test sets `execution.skip_defaults: true`. +The `assertions` field is the canonical way to define suite-level evaluators. Suite-level assertions are appended to every test's evaluators unless a test sets `execution.skip_defaults: true`. ```yaml description: API response validation @@ -88,11 +88,11 @@ tests: input: Check API health ``` -`assert` supports all evaluator types, including deterministic assertion types (`contains`, `regex`, `is_json`, `equals`) and `rubrics`. See [Tests](/evaluation/eval-cases/#per-test-assertions) for per-test assert usage. +`assertions` supports all evaluator types, including deterministic assertion types (`contains`, `regex`, `is_json`, `equals`) and `rubrics`. See [Tests](/evaluation/eval-cases/#per-test-assertions) for per-test assertions usage. ### Suite-level Input -The `input` field defines messages that are **prepended** to every test's input. This avoids repeating the same prompt or system context in each test case — following the same pattern as suite-level `assert`. +The `input` field defines messages that are **prepended** to every test's input. This avoids repeating the same prompt or system context in each test case — following the same pattern as suite-level `assertions`. ```yaml description: Travel assistant evaluation @@ -119,7 +119,7 @@ Suite-level `input` accepts the same formats as test-level `input`: - **String** — wrapped as `[{ role: "user", content: "..." }]` - **Message array** — used as-is, including file references -To opt out for a specific test, set `execution.skip_defaults: true` (same flag that skips suite-level `assert`). +To opt out for a specific test, set `execution.skip_defaults: true` (same flag that skips suite-level `assertions`). ### Suite-level Input Files diff --git a/apps/web/src/content/docs/evaluation/examples.mdx b/apps/web/src/content/docs/evaluation/examples.mdx index 655955418..d7c33e7e6 100644 --- a/apps/web/src/content/docs/evaluation/examples.mdx +++ b/apps/web/src/content/docs/evaluation/examples.mdx @@ -343,7 +343,7 @@ tests: ## Suite-level Input -Share a common prompt or system instruction across all tests. Suite-level `input` messages are prepended to each test's input — like suite-level `assert` for evaluators: +Share a common prompt or system instruction across all tests. Suite-level `input` messages are prepended to each test's input — like suite-level `assertions` for evaluators: ```yaml description: Travel assistant evaluation diff --git a/apps/web/src/content/docs/evaluation/rubrics.mdx b/apps/web/src/content/docs/evaluation/rubrics.mdx index 362a50d00..6f03d7c8c 100644 --- a/apps/web/src/content/docs/evaluation/rubrics.mdx +++ b/apps/web/src/content/docs/evaluation/rubrics.mdx @@ -5,11 +5,11 @@ sidebar: order: 3 --- -Rubrics are defined with `assert` entries and support binary checklist grading and score-range analytic grading. +Rubrics are defined with `assertions` entries and support binary checklist grading and score-range analytic grading. ## Basic Usage -The simplest form — list plain strings in `assert` and each one becomes a required criterion: +The simplest form — list plain strings in `assertions` and each one becomes a required criterion: ```yaml tests: diff --git a/apps/web/src/content/docs/evaluation/running-evals.mdx b/apps/web/src/content/docs/evaluation/running-evals.mdx index 924c7ded3..9a565cfbd 100644 --- a/apps/web/src/content/docs/evaluation/running-evals.mdx +++ b/apps/web/src/content/docs/evaluation/running-evals.mdx @@ -262,7 +262,7 @@ The `--file` option reads a JSON file with `{ "output": "...", "input": "..." }` **Exit codes:** 0 if score >= 0.5 (pass), 1 if score < 0.5 (fail). -This is the same interface that agent-orchestrated evals use — the EVAL.yaml transpiler emits `assert` instructions for code graders so external grading agents can execute them directly. +This is the same interface that agent-orchestrated evals use — the EVAL.yaml transpiler emits `assertions` instructions for code graders so external grading agents can execute them directly. ## Agent-Orchestrated Evals diff --git a/apps/web/src/content/docs/evaluation/sdk.mdx b/apps/web/src/content/docs/evaluation/sdk.mdx index e52c0862c..f8fa2a139 100644 --- a/apps/web/src/content/docs/evaluation/sdk.mdx +++ b/apps/web/src/content/docs/evaluation/sdk.mdx @@ -105,7 +105,7 @@ const { results, summary } = await evaluate({ { id: 'greeting', input: 'Say hello', - assert: [{ type: 'contains', value: 'Hello' }], + assertions: [{ type: 'contains', value: 'Hello' }], }, ], }); diff --git a/apps/web/src/content/docs/evaluators/composite.mdx b/apps/web/src/content/docs/evaluators/composite.mdx index 668e27ddb..9f7340a41 100644 --- a/apps/web/src/content/docs/evaluators/composite.mdx +++ b/apps/web/src/content/docs/evaluators/composite.mdx @@ -30,9 +30,9 @@ assertions: ``` Each sub-evaluator runs independently, then the aggregator combines their results. -Use `assert` for composite members. `evaluators` is still accepted for backward compatibility. +Use `assertions` for composite members. `evaluators` is still accepted for backward compatibility. -If you only need weighted-average aggregation, a plain test-level `assert` list already computes a weighted mean across evaluators. Use `composite` when you need a custom aggregation strategy (`threshold`, `code_grader`, `llm_grader`) or nested evaluator groups. +If you only need weighted-average aggregation, a plain test-level `assertions` list already computes a weighted mean across evaluators. Use `composite` when you need a custom aggregation strategy (`threshold`, `code_grader`, `llm_grader`) or nested evaluator groups. ## Aggregator Types diff --git a/apps/web/src/content/docs/evaluators/custom-evaluators.mdx b/apps/web/src/content/docs/evaluators/custom-evaluators.mdx index 73ee796e9..b7c847acf 100644 --- a/apps/web/src/content/docs/evaluators/custom-evaluators.mdx +++ b/apps/web/src/content/docs/evaluators/custom-evaluators.mdx @@ -13,11 +13,11 @@ AgentV supports multiple evaluator types that can be combined for comprehensive |------|-------------|----------| | `code_grader` | Deterministic command (Python/TS/any) | Exact matching, format validation, programmatic checks | | `llm_grader` | LLM-based evaluation with custom prompt | Semantic evaluation, nuance, subjective quality | -| `rubrics` | Structured rubric evaluator via `assert` | Multi-criterion grading with weights | +| `rubrics` | Structured rubric evaluator via `assertions` | Multi-criterion grading with weights | ## Referencing Evaluators -Evaluators are configured using `assert` — either top-level (applies to all tests) or per-test: +Evaluators are configured using `assertions` — either top-level (applies to all tests) or per-test: ### Top-Level (Default for All Tests) @@ -72,7 +72,7 @@ tests: Each evaluator produces its own score. Results appear in `scores[]` in the output JSONL. -For multiple evaluators in `assert`, the test score is the weighted mean: +For multiple evaluators in `assertions`, the test score is the weighted mean: ``` final_score = sum(score_i * weight_i) / sum(weight_i) diff --git a/apps/web/src/content/docs/evaluators/llm-graders.mdx b/apps/web/src/content/docs/evaluators/llm-graders.mdx index 246cffacc..2651fb8c2 100644 --- a/apps/web/src/content/docs/evaluators/llm-graders.mdx +++ b/apps/web/src/content/docs/evaluators/llm-graders.mdx @@ -9,17 +9,17 @@ LLM graders (also accepts `llm-judge` for backward compatibility) use a language ## Default Grader -When a test defines `criteria` but has **no `assert` field**, a default `llm-grader` runs automatically. The built-in prompt evaluates the response against your `criteria` and `expected_output`: +When a test defines `criteria` but has **no `assertions` field**, a default `llm-grader` runs automatically. The built-in prompt evaluates the response against your `criteria` and `expected_output`: ```yaml tests: - id: simple-eval criteria: Correctly explains the bug and proposes a fix input: "Debug this function..." - # No assert needed — default llm-grader evaluates against criteria + # No assertions needed — default llm-grader evaluates against criteria ``` -When `assert` **is** present, no default grader is added. To use an LLM grader alongside other evaluators, declare it explicitly. See [How criteria and assert interact](/evaluation/eval-cases/#how-criteria-and-assert-interact). +When `assertions` **is** present, no default grader is added. To use an LLM grader alongside other evaluators, declare it explicitly. See [How criteria and assertions interact](/evaluation/eval-cases/#how-criteria-and-assertions-interact). ## Configuration diff --git a/apps/web/src/content/docs/guides/agent-skills-evals.mdx b/apps/web/src/content/docs/guides/agent-skills-evals.mdx index 6954360cc..a288711b2 100644 --- a/apps/web/src/content/docs/guides/agent-skills-evals.mdx +++ b/apps/web/src/content/docs/guides/agent-skills-evals.mdx @@ -56,7 +56,7 @@ When AgentV loads `evals.json`, it promotes fields to its internal representatio |---|---|---| | `prompt` | `input` | Wrapped as `[{role: "user", content: prompt}]` | | `expected_output` | `expected_output` + `criteria` | Used as reference answer and evaluation criteria | -| `assertions[]` | `assert[]` | Each string becomes `{type: llm-grader, prompt: text}` | +| `assertions[]` | `assertions[]` | Each string becomes `{type: llm-grader, prompt: text}` | | `files[]` | `file_paths` | Resolved relative to evals.json, copied into workspace | | `skill_name` | `metadata.skill_name` | Carried as metadata | | `id` (number) | `id` (string) | Converted via `String(id)` | diff --git a/apps/web/src/content/docs/guides/human-review.mdx b/apps/web/src/content/docs/guides/human-review.mdx index d8da87701..cff452ff9 100644 --- a/apps/web/src/content/docs/guides/human-review.mdx +++ b/apps/web/src/content/docs/guides/human-review.mdx @@ -147,7 +147,7 @@ For workspace evaluations with multiple evaluators (code graders, LLM graders, t } ``` -Keys use the format `evaluator-type:evaluator-name` to match the evaluators defined in `assert` blocks. +Keys use the format `evaluator-type:evaluator-name` to match the evaluators defined in `assertions` blocks. ## Storing feedback across iterations diff --git a/apps/web/src/content/docs/tools/convert.mdx b/apps/web/src/content/docs/tools/convert.mdx index 417d0fe4c..aeac6640f 100644 --- a/apps/web/src/content/docs/tools/convert.mdx +++ b/apps/web/src/content/docs/tools/convert.mdx @@ -35,7 +35,7 @@ Converts an [Agent Skills `evals.json`](/guides/agent-skills-evals) file into an - Maps `prompt` → `input` message array - Maps `expected_output` → `expected_output` -- Maps `assertions` → `assert` evaluators (llm-grader) +- Maps `assertions` → `assertions` evaluators (llm-grader) - Resolves `files[]` paths relative to the evals.json directory - Adds TODO comments for AgentV-specific features (workspace setup, code graders, rubrics) diff --git a/plugins/agentv-dev/agents/eval-analyzer.md b/plugins/agentv-dev/agents/eval-analyzer.md index 31660128e..ad84ff377 100644 --- a/plugins/agentv-dev/agents/eval-analyzer.md +++ b/plugins/agentv-dev/agents/eval-analyzer.md @@ -48,7 +48,7 @@ For each evaluator entry in `scores` where `type` is `"llm-judge"` or `"rubrics" ### Step 3: Weak Assertion Detection -Scan the EVAL.yaml `assert` entries (if `eval-path` provided) and the `reasoning` fields in results for weak assertions: +Scan the EVAL.yaml `assertions` entries (if `eval-path` provided) and the `reasoning` fields in results for weak assertions: | Weakness | Detection | Improvement | |----------|-----------|-------------| diff --git a/plugins/agentv-dev/skills/agentv-eval-analyzer/SKILL.md b/plugins/agentv-dev/skills/agentv-eval-analyzer/SKILL.md index d059dc8d3..cf92fda0a 100644 --- a/plugins/agentv-dev/skills/agentv-eval-analyzer/SKILL.md +++ b/plugins/agentv-dev/skills/agentv-eval-analyzer/SKILL.md @@ -75,7 +75,7 @@ When results span multiple targets, flags evaluators with > 0.3 score variance a The analyzer report includes concrete YAML snippets for each suggestion. To apply: 1. Open the EVAL.yaml referenced in the report -2. Find the `assert` entry for the flagged evaluator (matched by `name` and `test_id`) +2. Find the `assertions` entry for the flagged evaluator (matched by `name` and `test_id`) 3. Replace or supplement the evaluator config with the suggested deterministic assertion 4. Re-run `agentv eval` to verify the change produces equivalent scores diff --git a/plugins/agentv-dev/skills/agentv-eval-writer/SKILL.md b/plugins/agentv-dev/skills/agentv-eval-writer/SKILL.md index 5ae6275a3..a32746d6e 100644 --- a/plugins/agentv-dev/skills/agentv-eval-writer/SKILL.md +++ b/plugins/agentv-dev/skills/agentv-eval-writer/SKILL.md @@ -35,7 +35,7 @@ agentv prompt eval --input evals.json --test-id 1 agentv prompt eval --expected-output evals.json --test-id 1 ``` -The converter maps `prompt` → `input`, `expected_output` → `expected_output`, `assertions` → `assert` (llm-judge), and resolves `files[]` paths. The generated YAML includes TODO comments for AgentV features to add (workspace setup, code judges, rubrics, required gates). +The converter maps `prompt` → `input`, `expected_output` → `expected_output`, `assertions` → `assertions` (llm-grader), and resolves `files[]` paths. The generated YAML includes TODO comments for AgentV features to add (workspace setup, code judges, rubrics, required gates). If you're running the lifecycle through `agentv-bench`, use `agentv convert` and `agentv prompt eval` directly — the Python scripts in `agentv-bench/scripts/` orchestrate these same commands. @@ -158,7 +158,7 @@ requires: ## Suite-level Input -Prepend shared input messages to every test (like suite-level `assert`). Avoids repeating the same prompt file in each test: +Prepend shared input messages to every test (like suite-level `assertions`). Avoids repeating the same prompt file in each test: ```yaml input: @@ -505,7 +505,7 @@ Binary check: is the output valid JSON? LLM-judged structured evaluation with weighted criteria. Criteria items support `id`, `outcome`, `weight`, and `required` fields. ### rubrics (inline, deprecated) -Top-level `rubrics:` field is deprecated. Use `type: rubrics` under `assert` instead. +Top-level `rubrics:` field is deprecated. Use `type: rubrics` under `assertions` instead. See `references/rubric-evaluator.md` for score-range mode and scoring formula. ## Execution Error Tolerance @@ -607,7 +607,7 @@ const { results, summary } = await evaluate({ { id: 'greeting', input: 'Say hello', - assert: [{ type: 'contains', value: 'hello' }], + assertions: [{ type: 'contains', value: 'hello' }], }, ], target: { provider: 'mock_agent' },