Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -309,7 +309,7 @@ const { results, summary } = await evaluate({
{
id: 'greeting',
input: 'Say hello',
assert: [{ type: 'contains', value: 'Hello' }],
assertions: [{ type: 'contains', value: 'Hello' }],
},
],
});
Expand Down
2 changes: 1 addition & 1 deletion apps/cli/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -309,7 +309,7 @@ const { results, summary } = await evaluate({
{
id: 'greeting',
input: 'Say hello',
assert: [{ type: 'contains', value: 'Hello' }],
assertions: [{ type: 'contains', value: 'Hello' }],
},
],
});
Expand Down
38 changes: 19 additions & 19 deletions apps/web/src/content/docs/evaluation/eval-cases.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ tests:
| `workspace` | No | Per-case workspace config (overrides suite-level) |
| `metadata` | No | Arbitrary key-value pairs passed to evaluators and workspace scripts |
| `rubrics` | No | Structured evaluation criteria |
| `assert` | No | Per-test evaluators |
| `assertions` | No | Per-test evaluators |

## Input

Expand Down Expand Up @@ -87,7 +87,7 @@ tests:
prompt: ./graders/depth.md
```

Per-case `assert` evaluators are **merged** with root-level `assert` evaluators — test-specific evaluators run first, then root-level defaults are appended. To opt out of root-level defaults for a specific test, set `execution.skip_defaults: true`:
Per-case `assertions` evaluators are **merged** with root-level `assertions` evaluators — test-specific evaluators run first, then root-level defaults are appended. To opt out of root-level defaults for a specific test, set `execution.skip_defaults: true`:

```yaml
assertions:
Expand All @@ -99,7 +99,7 @@ tests:
- id: normal-case
criteria: Returns correct answer
input: What is 2+2?
# Gets latency_check from root-level assert
# Gets latency_check from root-level assertions

- id: special-case
criteria: Handles edge case
Expand Down Expand Up @@ -161,7 +161,7 @@ The `metadata` field is included in the stdin JSON passed to lifecycle commands

## Per-Test Assertions

The `assert` field defines evaluators directly on a test. It supports both deterministic assertion types and LLM-based rubric evaluation.
The `assertions` field defines evaluators directly on a test. It supports both deterministic assertion types and LLM-based rubric evaluation.

### Deterministic Assertions

Expand Down Expand Up @@ -217,7 +217,7 @@ tests:

### Required Gates

Any evaluator in `assert` can be marked as `required`. When a required evaluator fails, the overall test verdict is `fail` regardless of the aggregate score.
Any evaluator in `assertions` can be marked as `required`. When a required evaluator fails, the overall test verdict is `fail` regardless of the aggregate score.

| Value | Behavior |
|-------|----------|
Expand All @@ -239,39 +239,39 @@ assertions:

Required gates are evaluated after all evaluators run. If any required evaluator falls below its threshold, the verdict is forced to `fail`.

### Assert Merge Behavior
### Assertions Merge Behavior

`assert` can be defined at both suite and test levels:
`assertions` can be defined at both suite and test levels:

- Per-test `assert` evaluators run first.
- Suite-level `assert` evaluators are appended automatically.
- Per-test `assertions` evaluators run first.
- Suite-level `assertions` evaluators are appended automatically.
- Set `execution.skip_defaults: true` on a test to skip suite-level defaults.

## How `criteria` and `assert` Interact
## How `criteria` and `assertions` Interact

The `criteria` field is a **data field** that describes what the response should accomplish. It is not an evaluator itself — how it gets used depends on whether `assert` is present.
The `criteria` field is a **data field** that describes what the response should accomplish. It is not an evaluator itself — how it gets used depends on whether `assertions` is present.

### No `assert` — implicit LLM grader
### No `assertions` — implicit LLM grader

When a test has no `assert` field, a default `llm-grader` evaluator runs automatically and uses `criteria` as the evaluation prompt:
When a test has no `assertions` field, a default `llm-grader` evaluator runs automatically and uses `criteria` as the evaluation prompt:

```yaml
tests:
- id: simple-eval
criteria: Assistant correctly explains the bug and proposes a fix
input: "Debug this function..."
# No assert → default llm-grader evaluates against criteria
# No assertions → default llm-grader evaluates against criteria
```

### `assert` present — explicit evaluators only
### `assertions` present — explicit evaluators only

When `assert` is defined, only the declared evaluators run. No implicit grader is added. Graders that are declared (such as `llm-grader`, `code-grader`, or `rubrics`) receive `criteria` as input automatically.
When `assertions` is defined, only the declared evaluators run. No implicit grader is added. Graders that are declared (such as `llm-grader`, `code-grader`, or `rubrics`) receive `criteria` as input automatically.

If `assert` contains only deterministic evaluators (like `contains` or `regex`), the `criteria` field is not evaluated and a warning is emitted:
If `assertions` contains only deterministic evaluators (like `contains` or `regex`), the `criteria` field is not evaluated and a warning is emitted:

```
Warning: Test 'my-test': criteria is defined but no evaluator in assert
will evaluate it. Add 'type: llm-grader' to assert, or remove criteria
Warning: Test 'my-test': criteria is defined but no evaluator in assertions
will evaluate it. Add 'type: llm-grader' to assertions, or remove criteria
if it is documentation-only.
```

Expand Down
12 changes: 6 additions & 6 deletions apps/web/src/content/docs/evaluation/eval-files.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ tests:
| `execution` | Default execution config (`target`, `fail_on_error`, etc.) |
| `workspace` | Suite-level workspace config — inline object or string path to an [external workspace file](/guides/workspace-pool/#external-workspace-config) |
| `tests` | Array of individual tests, or a string path to an external file |
| `assert` | Suite-level evaluators appended to each test unless `execution.skip_defaults: true` is set on the test |
| `assertions` | Suite-level evaluators appended to each test unless `execution.skip_defaults: true` is set on the test |
| `input` | Suite-level input messages prepended to each test's input unless `execution.skip_defaults: true` is set on the test |

### Metadata Fields
Expand Down Expand Up @@ -70,9 +70,9 @@ tests:
input: Screen "Acme Corp" against denied parties list
```

### Suite-level Assert
### Suite-level Assertions

The `assert` field is the canonical way to define suite-level evaluators. Suite-level assertions are appended to every test's evaluators unless a test sets `execution.skip_defaults: true`.
The `assertions` field is the canonical way to define suite-level evaluators. Suite-level assertions are appended to every test's evaluators unless a test sets `execution.skip_defaults: true`.

```yaml
description: API response validation
Expand All @@ -88,11 +88,11 @@ tests:
input: Check API health
```

`assert` supports all evaluator types, including deterministic assertion types (`contains`, `regex`, `is_json`, `equals`) and `rubrics`. See [Tests](/evaluation/eval-cases/#per-test-assertions) for per-test assert usage.
`assertions` supports all evaluator types, including deterministic assertion types (`contains`, `regex`, `is_json`, `equals`) and `rubrics`. See [Tests](/evaluation/eval-cases/#per-test-assertions) for per-test assertions usage.

### Suite-level Input

The `input` field defines messages that are **prepended** to every test's input. This avoids repeating the same prompt or system context in each test case — following the same pattern as suite-level `assert`.
The `input` field defines messages that are **prepended** to every test's input. This avoids repeating the same prompt or system context in each test case — following the same pattern as suite-level `assertions`.

```yaml
description: Travel assistant evaluation
Expand All @@ -119,7 +119,7 @@ Suite-level `input` accepts the same formats as test-level `input`:
- **String** — wrapped as `[{ role: "user", content: "..." }]`
- **Message array** — used as-is, including file references

To opt out for a specific test, set `execution.skip_defaults: true` (same flag that skips suite-level `assert`).
To opt out for a specific test, set `execution.skip_defaults: true` (same flag that skips suite-level `assertions`).

### Suite-level Input Files

Expand Down
2 changes: 1 addition & 1 deletion apps/web/src/content/docs/evaluation/examples.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -343,7 +343,7 @@ tests:

## Suite-level Input

Share a common prompt or system instruction across all tests. Suite-level `input` messages are prepended to each test's input — like suite-level `assert` for evaluators:
Share a common prompt or system instruction across all tests. Suite-level `input` messages are prepended to each test's input — like suite-level `assertions` for evaluators:

```yaml
description: Travel assistant evaluation
Expand Down
4 changes: 2 additions & 2 deletions apps/web/src/content/docs/evaluation/rubrics.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@ sidebar:
order: 3
---

Rubrics are defined with `assert` entries and support binary checklist grading and score-range analytic grading.
Rubrics are defined with `assertions` entries and support binary checklist grading and score-range analytic grading.

## Basic Usage

The simplest form — list plain strings in `assert` and each one becomes a required criterion:
The simplest form — list plain strings in `assertions` and each one becomes a required criterion:

```yaml
tests:
Expand Down
2 changes: 1 addition & 1 deletion apps/web/src/content/docs/evaluation/running-evals.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -262,7 +262,7 @@ The `--file` option reads a JSON file with `{ "output": "...", "input": "..." }`

**Exit codes:** 0 if score >= 0.5 (pass), 1 if score < 0.5 (fail).

This is the same interface that agent-orchestrated evals use — the EVAL.yaml transpiler emits `assert` instructions for code graders so external grading agents can execute them directly.
This is the same interface that agent-orchestrated evals use — the EVAL.yaml transpiler emits `assertions` instructions for code graders so external grading agents can execute them directly.

## Agent-Orchestrated Evals

Expand Down
2 changes: 1 addition & 1 deletion apps/web/src/content/docs/evaluation/sdk.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ const { results, summary } = await evaluate({
{
id: 'greeting',
input: 'Say hello',
assert: [{ type: 'contains', value: 'Hello' }],
assertions: [{ type: 'contains', value: 'Hello' }],
},
],
});
Expand Down
4 changes: 2 additions & 2 deletions apps/web/src/content/docs/evaluators/composite.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -30,9 +30,9 @@ assertions:
```

Each sub-evaluator runs independently, then the aggregator combines their results.
Use `assert` for composite members. `evaluators` is still accepted for backward compatibility.
Use `assertions` for composite members. `evaluators` is still accepted for backward compatibility.

If you only need weighted-average aggregation, a plain test-level `assert` list already computes a weighted mean across evaluators. Use `composite` when you need a custom aggregation strategy (`threshold`, `code_grader`, `llm_grader`) or nested evaluator groups.
If you only need weighted-average aggregation, a plain test-level `assertions` list already computes a weighted mean across evaluators. Use `composite` when you need a custom aggregation strategy (`threshold`, `code_grader`, `llm_grader`) or nested evaluator groups.

## Aggregator Types

Expand Down
6 changes: 3 additions & 3 deletions apps/web/src/content/docs/evaluators/custom-evaluators.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,11 @@ AgentV supports multiple evaluator types that can be combined for comprehensive
|------|-------------|----------|
| `code_grader` | Deterministic command (Python/TS/any) | Exact matching, format validation, programmatic checks |
| `llm_grader` | LLM-based evaluation with custom prompt | Semantic evaluation, nuance, subjective quality |
| `rubrics` | Structured rubric evaluator via `assert` | Multi-criterion grading with weights |
| `rubrics` | Structured rubric evaluator via `assertions` | Multi-criterion grading with weights |

## Referencing Evaluators

Evaluators are configured using `assert` — either top-level (applies to all tests) or per-test:
Evaluators are configured using `assertions` — either top-level (applies to all tests) or per-test:

### Top-Level (Default for All Tests)

Expand Down Expand Up @@ -72,7 +72,7 @@ tests:

Each evaluator produces its own score. Results appear in `scores[]` in the output JSONL.

For multiple evaluators in `assert`, the test score is the weighted mean:
For multiple evaluators in `assertions`, the test score is the weighted mean:

```
final_score = sum(score_i * weight_i) / sum(weight_i)
Expand Down
6 changes: 3 additions & 3 deletions apps/web/src/content/docs/evaluators/llm-graders.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -9,17 +9,17 @@ LLM graders (also accepts `llm-judge` for backward compatibility) use a language

## Default Grader

When a test defines `criteria` but has **no `assert` field**, a default `llm-grader` runs automatically. The built-in prompt evaluates the response against your `criteria` and `expected_output`:
When a test defines `criteria` but has **no `assertions` field**, a default `llm-grader` runs automatically. The built-in prompt evaluates the response against your `criteria` and `expected_output`:

```yaml
tests:
- id: simple-eval
criteria: Correctly explains the bug and proposes a fix
input: "Debug this function..."
# No assert needed — default llm-grader evaluates against criteria
# No assertions needed — default llm-grader evaluates against criteria
```

When `assert` **is** present, no default grader is added. To use an LLM grader alongside other evaluators, declare it explicitly. See [How criteria and assert interact](/evaluation/eval-cases/#how-criteria-and-assert-interact).
When `assertions` **is** present, no default grader is added. To use an LLM grader alongside other evaluators, declare it explicitly. See [How criteria and assertions interact](/evaluation/eval-cases/#how-criteria-and-assertions-interact).

## Configuration

Expand Down
2 changes: 1 addition & 1 deletion apps/web/src/content/docs/guides/agent-skills-evals.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ When AgentV loads `evals.json`, it promotes fields to its internal representatio
|---|---|---|
| `prompt` | `input` | Wrapped as `[{role: "user", content: prompt}]` |
| `expected_output` | `expected_output` + `criteria` | Used as reference answer and evaluation criteria |
| `assertions[]` | `assert[]` | Each string becomes `{type: llm-grader, prompt: text}` |
| `assertions[]` | `assertions[]` | Each string becomes `{type: llm-grader, prompt: text}` |
| `files[]` | `file_paths` | Resolved relative to evals.json, copied into workspace |
| `skill_name` | `metadata.skill_name` | Carried as metadata |
| `id` (number) | `id` (string) | Converted via `String(id)` |
Expand Down
2 changes: 1 addition & 1 deletion apps/web/src/content/docs/guides/human-review.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -147,7 +147,7 @@ For workspace evaluations with multiple evaluators (code graders, LLM graders, t
}
```

Keys use the format `evaluator-type:evaluator-name` to match the evaluators defined in `assert` blocks.
Keys use the format `evaluator-type:evaluator-name` to match the evaluators defined in `assertions` blocks.

## Storing feedback across iterations

Expand Down
2 changes: 1 addition & 1 deletion apps/web/src/content/docs/tools/convert.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ Converts an [Agent Skills `evals.json`](/guides/agent-skills-evals) file into an

- Maps `prompt` → `input` message array
- Maps `expected_output` → `expected_output`
- Maps `assertions` → `assert` evaluators (llm-grader)
- Maps `assertions` → `assertions` evaluators (llm-grader)
- Resolves `files[]` paths relative to the evals.json directory
- Adds TODO comments for AgentV-specific features (workspace setup, code graders, rubrics)

Expand Down
2 changes: 1 addition & 1 deletion plugins/agentv-dev/agents/eval-analyzer.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ For each evaluator entry in `scores` where `type` is `"llm-judge"` or `"rubrics"

### Step 3: Weak Assertion Detection

Scan the EVAL.yaml `assert` entries (if `eval-path` provided) and the `reasoning` fields in results for weak assertions:
Scan the EVAL.yaml `assertions` entries (if `eval-path` provided) and the `reasoning` fields in results for weak assertions:

| Weakness | Detection | Improvement |
|----------|-----------|-------------|
Expand Down
2 changes: 1 addition & 1 deletion plugins/agentv-dev/skills/agentv-eval-analyzer/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ When results span multiple targets, flags evaluators with > 0.3 score variance a
The analyzer report includes concrete YAML snippets for each suggestion. To apply:

1. Open the EVAL.yaml referenced in the report
2. Find the `assert` entry for the flagged evaluator (matched by `name` and `test_id`)
2. Find the `assertions` entry for the flagged evaluator (matched by `name` and `test_id`)
3. Replace or supplement the evaluator config with the suggested deterministic assertion
4. Re-run `agentv eval` to verify the change produces equivalent scores

Expand Down
8 changes: 4 additions & 4 deletions plugins/agentv-dev/skills/agentv-eval-writer/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ agentv prompt eval --input evals.json --test-id 1
agentv prompt eval --expected-output evals.json --test-id 1
```

The converter maps `prompt` → `input`, `expected_output` → `expected_output`, `assertions` → `assert` (llm-judge), and resolves `files[]` paths. The generated YAML includes TODO comments for AgentV features to add (workspace setup, code judges, rubrics, required gates).
The converter maps `prompt` → `input`, `expected_output` → `expected_output`, `assertions` → `assertions` (llm-grader), and resolves `files[]` paths. The generated YAML includes TODO comments for AgentV features to add (workspace setup, code judges, rubrics, required gates).

If you're running the lifecycle through `agentv-bench`, use `agentv convert` and `agentv prompt eval` directly — the Python scripts in `agentv-bench/scripts/` orchestrate these same commands.

Expand Down Expand Up @@ -158,7 +158,7 @@ requires:

## Suite-level Input

Prepend shared input messages to every test (like suite-level `assert`). Avoids repeating the same prompt file in each test:
Prepend shared input messages to every test (like suite-level `assertions`). Avoids repeating the same prompt file in each test:

```yaml
input:
Expand Down Expand Up @@ -505,7 +505,7 @@ Binary check: is the output valid JSON?
LLM-judged structured evaluation with weighted criteria. Criteria items support `id`, `outcome`, `weight`, and `required` fields.

### rubrics (inline, deprecated)
Top-level `rubrics:` field is deprecated. Use `type: rubrics` under `assert` instead.
Top-level `rubrics:` field is deprecated. Use `type: rubrics` under `assertions` instead.
See `references/rubric-evaluator.md` for score-range mode and scoring formula.

## Execution Error Tolerance
Expand Down Expand Up @@ -607,7 +607,7 @@ const { results, summary } = await evaluate({
{
id: 'greeting',
input: 'Say hello',
assert: [{ type: 'contains', value: 'hello' }],
assertions: [{ type: 'contains', value: 'hello' }],
},
],
target: { provider: 'mock_agent' },
Expand Down