diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md index 3b7fddb6c2..e05da37de0 100644 --- a/.github/PULL_REQUEST_TEMPLATE.md +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -23,3 +23,12 @@ _Specify any issues that can be closed from these changes (e.g. `Closes #233`)._ ### Screen Recording _If possible provide screenshots and/or a screen recording of proposed change._ + +### Harness Validation (Required for Launch-Flow Impact) + +_If this PR affects any launch flow, attach harness evidence and release-gate output._ + +- [ ] I ran `yarn test:wallet-flows` (or targeted flow subset) and reviewed `suite/report.json`. +- [ ] I checked `suite/release-matrix.md` and confirmed class distribution (`happy-path-pass` / `blocker-or-partial-pass` / `failed`). +- [ ] I ran `yarn test:wallet-flows:gate` and included output summary. +- [ ] For critical flows (`FLOW-001,002,005,010,011,013,014,018,019`), I confirmed `happy-path-pass` (or documented explicit exception + owner sign-off). diff --git a/AGENTS.md b/AGENTS.md index 7e6b6cec5a..5d5c88d7e6 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -31,3 +31,8 @@ - Seed scripts (Substrate dev): - `yarn script:setupServices` (create blueprints) - `yarn script:setupStaking` (LST/vault/operator staking fixtures) + +## Harness runbook +- Operating spec: `docs/harness-engineering-spec.md` +- Execution checklist: `docs/harness-engineering-checklist.md` +- Wallet flow suite usage: `docs/wallet-flow-suite.md` diff --git a/CLAUDE.md b/CLAUDE.md index cb7b875609..5b5c327f6e 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -77,6 +77,7 @@ yarn generate:release # Review version bumps and changelog - Report status with concrete evidence (commands run, pass/fail, remaining gaps), not vague progress language. - For release-readiness tasks, drive to production-grade confidence: strict validation, explicit failure reasons, and concrete remediation steps. - Avoid “do you want me to…” phrasing when the expected next step is obvious from context. +- For launch-flow-impacting changes, follow `docs/harness-engineering-spec.md` and complete `docs/harness-engineering-checklist.md` before requesting merge. ### Wallet Flow Reliability (agent-browser-driver) - Treat wallet E2E as environment-first: do not trust flow results until local chain + indexer + dApp are confirmed on the same network. diff --git a/docs/harness-engineering-checklist.md b/docs/harness-engineering-checklist.md new file mode 100644 index 0000000000..2066456824 --- /dev/null +++ b/docs/harness-engineering-checklist.md @@ -0,0 +1,51 @@ +# Harness Engineering Checklist + +Use this checklist for launch-flow-impacting work. + +## Design + +- [ ] Flow scope is mapped to `flow_id` values in `docs/launch-readiness-board.csv`. +- [ ] Acceptance criteria distinguish happy-path vs explicit blocker path. +- [ ] Critical-flow impact is identified up front. +- [ ] Flow owner and release/harness owner are assigned in the PR. + +## Implementation + +- [ ] Criteria updates are route-resilient (canonical-route recheck where needed). +- [ ] Tx checks include objective signals (tx history delta and/or explicit blocker copy). +- [ ] New env toggles are documented in `docs/wallet-flow-suite.md`. + +## Verification + +- [ ] Run suite (full or targeted): `yarn test:wallet-flows`. +- [ ] Inspect `suite/report.json` for `verified` and `agentSuccess`. +- [ ] Inspect `suite/release-matrix.md` classification counts. +- [ ] Run gate: `yarn test:wallet-flows:gate`. +- [ ] If release strictness is required, run: `yarn test:wallet-flows:gate:strict`. +- [ ] Confirm matrix artifacts exist (`json`, `csv`, `md`) under suite output. + +## Critical Flows + +- [ ] `FLOW-001` happy-path-pass +- [ ] `FLOW-002` happy-path-pass +- [ ] `FLOW-005` happy-path-pass +- [ ] `FLOW-010` happy-path-pass +- [ ] `FLOW-011` happy-path-pass +- [ ] `FLOW-013` happy-path-pass +- [ ] `FLOW-014` happy-path-pass +- [ ] `FLOW-018` happy-path-pass +- [ ] `FLOW-019` happy-path-pass + +## PR Hygiene + +- [ ] PR description includes matrix summary (happy/blocker/failed). +- [ ] Any blocker-or-partial critical flow has explicit exception, owner, and ETA. +- [ ] Evidence links are included (artifact directory or CI artifact URLs). +- [ ] If launch-impacting, PR approval includes a release-captain signoff. + +## Post-Merge + +- [ ] If semantics changed, update `CLAUDE.md` runbook section. +- [ ] If recurring flake found, add flow id to flaky rerun set in spec. +- [ ] File a follow-up for fixture/indexer reliability if blocker-pass rate is rising. +- [ ] Update weekly trend snapshot with this run's matrix totals. diff --git a/docs/harness-engineering-spec.md b/docs/harness-engineering-spec.md new file mode 100644 index 0000000000..be503b2d44 --- /dev/null +++ b/docs/harness-engineering-spec.md @@ -0,0 +1,167 @@ +# Harness Engineering Operating Spec + +Last updated: 2026-03-05 + +## Why This Exists + +This repo has strong momentum but still leaks reliability through: +- flow verification that can pass without happy-path completion +- drift between “what docs say” and “what release gates enforce” +- scattered operational knowledge across AGENTS/CLAUDE/docs/PR threads +- weak mechanical governance on release evidence quality + +This spec defines the senior-level operating model to convert harness work into predictable release outcomes. + +## Scope + +In scope: +- launch-critical dApp flows validated by the wallet flow suite +- release-go/no-go evidence used by maintainers +- repository process changes that make agent execution more reliable + +Out of scope: +- full native restaking UX (deprioritized) +- replacing manual signoff for flows that require external non-local actors + +## Source Principles + +Based on OpenAI Harness Engineering guidance: +- optimize for stable maps, not giant prompts +- enforce output quality mechanically (not by intention) +- classify evidence quality explicitly (not pass/fail only) +- continuously prune stale knowledge and keep docs compact + +Reference: +- https://openai.com/index/harness-engineering/ + +## Current Gaps In This Repo + +1. `verified` and `agentSuccess` can diverge, but were historically treated as equivalent in go/no-go conversations. +2. Launch evidence was captured, but not classified into quality tiers for release decisions. +3. Critical flows did not have hard happy-path enforcement by default. +4. PR reviews lacked a required harness evidence checklist. +5. No single script existed to fail release gate when matrix quality degraded. +6. Harness process details were spread across files without one operating contract. + +## Target Operating Model + +### 1) Two-Layer Pass Semantics + +- `verified=true`: + - criteria passed (tx delta OR explicit blocker state) +- `agentSuccess=true`: + - agent completed intended narrative without terminal tool/runtime failure + +Both are reported. Never collapse them into one metric. + +### 2) Matrix-Based Evidence + +Every run produces matrix artifacts: +- `suite/release-matrix.json` +- `suite/release-matrix.csv` +- `suite/release-matrix.md` + +Each flow is classified as: +- `happy-path-pass` +- `blocker-or-partial-pass` +- `failed` + +### 3) Critical-Flow Strictness + +Critical flows require happy-path completion (`agentSuccess=true`) even when `verified=true`. + +Default critical set: +- `FLOW-001`, `FLOW-002`, `FLOW-005`, `FLOW-010`, `FLOW-011`, `FLOW-013`, `FLOW-014`, `FLOW-018`, `FLOW-019` + +### 4) Mechanical Gate Script + +Release gate is enforced by: +- `yarn test:wallet-flows:gate` + +Script behavior: +- fails on `failed` rows above threshold +- fails on missing critical flow rows +- fails when critical flows are not `happy-path-pass` (unless explicitly overridden) + +### 5) PR Governance + +PR template requires harness evidence for launch-flow-impacting changes: +- report artifact review +- release matrix review +- gate script output +- critical flow exceptions explicitly documented + +### 6) Ownership And Escalation + +- Flow owner: feature owner who changed launch-flow behavior. +- Harness owner: engineer running/triaging suite output for release cut. +- Escalation owner: release captain when critical-flow gate fails. + +Escalation rules: +- critical flow failing: block merge to release branch until fixed or exception signed off +- blocker-or-partial trend worsening for 2 consecutive release cycles: open remediation issue with owner and ETA +- missing evidence in PR: do not approve launch-impacting changes + +### 7) CI Policy + +- Pre-merge (required): + - lint/type/build + - harness gate output for launch-flow-impacting PRs +- Nightly (required): + - full wallet flow suite + - matrix trend snapshot committed/attached as artifact +- Weekly hygiene (required): + - rerun known flaky flows + - open targeted cleanup PRs for recurring failure patterns + +### 8) Definition Of Done (Launch-Flow Changes) + +All must be true: +- code merged with tests/checks passing +- release matrix generated and attached +- `failed=0` +- all critical flows are `happy-path-pass` +- any non-critical blocker-or-partial rows have owner + ETA + issue link +- docs updated if semantics/criteria changed + +## Required Commands + +Run suite: +- `yarn test:wallet-flows` + +Run gate: +- `yarn test:wallet-flows:gate` + +Strict blocker cap: +- `yarn test:wallet-flows:gate:strict` + +## SLOs (Release Quality) + +Release candidate targets: +- `failed = 0` +- critical flows: all `happy-path-pass` +- blocker/partial flows: explicitly justified, tracked, and owner-assigned + +Escalation: +- any critical regression blocks merge to release branch +- any blocker growth trend over 2 consecutive release cycles requires remediation plan + +## Change Management + +When flow criteria are modified: +1. rerun impacted flow IDs +2. rerun known flaky set (`FLOW-007`, `FLOW-013`, `FLOW-016`) +3. update docs if semantics changed +4. include before/after matrix summary in PR + +## 30/60 Day Rollout + +Within 30 days: +1. enforce PR harness validation section for launch-impacting PRs +2. require gate output in release candidate PR descriptions +3. publish weekly matrix trend summary + +Within 60 days: +1. add nightly suite + gate CI job +2. add auto-generated matrix trend dashboard doc +3. codify recurring cleanup cadence as a standing maintenance task diff --git a/docs/wallet-flow-suite.md b/docs/wallet-flow-suite.md index 37dee03a7f..5b06649be2 100644 --- a/docs/wallet-flow-suite.md +++ b/docs/wallet-flow-suite.md @@ -27,6 +27,8 @@ Optional wallet env vars: - `AGENT_WALLET_USER_DATA_DIR=/abs/path/to/.agent-wallet-profile` - `AGENT_STRICT_WALLET_PREFLIGHT=false` to allow non-blocking preflight (default is strict/fail-closed) - `AGENT_WALLET_ALLOW_HEADLESS=true` to force wallet runs in headless mode (default is headful for extension stability) +- `AGENT_REQUIRE_AGENT_SUCCESS=true` to require agent narrative success for all flows +- `AGENT_REQUIRE_AGENT_SUCCESS_FLOWS=FLOW-001,FLOW-002,...` to enforce agent-success gate for specific flows (defaults to critical tx flows) Notes: @@ -57,9 +59,11 @@ Notes: - Default pass requires: - `verified=true` (all declared criteria pass) -- Optional dual-gate mode (`AGENT_REQUIRE_AGENT_SUCCESS=true`) also requires: - - `agentSuccess=true` -- Flow dependencies are expanded automatically (for example `FLOW-012` includes `FLOW-010`, `FLOW-016` includes `FLOW-013`). +- Critical-flow dual gate is enabled by default for: + - `FLOW-001`, `FLOW-002`, `FLOW-005`, `FLOW-010`, `FLOW-011`, `FLOW-013`, `FLOW-014`, `FLOW-018`, `FLOW-019` + - these flows require both `verified=true` and `agentSuccess=true` unless overridden via `AGENT_REQUIRE_AGENT_SUCCESS_FLOWS` +- Global strict mode (`AGENT_REQUIRE_AGENT_SUCCESS=true`) requires `agentSuccess=true` for every flow. +- Flow dependencies are expanded automatically when defined in case metadata. - `tx-outcome` flows pass when either: - a new terminal transaction status (`finalized` or `failed`) is observed in current-run `tx-history`, or - an explicit non-actionable blocker state is visible (permissions, missing wallet dependency, empty inventory, etc.). @@ -94,5 +98,10 @@ Notes: ## Artifacts and Exit Criteria - Artifacts are written to `agent-results/wallet-flows/` by default. +- Runner also writes release matrix artifacts under `agent-results/.../suite/`: + - `release-matrix.json` + - `release-matrix.csv` + - `release-matrix.md` + - classification: `happy-path-pass`, `blocker-or-partial-pass`, `failed` - Runner exits non-zero when any case fails or is skipped. - Use generated report artifacts plus tx hashes/request ids as launch sign-off evidence. diff --git a/package.json b/package.json index cebf8dc338..e53bb0c25f 100644 --- a/package.json +++ b/package.json @@ -43,7 +43,9 @@ "script:setupStaking": "bun scripts/setupStaking.ts", "test:wallet-flows": "node ./scripts/agent-browser/run-wallet-flow-suite.mjs", "test:wallet-flows:list": "node ./scripts/agent-browser/run-wallet-flow-suite.mjs --list", - "test:wallet-flows:docker": "bash ./scripts/agent-browser/run-wallet-flow-suite-docker.sh" + "test:wallet-flows:docker": "bash ./scripts/agent-browser/run-wallet-flow-suite-docker.sh", + "test:wallet-flows:gate": "node ./scripts/agent-browser/check-release-gate.mjs", + "test:wallet-flows:gate:strict": "node ./scripts/agent-browser/check-release-gate.mjs --max-blocker-partial 0" }, "resolutions": { "@polkadot/api": "^13.2.1", diff --git a/scripts/agent-browser/check-release-gate.mjs b/scripts/agent-browser/check-release-gate.mjs new file mode 100644 index 0000000000..f3d374c3be --- /dev/null +++ b/scripts/agent-browser/check-release-gate.mjs @@ -0,0 +1,173 @@ +#!/usr/bin/env node + +import fs from 'node:fs'; +import path from 'node:path'; +import { parseArgs } from 'node:util'; + +const DEFAULT_CRITICAL_FLOW_IDS = [ + 'FLOW-001', + 'FLOW-002', + 'FLOW-005', + 'FLOW-010', + 'FLOW-011', + 'FLOW-013', + 'FLOW-014', + 'FLOW-018', + 'FLOW-019', +]; + +const parsePositiveInteger = (value, label) => { + const parsed = Number(value); + if (!Number.isFinite(parsed) || !Number.isInteger(parsed) || parsed < 0) { + throw new Error(`${label} must be a non-negative integer. Received: ${value}`); + } + return parsed; +}; + +const parseFlowList = (value, fallback) => { + if (value === undefined) { + return [...fallback]; + } + + const normalized = String(value).trim().toLowerCase(); + if (normalized === '' || normalized === 'none' || normalized === 'off') { + return []; + } + + return [...new Set(String(value).split(/[,\s]+/).map((entry) => entry.trim()))] + .filter(Boolean) + .map((entry) => entry.toUpperCase()); +}; + +const findLatestReleaseMatrix = (rootDir) => { + const agentResultsDir = path.join(rootDir, 'agent-results'); + if (!fs.existsSync(agentResultsDir)) { + return null; + } + + const candidates = fs + .readdirSync(agentResultsDir, { withFileTypes: true }) + .filter((entry) => entry.isDirectory()) + .map((entry) => ({ + dir: path.join(agentResultsDir, entry.name), + matrix: path.join(agentResultsDir, entry.name, 'suite', 'release-matrix.json'), + })) + .filter((entry) => fs.existsSync(entry.matrix)) + .map((entry) => ({ + ...entry, + mtimeMs: fs.statSync(entry.matrix).mtimeMs, + })) + .sort((a, b) => b.mtimeMs - a.mtimeMs); + + return candidates[0]?.matrix ?? null; +}; + +const argv = parseArgs({ + args: process.argv.slice(2), + options: { + matrix: { type: 'string' }, + 'critical-flows': { type: 'string' }, + 'max-failed': { type: 'string' }, + 'max-blocker-partial': { type: 'string' }, + 'allow-critical-blocker': { type: 'boolean' }, + help: { type: 'boolean', short: 'h' }, + }, + strict: true, +}).values; + +if (argv.help) { + console.log(`\ +Usage: + node scripts/agent-browser/check-release-gate.mjs [options] + +Options: + --matrix Path to suite/release-matrix.json (defaults to latest under agent-results/) + --critical-flows Comma-separated critical flow IDs (default: ${DEFAULT_CRITICAL_FLOW_IDS.join(',')}) + --max-failed Max allowed failed rows (default: 0) + --max-blocker-partial Max allowed blocker-or-partial rows (default: unlimited) + --allow-critical-blocker Allow critical flows to pass as blocker-or-partial (default: false) + -h, --help Show help +`); + process.exit(0); +} + +const rootDir = process.cwd(); +const matrixPath = path.resolve( + argv.matrix ?? findLatestReleaseMatrix(rootDir) ?? '', +); +if (!matrixPath || !fs.existsSync(matrixPath)) { + throw new Error( + 'release-matrix.json not found. Provide --matrix or run wallet flow suite first.', + ); +} + +const matrix = JSON.parse(fs.readFileSync(matrixPath, 'utf8')); +const rows = Array.isArray(matrix.rows) ? matrix.rows : []; +if (rows.length === 0) { + throw new Error(`No rows found in release matrix: ${matrixPath}`); +} + +const criticalFlows = new Set( + parseFlowList(argv['critical-flows'], DEFAULT_CRITICAL_FLOW_IDS), +); +const maxFailed = parsePositiveInteger(argv['max-failed'] ?? '0', '--max-failed'); +const maxBlockerPartial = + argv['max-blocker-partial'] === undefined + ? Number.POSITIVE_INFINITY + : parsePositiveInteger(argv['max-blocker-partial'], '--max-blocker-partial'); +const allowCriticalBlocker = Boolean(argv['allow-critical-blocker'] ?? false); + +const failedRows = rows.filter((row) => row.classification === 'failed'); +const blockerRows = rows.filter( + (row) => row.classification === 'blocker-or-partial-pass', +); +const criticalRows = rows.filter((row) => criticalFlows.has(row.flowId)); +const missingCriticalFlows = [...criticalFlows].filter( + (flowId) => !criticalRows.some((row) => row.flowId === flowId), +); + +const criticalViolations = allowCriticalBlocker + ? [] + : criticalRows.filter((row) => row.classification !== 'happy-path-pass'); + +console.log('[release-gate] matrix:', matrixPath); +console.log( + `[release-gate] summary: total=${rows.length} happy=${rows.filter((row) => row.classification === 'happy-path-pass').length} blocker=${blockerRows.length} failed=${failedRows.length}`, +); +console.log( + `[release-gate] critical flows: ${[...criticalFlows].join(', ') || '(none)'}`, +); + +const reasons = []; +if (failedRows.length > maxFailed) { + reasons.push( + `Failed rows ${failedRows.length} exceeded max-failed=${maxFailed}.`, + ); +} +if (blockerRows.length > maxBlockerPartial) { + reasons.push( + `Blocker-or-partial rows ${blockerRows.length} exceeded max-blocker-partial=${maxBlockerPartial}.`, + ); +} +if (missingCriticalFlows.length > 0) { + reasons.push( + `Missing critical flows in matrix: ${missingCriticalFlows.join(', ')}.`, + ); +} +if (criticalViolations.length > 0) { + reasons.push( + `Critical flows are not happy-path-pass: ${criticalViolations + .map((row) => `${row.flowId}(${row.classification})`) + .join(', ')}.`, + ); +} + +if (reasons.length > 0) { + console.error('[release-gate] FAILED'); + for (const reason of reasons) { + console.error(`- ${reason}`); + } + process.exit(1); +} + +console.log('[release-gate] PASSED'); diff --git a/scripts/agent-browser/run-wallet-flow-suite.mjs b/scripts/agent-browser/run-wallet-flow-suite.mjs index 9d015e56fb..de7176b523 100755 --- a/scripts/agent-browser/run-wallet-flow-suite.mjs +++ b/scripts/agent-browser/run-wallet-flow-suite.mjs @@ -11,6 +11,17 @@ const DEFAULT_CONFIG_PATH = 'agent-browser-driver.config.mjs'; const LAUNCH_BOARD_PATH = 'docs/launch-readiness-board.csv'; const log = (message) => console.log(`[wallet-flows] ${message}`); +const DEFAULT_STRICT_AGENT_SUCCESS_FLOW_IDS = [ + 'FLOW-001', + 'FLOW-002', + 'FLOW-005', + 'FLOW-010', + 'FLOW-011', + 'FLOW-013', + 'FLOW-014', + 'FLOW-018', + 'FLOW-019', +]; const DEFAULT_WALLET_PASSWORD = process.env.AGENT_WALLET_PASSWORD ?? 'TangleLocal123!'; const DEFAULT_WALLET_CHAIN_ID = Number( @@ -1581,6 +1592,147 @@ const parseBooleanEnv = (value, fallback) => { return fallback; }; +const parseFlowIdList = (value, fallback = []) => { + if (value === undefined) { + return [...fallback]; + } + + const normalized = String(value).trim().toLowerCase(); + if (normalized === '' || normalized === 'none' || normalized === 'off') { + return []; + } + + return [...new Set(String(value).split(/[,\s]+/).map((entry) => entry.trim()))] + .filter(Boolean) + .map((entry) => entry.toUpperCase()); +}; + +const toReleaseMatrixRows = (results, strictAgentSuccessFlowIds) => + results.map((result) => { + const flowId = result.testCase?.id ?? 'UNKNOWN'; + const requiresAgentSuccess = + strictAgentSuccessFlowIds.has(flowId) || false; + const strictPass = Boolean( + result.verified && + (result.agentSuccess || (!requiresAgentSuccess && true)), + ); + const classification = strictPass + ? result.agentSuccess + ? 'happy-path-pass' + : 'blocker-or-partial-pass' + : 'failed'; + + return { + flowId, + name: result.testCase?.name ?? '', + persona: result.testCase?.category ?? 'uncategorized', + tags: (result.testCase?.tags ?? []).join(','), + verified: Boolean(result.verified), + agentSuccess: Boolean(result.agentSuccess), + requiresAgentSuccess, + strictPass, + classification, + verdict: String(result.verdict ?? ''), + durationMs: Number(result.durationMs ?? 0), + }; + }); + +const writeReleaseMatrixArtifacts = ( + outputDir, + suite, + strictAgentSuccessFlowIds, +) => { + const suiteDir = path.join(outputDir, 'suite'); + fs.mkdirSync(suiteDir, { recursive: true }); + + const rows = toReleaseMatrixRows(suite.results ?? [], strictAgentSuccessFlowIds); + const summary = { + total: rows.length, + happyPathPass: rows.filter( + (row) => row.classification === 'happy-path-pass', + ).length, + blockerOrPartialPass: rows.filter( + (row) => row.classification === 'blocker-or-partial-pass', + ).length, + failed: rows.filter((row) => row.classification === 'failed').length, + }; + + const jsonPath = path.join(suiteDir, 'release-matrix.json'); + fs.writeFileSync( + jsonPath, + JSON.stringify( + { + generatedAt: new Date().toISOString(), + strictAgentSuccessFlowIds: [...strictAgentSuccessFlowIds], + summary, + rows, + }, + null, + 2, + ), + ); + + const csvHeader = [ + 'flow_id', + 'name', + 'persona', + 'classification', + 'strict_pass', + 'verified', + 'agent_success', + 'requires_agent_success', + 'duration_ms', + 'verdict', + ]; + const escapeCsv = (value) => `"${String(value ?? '').replaceAll('"', '""')}"`; + const csvBody = rows.map((row) => + [ + row.flowId, + row.name, + row.persona, + row.classification, + row.strictPass, + row.verified, + row.agentSuccess, + row.requiresAgentSuccess, + row.durationMs, + row.verdict, + ] + .map(escapeCsv) + .join(','), + ); + const csvPath = path.join(suiteDir, 'release-matrix.csv'); + fs.writeFileSync(csvPath, `${csvHeader.join(',')}\n${csvBody.join('\n')}\n`); + + const mdRows = rows + .map( + (row) => + `| ${row.flowId} | ${row.persona} | ${row.classification} | ${row.strictPass} | ${row.verified} | ${row.agentSuccess} | ${row.durationMs} |`, + ) + .join('\n'); + const mdPath = path.join(suiteDir, 'release-matrix.md'); + fs.writeFileSync( + mdPath, + [ + '# Release Matrix', + '', + `Generated: ${new Date().toISOString()}`, + '', + `- Total: ${summary.total}`, + `- Happy-path pass: ${summary.happyPathPass}`, + `- Blocker/partial pass: ${summary.blockerOrPartialPass}`, + `- Failed: ${summary.failed}`, + '', + '| Flow | Persona | Class | Strict Pass | Verified | Agent Success | Duration ms |', + '| --- | --- | --- | --- | --- | --- | --- |', + mdRows, + '', + ].join('\n'), + ); + + return { rows, summary, jsonPath, csvPath, mdPath }; +}; + const printCaseList = (cases) => { log(`Selected ${cases.length} cases:`); for (const testCase of cases) { @@ -1862,6 +2014,12 @@ const main = async () => { process.env.AGENT_REQUIRE_AGENT_SUCCESS, false, ); + const strictAgentSuccessFlowIds = new Set( + parseFlowIdList( + process.env.AGENT_REQUIRE_AGENT_SUCCESS_FLOWS, + DEFAULT_STRICT_AGENT_SUCCESS_FLOW_IDS, + ), + ); const recycleWalletProfileOnPreflightFailure = parseBooleanEnv( process.env.AGENT_WALLET_PREFLIGHT_RECYCLE, false, @@ -1947,7 +2105,9 @@ const main = async () => { log( `Wallet chain target: id=${runtimeWalletChain.id} hex=${runtimeWalletChain.hex} rpc=${runtimeWalletChain.rpcUrl}`, ); - log(`Verification gate mode: require-agent-success=${requireAgentSuccess}`); + log( + `Verification gate mode: require-agent-success=${requireAgentSuccess} strict-flow-gate=[${[...strictAgentSuccessFlowIds].join(', ')}]`, + ); for (const warning of launchPlan.warnings ?? []) { log(`warning: ${warning}`); } @@ -2248,9 +2408,11 @@ const main = async () => { artifactSink: new FilesystemSink(outputDir), onTestStart: (testCase) => log(`start ${testCase.id} ${testCase.name}`), onTestComplete: (result) => { + const requiresAgentSuccessForFlow = + requireAgentSuccess || strictAgentSuccessFlowIds.has(result.testCase.id); const criteriaPass = Boolean( result.verified && - (result.agentSuccess || requireAgentSuccess === false), + (result.agentSuccess || requiresAgentSuccessForFlow === false), ); log( `${criteriaPass ? 'pass' : 'fail'} ${result.testCase.id} verified=${result.verified} agentSuccess=${result.agentSuccess} verdict=${result.verdict} durationMs=${result.durationMs}`, @@ -2267,10 +2429,21 @@ const main = async () => { runnableCases.map((testCase) => applyCaseRuntimeOverrides(testCase)), ); + const releaseMatrix = writeReleaseMatrixArtifacts( + outputDir, + suite, + strictAgentSuccessFlowIds, + ); + log( + `Release matrix: happy-path=${releaseMatrix.summary.happyPathPass} blocker-or-partial=${releaseMatrix.summary.blockerOrPartialPass} failed=${releaseMatrix.summary.failed}`, + ); + const strictPassed = suite.results.filter((result) => Boolean( result.verified && - (result.agentSuccess || requireAgentSuccess === false), + (result.agentSuccess || + (!requireAgentSuccess && + !strictAgentSuccessFlowIds.has(result.testCase.id))), ), ).length; const strictFailed = suite.results.length - strictPassed;