feat(analyst): registry + findings envelope over existing primitives#56
Conversation
Adds a generic, model-agnostic, transport-agnostic Analyst layer that orchestrates agent-eval's existing analyzers without re-implementing them. One contract, one runner, one persistence path — reusable by VB operator bench, the leaderboard submission pipeline, and the orchestrator on-completion reports surface with the same code. - `src/analyst/types.ts` — `Analyst` contract, `AnalystFinding` envelope with sha-stable `finding_id`, `AnalystRunInputs` with `inputKind` routing (trace-store | artifact-dir | run-record | judge-input | custom) - `src/analyst/chat-client.ts` — `ChatClient` abstraction over router | sandbox-sdk | cli-bridge | direct-provider | mock so analyst code never depends on the transport - `src/analyst/registry.ts` — register/list/run with input routing, per-analyst isolation (one failure does not stop others), budget split, per-analyst telemetry - `src/analyst/findings-store.ts` — locked JSONL append + `diffFindings` (appeared / disappeared / persisted / changed) keyed by stable id - `src/analyst/adapters.ts` — five thin lifters wrapping `analyzeTraces`, `MultiLayerVerifier`, `RunCritic`, `JudgeFn`, `SemanticConceptJudge` - `src/analyst/analyst.test.ts` — 12 tests covering hash stability, registration validation, routing, failure isolation, only/skip, cost attribution, store round-trip, diff semantics, mock transport Version: 0.28.0
Lifts five reviewer concerns into a small policy surface so consumers
override what they need without changing the registry. All defaults
preserve previous behavior.
- types.ts: `AnalystContext.chat` is now `ChatClient` (was `LlmClient`).
Drops the cast in registry.run() and matches what the PR promised —
analyst code is transport-agnostic by contract, not by convention.
- chat-client.ts: `wrapLlmClient` races the in-flight call against
`ChatCallOpts.signal`. Awaiting code unblocks on abort; the in-flight
HTTP request still bounds by `timeoutMs` (LlmClient doesn't yet
accept an external AbortSignal — documented inline).
- registry.ts:
* `AnalystHooks` — `onBeforeAnalyze`, `onAfterAnalyze`, `onError`,
`onComplete`. `onError` MAY return findings to convert a crash into
structured findings; `onAfterAnalyze` runs for ok | failed | skipped.
This is the seam for telemetry, cost ingestion, storage rotation,
error-to-finding conversion — all without registry changes.
* `BudgetPolicy` — `{ totalUsd, weights, allocate }`. Default still
equal-split; `allocate` is the precise hook when weights aren't
enough.
- findings-store.ts: `diffFindings(prev, cur, { isMaterial })`. Default
materiality test (severity / confidence Δ > 0.05 / evidence count) is
exported as `defaultIsMaterial` so consumers can layer stricter
predicates without re-implementing the base.
- 8 new tests cover hook ordering, error→finding conversion, skipped
hooks, equal-split + weighted budget, default + custom diff policy,
signal racing.
1125/1125 tests pass.
|
Pushed #1 ChatClient wiring — fixed. #2 Signal threading — partial, with honest scope. #4 Diff materiality — pluggable. #5 Budget — pluggable. #6 Rotation, #+ telemetry, #+ error→finding — covered by #3 AxAIService coupling — non-issue. #7 schema_version literal — leaving for now. Migration helper is a 0.29 concern; current literal forces a deliberate type bump on breaking changes, which is the right ergonomic for 1.0. #8 Isolation under load — added. Hook tests + error-conversion tests cover the failure-isolation invariant from another angle (and the failing analyst's siblings are observed via Tests: 20 in the analyst file, 1125 across the suite, all green. |
…on top of #56) (#57) The original PR-A adapter (`createTraceAnalystAdapter`) shipped a single generic flow whose output was `findings:string[]` — every bullet became a flat-severity `medium` / confidence `0.6` `AnalystFinding`, losing the per-finding grading the analyzer LLM is perfectly capable of producing. This commit adds the typed-kind architecture on top: finding-signature.ts Strict Zod schema for one analyst-emitted row (severity / claim / subject / evidence_uri / confidence / rationale / recommended_action). `RAW_FINDING_SCHEMA_PROMPT` embeds the shape into every kind's actor prompt; the LLM emits structured JSON; the factory Zod-validates each row at the boundary and drops malformed rows with a logged reason. kind-factory.ts `createTraceAnalystKind(spec, { ai })` lifts a kind spec into the existing `Analyst<TraceAnalysisStore>` contract. Wires Ax `agent(...)` with `findings:json[]` output, AxJSRuntime sandbox, advanced-mode recursion when `maxDepth>0`, and the per-kind tool subset. `versionSuffix` is reserved for prompt-optimization artifacts (MIPRO/GEPA bumps Ax version → wire-up in a follow-up). tool-groups.ts Five named subsets — `all`, `discovery`, `discoveryAndRead`, `discoveryAndSearch`, `targeted` — so each kind takes only what it needs from the seven trace-analyst tools. Unknown group names throw (silent-all would defeat the cost-control point). kinds/ — four default kinds, ordered for dependency-aware runs: 1. failure-mode (maxDepth 3, parallel 4) — clusters dataset failures into distinct modes with cited evidence. Discovery → cluster → cite protocol. Aggressive RLM delegation: one subagent per cluster, and confounded clusters split again at the next level. 2. knowledge-gap (maxDepth 2, parallel 4) — names the specific information the agent lacked or that was stale, attributed to the runtime layer that should have surfaced it. Anchored on `@tangle-network/agent-knowledge` (wiki page / claim / raw source loci) with secondary loci for `websearch:outdated:*`, `tool-doc:*`, `system-prompt:*`, `memory:*`. Subagents fan out per layer. 3. knowledge-poisoning (maxDepth 2, parallel 4) — finds confident- but-wrong actions. DUAL-VERIFY protocol: subagents prove (i) the agent acted on the belief and (ii) the belief is false in this trace's evidence. Only findings with both halves proven survive. 4. improvement (maxDepth 3, parallel 4) — converts upstream findings into concrete locus-named edits. DISCOVERY → CANDIDATE-FIXES → COMPETE → CITE: subagents simulate competing fix candidates per cluster; the winning candidate per cluster is emitted with leverage grade, rationale, and a literal edit phrased as a diff. Cross- references upstream findings via `evidence_uri: "finding://<id>"` so the dependency graph renders. Tests (21 new in `kinds/kinds.test.ts`): - Zod schema rejects out-of-range confidence / unknown severity / extra fields (strict mode), logs the rejection reason - `parseRawFinding` returns null + logs on failure, types value on success - default suite emits the four kinds in run order - every kind exercises Ax recursion (maxDepth ≥ 1) - improvement has the deepest depth (competing candidate fixes) - knowledge-gap prompt anchors on agent-knowledge + websearch + tool-doc, not generic RAG - knowledge-poisoning enforces dual-verify - failure-mode requires clustering, not enumeration - tool groups filter narrowly; unknown name throws - `versionSuffix` appends to kind version (for future optimizer pins) - `finding_id` stable across runs for same kind + area + claim + subject The legacy `createTraceAnalystAdapter` is now `@deprecated` (kept one minor for consumer migration). New code should reach for kinds first. Total: 1146/1146 tests pass, typecheck clean. Ax 19's `.d.ts` is missing some optimizer-class exports that 21.x ships; the optimizer-fit pipeline lands in a follow-up after the Ax bump.
The npm package was bumped to 0.28.0 in PR #56 (analyst registry + findings envelope) but clients/python/pyproject.toml stayed at 0.27.2. The publish workflow rejects mismatched versions across npm and PyPI to prevent a half-published release where one ecosystem advertises a version the other doesn't have. Bumping pyproject to 0.28.0 so v0.28.0 can be tagged and the workflow publishes both packages atomically.
Summary
Adds a generic, model-agnostic, transport-agnostic Analyst layer that orchestrates agent-eval's existing analyzers (
analyzeTraces,MultiLayerVerifier,RunCritic,SemanticConceptJudge,JudgeFn) without re-implementing them. One contract, one runner, one persistence path — reusable by VB operator bench, the leaderboard submission pipeline, and the orchestrator on-completion/v1/intelligence/reportsmode:'agentic'surface with the same code.This is PR-A of a three-PR series:
@tangle-network/agent-eval@0.28.0submit-external-run.tsto persist findings + rendergenerateIntelligenceReportas anAnalyst; routemode:'agentic'through the registryWhat's in
src/analyst/types.ts—Analyst<TInput>contract,AnalystFindingenvelope with sha-stablefinding_id(so cross-run diffs work),AnalystRunInputsdiscriminated byinputKind(trace-store | artifact-dir | run-record | judge-input | custom)src/analyst/chat-client.ts— unifiedChatClientoverrouter | sandbox-sdk | cli-bridge | direct-provider | mockso analyst code never knows the transport (production binds to sandbox-sdk; tests bind to mock)src/analyst/registry.ts—register/list/runwith input routing, per-analyst isolation (one analyst failure does not stop the run), equal-split budgeting, per-analyst telemetry (latency_ms, cost_usd, status, error.class)src/analyst/findings-store.ts— locked JSONL append +diffFindingsreturning{ appeared, disappeared, persisted, changed }keyed by stable id, with a narrow material-change definition (severity / confidence > 0.05 / evidence count) to avoid LLM-reword noisesrc/analyst/adapters.ts— five thin lifters that wrap the existing primitives into theAnalystshape (no duplication; the wrapped code is unchanged)src/analyst/analyst.test.ts— 12 unit testsVersion bump:
0.27.2 → 0.28.0.Test plan
pnpm tsc --noEmit— cleanpnpm vitest run src/analyst/analyst.test.ts— 12/12 passpnpm test(full suite) — 1117/1117 pass, no regressionspnpm build— clean, OpenAPI spec re-emitsDesign notes
analyst/imports a specific provider or model name. Analysts that call an LLM go throughChatClientonAnalystContext.chat.createChatClient(...)call.LlmClientcallers andTCloud-based judges keep working untouched. The five adapters lift, they don't re-implement.{analyst_id, area, subject, normalized_claim}(or explicitid_basis) — two runs that "find the same thing" share the id, which is what makesdiffFindingsmeaningful for regression tracking.status: 'failed'summary row witherror.class + error.message; siblings still run.