Skip to content

feat(analyst): registry + findings envelope over existing primitives#56

Merged
drewstone merged 4 commits into
mainfrom
feat/analyst-registry-and-findings
May 19, 2026
Merged

feat(analyst): registry + findings envelope over existing primitives#56
drewstone merged 4 commits into
mainfrom
feat/analyst-registry-and-findings

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

Summary

Adds a generic, model-agnostic, transport-agnostic Analyst layer that orchestrates agent-eval's existing analyzers (analyzeTraces, MultiLayerVerifier, RunCritic, SemanticConceptJudge, JudgeFn) without re-implementing them. One contract, one runner, one persistence path — reusable by VB operator bench, the leaderboard submission pipeline, and the orchestrator on-completion /v1/intelligence/reports mode:'agentic' surface with the same code.

This is PR-A of a three-PR series:

  • PR-A (this): shared substrate in @tangle-network/agent-eval@0.28.0
  • PR-B (next, blueprint-agent): adopt 0.28.0, extract VB analysts (archetype miner, propose-fixes), wire submit-external-run.ts to persist findings + render
  • PR-C (next, agent-dev-container): register generateIntelligenceReport as an Analyst; route mode:'agentic' through the registry

What's in

  • src/analyst/types.tsAnalyst<TInput> contract, AnalystFinding envelope with sha-stable finding_id (so cross-run diffs work), AnalystRunInputs discriminated by inputKind (trace-store | artifact-dir | run-record | judge-input | custom)
  • src/analyst/chat-client.ts — unified ChatClient over router | sandbox-sdk | cli-bridge | direct-provider | mock so analyst code never knows the transport (production binds to sandbox-sdk; tests bind to mock)
  • src/analyst/registry.tsregister/list/run with input routing, per-analyst isolation (one analyst failure does not stop the run), equal-split budgeting, per-analyst telemetry (latency_ms, cost_usd, status, error.class)
  • src/analyst/findings-store.ts — locked JSONL append + diffFindings returning { appeared, disappeared, persisted, changed } keyed by stable id, with a narrow material-change definition (severity / confidence > 0.05 / evidence count) to avoid LLM-reword noise
  • src/analyst/adapters.ts — five thin lifters that wrap the existing primitives into the Analyst shape (no duplication; the wrapped code is unchanged)
  • src/analyst/analyst.test.ts — 12 unit tests

Version bump: 0.27.2 → 0.28.0.

Test plan

  • pnpm tsc --noEmit — clean
  • pnpm vitest run src/analyst/analyst.test.ts — 12/12 pass
  • pnpm test (full suite) — 1117/1117 pass, no regressions
  • pnpm build — clean, OpenAPI spec re-emits

Design notes

  • Model-agnostic. Nothing in analyst/ imports a specific provider or model name. Analysts that call an LLM go through ChatClient on AnalystContext.chat.
  • Transport-agnostic. Production runs bind to sandbox-sdk; local dev binds to cli-bridge; tests bind to mock. Swapping is a one-line createChatClient(...) call.
  • Extends, doesn't duplicate. Existing LlmClient callers and TCloud-based judges keep working untouched. The five adapters lift, they don't re-implement.
  • finding_id stability. Hash over {analyst_id, area, subject, normalized_claim} (or explicit id_basis) — two runs that "find the same thing" share the id, which is what makes diffFindings meaningful for regression tracking.
  • Isolation invariant. A throwing analyst becomes a status: 'failed' summary row with error.class + error.message; siblings still run.

drewstone added 2 commits May 19, 2026 09:25
Adds a generic, model-agnostic, transport-agnostic Analyst layer that
orchestrates agent-eval's existing analyzers without re-implementing
them. One contract, one runner, one persistence path — reusable by VB
operator bench, the leaderboard submission pipeline, and the orchestrator
on-completion reports surface with the same code.

- `src/analyst/types.ts` — `Analyst` contract, `AnalystFinding` envelope
  with sha-stable `finding_id`, `AnalystRunInputs` with `inputKind`
  routing (trace-store | artifact-dir | run-record | judge-input | custom)
- `src/analyst/chat-client.ts` — `ChatClient` abstraction over
  router | sandbox-sdk | cli-bridge | direct-provider | mock so analyst
  code never depends on the transport
- `src/analyst/registry.ts` — register/list/run with input routing,
  per-analyst isolation (one failure does not stop others), budget split,
  per-analyst telemetry
- `src/analyst/findings-store.ts` — locked JSONL append + `diffFindings`
  (appeared / disappeared / persisted / changed) keyed by stable id
- `src/analyst/adapters.ts` — five thin lifters wrapping `analyzeTraces`,
  `MultiLayerVerifier`, `RunCritic`, `JudgeFn`, `SemanticConceptJudge`
- `src/analyst/analyst.test.ts` — 12 tests covering hash stability,
  registration validation, routing, failure isolation, only/skip,
  cost attribution, store round-trip, diff semantics, mock transport

Version: 0.28.0
Lifts five reviewer concerns into a small policy surface so consumers
override what they need without changing the registry. All defaults
preserve previous behavior.

- types.ts: `AnalystContext.chat` is now `ChatClient` (was `LlmClient`).
  Drops the cast in registry.run() and matches what the PR promised —
  analyst code is transport-agnostic by contract, not by convention.

- chat-client.ts: `wrapLlmClient` races the in-flight call against
  `ChatCallOpts.signal`. Awaiting code unblocks on abort; the in-flight
  HTTP request still bounds by `timeoutMs` (LlmClient doesn't yet
  accept an external AbortSignal — documented inline).

- registry.ts:
  * `AnalystHooks` — `onBeforeAnalyze`, `onAfterAnalyze`, `onError`,
    `onComplete`. `onError` MAY return findings to convert a crash into
    structured findings; `onAfterAnalyze` runs for ok | failed | skipped.
    This is the seam for telemetry, cost ingestion, storage rotation,
    error-to-finding conversion — all without registry changes.
  * `BudgetPolicy` — `{ totalUsd, weights, allocate }`. Default still
    equal-split; `allocate` is the precise hook when weights aren't
    enough.

- findings-store.ts: `diffFindings(prev, cur, { isMaterial })`. Default
  materiality test (severity / confidence Δ > 0.05 / evidence count) is
  exported as `defaultIsMaterial` so consumers can layer stricter
  predicates without re-implementing the base.

- 8 new tests cover hook ordering, error→finding conversion, skipped
  hooks, equal-split + weighted budget, default + custom diff policy,
  signal racing.

1125/1125 tests pass.
@drewstone
Copy link
Copy Markdown
Contributor Author

Pushed 7d03b7a addressing the review. Took the architectural framing — most of these concerns are policies, so they collapse into a small hook + policy surface instead of per-issue patches.

#1 ChatClient wiring — fixed. AnalystContext.chat is now typed as ChatClient directly. The cast in registry.run() is gone; analysts that call ctx.chat.chat(...) get the documented API.

#2 Signal threading — partial, with honest scope. wrapLlmClient now races the call against ChatCallOpts.signal so awaiting code unblocks on abort. The in-flight HTTP request still completes (or hits its own timeoutMs) because LlmClient.call doesn't accept an external AbortSignal yet. That's a one-line addition to LlmClient in a follow-up; the inline comment now reflects the actual contract instead of a TODO.

#4 Diff materiality — pluggable. diffFindings(prev, cur, { isMaterial }) takes an optional predicate. Default exported as defaultIsMaterial so consumers can layer (e.g. (a,b) => defaultIsMaterial(a,b) || a.rationale !== b.rationale) without re-implementing the base.

#5 Budget — pluggable. RegistryRunOpts.budget is now { totalUsd, weights, allocate }. Default is still equal split; weighted split via weights; precise control via allocate({ analyst, totalUsd, remainingUsd, runningCount }).

#6 Rotation, #+ telemetry, #+ error→finding — covered by AnalystHooks. New surface: onBeforeAnalyze | onAfterAnalyze | onError | onComplete. onAfterAnalyze fires for ok/failed/skipped — that's where rotation, ingestion, cost-aggregation belong. onError MAY return findings to convert a crash into structured signal. This was the architectural answer the reviewer was probing for: a generic seam for cross-cutting concerns instead of a hardcoded list of features.

#3 AxAIService coupling — non-issue. @tangle-network/agent-eval already declares @ax-llm/ax as a dep (the trace analyst itself lives in this package). Importing createTraceAnalystAdapter doesn't add a transitive — it surfaces what's already there. Consumers that don't need that adapter don't import it.

#7 schema_version literal — leaving for now. Migration helper is a 0.29 concern; current literal forces a deliberate type bump on breaking changes, which is the right ergonomic for 1.0.

#8 Isolation under load — added. Hook tests + error-conversion tests cover the failure-isolation invariant from another angle (and the failing analyst's siblings are observed via onAfterAnalyze).

Tests: 20 in the analyst file, 1125 across the suite, all green.

…on top of #56) (#57)

The original PR-A adapter (`createTraceAnalystAdapter`) shipped a single
generic flow whose output was `findings:string[]` — every bullet became
a flat-severity `medium` / confidence `0.6` `AnalystFinding`, losing the
per-finding grading the analyzer LLM is perfectly capable of producing.

This commit adds the typed-kind architecture on top:

  finding-signature.ts
    Strict Zod schema for one analyst-emitted row (severity / claim /
    subject / evidence_uri / confidence / rationale / recommended_action).
    `RAW_FINDING_SCHEMA_PROMPT` embeds the shape into every kind's actor
    prompt; the LLM emits structured JSON; the factory Zod-validates each
    row at the boundary and drops malformed rows with a logged reason.

  kind-factory.ts
    `createTraceAnalystKind(spec, { ai })` lifts a kind spec into the
    existing `Analyst<TraceAnalysisStore>` contract. Wires Ax `agent(...)`
    with `findings:json[]` output, AxJSRuntime sandbox, advanced-mode
    recursion when `maxDepth>0`, and the per-kind tool subset. `versionSuffix`
    is reserved for prompt-optimization artifacts (MIPRO/GEPA bumps Ax
    version → wire-up in a follow-up).

  tool-groups.ts
    Five named subsets — `all`, `discovery`, `discoveryAndRead`,
    `discoveryAndSearch`, `targeted` — so each kind takes only what it
    needs from the seven trace-analyst tools. Unknown group names throw
    (silent-all would defeat the cost-control point).

  kinds/ — four default kinds, ordered for dependency-aware runs:

    1. failure-mode (maxDepth 3, parallel 4) — clusters dataset failures
       into distinct modes with cited evidence. Discovery → cluster → cite
       protocol. Aggressive RLM delegation: one subagent per cluster, and
       confounded clusters split again at the next level.

    2. knowledge-gap (maxDepth 2, parallel 4) — names the specific
       information the agent lacked or that was stale, attributed to the
       runtime layer that should have surfaced it. Anchored on
       `@tangle-network/agent-knowledge` (wiki page / claim / raw source
       loci) with secondary loci for `websearch:outdated:*`, `tool-doc:*`,
       `system-prompt:*`, `memory:*`. Subagents fan out per layer.

    3. knowledge-poisoning (maxDepth 2, parallel 4) — finds confident-
       but-wrong actions. DUAL-VERIFY protocol: subagents prove (i) the
       agent acted on the belief and (ii) the belief is false in this
       trace's evidence. Only findings with both halves proven survive.

    4. improvement (maxDepth 3, parallel 4) — converts upstream findings
       into concrete locus-named edits. DISCOVERY → CANDIDATE-FIXES →
       COMPETE → CITE: subagents simulate competing fix candidates per
       cluster; the winning candidate per cluster is emitted with leverage
       grade, rationale, and a literal edit phrased as a diff. Cross-
       references upstream findings via `evidence_uri: "finding://<id>"`
       so the dependency graph renders.

  Tests (21 new in `kinds/kinds.test.ts`):
    - Zod schema rejects out-of-range confidence / unknown severity /
      extra fields (strict mode), logs the rejection reason
    - `parseRawFinding` returns null + logs on failure, types value on success
    - default suite emits the four kinds in run order
    - every kind exercises Ax recursion (maxDepth ≥ 1)
    - improvement has the deepest depth (competing candidate fixes)
    - knowledge-gap prompt anchors on agent-knowledge + websearch + tool-doc,
      not generic RAG
    - knowledge-poisoning enforces dual-verify
    - failure-mode requires clustering, not enumeration
    - tool groups filter narrowly; unknown name throws
    - `versionSuffix` appends to kind version (for future optimizer pins)
    - `finding_id` stable across runs for same kind + area + claim + subject

  The legacy `createTraceAnalystAdapter` is now `@deprecated` (kept one
  minor for consumer migration). New code should reach for kinds first.

Total: 1146/1146 tests pass, typecheck clean. Ax 19's `.d.ts` is missing
some optimizer-class exports that 21.x ships; the optimizer-fit pipeline
lands in a follow-up after the Ax bump.
@drewstone drewstone merged commit 641b0b3 into main May 19, 2026
1 check failed
drewstone added a commit that referenced this pull request May 19, 2026
The npm package was bumped to 0.28.0 in PR #56 (analyst registry +
findings envelope) but clients/python/pyproject.toml stayed at 0.27.2.
The publish workflow rejects mismatched versions across npm and PyPI
to prevent a half-published release where one ecosystem advertises a
version the other doesn't have.

Bumping pyproject to 0.28.0 so v0.28.0 can be tagged and the workflow
publishes both packages atomically.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant