Skip to content

refactor: route evolution judges through Agent SDK subprocess#49

Merged
mcheemaa merged 2 commits intomainfrom
phase1/judge-subprocess
Apr 12, 2026
Merged

refactor: route evolution judges through Agent SDK subprocess#49
mcheemaa merged 2 commits intomainfrom
phase1/judge-subprocess

Conversation

@mcheemaa
Copy link
Copy Markdown
Member

Summary

  • LLM evolution judges now route through the same Agent SDK query() subprocess as the main agent, via a new runtime.judgeQuery() method
  • Removes the raw @anthropic-ai/sdk dependency. The Agent SDK (@anthropic-ai/claude-agent-sdk) is unchanged
  • Structured output moves from messages.parse() to prompt instruction + JSON.parse() + Zod validation, with tolerant recovery for raw JSON, fenced JSON, and JSON wrapped in prose
  • Adds an optional judge_model config field for operators who want a different model tier for judges

Unifies authentication: ANTHROPIC_API_KEY, ANTHROPIC_BASE_URL, and any Claude Code credentials now apply to both the main agent and every evolution judge. Judge voting logic, prompts, schemas, and the 5-gate validation pipeline are unchanged. Existing deployments continue to work without configuration changes.

Judges previously imported Anthropic and zodOutputFormat from @anthropic-ai/sdk and held their own singleton client. They now delegate to runtime.judgeQuery() which reuses the Agent SDK subprocess, so a single code path and a single credential store drives both tiers.

Test plan

  • bun test: 838 pass, 0 fail (up from 825 with 13 new parser tests in src/agent/__tests__/judge-query.test.ts)
  • bun run typecheck clean
  • bun run lint clean
  • No file in src/ imports @anthropic-ai/sdk anymore
  • Judge voting logic (minority_veto, majority, unanimous) byte-for-byte unchanged
  • parseJsonFromResponse handles raw JSON, ```json fences, plain ``` fences, prose-wrapped JSON, and throws clear errors on empty / non-JSON / malformed / schema-violating output

@mcheemaa mcheemaa merged commit 7b455a7 into main Apr 12, 2026
1 check passed
@mcheemaa mcheemaa deleted the phase1/judge-subprocess branch April 12, 2026 04:30
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 64f036ae82

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

options: JudgeQueryOptions<T>,
): Promise<JudgeQueryResult<T>> {
const startTime = Date.now();
const resolvedModel = options.model ?? config.judge_model ?? config.model;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor judge_model override in judge selection

runJudgeQuery resolves the model as options.model ?? config.judge_model ?? config.model, but every judge wrapper still passes a hard-coded model into callJudge (for example, the Sonnet/Haiku constants in the judge modules), so config.judge_model is never reached in practice. Operators who set judge_model expecting to shift judge traffic to a cheaper/faster tier will see no behavior change and continue paying for the hard-coded models.

Useful? React with 👍 / 👎.

Comment on lines +121 to +123
maxTurns: 1,
effort: "low",
persistSession: false,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Apply maxTokens when issuing judge subprocess queries

The new judge path keeps maxTokens in the public options shape and forwards it from callJudge, but runJudgeQuery never includes that value in the SDK query() options. This silently drops token caps that previously bounded judge responses, which can increase latency/cost or make long judge outputs fail unpredictably when callers rely on that limit.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant