Skip to content

feat(acp): emit agent_error session/update when LLM call fails after retries#26352

Closed
truenorth-lj wants to merge 3 commits into
anomalyco:devfrom
truenorth-lj:feat/halt-emits-agent-error
Closed

feat(acp): emit agent_error session/update when LLM call fails after retries#26352
truenorth-lj wants to merge 3 commits into
anomalyco:devfrom
truenorth-lj:feat/halt-emits-agent-error

Conversation

@truenorth-lj

@truenorth-lj truenorth-lj commented May 8, 2026

Copy link
Copy Markdown

Issue for this PR

Closes #24494
Closes #28453

Stacks on #26306, which adds the agent_error SessionUpdate kind and the LLMErrorPayload shape this PR uses. Diffs from that PR are included here until it merges.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement / new feature (non-breaking change which adds functionality)

What does this PR do?

When the retry policy in session/retry.ts is exhausted, session/processor.ts halt() records the failure internally but does not emit an ACP frame to the connected client. From an ACP client perspective, the turn can look silently stuck: no final error notification and no stopReason.

This PR adds a case "session.error": branch in packages/opencode/src/acp/agent.ts that translates the SDK error variant into an LLMErrorPayload and emits the new agent_error session/update kind with stopReason: "error".

The payload extraction prefers classification headers when present:

  • x-llm-error-type: budget | rate_limit | provider_unavailable | context_overflow | auth | unknown
  • x-llm-error-retryable: "true" / "false" (overrides type-derived retryable)
  • x-llm-error-reset-at: epoch ms - populates reset_at_epoch_ms
  • retry-after: seconds - populates retry_after_seconds

It falls back to status-code heuristics when headers are absent (401 -> auth, 5xx -> provider_unavailable, else -> unknown). ProviderAuthError and ContextOverflowError SDK variants are mapped by name.

Two session.error variants are intentionally not emitted as agent_error:

  • ContextOverflowError: compaction handles this path and the turn can continue on a smaller context window.
  • MessageAbortedError: this is the normal user-stop path triggered by session/cancel; the prompt RPC reports cancellation via stopReason: "cancelled", so emitting agent_error would make one user action look like both an error and a cancellation.

The existing event subscription already receives session.error events through the SDK event stream. No bus, SDK, or event-stream wiring changes are required.

How did you verify your code works?

packages/opencode/test/acp/halt-emits-agent-error.test.ts covers:

  • APIError classification headers -> typed payload fields preserved
  • fallback status-code classification for 503 and 401
  • ContextOverflowError and ProviderAuthError mapping
  • explicit retryability override from headers
  • session.error emits exactly one agent_error update for a classified API error
  • ContextOverflowError emits no agent_error
  • MessageAbortedError emits no agent_error
  • unknown session IDs emit no update
$ bun test test/acp/halt-emits-agent-error.test.ts
 11 pass
 0 fail
 24 expect() calls

$ bun run typecheck
# clean

# push hook
$ bun turbo typecheck
 Tasks:    12 successful, 12 total

Screenshots / recordings

N/A - backend / ACP wire change. Client rendering of the agent_error frame is downstream consumer work.

Checklist

  • All tests pass locally
  • Code follows the existing event-handling style
  • Diff is confined to ACP error mapping / event handling and tests
  • No new dependencies, schema migrations, or breaking protocol changes
  • bun run typecheck clean

truenorth-lj and others added 2 commits May 8, 2026 15:53
…or propagation

Phase 1B of the LLM error propagation refactor. This is the TS mirror
of the Python contract merged in tn-mono PR anomalyco#721 (commit 884bcf71).

Why a new file instead of an ambient .d.ts augmentation: the SDK's
SessionUpdate is a closed `type` alias (discriminated union), not an
`interface`. TypeScript declaration merging only works on interfaces,
so we cannot extend SessionUpdate via a `.d.ts` patch. The clean
TS-native approach is a local extended type that consumers opt into.

Adds:
- LLMErrorType: closed string union of 6 categories (budget, rate_limit,
  provider_unavailable, context_overflow, auth, unknown)
- LLMErrorPayload: snake_case wire shape mirroring Python LLMErrorPayload;
  retryable is on-the-wire explicit even though derivable from type
- AgentErrorUpdate: { sessionUpdate: "agent_error"; error; stopReason? }
- SessionUpdateWithAgentError: SessionUpdate | AgentErrorUpdate (Phase 4
  emit site at session/processor.ts halt() will use this)
- isRetriable(type): TS mirror of the Python is_retriable() classifier
- isAgentErrorUpdate(value): type guard for narrowing unknown frames

12 unit tests cover: classifier per type, vocabulary stability, type-guard
positive/negative paths, JSON round-trip, compile-time discriminated
union narrowing.

Spec: specs/20260508-llm-error-propagation/spec.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…retries

When the retry policy in session/retry.ts is exhausted, halt() updates
internal state (ctx.assistantMessage.error, Bus.Session.Event.Error,
EventV2.SessionEvent.Step.Failed.Sync) but never emits any ACP frame
to the connected client. From an ACP client's perspective the turn is
silently stuck — no stopReason, no error notification, no final
session/update — and the only signal is whatever timeout the client
imposes locally.

Add a session.error case in acp/agent.ts handleEvent() that translates
the SDK error variant into an LLMErrorPayload and emits the
agent_error session/update kind with stopReason: "error". The payload
prefers headers set by an upstream classifying proxy
(x-llm-error-type / x-llm-error-retryable / x-llm-error-reset-at /
retry-after) over status-code heuristics.

ContextOverflowError is intentionally NOT emitted — halt() routes that
variant into in-process compaction; the turn continues on a smaller
context window rather than ending in an error state.

Tests cover the SDK→LLMErrorPayload mapping for every error variant
(APIError with classification headers, APIError status-code fallback,
ContextOverflowError, ProviderAuthError, explicit retryable header
override) and the integration path that pushes a session.error event
through the agent's event subscription and asserts the resulting
agent_error session/update.

Builds on anomalyco#26306 (which adds the agent_error SessionUpdate kind +
LLMErrorPayload type definitions). Closes anomalyco#26350.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented May 8, 2026

Copy link
Copy Markdown
Contributor

The following comment was made by an LLM, it may be inaccurate:

Based on my search, PR #26306 is a related PR but not a duplicate:

The other PRs found (26292, 26192, 19116, 24369, 18443) address different aspects of retry/error handling but are not duplicates of the current PR.

Conclusion: No duplicate PRs found. The only related PR is #26306 (a required dependency that this PR builds upon).

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Automated PR Cleanup

Thank you for contributing to opencode.

Due to the high volume of PRs from users and AI agents, we periodically close older PRs using automated criteria so maintainers can focus review time on the most active and community-supported contributions.

This PR was closed because it matched the following cleanup criteria:

  • The PR was created more than 1 month ago
  • The PR had fewer than 2 positive reactions
  • Positive reactions are counted as thumbs-up, heart, celebration, or rocket reactions on the PR

PRs created within the last month are not affected by this cleanup.

If you believe this PR was closed incorrectly, or if you are still actively working on it, please leave a comment explaining why it should be reopened. A maintainer can review and reopen it if appropriate.

Thanks again for taking the time to contribute.

@alshain

alshain commented Jun 9, 2026

Copy link
Copy Markdown

Leaving a comment so the bot will reopen. I'm still affected by this behavior when using ACP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

2 participants