Skip to content

Sessions fail on transient network errors instead of retrying #30611

@literally-dan

Description

@literally-dan

Description

The session retry path only treats ECONNRESET as retryable. Other transient transport failures are classified as hard errors, so a brief network problem kills the assistant turn instead of being retried by the existing retry policy.

Cases that currently fail the session outright include system error codes such as ETIMEDOUT, ENOTFOUND, EAI_AGAIN, ECONNREFUSED, EHOSTUNREACH, ENETUNREACH, and EPIPE; request timeouts surfaced as a TimeoutError; and bare fetch/undici errors whose only signal is the message text, such as fetch failed, socket hang up, terminated, and other side closed.

On a flaky connection this shows up as the turn ending with an error that a retry would have recovered from.

This is a different layer from #21893 (provider/stream-layer stream_read_error and wrapped rate-limit errors) and from #20822 (a broader "retry UnknownError by default" strategy). This issue is specifically about transport/socket-level failures in MessageV2.fromError.

I have a fix that widens the retryable classification while keeping a real user cancel non-retryable.

Plugins

None

OpenCode version

1.15.13

Steps to reproduce

  1. Start an assistant turn that makes a provider request.
  2. Trigger a transient transport failure (drop the network briefly, or force a DNS failure / connection timeout) so the error surfaces as one of the system codes above, a TimeoutError, or a bare fetch failed / socket hang up.
  3. The session ends in error instead of entering retry, even though the existing retry policy would have recovered.

Screenshot and/or share link

No response

Operating System

Linux 6.17.0-29-generic

Terminal

foot

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions