fix(session): retry empty stream truncations and discard partial parts by edevil · Pull Request #26167 · anomalyco/opencode

edevil · 2026-05-07T11:04:20Z

Issue for this PR

Type of change

Bug fix

What does this PR do?

When an upstream provider stream ends without a proper stop_reason, the AI
SDK emits a fallback finish with zero output tokens. opencode previously
accepted this as a normal end-of-step, persisting a truncated message with no
error and no retry. The user got a half-finished response and had to manually
re-prompt.

This PR detects the truncation pattern at the session-processor layer,
surfaces it as a retryable APIError (capped at 3 attempts), and discards the
parts the failed attempt persisted so a successful retry replaces — rather than
appends to — the truncated content.

The trigger condition

When the upstream provider stream is cut mid-generation, the AI SDK flushes a
finish-step whose normalized reason is "unknown" with usage.outputTokens
of 0:

{ type: "text-delta", delta: "..." }
{ type: "text-delta", delta: "..." }   // ← upstream stream cuts here
{ type: "finish-step", reason: "unknown", usage: { outputTokens: 0 } }
                               ↑
              AI SDK's "no stop reason was given" fallback (provider "other")

opencode's session processor receives the finish-step with
value.reason === "unknown" and usage.tokens.output === 0. Pre-fix, the
processor accepts that as a legitimate end-of-step.

Symptom (real-world evidence)

I found more than a dozen instances of this exact bug pattern across my own
opencode session database, spanning two providers (anthropic, openai)
and four models (gpt-5.3-codex, claude-opus-4-6, claude-opus-4-7,
claude-haiku-4-5). All exhibit the same shape:

// assistant message stored after the truncation
{
  "role": "assistant",
  "providerID": "anthropic",
  "modelID": "claude-opus-4-7",
  "finish": "other",
  "tokens": { "input": 0, "output": 0, "reasoning": 0, "cache": {...} },
  "cost": 0
}

// the corresponding step-finish part
{
  "type": "step-finish",
  "reason": "other",
  "tokens": { "input": 0, "output": 0, "reasoning": 0 },
  "cost": 0
}

Mid-stream cut, not a model decision: in one diagnostic example, the
reasoning text literally ends mid-word — "...really just wrapping the existing whichlang::detect_language() functi". The upstream stream was
severed before the next chunk arrived.

User-visible behavior pre-fix: the session stores a half-finished
message with no error, no retry, no recovery. In one observed session the
user manually re-prompted ~111s later, succeeded for 3 turns, hit the bug
again, re-prompted again — the "session degradation" pattern users report
in #16214.

The fix

processor.ts — Detect the truncation (value.reason === "unknown"
with zero output tokens) on finish-step and fail the stream with a
retryable APIError tagged metadata.code = "EmptyOther".
retry.ts — Cap EmptyOther retries at 3 attempts so a misbehaving
provider can't loop forever. Other retryable classifications keep their
existing unbounded behaviour. The retry set callback now also receives the
parsed error so the processor can decide whether to discard.
message-v2.ts — Add case APIError.isInstance(e) to fromError
that converts the class instance to its wire form, so the structured
message and metadata reach the TUI instead of being wrapped in a generic
UnknownError whose payload is the JSON-stringified original.
processor.ts (discard) — On retry, drop the parts the failed attempt
persisted (see below) so the message reflects only the final attempt.

Discarding the truncated attempt

A naive "remove the partial text on retry" would leave the message in an
inconsistent state — earlier iterations only tracked the text/reasoning parts,
so the step-start part created at the top of each attempt was left behind and
piled up one orphan per retry (the "weird ux" raised in review).

This PR instead records a partFloor (a part id captured just before each
process() call's attempts) and, when discarding, removes every part the
attempt created after that floor — step-start, text, reasoning, etc. — so no
orphans remain. The assistant message is created fresh per process() call, so
the floor scopes removal precisely to this turn's output.

The discard is deliberately scoped:

Only stream truncations (EmptyOther) trigger it. Other retryable errors
(rate limits, 5xx, decompression) retry untouched, exactly as before — this
avoids touching attempts where tools may have already executed.
It also runs when the 3-retry cap is hit, so a permanently failing
message doesn't keep the orphan parts either.

On the UX side there is nothing to "flicker": an EmptyOther truncation has
zero output tokens, so the only thing discarded is an effectively empty
step-start. The existing "retrying" indicator still shows.

Scope: why processor-layer instead of provider-layer

Related #21727 catches a similar truncation pattern at the
@ai-sdk/openai-compatible provider's flush() callback, which works only
for OpenAI-compatible providers. This PR catches the same condition one
layer up, in the session processor, where it applies to all AI-SDK
providers — including Anthropic direct, Bedrock, and Vertex. The instances
I observed include Anthropic-direct cases that #21727 cannot reach. The
two PRs are independent and complementary; either order of merge is fine.

How did you verify your code works?

retry.test.ts — recognizes EmptyOther as retryable, stops retrying after
3 attempts, and round-trips APIError class instances through fromError
(preserving data.message and metadata.code). 36 pass.
prompt.test.ts — retry discards in-flight parts from the failed attempt:
asserts the retried message keeps only the final text and exactly one
step-start (i.e. the orphan is gone). 54 pass / 1 skip.
processor-effect.test.ts — reasoning state is reset across retries (no
concatenated leftovers). 15 pass.
bun typecheck adds no new errors.

Other user-visible issues this likely helps

coding agent often abruptly stops mid codeblock #24132 ("coding agent often abruptly stops mid codeblock") — same
symptom class; user has not run diagnostics to confirm mechanism.
Explore subagent hangs indefinitely with Anthropic Claude Opus 4.6 -- no timeout or recovery #13841 ("Explore subagent hangs indefinitely") — partially addresses;
satisfies one of four suggested fixes (retry-cap angle for EmptyOther only).
Intermittent OpenAI streamed server_error (sequence_number:2) with gpt-5.3-codex; retries degrade session #16214 ("Intermittent OpenAI streamed server_error … retries degrade
session") — fromError passthrough makes these errors readable in the
TUI; retry cap prevents indefinite loop.

Checklist

I have tested my changes locally
I have not included unrelated changes in this PR

github-actions · 2026-05-07T11:04:37Z

Thanks for your contribution!

This PR doesn't have a linked issue. All PRs must reference an existing issue.

Please:

Open an issue describing the bug/feature (if one doesn't exist)
Add Fixes #<number> or Closes #<number> to this PR description

See CONTRIBUTING.md for details.

github-actions · 2026-05-07T11:05:29Z

The following comment was made by an LLM, it may be inaccurate:

Results

Found 1 related PR:

#21727 - fix: handle stream interruption for OpenAI-compatible providers
- This PR is explicitly mentioned in the current PR's description as complementary. It catches the same truncation pattern at the @ai-sdk/openai-compatible provider layer's flush() callback, while PR fix(session): retry empty stream truncations and discard partial parts #26167 catches it at the session processor layer (applying to all AI-SDK providers). The description notes both are independent and either order of merge is fine.

Note: PR #26167 is the current PR being analyzed, so it correctly appears in search results but is not a duplicate of itself.

No other duplicate PRs found addressing the same issue.

rekram1-node · 2026-05-15T03:06:24Z

/review

github-actions · 2026-05-15T03:09:18Z

+            // No usage and no output means the connection was cut mid-generation,
+            // which is a transient failure that should be retried.
+            if (value.finishReason === "other" && usage.tokens.output === 0) {
+              return yield* Effect.fail(


Suggestion for the human to decide: this failure happens after stream parts may already have been persisted on the current assistant message. Because Effect.retry(...) wraps the stream before cleanup() runs, a retry will start a new stream on the same message without removing the partial text/reasoning parts from the truncated attempt, so a successful retry can leave the original truncated content plus the retried response in the final assistant message. Consider clearing the in-flight attempt parts before retrying, or moving this detection earlier to a place where no partial parts have been committed.

Good catch — verified the concern is real:

Effect.ensuring(cleanup()) wraps the retry, so cleanup() only runs at the very end of the whole chain.

ctx.currentText / ctx.reasoningMap persist across retry attempts (closure-captured).

text-start and reasoning-start call session.updatePart(...) immediately, so partial parts are already in SQLite by the time finish-step fires.

Pushed a fix in 0a09591b2:

Track partIDs created during each attempt on ctx.attemptParts (pushed in text-start / reasoning-start).

New discardAttempt() helper deletes those parts via session.removePart(...) and resets currentText / reasoningMap / snapshot.

Hooked into the retry policy's set callback so it fires only when a retry will actually happen. Terminal failures (no retry) route through halt and keep the partial content as user-visible context.

Note this is a pre-existing issue affecting all retryable mid-stream errors (ECONNRESET, ZlibError, SSE timeout, etc.); the EmptyOther path just makes it more frequent. The fix applies uniformly to all of them.

Added an it.instance regression test (retry discards in-flight parts from the failed attempt) that pushes a truncated reply followed by a clean success and asserts the final message contains only the retried text.

github-actions · 2026-05-15T03:09:24Z

+            // No usage and no output means the connection was cut mid-generation,
+            // which is a transient failure that should be retried.
+            if (value.finishReason === "other" && usage.tokens.output === 0) {
+              return yield* Effect.fail(


Small style-guide suggestion, optional for the human to decide: in Effect.gen / Effect.fn, this repo prefers yield* new MyError(...) for direct typed-error failures instead of wrapping the error with Effect.fail(...). This branch could be written as return yield* new MessageV2.APIError({ ... }) while preserving the same behavior.

The yield* new MyError(...) pattern requires Schema.TaggedErrorClass-derived errors (Effect's YieldableError). MessageV2.APIError is built with namedSchemaError (message-v2.ts:51), which extends Error directly without [Symbol.iterator]. The suggested form fails to compile:

src/session/processor.ts: error TS2488: Type 'NamedSchemaError' must have a '[Symbol.iterator]()' method that returns an iterator.

All 14 existing yield* new ... sites in src/ use Schema.TaggedErrorClass (UpgradeFailedError, CliError, PhotonUnavailableError, RejectedError, etc.). Migrating MessageV2.APIError and its siblings (AbortedError, OutputLengthError, AuthError, ContextOverflowError) from namedSchemaError to Schema.TaggedErrorClass would change the wire schema ({ name, data } → { _tag, ... }) and break SDK consumers — out of scope for this PR.

Keeping the Effect.fail(new MessageV2.APIError(...)) form.

edevil · 2026-05-16T07:28:09Z

/review

energyd · 2026-05-27T04:35:40Z

+            if (value.reason === "unknown" && usage.tokens.output === 0) {
+              return yield* Effect.fail(
+                new MessageV2.APIError({
+                  message: "Provider stream ended without a stop reason",
+                  isRetryable: true,
+                  metadata: { code: "EmptyOther" },
+                }),
+              )
+            }


Suggest extending this check to also catch finish_reason: "stop" with zero output tokens.

Hit the same failure shape on Azure-served gpt-5.5 via the OpenAI-compatible adapter:

Assistant turn finished cleanly: finish_reason: stop, 0 output tokens, no text or tool parts

Every subsequent user message returned the same empty shape

Same "session degradation" pattern you describe in this PR

Why the current check misses it: the PR guards on reason === "unknown", which is the AI SDK fallback when the stream ends without a stop_reason. In my case the /chat/completions stream still emitted finish_reason: stop in the final chunk despite carrying no content. My turn slipped through with reason: stop and got persisted.

Suggested extension (reuses ctx.attemptParts from this PR so it doesn't trip on legitimate text-emitting stop turns):

if ( usage.tokens.output === 0 && ctx.attemptParts.length === 0 && (value.reason === "unknown" || value.reason === "stop") ) { return yield* Effect.fail(new MessageV2.APIError({ message: "Provider returned empty stream", isRetryable: true, metadata: { code: "EmptyStream" }, })) }

rekram1-node · 2026-06-03T20:01:14Z

alright rreview time

rekram1-node · 2026-06-04T05:21:34Z

im not sure we can just discard attempts like this without some other changes too, i think it lends itself to a weird ux potentially

Detect provider stream truncation (finish reason "unknown" with zero output tokens) and retry it as a transient failure, capped at 3 attempts. On an EmptyOther retry — and when the retry cap is hit — discard the parts the failed attempt persisted (everything created after a per-call part floor) so the message reflects only the final attempt instead of accumulating an orphan step-start / partial text or reasoning. The discard is scoped to truncations; other retryable errors (rate limits, 5xx) retry untouched. Surface APIError instances through MessageV2.fromError so the TUI receives the structured message and metadata. Refs anomalyco#14108

edevil · 2026-06-05T14:38:03Z

yeah that's fair, the original version was half-doing it which is what made it weird. reworked it so the discard is way more targeted:

it only runs on empty-stream truncations now (the EmptyOther case), not every retry. rate limits / 5xx etc retry untouched like before, so no behavior change there
when it does discard, it now removes everything the failed attempt created (via a per-call part floor), not just the text/reasoning. that was the actual bug — we were leaving an orphan step-start behind on each retry, so you'd get duplicate step separators piling up
also discards on the final give-up (when the 3-retry cap is hit) so a failed message doesn't keep the leftover either

On the ux side: for empty truncations there's no visible content to flicker (0 output tokens, it's basically just the step-start), so nothing the user was reading disappears. the existing "retrying" indicator still shows. i deliberately left the broader "discard partial content on any mid-stream retry" case out of scope since that's where the tool side-effect / flicker concerns actually live.

Also rebased onto latest dev and squashed to a single commit.

github-actions · 2026-06-07T22:25:29Z

Automated PR Cleanup

Thank you for contributing to opencode.

Due to the high volume of PRs from users and AI agents, we periodically close older PRs using automated criteria so maintainers can focus review time on the most active and community-supported contributions.

This PR was closed because it matched the following cleanup criteria:

The PR was created more than 1 month ago
The PR had fewer than 2 positive reactions
Positive reactions are counted as thumbs-up, heart, celebration, or rocket reactions on the PR

PRs created within the last month are not affected by this cleanup.

If you believe this PR was closed incorrectly, or if you are still actively working on it, please leave a comment explaining why it should be reopened. A maintainer can review and reopen it if appropriate.

Thanks again for taking the time to contribute.

github-actions Bot added needs:issue contributor labels May 7, 2026

github-actions Bot removed the needs:issue label May 7, 2026

This was referenced May 7, 2026

fix(session): exclude orphaned interrupted tools from run-loop continuation #26178

Merged

fix(session): add fallback retry handling and harden pre-push bun path #26192

Closed

fix(session): cap retry schedule at RETRY_MAX_ATTEMPTS = 3 #26343

Closed

edevil force-pushed the fix/empty-other-stream-truncation branch from a2b30cc to 3d55892 Compare May 13, 2026 07:36

github-actions Bot reviewed May 15, 2026

View reviewed changes

edevil force-pushed the fix/empty-other-stream-truncation branch 2 times, most recently from 0a09591 to b2fd02a Compare May 15, 2026 14:52

YIWANG-sketch mentioned this pull request May 17, 2026

[Bug] todo-continuation-enforcer: assistant loops repeating final summary after all tasks completed code-yeongyu/oh-my-openagent#4013

Closed

edevil force-pushed the fix/empty-other-stream-truncation branch 2 times, most recently from 1e7c26d to 5bc0028 Compare May 20, 2026 18:07

github-actions Bot mentioned this pull request May 22, 2026

fix(session): normalize wrapped subagent stream errors #28898

Open

7 tasks

energyd reviewed May 27, 2026

View reviewed changes

lele872 mentioned this pull request May 31, 2026

ECONNRESET with zai-coding-plan provider (api.z.ai) #15350

Open

edevil force-pushed the fix/empty-other-stream-truncation branch from e0b8569 to 1d8748e Compare June 3, 2026 11:50

edevil force-pushed the fix/empty-other-stream-truncation branch from fed896e to 03b8a31 Compare June 5, 2026 14:27

edevil changed the title ~~fix(session): retry empty stream truncations with attempt cap~~ fix(session): retry empty stream truncations and discard partial parts Jun 5, 2026

github-actions Bot closed this Jun 7, 2026

github-actions Bot added the automated-pr-cleanup label Jun 7, 2026

rekram1-node reopened this Jun 8, 2026

This was referenced Jun 8, 2026

📊 AI CLI 工具社区动态日报 2026-06-08 litang9/big_model_radar#31

Open

fix(cli): flush run parts after json stream idle #31483

Closed

Conversation

edevil commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue for this PR

Type of change

What does this PR do?

The trigger condition

Symptom (real-world evidence)

The fix

Discarding the truncated attempt

Scope: why processor-layer instead of provider-layer

How did you verify your code works?

Other user-visible issues this likely helps

Checklist

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Results

Uh oh!

rekram1-node commented May 15, 2026

Uh oh!

github-actions Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

edevil May 15, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

edevil May 15, 2026

Choose a reason for hiding this comment

Uh oh!

edevil commented May 16, 2026

Uh oh!

energyd May 27, 2026

Choose a reason for hiding this comment

Uh oh!

rekram1-node commented Jun 3, 2026

Uh oh!

rekram1-node commented Jun 4, 2026

Uh oh!

edevil commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

edevil commented May 7, 2026 •

edited

Loading