Skip to content

fix(session): retry empty stream truncations and discard partial parts#26167

Open
edevil wants to merge 1 commit into
anomalyco:devfrom
edevil:fix/empty-other-stream-truncation
Open

fix(session): retry empty stream truncations and discard partial parts#26167
edevil wants to merge 1 commit into
anomalyco:devfrom
edevil:fix/empty-other-stream-truncation

Conversation

@edevil

@edevil edevil commented May 7, 2026

Copy link
Copy Markdown
Contributor

Issue for this PR

Closes #26170
Related #21727

Type of change

  • Bug fix

What does this PR do?

When an upstream provider stream ends without a proper stop_reason, the AI
SDK emits a fallback finish with zero output tokens. opencode previously
accepted this as a normal end-of-step, persisting a truncated message with no
error and no retry. The user got a half-finished response and had to manually
re-prompt.

This PR detects the truncation pattern at the session-processor layer,
surfaces it as a retryable APIError (capped at 3 attempts), and discards the
parts the failed attempt persisted so a successful retry replaces — rather than
appends to — the truncated content.

The trigger condition

When the upstream provider stream is cut mid-generation, the AI SDK flushes a
finish-step whose normalized reason is "unknown" with usage.outputTokens
of 0:

{ type: "text-delta", delta: "..." }
{ type: "text-delta", delta: "..." }   // ← upstream stream cuts here
{ type: "finish-step", reason: "unknown", usage: { outputTokens: 0 } }
                               
              AI SDK's "no stop reason was given" fallback (provider "other")

opencode's session processor receives the finish-step with
value.reason === "unknown" and usage.tokens.output === 0. Pre-fix, the
processor accepts that as a legitimate end-of-step.

Symptom (real-world evidence)

I found more than a dozen instances of this exact bug pattern across my own
opencode session database, spanning two providers (anthropic, openai)
and four models (gpt-5.3-codex, claude-opus-4-6, claude-opus-4-7,
claude-haiku-4-5). All exhibit the same shape:

// assistant message stored after the truncation
{
  "role": "assistant",
  "providerID": "anthropic",
  "modelID": "claude-opus-4-7",
  "finish": "other",
  "tokens": { "input": 0, "output": 0, "reasoning": 0, "cache": {...} },
  "cost": 0
}
// the corresponding step-finish part
{
  "type": "step-finish",
  "reason": "other",
  "tokens": { "input": 0, "output": 0, "reasoning": 0 },
  "cost": 0
}

Mid-stream cut, not a model decision: in one diagnostic example, the
reasoning text literally ends mid-word — "...really just wrapping the existing whichlang::detect_language() functi". The upstream stream was
severed before the next chunk arrived.

User-visible behavior pre-fix: the session stores a half-finished
message with no error, no retry, no recovery. In one observed session the
user manually re-prompted ~111s later, succeeded for 3 turns, hit the bug
again, re-prompted again — the "session degradation" pattern users report
in #16214.

The fix

  1. processor.ts — Detect the truncation (value.reason === "unknown"
    with zero output tokens) on finish-step and fail the stream with a
    retryable APIError tagged metadata.code = "EmptyOther".

  2. retry.ts — Cap EmptyOther retries at 3 attempts so a misbehaving
    provider can't loop forever. Other retryable classifications keep their
    existing unbounded behaviour. The retry set callback now also receives the
    parsed error so the processor can decide whether to discard.

  3. message-v2.ts — Add case APIError.isInstance(e) to fromError
    that converts the class instance to its wire form, so the structured
    message and metadata reach the TUI instead of being wrapped in a generic
    UnknownError whose payload is the JSON-stringified original.

  4. processor.ts (discard) — On retry, drop the parts the failed attempt
    persisted (see below) so the message reflects only the final attempt.

Discarding the truncated attempt

A naive "remove the partial text on retry" would leave the message in an
inconsistent state — earlier iterations only tracked the text/reasoning parts,
so the step-start part created at the top of each attempt was left behind and
piled up one orphan per retry (the "weird ux" raised in review).

This PR instead records a partFloor (a part id captured just before each
process() call's attempts) and, when discarding, removes every part the
attempt created after that floor — step-start, text, reasoning, etc. — so no
orphans remain. The assistant message is created fresh per process() call, so
the floor scopes removal precisely to this turn's output.

The discard is deliberately scoped:

  • Only stream truncations (EmptyOther) trigger it. Other retryable errors
    (rate limits, 5xx, decompression) retry untouched, exactly as before — this
    avoids touching attempts where tools may have already executed.
  • It also runs when the 3-retry cap is hit, so a permanently failing
    message doesn't keep the orphan parts either.

On the UX side there is nothing to "flicker": an EmptyOther truncation has
zero output tokens, so the only thing discarded is an effectively empty
step-start. The existing "retrying" indicator still shows.

Scope: why processor-layer instead of provider-layer

Related #21727 catches a similar truncation pattern at the
@ai-sdk/openai-compatible provider's flush() callback, which works only
for OpenAI-compatible providers. This PR catches the same condition one
layer up, in the session processor, where it applies to all AI-SDK
providers — including Anthropic direct, Bedrock, and Vertex. The instances
I observed include Anthropic-direct cases that #21727 cannot reach. The
two PRs are independent and complementary; either order of merge is fine.

How did you verify your code works?

  • retry.test.ts — recognizes EmptyOther as retryable, stops retrying after
    3 attempts, and round-trips APIError class instances through fromError
    (preserving data.message and metadata.code). 36 pass.
  • prompt.test.tsretry discards in-flight parts from the failed attempt:
    asserts the retried message keeps only the final text and exactly one
    step-start
    (i.e. the orphan is gone). 54 pass / 1 skip.
  • processor-effect.test.ts — reasoning state is reset across retries (no
    concatenated leftovers). 15 pass.
  • bun typecheck adds no new errors.

Other user-visible issues this likely helps

Checklist

  • I have tested my changes locally
  • I have not included unrelated changes in this PR

@github-actions

github-actions Bot commented May 7, 2026

Copy link
Copy Markdown
Contributor

Thanks for your contribution!

This PR doesn't have a linked issue. All PRs must reference an existing issue.

Please:

  1. Open an issue describing the bug/feature (if one doesn't exist)
  2. Add Fixes #<number> or Closes #<number> to this PR description

See CONTRIBUTING.md for details.

@github-actions

github-actions Bot commented May 7, 2026

Copy link
Copy Markdown
Contributor

The following comment was made by an LLM, it may be inaccurate:

Results

Found 1 related PR:

  1. #21727 - fix: handle stream interruption for OpenAI-compatible providers
    • This PR is explicitly mentioned in the current PR's description as complementary. It catches the same truncation pattern at the @ai-sdk/openai-compatible provider layer's flush() callback, while PR fix(session): retry empty stream truncations and discard partial parts #26167 catches it at the session processor layer (applying to all AI-SDK providers). The description notes both are independent and either order of merge is fine.

Note: PR #26167 is the current PR being analyzed, so it correctly appears in search results but is not a duplicate of itself.

No other duplicate PRs found addressing the same issue.

@rekram1-node

Copy link
Copy Markdown
Collaborator

/review

// No usage and no output means the connection was cut mid-generation,
// which is a transient failure that should be retried.
if (value.finishReason === "other" && usage.tokens.output === 0) {
return yield* Effect.fail(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion for the human to decide: this failure happens after stream parts may already have been persisted on the current assistant message. Because Effect.retry(...) wraps the stream before cleanup() runs, a retry will start a new stream on the same message without removing the partial text/reasoning parts from the truncated attempt, so a successful retry can leave the original truncated content plus the retried response in the final assistant message. Consider clearing the in-flight attempt parts before retrying, or moving this detection earlier to a place where no partial parts have been committed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — verified the concern is real:

  • Effect.ensuring(cleanup()) wraps the retry, so cleanup() only runs at the very end of the whole chain.
  • ctx.currentText / ctx.reasoningMap persist across retry attempts (closure-captured).
  • text-start and reasoning-start call session.updatePart(...) immediately, so partial parts are already in SQLite by the time finish-step fires.

Pushed a fix in 0a09591b2:

  • Track partIDs created during each attempt on ctx.attemptParts (pushed in text-start / reasoning-start).
  • New discardAttempt() helper deletes those parts via session.removePart(...) and resets currentText / reasoningMap / snapshot.
  • Hooked into the retry policy's set callback so it fires only when a retry will actually happen. Terminal failures (no retry) route through halt and keep the partial content as user-visible context.

Note this is a pre-existing issue affecting all retryable mid-stream errors (ECONNRESET, ZlibError, SSE timeout, etc.); the EmptyOther path just makes it more frequent. The fix applies uniformly to all of them.

Added an it.instance regression test (retry discards in-flight parts from the failed attempt) that pushes a truncated reply followed by a clean success and asserts the final message contains only the retried text.

// No usage and no output means the connection was cut mid-generation,
// which is a transient failure that should be retried.
if (value.finishReason === "other" && usage.tokens.output === 0) {
return yield* Effect.fail(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small style-guide suggestion, optional for the human to decide: in Effect.gen / Effect.fn, this repo prefers yield* new MyError(...) for direct typed-error failures instead of wrapping the error with Effect.fail(...). This branch could be written as return yield* new MessageV2.APIError({ ... }) while preserving the same behavior.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The yield* new MyError(...) pattern requires Schema.TaggedErrorClass-derived errors (Effect's YieldableError). MessageV2.APIError is built with namedSchemaError (message-v2.ts:51), which extends Error directly without [Symbol.iterator]. The suggested form fails to compile:

src/session/processor.ts: error TS2488:
  Type 'NamedSchemaError' must have a '[Symbol.iterator]()' method that returns an iterator.

All 14 existing yield* new ... sites in src/ use Schema.TaggedErrorClass (UpgradeFailedError, CliError, PhotonUnavailableError, RejectedError, etc.). Migrating MessageV2.APIError and its siblings (AbortedError, OutputLengthError, AuthError, ContextOverflowError) from namedSchemaError to Schema.TaggedErrorClass would change the wire schema ({ name, data }{ _tag, ... }) and break SDK consumers — out of scope for this PR.

Keeping the Effect.fail(new MessageV2.APIError(...)) form.

@edevil edevil force-pushed the fix/empty-other-stream-truncation branch 2 times, most recently from 0a09591 to b2fd02a Compare May 15, 2026 14:52
@edevil

edevil commented May 16, 2026

Copy link
Copy Markdown
Contributor Author

/review

Comment on lines +572 to +580
if (value.reason === "unknown" && usage.tokens.output === 0) {
return yield* Effect.fail(
new MessageV2.APIError({
message: "Provider stream ended without a stop reason",
isRetryable: true,
metadata: { code: "EmptyOther" },
}),
)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest extending this check to also catch finish_reason: "stop" with zero output tokens.

Hit the same failure shape on Azure-served gpt-5.5 via the OpenAI-compatible adapter:

  • Assistant turn finished cleanly: finish_reason: stop, 0 output tokens, no text or tool parts
  • Every subsequent user message returned the same empty shape
  • Same "session degradation" pattern you describe in this PR

Why the current check misses it: the PR guards on reason === "unknown", which is the AI SDK fallback when the stream ends without a stop_reason. In my case the /chat/completions stream still emitted finish_reason: stop in the final chunk despite carrying no content. My turn slipped through with reason: stop and got persisted.

Suggested extension (reuses ctx.attemptParts from this PR so it doesn't trip on legitimate text-emitting stop turns):

if (
  usage.tokens.output === 0 &&
  ctx.attemptParts.length === 0 &&
  (value.reason === "unknown" || value.reason === "stop")
) {
  return yield* Effect.fail(new MessageV2.APIError({
    message: "Provider returned empty stream",
    isRetryable: true,
    metadata: { code: "EmptyStream" },
  }))
}

@rekram1-node

Copy link
Copy Markdown
Collaborator

alright rreview time

@rekram1-node

Copy link
Copy Markdown
Collaborator

im not sure we can just discard attempts like this without some other changes too, i think it lends itself to a weird ux potentially

Detect provider stream truncation (finish reason "unknown" with zero
output tokens) and retry it as a transient failure, capped at 3 attempts.

On an EmptyOther retry — and when the retry cap is hit — discard the
parts the failed attempt persisted (everything created after a per-call
part floor) so the message reflects only the final attempt instead of
accumulating an orphan step-start / partial text or reasoning. The
discard is scoped to truncations; other retryable errors (rate limits,
5xx) retry untouched.

Surface APIError instances through MessageV2.fromError so the TUI
receives the structured message and metadata.

Refs anomalyco#14108
@edevil edevil force-pushed the fix/empty-other-stream-truncation branch from fed896e to 03b8a31 Compare June 5, 2026 14:27
@edevil

edevil commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

yeah that's fair, the original version was half-doing it which is what made it weird. reworked it so the discard is way more targeted:

  • it only runs on empty-stream truncations now (the EmptyOther case), not every retry. rate limits / 5xx etc retry untouched like before, so no behavior change there
  • when it does discard, it now removes everything the failed attempt created (via a per-call part floor), not just the text/reasoning. that was the actual bug — we were leaving an orphan step-start behind on each retry, so you'd get duplicate step separators piling up
  • also discards on the final give-up (when the 3-retry cap is hit) so a failed message doesn't keep the leftover either

On the ux side: for empty truncations there's no visible content to flicker (0 output tokens, it's basically just the step-start), so nothing the user was reading disappears. the existing "retrying" indicator still shows. i deliberately left the broader "discard partial content on any mid-stream retry" case out of scope since that's where the tool side-effect / flicker concerns actually live.

Also rebased onto latest dev and squashed to a single commit.

@edevil edevil changed the title fix(session): retry empty stream truncations with attempt cap fix(session): retry empty stream truncations and discard partial parts Jun 5, 2026
@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Automated PR Cleanup

Thank you for contributing to opencode.

Due to the high volume of PRs from users and AI agents, we periodically close older PRs using automated criteria so maintainers can focus review time on the most active and community-supported contributions.

This PR was closed because it matched the following cleanup criteria:

  • The PR was created more than 1 month ago
  • The PR had fewer than 2 positive reactions
  • Positive reactions are counted as thumbs-up, heart, celebration, or rocket reactions on the PR

PRs created within the last month are not affected by this cleanup.

If you believe this PR was closed incorrectly, or if you are still actively working on it, please leave a comment explaining why it should be reopened. A maintainer can review and reopen it if appropriate.

Thanks again for taking the time to contribute.

@github-actions github-actions Bot closed this Jun 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Provider stream truncation (finishReason="other" with zero output) silently accepted, persisting half-finished assistant messages

3 participants