fix(session): retry empty stream truncations and discard partial parts#26167
fix(session): retry empty stream truncations and discard partial parts#26167edevil wants to merge 1 commit into
Conversation
|
Thanks for your contribution! This PR doesn't have a linked issue. All PRs must reference an existing issue. Please:
See CONTRIBUTING.md for details. |
|
The following comment was made by an LLM, it may be inaccurate: ResultsFound 1 related PR:
Note: PR #26167 is the current PR being analyzed, so it correctly appears in search results but is not a duplicate of itself. No other duplicate PRs found addressing the same issue. |
a2b30cc to
3d55892
Compare
|
/review |
| // No usage and no output means the connection was cut mid-generation, | ||
| // which is a transient failure that should be retried. | ||
| if (value.finishReason === "other" && usage.tokens.output === 0) { | ||
| return yield* Effect.fail( |
There was a problem hiding this comment.
Suggestion for the human to decide: this failure happens after stream parts may already have been persisted on the current assistant message. Because Effect.retry(...) wraps the stream before cleanup() runs, a retry will start a new stream on the same message without removing the partial text/reasoning parts from the truncated attempt, so a successful retry can leave the original truncated content plus the retried response in the final assistant message. Consider clearing the in-flight attempt parts before retrying, or moving this detection earlier to a place where no partial parts have been committed.
There was a problem hiding this comment.
Good catch — verified the concern is real:
Effect.ensuring(cleanup())wraps the retry, socleanup()only runs at the very end of the whole chain.ctx.currentText/ctx.reasoningMappersist across retry attempts (closure-captured).text-startandreasoning-startcallsession.updatePart(...)immediately, so partial parts are already in SQLite by the timefinish-stepfires.
Pushed a fix in 0a09591b2:
- Track partIDs created during each attempt on
ctx.attemptParts(pushed intext-start/reasoning-start). - New
discardAttempt()helper deletes those parts viasession.removePart(...)and resetscurrentText/reasoningMap/snapshot. - Hooked into the retry policy's
setcallback so it fires only when a retry will actually happen. Terminal failures (no retry) route throughhaltand keep the partial content as user-visible context.
Note this is a pre-existing issue affecting all retryable mid-stream errors (ECONNRESET, ZlibError, SSE timeout, etc.); the EmptyOther path just makes it more frequent. The fix applies uniformly to all of them.
Added an it.instance regression test (retry discards in-flight parts from the failed attempt) that pushes a truncated reply followed by a clean success and asserts the final message contains only the retried text.
| // No usage and no output means the connection was cut mid-generation, | ||
| // which is a transient failure that should be retried. | ||
| if (value.finishReason === "other" && usage.tokens.output === 0) { | ||
| return yield* Effect.fail( |
There was a problem hiding this comment.
Small style-guide suggestion, optional for the human to decide: in Effect.gen / Effect.fn, this repo prefers yield* new MyError(...) for direct typed-error failures instead of wrapping the error with Effect.fail(...). This branch could be written as return yield* new MessageV2.APIError({ ... }) while preserving the same behavior.
There was a problem hiding this comment.
The yield* new MyError(...) pattern requires Schema.TaggedErrorClass-derived errors (Effect's YieldableError). MessageV2.APIError is built with namedSchemaError (message-v2.ts:51), which extends Error directly without [Symbol.iterator]. The suggested form fails to compile:
src/session/processor.ts: error TS2488:
Type 'NamedSchemaError' must have a '[Symbol.iterator]()' method that returns an iterator.
All 14 existing yield* new ... sites in src/ use Schema.TaggedErrorClass (UpgradeFailedError, CliError, PhotonUnavailableError, RejectedError, etc.). Migrating MessageV2.APIError and its siblings (AbortedError, OutputLengthError, AuthError, ContextOverflowError) from namedSchemaError to Schema.TaggedErrorClass would change the wire schema ({ name, data } → { _tag, ... }) and break SDK consumers — out of scope for this PR.
Keeping the Effect.fail(new MessageV2.APIError(...)) form.
0a09591 to
b2fd02a
Compare
|
/review |
1e7c26d to
5bc0028
Compare
| if (value.reason === "unknown" && usage.tokens.output === 0) { | ||
| return yield* Effect.fail( | ||
| new MessageV2.APIError({ | ||
| message: "Provider stream ended without a stop reason", | ||
| isRetryable: true, | ||
| metadata: { code: "EmptyOther" }, | ||
| }), | ||
| ) | ||
| } |
There was a problem hiding this comment.
Suggest extending this check to also catch finish_reason: "stop" with zero output tokens.
Hit the same failure shape on Azure-served gpt-5.5 via the OpenAI-compatible adapter:
- Assistant turn finished cleanly:
finish_reason: stop, 0 output tokens, no text or tool parts - Every subsequent user message returned the same empty shape
- Same "session degradation" pattern you describe in this PR
Why the current check misses it: the PR guards on reason === "unknown", which is the AI SDK fallback when the stream ends without a stop_reason. In my case the /chat/completions stream still emitted finish_reason: stop in the final chunk despite carrying no content. My turn slipped through with reason: stop and got persisted.
Suggested extension (reuses ctx.attemptParts from this PR so it doesn't trip on legitimate text-emitting stop turns):
if (
usage.tokens.output === 0 &&
ctx.attemptParts.length === 0 &&
(value.reason === "unknown" || value.reason === "stop")
) {
return yield* Effect.fail(new MessageV2.APIError({
message: "Provider returned empty stream",
isRetryable: true,
metadata: { code: "EmptyStream" },
}))
}e0b8569 to
1d8748e
Compare
|
alright rreview time |
|
im not sure we can just discard attempts like this without some other changes too, i think it lends itself to a weird ux potentially |
Detect provider stream truncation (finish reason "unknown" with zero output tokens) and retry it as a transient failure, capped at 3 attempts. On an EmptyOther retry — and when the retry cap is hit — discard the parts the failed attempt persisted (everything created after a per-call part floor) so the message reflects only the final attempt instead of accumulating an orphan step-start / partial text or reasoning. The discard is scoped to truncations; other retryable errors (rate limits, 5xx) retry untouched. Surface APIError instances through MessageV2.fromError so the TUI receives the structured message and metadata. Refs anomalyco#14108
fed896e to
03b8a31
Compare
|
yeah that's fair, the original version was half-doing it which is what made it weird. reworked it so the discard is way more targeted:
On the ux side: for empty truncations there's no visible content to flicker (0 output tokens, it's basically just the step-start), so nothing the user was reading disappears. the existing "retrying" indicator still shows. i deliberately left the broader "discard partial content on any mid-stream retry" case out of scope since that's where the tool side-effect / flicker concerns actually live. Also rebased onto latest dev and squashed to a single commit. |
|
Automated PR Cleanup Thank you for contributing to opencode. Due to the high volume of PRs from users and AI agents, we periodically close older PRs using automated criteria so maintainers can focus review time on the most active and community-supported contributions. This PR was closed because it matched the following cleanup criteria:
PRs created within the last month are not affected by this cleanup. If you believe this PR was closed incorrectly, or if you are still actively working on it, please leave a comment explaining why it should be reopened. A maintainer can review and reopen it if appropriate. Thanks again for taking the time to contribute. |
Issue for this PR
Closes #26170
Related #21727
Type of change
What does this PR do?
When an upstream provider stream ends without a proper
stop_reason, the AISDK emits a fallback finish with zero output tokens. opencode previously
accepted this as a normal end-of-step, persisting a truncated message with no
error and no retry. The user got a half-finished response and had to manually
re-prompt.
This PR detects the truncation pattern at the session-processor layer,
surfaces it as a retryable
APIError(capped at 3 attempts), and discards theparts the failed attempt persisted so a successful retry replaces — rather than
appends to — the truncated content.
The trigger condition
When the upstream provider stream is cut mid-generation, the AI SDK flushes a
finish-stepwhose normalized reason is"unknown"withusage.outputTokensof 0:
opencode's session processor receives the
finish-stepwithvalue.reason === "unknown"andusage.tokens.output === 0. Pre-fix, theprocessor accepts that as a legitimate end-of-step.
Symptom (real-world evidence)
I found more than a dozen instances of this exact bug pattern across my own
opencode session database, spanning two providers (
anthropic,openai)and four models (
gpt-5.3-codex,claude-opus-4-6,claude-opus-4-7,claude-haiku-4-5). All exhibit the same shape:Mid-stream cut, not a model decision: in one diagnostic example, the
reasoning text literally ends mid-word —
"...really just wrapping the existing whichlang::detect_language() functi". The upstream stream wassevered before the next chunk arrived.
User-visible behavior pre-fix: the session stores a half-finished
message with no error, no retry, no recovery. In one observed session the
user manually re-prompted ~111s later, succeeded for 3 turns, hit the bug
again, re-prompted again — the "session degradation" pattern users report
in #16214.
The fix
processor.ts— Detect the truncation (value.reason === "unknown"with zero output tokens) on
finish-stepand fail the stream with aretryable
APIErrortaggedmetadata.code = "EmptyOther".retry.ts— CapEmptyOtherretries at 3 attempts so a misbehavingprovider can't loop forever. Other retryable classifications keep their
existing unbounded behaviour. The retry
setcallback now also receives theparsed error so the processor can decide whether to discard.
message-v2.ts— Addcase APIError.isInstance(e)tofromErrorthat converts the class instance to its wire form, so the structured
message and metadata reach the TUI instead of being wrapped in a generic
UnknownErrorwhose payload is the JSON-stringified original.processor.ts(discard) — On retry, drop the parts the failed attemptpersisted (see below) so the message reflects only the final attempt.
Discarding the truncated attempt
A naive "remove the partial text on retry" would leave the message in an
inconsistent state — earlier iterations only tracked the text/reasoning parts,
so the
step-startpart created at the top of each attempt was left behind andpiled up one orphan per retry (the "weird ux" raised in review).
This PR instead records a
partFloor(a part id captured just before eachprocess()call's attempts) and, when discarding, removes every part theattempt created after that floor —
step-start, text, reasoning, etc. — so noorphans remain. The assistant message is created fresh per
process()call, sothe floor scopes removal precisely to this turn's output.
The discard is deliberately scoped:
EmptyOther) trigger it. Other retryable errors(rate limits, 5xx, decompression) retry untouched, exactly as before — this
avoids touching attempts where tools may have already executed.
message doesn't keep the orphan parts either.
On the UX side there is nothing to "flicker": an
EmptyOthertruncation haszero output tokens, so the only thing discarded is an effectively empty
step-start. The existing "retrying" indicator still shows.Scope: why processor-layer instead of provider-layer
Related #21727 catches a similar truncation pattern at the
@ai-sdk/openai-compatibleprovider'sflush()callback, which works onlyfor OpenAI-compatible providers. This PR catches the same condition one
layer up, in the session processor, where it applies to all AI-SDK
providers — including Anthropic direct, Bedrock, and Vertex. The instances
I observed include Anthropic-direct cases that #21727 cannot reach. The
two PRs are independent and complementary; either order of merge is fine.
How did you verify your code works?
retry.test.ts— recognizesEmptyOtheras retryable, stops retrying after3 attempts, and round-trips
APIErrorclass instances throughfromError(preserving
data.messageandmetadata.code). 36 pass.prompt.test.ts—retry discards in-flight parts from the failed attempt:asserts the retried message keeps only the final text and exactly one
step-start(i.e. the orphan is gone). 54 pass / 1 skip.processor-effect.test.ts— reasoning state is reset across retries (noconcatenated leftovers). 15 pass.
bun typecheckadds no new errors.Other user-visible issues this likely helps
symptom class; user has not run diagnostics to confirm mechanism.
satisfies one of four suggested fixes (retry-cap angle for EmptyOther only).
session") —
fromErrorpassthrough makes these errors readable in theTUI; retry cap prevents indefinite loop.
Checklist