Skip to content

runtime: recover oversized user message after wire/media overflow#2821

Draft
trungutt wants to merge 2 commits into
docker:mainfrom
trungutt:trungutt/overflow-aware-compaction
Draft

runtime: recover oversized user message after wire/media overflow#2821
trungutt wants to merge 2 commits into
docker:mainfrom
trungutt:trungutt/overflow-aware-compaction

Conversation

@trungutt
Copy link
Copy Markdown
Contributor

@trungutt trungutt commented May 19, 2026

Stacked on #2819. Review after that one merges.

The problem this step fixes

After #2818 + #2819, an oversized message produces the right error message and fails fast — but the same chat session cannot continue. The offending user message stays verbatim in sess.Messages for the rest of the process, and every retry resends the same oversized payload alongside the new (smaller) one. The user sees the same rejection every time they hit Send.

The original user-reported regression:

"I shortened my paste but it kept failing with the same character-limit error."

What this change does

After a wire- or media-overflow rejection, walk back to the most recent user message and rewrite it in place in memory:

  • Each media part (image, file, document) becomes a short text placeholder that records what was attached (name, size, MIME when known).
  • Plain-text content over a conservative threshold is replaced with a placeholder that records the original size.

The rewrite happens immediately after the failure event fires. The error itself is still emitted with the kind-specific code from #2818, and a Warning event explains to the user that the previous message was rewritten — so the hygiene action is observable rather than silent.

Before / after

Today (after #2818 + #2819)          After this change
───────────────────────────          ─────────────────
User pastes huge content             User pastes huge content
       │                                    │
       ▼                                    ▼
Provider rejects                     Provider rejects
       │                                    │
       ▼                                    ▼
Error surfaced with the right code   Error surfaced with the right code
       │                                    │
       ▼                                    ▼
User shortens, retries               Offending message rewritten in
       │                              memory: media → placeholders,
       ▼                              oversized text → placeholder
sess.Messages still carries                 │
the oversized turn                          ▼
       │                             User shortens, retries
       ▼                                    │
Same limit tripped again                    ▼
       │                             sess.Messages now carries only
       ▼                             the slim placeholder + the new
Chat session stuck                   smaller message
                                            │
                                            ▼
                                     Provider accepts — chat session
                                     continues normally

Scope — what this PR is and is not

In scope (this PR) Out of scope (separate follow-up)
Same-process recovery ✓ in-memory sess.Messages rewritten so the chat session continues immediately
Persistence Mirroring the rewrite to the session store so it survives a docker-agent restart

Why the persistence side is a separate follow-up

The persistence side requires Message.ID to round-trip through Store.AddMessage (currently the returned ID is discarded by the PersistenceObserver) and through loadSessionItems (currently the id column is not selected on reload). That gap is independent of overflow handling — it would affect anything that needs Store.UpdateMessage against an in-memory message, including any future compaction-by-id work or message-editing features.

Folding that infrastructure fix into this PR doubled its size and conflated concerns. It now lands as its own focused change where it can be evaluated on its own merits (propagate the ID? position-based updates? new API?).

The scope of the persistence gap, for clarity:

Affected         Not affected
────────         ────────────
docker-agent     The chat session within a single docker-agent process
restart mid-     (closed by this PR)
session   →
oversized
message reloads
from disk

For the user-reported regression (paste oversized → 413 → shortening keeps failing in the same chat), the same-process fix in this PR is sufficient.

What is preserved

  • Token overflow handling is unchanged — that path goes through auto-compaction, which is the correct mechanism for that shape of failure.
  • Hooks, telemetry, retry semantics: untouched.
  • The fatal ErrorEvent for the original rejection is still emitted — scrubbing is in addition to surfacing the error, not instead of it.

Files touched

pkg/runtime/loop_steps.go              Hook recoverFromOversizedTurn into handleStreamError
pkg/runtime/loop_steps_test.go         Tests for the new branch
pkg/runtime/overflow_recovery.go       scrubMessage + recoverFromOversizedTurn (new file)
pkg/runtime/overflow_recovery_test.go  Tests for scrub logic + recovery (new file)
pkg/session/session.go                 New Session.RewriteLatestUserMessage helper

5 files, ~850 lines (about half of that is tests).

Copy link
Copy Markdown

@docker-agent docker-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assessment: 🟡 NEEDS ATTENTION

2 medium-severity issues found in the overflow recovery implementation.

Comment thread pkg/runtime/loop_steps.go Outdated
Comment thread pkg/runtime/overflow_recovery.go
@trungutt trungutt force-pushed the trungutt/overflow-aware-compaction branch from a1d06ad to 18a703a Compare May 19, 2026 12:17
@trungutt trungutt changed the base branch from main to trungutt/overflow-skip-compaction-on-wire May 19, 2026 12:24
@trungutt trungutt changed the title feat(runtime): recover oversized user message after wire/media overflow runtime: recover oversized user message after wire/media overflow May 19, 2026
@trungutt trungutt marked this pull request as ready for review May 19, 2026 12:45
@trungutt trungutt requested a review from a team as a code owner May 19, 2026 12:45
Copy link
Copy Markdown

@docker-agent docker-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assessment: 🟡 NEEDS ATTENTION

Two medium-severity findings in the new overflow-recovery code. The core approach — scrubbing oversized messages in-place, persisting the rewrite, and emitting a Warning event — is sound and the two learned patterns (wire/media scrub independence from compaction config; top-level Content vs MultiContent text scrubbing) are both correctly implemented.

Comment thread pkg/runtime/overflow_recovery.go Outdated
Comment thread pkg/session/session.go Outdated
@trungutt trungutt force-pushed the trungutt/overflow-aware-compaction branch 2 times, most recently from b22d0fb to c724506 Compare May 19, 2026 13:17
@trungutt trungutt requested a review from docker-agent May 19, 2026 14:11
Copy link
Copy Markdown

@docker-agent docker-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assessment: 🟢 APPROVE

Reviewed the overflow recovery changes across pkg/runtime/overflow_recovery.go, pkg/runtime/loop_steps.go, pkg/session/session.go, and the accompanying tests.

What was checked:

  • Correctness of findPersistedMessage when sub-sessions are present — GetAllMessages() appends top-level messages before sub-session messages, so even in a hypothetical ID-collision scenario the first match is always the correctly-rewritten top-level message ✅
  • Notification ordering (emitScrubNotice fires unconditionally after in-memory rewrite, regardless of persistence outcome) — correctly implemented per design intent ✅
  • UpdateMessage call site — called with the post-mutation message content, messageID == 0 guard handles un-persisted synthesised messages ✅
  • Store-write failure path — non-fatal, logged with slog.WarnContext, and user is still notified via the scrub notice ✅
  • Test coverage — happy path, persist-fails path, and no-user-message path all covered ✅

No confirmed or likely bugs found. The implementation correctly addresses the session-poisoning scenario described in the PR description.

@trungutt trungutt changed the base branch from trungutt/overflow-skip-compaction-on-wire to main May 19, 2026 17:09
Auto-compaction is only useful when the rejection is a token-count
overflow — summarising older turns reduces the input token count.

For wire-level overflow ([OverflowKindWire]) the request body itself
exceeds the provider's cap, and the latest turn alone is over the limit;
the compaction call would have to send the same oversized history and
would also be rejected.

For media overflow ([OverflowKindMedia]) we have no media-stripping
during compaction today, so a retry would resend the same attachment
and fail again.

In both cases the recovery attempt always fails, then we surface the
error anyway, while having spent an extra provider call and several
seconds of wall-clock latency. This change skips compaction for those
two kinds and surfaces the error directly. The token-overflow path is
unchanged.
@trungutt trungutt force-pushed the trungutt/overflow-aware-compaction branch from c724506 to b334412 Compare May 19, 2026 17:13
@trungutt trungutt requested a review from docker-agent May 19, 2026 17:14
Copy link
Copy Markdown

@docker-agent docker-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assessment: 🔴 CRITICAL

1 high-severity and 1 low-severity finding in the new overflow recovery code.

Comment thread pkg/runtime/overflow_recovery.go Outdated
Comment thread pkg/session/session.go Outdated
@trungutt trungutt force-pushed the trungutt/overflow-aware-compaction branch from b334412 to 354f333 Compare May 19, 2026 19:04
@trungutt trungutt requested a review from docker-agent May 20, 2026 07:12
Copy link
Copy Markdown

@docker-agent docker-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assessment: 🟢 APPROVE

Reviewed the overflow recovery implementation across pkg/runtime/overflow_recovery.go, pkg/runtime/loop_steps.go, pkg/session/session.go, and pkg/session/store.go.

Summary of analysis:

The implementation is correct and well-structured. Three hypotheses were investigated and all were dismissed after verification:

  1. Token overflow gap — The deliberate decision to only run recoverFromOversizedTurn for OverflowKindWire/OverflowKindMedia (not OverflowKindTokens) is correct by design. Token overflow is caused by accumulated conversation history, not a single oversized message; scrubbing only the latest user message would not recover it. Auto-compaction handles the token case differently and correctly.

  2. Sub-session ID collision in findPersistedMessage — SQLite auto-increment guarantees globally unique message IDs across the table. Additionally, GetAllMessages returns top-level messages before sub-session messages, so findPersistedMessage would always find the correct top-level match first even in the theoretical event of an ID collision.

  3. Misleading log fields for MultiContent scrubbingparts_replaced > 0 already communicates to operators that content was scrubbed when only MultiContent parts are oversized; the text_replaced/original_text_bytes fields are supplementary precision, not the primary scrub indicator.

Notable design positives:

  • RewriteLatestUserMessage correctly holds s.mu.Lock() for the full rewrite, preventing torn state
  • The messageID == 0 guard correctly skips the store write for unpersisted messages
  • emitScrubNotice is emitted unconditionally after a successful in-memory rewrite, even on persistence failure — this is the right user-observable behavior
  • scrubMessagePart passes unknown part types through unchanged rather than silently dropping them

When the provider rejects a request because the body itself is over the
wire-size cap or contains an oversized attachment, the offending user
message stays verbatim in the session. Every subsequent call reloads
that message as part of the conversation history and trips the same
limit. The session is effectively dead until the user starts over.

Add a hygiene step that runs on wire- and media-overflow rejections:
walk back to the latest user message, replace each media part (image,
file, document) with a text placeholder that records what was attached,
and replace plain-text content over 1 MiB with a size-noting placeholder.
The rewrite is mirrored to the session store so the next session load
reflects it; the in-memory mutation alone keeps the current process
healthy even if the store write fails.

A Warning event is emitted so the UI can tell the user that their
previous message was rewritten in place. The fatal ErrorEvent for the
original rejection is still emitted — scrubbing is in addition to
surfacing the error, not instead of it.

Token-overflow is unchanged: it still goes through auto-compaction,
which is the correct mechanism for that shape of failure.
@trungutt trungutt force-pushed the trungutt/overflow-aware-compaction branch from 354f333 to 693a6c1 Compare May 20, 2026 07:25
@trungutt trungutt requested a review from docker-agent May 20, 2026 07:39
@docker-agent
Copy link
Copy Markdown

PR Review Failed — The review agent encountered an error and could not complete the review. View logs.

@trungutt trungutt marked this pull request as draft May 20, 2026 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants