runtime: recover oversized user message after wire/media overflow by trungutt · Pull Request #2821 · docker/docker-agent

trungutt · 2026-05-19T11:51:14Z

Stacked on #2819. Review after that one merges.

The problem this step fixes

After #2818 + #2819, an oversized message produces the right error message and fails fast — but the same chat session cannot continue. The offending user message stays verbatim in sess.Messages for the rest of the process, and every retry resends the same oversized payload alongside the new (smaller) one. The user sees the same rejection every time they hit Send.

The original user-reported regression:

"I shortened my paste but it kept failing with the same character-limit error."

What this change does

After a wire- or media-overflow rejection, walk back to the most recent user message and rewrite it in place in memory:

Each media part (image, file, document) becomes a short text placeholder that records what was attached (name, size, MIME when known).
Plain-text content over a conservative threshold is replaced with a placeholder that records the original size.

The rewrite happens immediately after the failure event fires. The error itself is still emitted with the kind-specific code from #2818, and a Warning event explains to the user that the previous message was rewritten — so the hygiene action is observable rather than silent.

Before / after

Today (after #2818 + #2819)          After this change
───────────────────────────          ─────────────────
User pastes huge content             User pastes huge content
       │                                    │
       ▼                                    ▼
Provider rejects                     Provider rejects
       │                                    │
       ▼                                    ▼
Error surfaced with the right code   Error surfaced with the right code
       │                                    │
       ▼                                    ▼
User shortens, retries               Offending message rewritten in
       │                              memory: media → placeholders,
       ▼                              oversized text → placeholder
sess.Messages still carries                 │
the oversized turn                          ▼
       │                             User shortens, retries
       ▼                                    │
Same limit tripped again                    ▼
       │                             sess.Messages now carries only
       ▼                             the slim placeholder + the new
Chat session stuck                   smaller message
                                            │
                                            ▼
                                     Provider accepts — chat session
                                     continues normally

Scope — what this PR is and is not

	In scope (this PR)	Out of scope (separate follow-up)
Same-process recovery	✓ in-memory `sess.Messages` rewritten so the chat session continues immediately	—
Persistence	—	Mirroring the rewrite to the session store so it survives a docker-agent restart

Why the persistence side is a separate follow-up

The persistence side requires Message.ID to round-trip through Store.AddMessage (currently the returned ID is discarded by the PersistenceObserver) and through loadSessionItems (currently the id column is not selected on reload). That gap is independent of overflow handling — it would affect anything that needs Store.UpdateMessage against an in-memory message, including any future compaction-by-id work or message-editing features.

Folding that infrastructure fix into this PR doubled its size and conflated concerns. It now lands as its own focused change where it can be evaluated on its own merits (propagate the ID? position-based updates? new API?).

The scope of the persistence gap, for clarity:

Affected         Not affected
────────         ────────────
docker-agent     The chat session within a single docker-agent process
restart mid-     (closed by this PR)
session   →
oversized
message reloads
from disk

For the user-reported regression (paste oversized → 413 → shortening keeps failing in the same chat), the same-process fix in this PR is sufficient.

What is preserved

Token overflow handling is unchanged — that path goes through auto-compaction, which is the correct mechanism for that shape of failure.
Hooks, telemetry, retry semantics: untouched.
The fatal ErrorEvent for the original rejection is still emitted — scrubbing is in addition to surfacing the error, not instead of it.

Files touched

pkg/runtime/loop_steps.go              Hook recoverFromOversizedTurn into handleStreamError
pkg/runtime/loop_steps_test.go         Tests for the new branch
pkg/runtime/overflow_recovery.go       scrubMessage + recoverFromOversizedTurn (new file)
pkg/runtime/overflow_recovery_test.go  Tests for scrub logic + recovery (new file)
pkg/session/session.go                 New Session.RewriteLatestUserMessage helper

5 files, ~850 lines (about half of that is tests).

docker-agent

Assessment: 🟡 NEEDS ATTENTION

2 medium-severity issues found in the overflow recovery implementation.

docker-agent

Assessment: 🟡 NEEDS ATTENTION

Two medium-severity findings in the new overflow-recovery code. The core approach — scrubbing oversized messages in-place, persisting the rewrite, and emitting a Warning event — is sound and the two learned patterns (wire/media scrub independence from compaction config; top-level Content vs MultiContent text scrubbing) are both correctly implemented.

docker-agent

Assessment: 🟢 APPROVE

Reviewed the overflow recovery changes across pkg/runtime/overflow_recovery.go, pkg/runtime/loop_steps.go, pkg/session/session.go, and the accompanying tests.

What was checked:

Correctness of findPersistedMessage when sub-sessions are present — GetAllMessages() appends top-level messages before sub-session messages, so even in a hypothetical ID-collision scenario the first match is always the correctly-rewritten top-level message ✅
Notification ordering (emitScrubNotice fires unconditionally after in-memory rewrite, regardless of persistence outcome) — correctly implemented per design intent ✅
UpdateMessage call site — called with the post-mutation message content, messageID == 0 guard handles un-persisted synthesised messages ✅
Store-write failure path — non-fatal, logged with slog.WarnContext, and user is still notified via the scrub notice ✅
Test coverage — happy path, persist-fails path, and no-user-message path all covered ✅

No confirmed or likely bugs found. The implementation correctly addresses the session-poisoning scenario described in the PR description.

Auto-compaction is only useful when the rejection is a token-count overflow — summarising older turns reduces the input token count. For wire-level overflow ([OverflowKindWire]) the request body itself exceeds the provider's cap, and the latest turn alone is over the limit; the compaction call would have to send the same oversized history and would also be rejected. For media overflow ([OverflowKindMedia]) we have no media-stripping during compaction today, so a retry would resend the same attachment and fail again. In both cases the recovery attempt always fails, then we surface the error anyway, while having spent an extra provider call and several seconds of wall-clock latency. This change skips compaction for those two kinds and surfaces the error directly. The token-overflow path is unchanged.

docker-agent

Assessment: 🔴 CRITICAL

1 high-severity and 1 low-severity finding in the new overflow recovery code.

docker-agent

Assessment: 🟢 APPROVE

Reviewed the overflow recovery implementation across pkg/runtime/overflow_recovery.go, pkg/runtime/loop_steps.go, pkg/session/session.go, and pkg/session/store.go.

Summary of analysis:

The implementation is correct and well-structured. Three hypotheses were investigated and all were dismissed after verification:

Token overflow gap — The deliberate decision to only run recoverFromOversizedTurn for OverflowKindWire/OverflowKindMedia (not OverflowKindTokens) is correct by design. Token overflow is caused by accumulated conversation history, not a single oversized message; scrubbing only the latest user message would not recover it. Auto-compaction handles the token case differently and correctly.
Sub-session ID collision in findPersistedMessage — SQLite auto-increment guarantees globally unique message IDs across the table. Additionally, GetAllMessages returns top-level messages before sub-session messages, so findPersistedMessage would always find the correct top-level match first even in the theoretical event of an ID collision.
Misleading log fields for MultiContent scrubbing — parts_replaced > 0 already communicates to operators that content was scrubbed when only MultiContent parts are oversized; the text_replaced/original_text_bytes fields are supplementary precision, not the primary scrub indicator.

Notable design positives:

RewriteLatestUserMessage correctly holds s.mu.Lock() for the full rewrite, preventing torn state
The messageID == 0 guard correctly skips the store write for unpersisted messages
emitScrubNotice is emitted unconditionally after a successful in-memory rewrite, even on persistence failure — this is the right user-observable behavior
scrubMessagePart passes unknown part types through unchanged rather than silently dropping them

When the provider rejects a request because the body itself is over the wire-size cap or contains an oversized attachment, the offending user message stays verbatim in the session. Every subsequent call reloads that message as part of the conversation history and trips the same limit. The session is effectively dead until the user starts over. Add a hygiene step that runs on wire- and media-overflow rejections: walk back to the latest user message, replace each media part (image, file, document) with a text placeholder that records what was attached, and replace plain-text content over 1 MiB with a size-noting placeholder. The rewrite is mirrored to the session store so the next session load reflects it; the in-memory mutation alone keeps the current process healthy even if the store write fails. A Warning event is emitted so the UI can tell the user that their previous message was rewritten in place. The fatal ErrorEvent for the original rejection is still emitted — scrubbing is in addition to surfacing the error, not instead of it. Token-overflow is unchanged: it still goes through auto-compaction, which is the correct mechanism for that shape of failure.

docker-agent · 2026-05-20T08:10:50Z

❌ PR Review Failed — The review agent encountered an error and could not complete the review. View logs.

docker-agent reviewed May 19, 2026

View reviewed changes

Comment thread pkg/runtime/loop_steps.go Outdated

Comment thread pkg/runtime/overflow_recovery.go

trungutt force-pushed the trungutt/overflow-aware-compaction branch from a1d06ad to 18a703a Compare May 19, 2026 12:17

trungutt changed the base branch from main to trungutt/overflow-skip-compaction-on-wire May 19, 2026 12:24

trungutt changed the title ~~feat(runtime): recover oversized user message after wire/media overflow~~ runtime: recover oversized user message after wire/media overflow May 19, 2026

trungutt marked this pull request as ready for review May 19, 2026 12:45

trungutt requested a review from a team as a code owner May 19, 2026 12:45

docker-agent reviewed May 19, 2026

View reviewed changes

Comment thread pkg/runtime/overflow_recovery.go Outdated

Comment thread pkg/session/session.go Outdated

trungutt force-pushed the trungutt/overflow-aware-compaction branch 2 times, most recently from b22d0fb to c724506 Compare May 19, 2026 13:17

trungutt requested a review from docker-agent May 19, 2026 14:11

docker-agent reviewed May 19, 2026

View reviewed changes

trungutt changed the base branch from trungutt/overflow-skip-compaction-on-wire to main May 19, 2026 17:09

trungutt force-pushed the trungutt/overflow-aware-compaction branch from c724506 to b334412 Compare May 19, 2026 17:13

trungutt requested a review from docker-agent May 19, 2026 17:14

docker-agent reviewed May 19, 2026

View reviewed changes

Comment thread pkg/runtime/overflow_recovery.go Outdated

Comment thread pkg/session/session.go Outdated

trungutt force-pushed the trungutt/overflow-aware-compaction branch from b334412 to 354f333 Compare May 19, 2026 19:04

trungutt requested a review from docker-agent May 20, 2026 07:12

docker-agent reviewed May 20, 2026

View reviewed changes

trungutt force-pushed the trungutt/overflow-aware-compaction branch from 354f333 to 693a6c1 Compare May 20, 2026 07:25

trungutt requested a review from docker-agent May 20, 2026 07:39

trungutt marked this pull request as draft May 20, 2026 14:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: recover oversized user message after wire/media overflow#2821

runtime: recover oversized user message after wire/media overflow#2821
trungutt wants to merge 2 commits into
docker:mainfrom
trungutt:trungutt/overflow-aware-compaction

trungutt commented May 19, 2026 •

edited

Loading

Uh oh!

docker-agent left a comment

Uh oh!

Uh oh!

Uh oh!

docker-agent left a comment

Uh oh!

Uh oh!

Uh oh!

docker-agent left a comment

Uh oh!

docker-agent left a comment

Uh oh!

Uh oh!

Uh oh!

docker-agent left a comment

Uh oh!

docker-agent commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

trungutt commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The problem this step fixes

What this change does

Before / after

Scope — what this PR is and is not

Why the persistence side is a separate follow-up

What is preserved

Files touched

Uh oh!

docker-agent left a comment

Choose a reason for hiding this comment

Assessment: 🟡 NEEDS ATTENTION

Uh oh!

Uh oh!

Uh oh!

docker-agent left a comment

Choose a reason for hiding this comment

Assessment: 🟡 NEEDS ATTENTION

Uh oh!

Uh oh!

Uh oh!

docker-agent left a comment

Choose a reason for hiding this comment

Assessment: 🟢 APPROVE

Uh oh!

docker-agent left a comment

Choose a reason for hiding this comment

Assessment: 🔴 CRITICAL

Uh oh!

Uh oh!

Uh oh!

docker-agent left a comment

Choose a reason for hiding this comment

Assessment: 🟢 APPROVE

Uh oh!

docker-agent commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

trungutt commented May 19, 2026 •

edited

Loading