Harden routine replay against long-context attention decay#64
Merged
Conversation
Session d1395b5d ran qwen3.5-flash through the value-stocks-monthly-drop routine replay. Three problems surfaced: 1. The agent lost the routine itself between events 97 and 99 — reasoned "the routine is provided in the user's message" one turn and "I don't have access to this routine" the next, then called please_help_me. 2. It rubber-stamped several confirm_click / confirm_select calls whose reasoning never compared the YELLOW preview to the step's intent. 3. The condenser never fired. qwen3.5-flash advertises a 1M-token context, so the 0.7×context_window threshold (~700k) was unreachable at the observation volumes a routine replay produces. Changes in this PR: - `pyproject.toml` / `uv.lock`: bump agent-sdk to 66ed257b, which adds the matching system-prompt hardening (task_tracker plan pinning in routine_replay mode, small-model confirmation reasoning gate) and refocuses the summarizing-condenser prompt. - `frontend/index.html::buildRoutinePrompt`: send only the SOP markdown as the user message. The routine name / goal / "follow step by step" framing now lives in the ROUTINE_REPLAY system prompt block, so the user message no longer carries a redundant identifier for the model to lose track of. - `server/agent/browser_condenser.py`: add SMALL_MODEL_TOKEN_OVERRIDES mapping model-name substrings to fixed token caps, taking precedence over the context-window derivation. Seeded with qwen3.5-flash -> 100k so condensation actually kicks in on realistic routine replays. - `server/tests/unit/test_browser_condenser.py`: cover the override, the substring match, the non-match fallback, and the configure integration. Replay of the same routine after these changes completed end-to-end with 0 please_help_me calls, 32 actions, and clean step-by-step task_tracker progression. The model also caught a wrong-column click mid-confirm, cancelled the pending confirmation, and re-highlighted with the correct keyword. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picks up the SMALL_MODEL_GUIDANCE / LARGE_MODEL_GUIDANCE tag rename to <ACTION_PROTOCOL> so the model-tier identity no longer leaks into rendered prompts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
d1395b5d(replayedvalue-stocks-monthly-drop), then rubber-stamped several wrong-element confirmations, then calledplease_help_meat event 99. The condenser never fired because the 0.7×context_window threshold for a 1M-token model is unreachable at realistic observation volumes.66ed257b(onopen-browser) to pin an SOP plan intotask_trackerup-front, force a three-part confirmation-reasoning gate in small-model guidance, and refocus the condenser summary prompt on generic progress framing. The OpenBrowser side strips the redundant "Run the saved routine…" preamble (the routine framing now lives in the system prompt) and adds a per-model token cap forqwen3.5-flashso condensation actually kicks in.please_help_mecalls, 32 actions, clean step-by-steptask_trackerprogression, and one successful self-correction where the model caught a wrong-column click mid-confirm and re-highlighted.Test plan
uv run pre-commit run --files <changed>— pass (black reformatted one test, re-run clean)uv run pytest— 468 passed, 4 skippedvalue-stocks-monthly-dropwithdashscope/qwen3.5-flashend-to-end viaskill/claude/ob-routines/scripts/replay.py(convd7c4856b)🤖 Generated with Claude Code