Harden routine replay against long-context attention decay by softpudding · Pull Request #64 · softpudding/OpenBrowser

softpudding · 2026-04-18T11:32:20Z

Summary

qwen3.5-flash lost the routine mid-run in session d1395b5d (replayed value-stocks-monthly-drop), then rubber-stamped several wrong-element confirmations, then called please_help_me at event 99. The condenser never fired because the 0.7×context_window threshold for a 1M-token model is unreachable at realistic observation volumes.
This PR pairs with agent-sdk commit 66ed257b (on open-browser) to pin an SOP plan into task_tracker up-front, force a three-part confirmation-reasoning gate in small-model guidance, and refocus the condenser summary prompt on generic progress framing. The OpenBrowser side strips the redundant "Run the saved routine…" preamble (the routine framing now lives in the system prompt) and adds a per-model token cap for qwen3.5-flash so condensation actually kicks in.
Re-run of the same routine after these changes completed end-to-end with 0 please_help_me calls, 32 actions, clean step-by-step task_tracker progression, and one successful self-correction where the model caught a wrong-column click mid-confirm and re-highlighted.

Test plan

uv run pre-commit run --files <changed> — pass (black reformatted one test, re-run clean)
uv run pytest — 468 passed, 4 skipped
Live replay of value-stocks-monthly-drop with dashscope/qwen3.5-flash end-to-end via skill/claude/ob-routines/scripts/replay.py (conv d7c4856b)
CI: Pre-commit / Pytest / Extension Tests

🤖 Generated with Claude Code

Session d1395b5d ran qwen3.5-flash through the value-stocks-monthly-drop routine replay. Three problems surfaced: 1. The agent lost the routine itself between events 97 and 99 — reasoned "the routine is provided in the user's message" one turn and "I don't have access to this routine" the next, then called please_help_me. 2. It rubber-stamped several confirm_click / confirm_select calls whose reasoning never compared the YELLOW preview to the step's intent. 3. The condenser never fired. qwen3.5-flash advertises a 1M-token context, so the 0.7×context_window threshold (~700k) was unreachable at the observation volumes a routine replay produces. Changes in this PR: - `pyproject.toml` / `uv.lock`: bump agent-sdk to 66ed257b, which adds the matching system-prompt hardening (task_tracker plan pinning in routine_replay mode, small-model confirmation reasoning gate) and refocuses the summarizing-condenser prompt. - `frontend/index.html::buildRoutinePrompt`: send only the SOP markdown as the user message. The routine name / goal / "follow step by step" framing now lives in the ROUTINE_REPLAY system prompt block, so the user message no longer carries a redundant identifier for the model to lose track of. - `server/agent/browser_condenser.py`: add SMALL_MODEL_TOKEN_OVERRIDES mapping model-name substrings to fixed token caps, taking precedence over the context-window derivation. Seeded with qwen3.5-flash -> 100k so condensation actually kicks in on realistic routine replays. - `server/tests/unit/test_browser_condenser.py`: cover the override, the substring match, the non-match fallback, and the configure integration. Replay of the same routine after these changes completed end-to-end with 0 please_help_me calls, 32 actions, and clean step-by-step task_tracker progression. The model also caught a wrong-column click mid-confirm, cancelled the pending confirmation, and re-highlighted with the correct keyword. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Picks up the SMALL_MODEL_GUIDANCE / LARGE_MODEL_GUIDANCE tag rename to <ACTION_PROTOCOL> so the model-tier identity no longer leaks into rendered prompts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

softpudding and others added 2 commits April 18, 2026 19:31

Bump agent-sdk to c92a185a

e445430

Picks up the SMALL_MODEL_GUIDANCE / LARGE_MODEL_GUIDANCE tag rename to <ACTION_PROTOCOL> so the model-tier identity no longer leaks into rendered prompts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

softpudding merged commit e32fa5e into main Apr 18, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden routine replay against long-context attention decay#64

Harden routine replay against long-context attention decay#64
softpudding merged 2 commits into
mainfrom
fix/routine-replay-attention-decay

softpudding commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

softpudding commented Apr 18, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant