Skip to content

Harden routine replay against long-context attention decay#64

Merged
softpudding merged 2 commits into
mainfrom
fix/routine-replay-attention-decay
Apr 18, 2026
Merged

Harden routine replay against long-context attention decay#64
softpudding merged 2 commits into
mainfrom
fix/routine-replay-attention-decay

Conversation

@softpudding

Copy link
Copy Markdown
Owner

Summary

  • qwen3.5-flash lost the routine mid-run in session d1395b5d (replayed value-stocks-monthly-drop), then rubber-stamped several wrong-element confirmations, then called please_help_me at event 99. The condenser never fired because the 0.7×context_window threshold for a 1M-token model is unreachable at realistic observation volumes.
  • This PR pairs with agent-sdk commit 66ed257b (on open-browser) to pin an SOP plan into task_tracker up-front, force a three-part confirmation-reasoning gate in small-model guidance, and refocus the condenser summary prompt on generic progress framing. The OpenBrowser side strips the redundant "Run the saved routine…" preamble (the routine framing now lives in the system prompt) and adds a per-model token cap for qwen3.5-flash so condensation actually kicks in.
  • Re-run of the same routine after these changes completed end-to-end with 0 please_help_me calls, 32 actions, clean step-by-step task_tracker progression, and one successful self-correction where the model caught a wrong-column click mid-confirm and re-highlighted.

Test plan

  • uv run pre-commit run --files <changed> — pass (black reformatted one test, re-run clean)
  • uv run pytest — 468 passed, 4 skipped
  • Live replay of value-stocks-monthly-drop with dashscope/qwen3.5-flash end-to-end via skill/claude/ob-routines/scripts/replay.py (conv d7c4856b)
  • CI: Pre-commit / Pytest / Extension Tests

🤖 Generated with Claude Code

softpudding and others added 2 commits April 18, 2026 19:31
Session d1395b5d ran qwen3.5-flash through the value-stocks-monthly-drop
routine replay. Three problems surfaced:

1. The agent lost the routine itself between events 97 and 99 — reasoned
   "the routine is provided in the user's message" one turn and "I don't
   have access to this routine" the next, then called please_help_me.
2. It rubber-stamped several confirm_click / confirm_select calls whose
   reasoning never compared the YELLOW preview to the step's intent.
3. The condenser never fired. qwen3.5-flash advertises a 1M-token
   context, so the 0.7×context_window threshold (~700k) was unreachable
   at the observation volumes a routine replay produces.

Changes in this PR:

- `pyproject.toml` / `uv.lock`: bump agent-sdk to 66ed257b, which
  adds the matching system-prompt hardening (task_tracker plan pinning
  in routine_replay mode, small-model confirmation reasoning gate) and
  refocuses the summarizing-condenser prompt.
- `frontend/index.html::buildRoutinePrompt`: send only the SOP markdown
  as the user message. The routine name / goal / "follow step by step"
  framing now lives in the ROUTINE_REPLAY system prompt block, so the
  user message no longer carries a redundant identifier for the model
  to lose track of.
- `server/agent/browser_condenser.py`: add SMALL_MODEL_TOKEN_OVERRIDES
  mapping model-name substrings to fixed token caps, taking precedence
  over the context-window derivation. Seeded with qwen3.5-flash -> 100k
  so condensation actually kicks in on realistic routine replays.
- `server/tests/unit/test_browser_condenser.py`: cover the override,
  the substring match, the non-match fallback, and the configure
  integration.

Replay of the same routine after these changes completed end-to-end
with 0 please_help_me calls, 32 actions, and clean step-by-step
task_tracker progression. The model also caught a wrong-column click
mid-confirm, cancelled the pending confirmation, and re-highlighted
with the correct keyword.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picks up the SMALL_MODEL_GUIDANCE / LARGE_MODEL_GUIDANCE tag rename to
<ACTION_PROTOCOL> so the model-tier identity no longer leaks into
rendered prompts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@softpudding softpudding merged commit e32fa5e into main Apr 18, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant