Skip to content

Observability: code execution timeout awareness + subagent lifecycle ledger #24

Description

@Number531

Context

Raw SSE logs from session 2026-03-02-1772489470 revealed two observability gaps:

  1. 3 code execution timeoutsrun_python_analysis hit the 300s wall-clock limit because neither the system prompt nor tool description told the model about the time budget. The model wrote expensive Monte Carlo simulations that exceeded it.
  2. Subagent start/stop mismatch (33 starts vs 40 stops) — the agentRegistry in hookSSEBridge.js was fire-and-forget (deleted entries on stop), so there was no session-end audit. Stops without matching starts (compaction recovery, gate re-invocations) went undetected.

Both were observability/guidance gaps, not functional bugs — the session completed successfully.

Changes (v3.10.3, commit 8f1f2cf)

Issue 1: Code Execution Timeout Awareness

  • System prompt: Execution budget paragraph added (interpolates OVERALL_TIMEOUT_MS / 1000 = 300s)
  • Tool description: EXECUTION BUDGET block with iteration limits and vectorization guidance
  • Structured error: Timeout path now sets timeout: true + retry_hint on the result object

Issue 2: Subagent Lifecycle Ledger

  • createSSEBridge(): New factory replacing wrapHooksForSSE() as server's bridge entry point
  • agentLedger (Map): Persistent ledger tracking all start/stop events (never deletes entries)
  • Orphan detection: Stops without matching starts logged to console + flagged in summary
  • getAgentSummary(): Returns { total_starts, total_stops, orphan_stops[], unmatched_starts[], has_mismatch }
  • SSE event: agent_lifecycle_summary emitted before manifest.finalize() when mismatches exist

Backward Compat

  • wrapHooksForSSE() remains exported and functional (passes null for ledger — all guards are no-ops)
  • Zero behavioral change to any existing code path

Test Results

  • test/sdk/code-execution-bridge.test.js: 337/337 pass
  • Inline hookSSEBridge verification: 8/8 scenarios pass (normal, orphan, unmatched, bail-out)

Files Changed

File Lines
src/tools/codeExecutionBridge.js +18 -6
src/utils/hookSSEBridge.js +108 -2
src/server/claude-sdk-server.js +13 -2
CHANGELOG.md +29

Future Work

  • Frontend: render agent_lifecycle_summary SSE event in the dashboard (e.g., warning badge on mismatch)
  • Metrics: track timeout frequency and retry success rate in hookDBBridge
  • Consider adjusting OVERALL_TIMEOUT_MS if Anthropic changes the ~4.5 min container idle expiry
  • Evaluate whether retry_hint should trigger automatic retry with simplified constraints

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestinfrastructureBackend/infrastructure changes

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions