diff --git a/tests/playwright/STRESS-EXPLORATION-REPORT.md b/tests/playwright/STRESS-EXPLORATION-REPORT.md index d5ea07ac..ea9b9514 100644 --- a/tests/playwright/STRESS-EXPLORATION-REPORT.md +++ b/tests/playwright/STRESS-EXPLORATION-REPORT.md @@ -46,8 +46,9 @@ For duration runs, the explorer also records first/last one-second FPS windows, | Chat mix ramp | custom | 1,000 | 10 | 30s | 5% | 300,000 | 25.5 | 50.1 | 82.4ms | 0 | FAIL | Two runs: min FPS 24, 27 with 15,000 chat events. 10%/25% were skipped after this confirmed the lower break. | | Duration ramp | custom PTY-only | 1,000 | 10 | 2m | 0% | 1,200,000 | 50.0 | 59.1 | 75.1ms | 0 | PASS | Two runs: min FPS 49, 51. First 30s avg 58.6/58.7; last 30s avg 59.2/59.2. | | Duration ramp | custom PTY-only | 1,000 | 10 | 5m | 0% | 3,000,000 | 50.5 | 59.1 | 68.2ms | 0 | PASS | Two runs: min FPS 50, 51. First 30s avg 58.9/58.9; last 30s avg 59.2/59.3. | -| Duration ramp | custom PTY-only | 1,000 | 10 | 15m | 0% | TBD | TBD | TBD | TBD | TBD | TBD | Pending. | -| Combined high-load | custom | 5,000 | 50 | 5m | 25% | TBD | TBD | TBD | TBD | TBD | TBD | Pending. | +| Duration ramp | custom PTY-only | 1,000 | 10 | 15m | 0% | 9,000,000 | 19.5 | 59.2 | 425.3ms | 0 | FAIL | Two runs: min FPS 18, 21; longest frame 550.4ms, 300.1ms. Last 30s avg stayed 59.4/59.2, so this is intermittent pause behavior, not sustained drift. CDP heap delta +570MB/+1,061MB. | +| Duration ramp | custom PTY-only | 100 | 10 | 15m | 0% | 900,000 | 2.0 | 57.9 | 1,134.3ms | 0 | FAIL | Single run only; second repeat was stopped to avoid contaminating concurrent perf validation. CDP heap delta +117.6MB. Mock PTY store retained 9,100 chunks / 46.4M characters, max 464.6k chars for one agent. | +| Combined high-load | custom | 1,000 | 25 | 5m | 1% | TBD | TBD | TBD | TBD | TBD | TBD | Pending; paused while perf implementation team reruns stress gates to avoid benchmark contention. | ## Break Point Analysis @@ -73,13 +74,15 @@ The failure mode is sustained low per-second FPS during the run, not one-off lon The safe 1,000-agent PTY-only profile does not show FPS drift through 5 minutes. Both 2-minute runs and both 5-minute runs passed with min FPS at or above 49, and every run's last 30-second average FPS was slightly higher than its first 30-second average FPS. -The in-page `performance.memory` reading stayed flat within each of these runs, but this appears too coarse to rely on. CDP heap sampling has been added for the remaining 15-minute and combined runs. +The 15-minute 1,000-agent case failed twice, but not as sustained degradation. Average FPS remained around 59 and the final 30-second windows were healthy. The failures came from isolated long pauses: 550ms and 300ms longest frames, with min FPS 18 and 21. CDP heap sampling showed large growth over the run: +570MB and +1,061MB, consistent with heap-pressure/major-GC pauses. + +A 100-agent 15-minute scaling check also failed on one long pause, with min FPS 2 and a 1,134ms longest frame, while its final 30-second average was still 59.5 FPS. CDP heap delta was +117.6MB, much lower than the 1,000-agent runs. The mock PTY store retained 9,100 chunk entries and 46.4M characters for that run, so at least part of the heap growth is retained PTY text; the remaining gap is consistent with renderer terminal/xterm buffer structures. The second 100-agent repeat was intentionally stopped to avoid contaminating concurrent perf validation in the same checkout. ## Bottleneck Hypothesis Current data suggests the PTY-heavy break is driven more by agent/stream multiplicity than by raw PTY line count. The renderer stays smooth at 1,000 streams with up to 250k logical PTY events/sec, but drops below the FPS gate at 1,750-2,000 streams with only 17.5k-20k logical PTY events/sec. -Likely contributors are per-agent terminal bookkeeping, PTY buffer fanout by agent key, store reconciliation over larger agent collections, and DOM/list work associated with many tracked agents. The known chat-heavy expected-fail remains a separate per-row chat rendering bottleneck. +Likely contributors are per-agent terminal bookkeeping, PTY buffer fanout by agent key, store reconciliation over larger agent collections, and DOM/list work associated with many tracked agents. The 15-minute heap data strengthens the xterm/terminal-buffer hypothesis: the long-run failures look like major GC pauses after retained terminal/PTY state grows, not raw throughput collapse. The known chat-heavy expected-fail remains a separate per-row chat rendering bottleneck. The chat mix data confirms a separate chat-path bottleneck: even 3,000 live chat messages over 30s is borderline, and 7,500 chat messages fails consistently. That points at message reconciliation, chat list virtualization pressure, markdown formatting, `ChatMessage` subtree cost, and agent metadata lookups in chat rows. @@ -90,6 +93,7 @@ Recommendations from the completed axes: - Add a perf regression target around 1,625-1,750 PTY-only agents at 10 events/sec/agent; this is the current knee. - Profile per-agent PTY dispatch and terminal bookkeeping before optimizing raw chunk size. The 25 events/sec result suggests throughput bytes are not the first limit. - Keep PTY aggregation in place for high-volume stream traffic; removing it turns the test into a per-tick chunk churn benchmark and inflates wall time. +- Prioritize bounding retained terminal/PTY state for non-visible agents before deeper CPU tuning. The 15-minute failures are heap/GC shaped, and lazy mounting inactive terminals should directly reduce xterm scrollback allocation. - Add a chat-mix regression target around 1%-2.5% chat at 1,000 agents and 10 events/sec/agent; this is the current chat-path knee. - Treat the chat-heavy profile as a separate optimization track: reduce per-row chat render work, memoize agent metadata lookup used by chat rows, and consider batching/debouncing message reconciliation under replay bursts. - Add a follow-up "agents but non-rendered" explorer mode before the next production fix. If 2,000+ spawned agents stay smooth when only one `TerminalInstance` is mounted, lazy unmounting inactive terminal panes should be the highest-leverage PTY-side optimization. diff --git a/tests/playwright/stress-explorer.spec.ts b/tests/playwright/stress-explorer.spec.ts index fe7c7c81..22bc9f2c 100644 --- a/tests/playwright/stress-explorer.spec.ts +++ b/tests/playwright/stress-explorer.spec.ts @@ -56,6 +56,9 @@ type StressResult = { finalMockAgentCount: number finalMockBrokerEventCount: number ptyChunkKeyCount: number + ptyChunkEntryCount: number + ptyChunkCharacterCount: number + maxPtyChunkCharactersForAgent: number terminalSampleAgent: string } @@ -167,6 +170,9 @@ test.describe('renderer stress explorer', () => { finalMockAgentCount: 0, finalMockBrokerEventCount: 0, ptyChunkKeyCount: 0, + ptyChunkEntryCount: 0, + ptyChunkCharacterCount: 0, + maxPtyChunkCharactersForAgent: 0, terminalSampleAgent: 'agent-0001' } @@ -350,6 +356,10 @@ test.describe('renderer stress explorer', () => { await new Promise((resolve) => setTimeout(resolve, 250)) const state = mock.getState() + const ptyChunkValues = Object.values(state.ptyChunks) + const ptyChunkCharacterCounts = ptyChunkValues.map((chunks) => + chunks.reduce((sum, chunk) => sum + chunk.length, 0) + ) const totalFrameMs = frameDeltas.reduce((sum, delta) => sum + delta, 0) const longestFrameMs = frameDeltas.length > 0 ? Math.max(...frameDeltas) : 0 const avgFrameMs = totalFrameMs / Math.max(1, frameDeltas.length) @@ -397,6 +407,9 @@ test.describe('renderer stress explorer', () => { finalMockAgentCount: state.agents.length, finalMockBrokerEventCount: state.events.length, ptyChunkKeyCount: Object.keys(state.ptyChunks).length, + ptyChunkEntryCount: ptyChunkValues.reduce((sum, chunks) => sum + chunks.length, 0), + ptyChunkCharacterCount: ptyChunkCharacterCounts.reduce((sum, count) => sum + count, 0), + maxPtyChunkCharactersForAgent: ptyChunkCharacterCounts.length > 0 ? Math.max(...ptyChunkCharacterCounts) : 0, terminalSampleAgent: 'agent-0001' } },