fix(sdk-driver): resolve idle-timeout as success when output collected and all tools complete#40419
Conversation
…and no pending tools When session.sendAndWait times out waiting for session.idle but the agent produced output and all tracked tool calls have already completed, the SDK simply failed to emit the idle signal — the agent's work is done. Resolve with exitCode 0 instead of 1 so the harness does not classify the run as a failure or waste retry budget. Also fix the catch block to: - Use the accumulated hasOutput value (was hardcoded false) - Preserve the accumulated output in the error return (was discarded) Adds three regression tests covering: - idle-timeout with output + all tool calls complete → exitCode 0 (success) - idle-timeout with pending tool calls → exitCode 1 (genuine failure, retry) - idle-timeout with no output collected → exitCode 1 (no work done) Closes #40418 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Extract the inline idle-timeout regex into a named module-level constant SDK_IDLE_TIMEOUT_PATTERN, matching the more specific format already used in copilot_harness.cjs (SDK_SESSION_IDLE_TIMEOUT_PATTERN). This makes the intent clear and is easy to update if the SDK error message format changes. Also export the constant for testability. Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
|
✅ Design Decision Gate 🏗️ completed the design decision gate check. No ADR enforcement needed: PR #40419 does not have the 'implementation' label and has 0 new lines of code in business logic directories (≤100 threshold). Only 2 files changed, neither in tracked business-logic paths. |
|
🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅ |
|
✅ PR Code Quality Reviewer completed the code quality review. |
|
✅ Test Quality Sentinel completed test quality analysis. |
There was a problem hiding this comment.
Pull request overview
This PR adjusts the Copilot SDK session driver so that a sendAndWait idle-timeout (waiting for session.idle) is treated as a successful completion when the agent has already produced output and there are no tracked in-flight tool calls, preventing long single-turn runs from being incorrectly marked as failures.
Changes:
- Add
SDK_IDLE_TIMEOUT_PATTERNand use it inrunWithCopilotSDKto detect the specific SDKsession.idletimeout message. - On idle-timeout errors, return
exitCode: 0whenhasOutput === trueandpendingToolCalls.size === 0; otherwise preserve accumulated output and returnexitCode: 1with correcthasOutput. - Add regression tests covering the success case and two failure cases (pending tool calls, no output).
Show a summary per file
| File | Description |
|---|---|
| actions/setup/js/copilot_sdk_session.cjs | Tracks pending tool calls across try/catch, preserves accumulated output on error, and treats SDK idle-timeout as success when work is complete. |
| actions/setup/js/copilot_sdk_driver.test.cjs | Adds regression coverage for idle-timeout behavior across success/failure scenarios. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 2/2 changed files
- Comments generated: 0
🧪 Test Quality Sentinel Report✅ Test Quality Score: 90/100 — Excellent
📊 Metrics & Test Classification (3 tests analyzed)
Go: 0 ( i️ Note on Test Inflation (non-blocking)The test file added 143 lines versus 30 lines in the production file (≈ 4.8:1 ratio, threshold 2:1). Per the rubric this deducts 10 points from the Proportional Growth component. In context this is expected and justified: the production fix is compact (pattern match + conditional in the Verdict
|
There was a problem hiding this comment.
✅ Test Quality Sentinel: 90/100. Test quality is excellent — 0% of new tests are implementation tests (threshold: 30%). All 3 tests verify observable behavioral contracts (exitCode and hasOutput under distinct idle-timeout scenarios). Mocking targets are external SDK I/O dependencies.
There was a problem hiding this comment.
Skills-Based Review 🧠
Applied /diagnose and /tdd — approving with minor suggestions on test completeness and a cross-file DRY opportunity.
📋 Key Themes & Highlights
Key Themes
- Cross-file regex duplication:
SDK_IDLE_TIMEOUT_PATTERN(new) andSDK_SESSION_IDLE_TIMEOUT_PATTERN(existing incopilot_harness.cjs) are identical. The constant is now exported — the harness could import it to eliminate the "keep in sync" burden. - Test specification gaps: The three regression tests are well-structured, but the two failure paths don't assert cleanup (
disconnect/stop) and the no-output test leavesoutputcontent unspecified. - Untested guard: The
!catastrophicToolDenialsErrorcondition that prevents tool-denial failures from being promoted to success has no dedicated test.
Positive Highlights
- ✅ Root cause addressed cleanly: hoisting
pendingToolCallsto function scope is the minimal structural change needed — nothing was added unnecessarily. - ✅ Three-condition guard (
isIdleTimeout && hasOutput && pendingToolCalls.size === 0) is precise and well-reasoned; each condition maps directly to a test case. - ✅
hasOutputandoutputfix on the error path are bonuses that correct silent data loss and wrong retry signals. - ✅ Mock event sequences (start/complete pairing) in the tests exercise real state transitions, not just surface throw/catch behaviour.
- ✅ The PR description is excellent — the before/after code block makes the change immediately graspable.
🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · 73.3 AIC · ⌖ 10.6 AIC · ⊞ 6.9K
| // deadline waiting for the session.idle event. This matches the message format | ||
| // "Timeout after <N>ms waiting for session.idle" produced by the Copilot SDK. | ||
| // Keep in sync with SDK_SESSION_IDLE_TIMEOUT_PATTERN in copilot_harness.cjs. | ||
| const SDK_IDLE_TIMEOUT_PATTERN = /Timeout after \d+ms waiting for session\.idle/; |
There was a problem hiding this comment.
[/diagnose] SDK_IDLE_TIMEOUT_PATTERN is identical to SDK_SESSION_IDLE_TIMEOUT_PATTERN already defined in copilot_harness.cjs (line 120). The "Keep in sync" comment flags a live drift risk — the two patterns can diverge silently.
💡 Suggestion: import from here instead of duplicating
Since SDK_IDLE_TIMEOUT_PATTERN is now exported, copilot_harness.cjs could import it directly:
// copilot_harness.cjs
const { SDK_IDLE_TIMEOUT_PATTERN } = require("./copilot_sdk_session.cjs");
// ... replace SDK_SESSION_IDLE_TIMEOUT_PATTERN usages with SDK_IDLE_TIMEOUT_PATTERNThis eliminates the "keep in sync" maintenance burden and makes the single source of truth explicit.
| // output and all tracked tool calls have already completed, the session work is | ||
| // done — the SDK simply failed to emit the idle signal. Treat it as a successful | ||
| // run so the harness does not classify it as a failure or waste retry attempts. | ||
| const isIdleTimeout = !catastrophicToolDenialsError && SDK_IDLE_TIMEOUT_PATTERN.test(failure.message); |
There was a problem hiding this comment.
[/tdd] The !catastrophicToolDenialsError guard is a critical safety rail — it prevents a tool-denial failure whose error message accidentally matches the idle-timeout pattern from being promoted to success. But this branch is not covered by a regression test.
💡 Suggested test case
it("returns exitCode 1 when catastrophic tool denials fires even if timeout message matches", async () => {
// Simulate maxToolDenials being hit; the driver should NOT treat this as
// an idle-timeout success even if the error text happens to match.
const session = {
sessionId: "session-denial-not-idle",
on: handler => { onEvent = handler; },
sendAndWait: vi.fn().mockImplementation(async () => {
// fire enough tool-denial events to breach the threshold, then stall
throw new Error("Timeout after 870000ms waiting for session.idle");
}),
disconnect: vi.fn().mockResolvedValue(undefined),
};
// ... configure maxToolDenials:1 and fire denial events ...
expect(result.exitCode).toBe(1);
});Without this test, a future refactor that accidentally drops the !catastrophicToolDenialsError condition would go undetected.
| }); | ||
|
|
||
| expect(result.exitCode).toBe(1); | ||
| expect(result.hasOutput).toBe(false); |
There was a problem hiding this comment.
[/tdd] This test verifies exitCode: 1 and hasOutput: false but doesn't assert what output contains. When hasOutput is false, the implementation returns failure.message — so the harness receives the timeout error string, not an empty value. Asserting this fully specifies the behaviour and prevents a silent regression if the fallback is changed.
💡 Suggested addition
// Add after the existing expects:
expect(result.output).toContain("waiting for session.idle");
// confirms the error string is surfaced to the harness for diagnostics|
|
||
| expect(result.exitCode).toBe(1); | ||
| expect(result.hasOutput).toBe(true); | ||
| expect(result.output).toContain("working on it"); |
There was a problem hiding this comment.
[/tdd] The pending-tools and no-output failure tests don't assert that cleanup (disconnect, stop) was called, unlike the success-case test above. The finally block should run on all exit paths — adding these assertions prevents a future regression where error handling disrupts cleanup.
💡 Suggested additions (also applies to the no-output test)
// Add to both failure tests:
expect(disconnect).toHaveBeenCalledTimes(1);
expect(stop).toHaveBeenCalledTimes(1);There was a problem hiding this comment.
🔎 Code quality review by PR Code Quality Reviewer · 153.4 AIC · ⌖ 10.2 AIC · ⊞ 5.1K
| // deadline waiting for the session.idle event. This matches the message format | ||
| // "Timeout after <N>ms waiting for session.idle" produced by the Copilot SDK. | ||
| // Keep in sync with SDK_SESSION_IDLE_TIMEOUT_PATTERN in copilot_harness.cjs. | ||
| const SDK_IDLE_TIMEOUT_PATTERN = /Timeout after \d+ms waiting for session\.idle/; |
There was a problem hiding this comment.
Exported pattern is not consumed by its stated peer — copilot_harness.cjs still carries its own copy. Two independent regexes can silently drift.
💡 Details
The comment says "Keep in sync with SDK_SESSION_IDLE_TIMEOUT_PATTERN in copilot_harness.cjs", implying the export is meant to be a single source of truth. But copilot_harness.cjs defines its own copy at line 120 and never imports from this module:
// copilot_harness.cjs line 120 — still exists, still independent
const SDK_SESSION_IDLE_TIMEOUT_PATTERN = /Timeout after \d+ms waiting for session\.idle/;The export is semantically incomplete.
There is also a diagnostic side-effect: for the idle-timeout + pending-tools + output path, this PR correctly returns output = <assistant content> rather than the error string. That means isSDKSessionIdleTimeoutError(result.output) in the harness will now return false for those cases, so classifyCopilotFailure logs partial_execution instead of sdk_session_idle_timeout. Retry logic is unaffected (keyed on hasOutput), but failure traces become less specific.
Suggested fix: import from this module in copilot_harness.cjs:
const { SDK_IDLE_TIMEOUT_PATTERN } = require('./copilot_sdk_session.cjs');
// retain old name via alias if needed:
const SDK_SESSION_IDLE_TIMEOUT_PATTERN = SDK_IDLE_TIMEOUT_PATTERN;|
|
||
| expect(result.exitCode).toBe(1); | ||
| expect(result.hasOutput).toBe(true); | ||
| expect(result.output).toContain("working on it"); |
There was a problem hiding this comment.
Tests 2 and 3 don't assert that disconnect and stop are called, leaving the error-path cleanup unverified. Test 1 (success path) asserts both; the failure paths should too.
💡 Suggested additions
For the pending-tools test (after line 230):
expect(disconnect).toHaveBeenCalledTimes(1);
expect(stop).toHaveBeenCalledTimes(1);For the no-output test (after line 262), the same assertions apply, and it's also worth adding:
expect(result.output).toBe('Timeout after 870000ms waiting for session.idle');to lock in the new behaviour where output = failure.message when hasOutput is false (the fix this PR introduces — currently this invariant has no test coverage).
Without the cleanup assertions, a regression that skips disconnect/stop in the catch path would not be caught by these two tests.
|
@copilot review all comments and address unresolved review feedback.
|
|
Fix the open review threads: dedupe the timeout regex and add cleanup/output assertions to the failure tests.
|
|
@copilot review all comments and address unresolved review feedback.
|
|
@copilot review all comments and address the unresolved review feedback.
|
|
@copilot review all comments and address unresolved review feedback. Then refresh the branch, rerun checks, and summarize anything still blocking mergeability.
|
The Copilot SDK driver would classify a run as failed (exit 1) whenever
sendAndWaithit the 870ssession.idlewatchdog — even when the agent had produced output and all tool calls had already completed. The SDK simply dropped the idle signal on long single-turn runs; the work was done but the driver discarded it.Changes
copilot_sdk_session.cjssendAndWaitthrows a timeout error,hasOutput=true, andpendingToolCalls.size === 0(all tracked tool calls finished), returnexitCode: 0with the accumulated output instead of failing.hasOutputbug: thecatchblock hardcodedhasOutput: false; now uses the actual accumulated value so the harness gets correct retry signals for cases that are genuine failures.outputin the error return now uses the accumulated assistant-message content instead of discarding it in favour of the error string.pendingToolCallsto function scope (was only accessible insidetry) so thecatchblock can inspect in-flight tool state.SDK_IDLE_TIMEOUT_PATTERNconstant matching the specific format used incopilot_harness.cjs, replacing a looser inline regex.copilot_sdk_driver.test.cjs— three regression tests:exitCode: 0exitCode: 1(harness retries)exitCode: 1(nothing to surface)