Skip to content

fix(sdk-driver): resolve idle-timeout as success when output collected and all tools complete#40419

Merged
pelikhan merged 3 commits into
mainfrom
copilot/aw-failures-fix-sdk-idle-timeout
Jun 20, 2026
Merged

fix(sdk-driver): resolve idle-timeout as success when output collected and all tools complete#40419
pelikhan merged 3 commits into
mainfrom
copilot/aw-failures-fix-sdk-idle-timeout

Conversation

Copilot AI commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

The Copilot SDK driver would classify a run as failed (exit 1) whenever sendAndWait hit the 870s session.idle watchdog — even when the agent had produced output and all tool calls had already completed. The SDK simply dropped the idle signal on long single-turn runs; the work was done but the driver discarded it.

Changes

copilot_sdk_session.cjs

  • Idle-timeout → success: when sendAndWait throws a timeout error, hasOutput=true, and pendingToolCalls.size === 0 (all tracked tool calls finished), return exitCode: 0 with the accumulated output instead of failing.
  • Fix hasOutput bug: the catch block hardcoded hasOutput: false; now uses the actual accumulated value so the harness gets correct retry signals for cases that are genuine failures.
  • Preserve output on error: output in the error return now uses the accumulated assistant-message content instead of discarding it in favour of the error string.
  • Moved pendingToolCalls to function scope (was only accessible inside try) so the catch block can inspect in-flight tool state.
  • Extracted SDK_IDLE_TIMEOUT_PATTERN constant matching the specific format used in copilot_harness.cjs, replacing a looser inline regex.
// Before — always discarded output on any sendAndWait error
} catch (err) {
  return { exitCode: 1, output: failure.message, hasOutput: false, durationMs };
}

// After — idle-timeout with completed work resolves as success
const isIdleTimeout = !catastrophicToolDenialsError && SDK_IDLE_TIMEOUT_PATTERN.test(failure.message);
if (isIdleTimeout && hasOutput && pendingToolCalls.size === 0) {
  log(`warning: SDK idle-timeout with collected output and no pending tool calls — treating as completed`);
  return { exitCode: 0, output, hasOutput, durationMs };
}
return { exitCode: 1, output: hasOutput ? output : failure.message, hasOutput, durationMs };

copilot_sdk_driver.test.cjs — three regression tests:

  • idle-timeout + output + all tool calls complete → exitCode: 0
  • idle-timeout + pending tool calls → exitCode: 1 (harness retries)
  • idle-timeout + no output → exitCode: 1 (nothing to surface)

Copilot AI and others added 2 commits June 20, 2026 03:15
…and no pending tools

When session.sendAndWait times out waiting for session.idle but the agent
produced output and all tracked tool calls have already completed, the SDK
simply failed to emit the idle signal — the agent's work is done. Resolve
with exitCode 0 instead of 1 so the harness does not classify the run as a
failure or waste retry budget.

Also fix the catch block to:
- Use the accumulated hasOutput value (was hardcoded false)
- Preserve the accumulated output in the error return (was discarded)

Adds three regression tests covering:
- idle-timeout with output + all tool calls complete → exitCode 0 (success)
- idle-timeout with pending tool calls → exitCode 1 (genuine failure, retry)
- idle-timeout with no output collected → exitCode 1 (no work done)

Closes #40418

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Extract the inline idle-timeout regex into a named module-level constant
SDK_IDLE_TIMEOUT_PATTERN, matching the more specific format already used in
copilot_harness.cjs (SDK_SESSION_IDLE_TIMEOUT_PATTERN). This makes the
intent clear and is easy to update if the SDK error message format changes.

Also export the constant for testability.

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix Copilot SDK session.idle timeout issue in PR Code Quality Reviewer fix(sdk-driver): resolve idle-timeout as success when output collected and all tools complete Jun 20, 2026
Copilot AI requested a review from pelikhan June 20, 2026 03:17
@pelikhan pelikhan marked this pull request as ready for review June 20, 2026 03:22
Copilot AI review requested due to automatic review settings June 20, 2026 03:22
@github-actions

github-actions Bot commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Design Decision Gate 🏗️ completed the design decision gate check.

No ADR enforcement needed: PR #40419 does not have the 'implementation' label and has 0 new lines of code in business logic directories (≤100 threshold). Only 2 files changed, neither in tracked business-logic paths.

@github-actions

github-actions Bot commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅

@github-actions

github-actions Bot commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

PR Code Quality Reviewer completed the code quality review.

@github-actions

github-actions Bot commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Test Quality Sentinel completed test quality analysis.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts the Copilot SDK session driver so that a sendAndWait idle-timeout (waiting for session.idle) is treated as a successful completion when the agent has already produced output and there are no tracked in-flight tool calls, preventing long single-turn runs from being incorrectly marked as failures.

Changes:

  • Add SDK_IDLE_TIMEOUT_PATTERN and use it in runWithCopilotSDK to detect the specific SDK session.idle timeout message.
  • On idle-timeout errors, return exitCode: 0 when hasOutput === true and pendingToolCalls.size === 0; otherwise preserve accumulated output and return exitCode: 1 with correct hasOutput.
  • Add regression tests covering the success case and two failure cases (pending tool calls, no output).
Show a summary per file
File Description
actions/setup/js/copilot_sdk_session.cjs Tracks pending tool calls across try/catch, preserves accumulated output on error, and treats SDK idle-timeout as success when work is complete.
actions/setup/js/copilot_sdk_driver.test.cjs Adds regression coverage for idle-timeout behavior across success/failure scenarios.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 2/2 changed files
  • Comments generated: 0

@github-actions github-actions Bot mentioned this pull request Jun 20, 2026
@github-actions

Copy link
Copy Markdown
Contributor

🧪 Test Quality Sentinel Report

Test Quality Score: 90/100 — Excellent

Analyzed 3 test(s) across 1 JavaScript file: 3 design tests, 0 implementation tests, 0 guideline violations.

📊 Metrics & Test Classification (3 tests analyzed)
Metric Value
New/modified tests analyzed 3
✅ Design tests (behavioral contracts) 3 (100%)
⚠️ Implementation tests (low value) 0 (0%)
Tests with error/edge cases 3 (100%)
Duplicate test clusters 0
Test inflation detected ⚠️ Yes — copilot_sdk_driver.test.cjs +143 lines vs copilot_sdk_session.cjs +30 lines (ratio ≈ 4.8:1)
🚨 Coding-guideline violations 0
Test File Classification Issues Detected
resolves exitCode 0 on SDK idle-timeout when output collected and all tool calls complete copilot_sdk_driver.test.cjs ✅ Design
returns exitCode 1 on SDK idle-timeout when tool calls are still pending copilot_sdk_driver.test.cjs ✅ Design
returns exitCode 1 on SDK idle-timeout with no output collected copilot_sdk_driver.test.cjs ✅ Design

Go: 0 (*_test.go); JavaScript: 3 (*.test.cjs). No other languages detected.

i️ Note on Test Inflation (non-blocking)

The test file added 143 lines versus 30 lines in the production file (≈ 4.8:1 ratio, threshold 2:1). Per the rubric this deducts 10 points from the Proportional Growth component. In context this is expected and justified: the production fix is compact (pattern match + conditional in the catch block) but requires three distinct test scenarios to cover the state machine — success (all tools complete + output), failure (pending tools), and failure (no output). No action is required, but it is noted per the scoring rules.

Verdict

Check passed. 0% implementation tests (threshold: 30%). All three tests verify observable behavioral contracts — the exitCode and hasOutput values returned by runWithCopilotSDK under three distinct idle-timeout scenarios — and each covers an error/edge case. Mocking targets (CopilotClient, SDK session) are external I/O dependencies, which is the accepted pattern for this codebase.

🧪 Test quality analysis by Test Quality Sentinel · 54.4 AIC · ⌖ 8.79 AIC · ⊞ 8.3K ·

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Test Quality Sentinel: 90/100. Test quality is excellent — 0% of new tests are implementation tests (threshold: 30%). All 3 tests verify observable behavioral contracts (exitCode and hasOutput under distinct idle-timeout scenarios). Mocking targets are external SDK I/O dependencies.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skills-Based Review 🧠

Applied /diagnose and /tdd — approving with minor suggestions on test completeness and a cross-file DRY opportunity.

📋 Key Themes & Highlights

Key Themes

  • Cross-file regex duplication: SDK_IDLE_TIMEOUT_PATTERN (new) and SDK_SESSION_IDLE_TIMEOUT_PATTERN (existing in copilot_harness.cjs) are identical. The constant is now exported — the harness could import it to eliminate the "keep in sync" burden.
  • Test specification gaps: The three regression tests are well-structured, but the two failure paths don't assert cleanup (disconnect/stop) and the no-output test leaves output content unspecified.
  • Untested guard: The !catastrophicToolDenialsError condition that prevents tool-denial failures from being promoted to success has no dedicated test.

Positive Highlights

  • Root cause addressed cleanly: hoisting pendingToolCalls to function scope is the minimal structural change needed — nothing was added unnecessarily.
  • Three-condition guard (isIdleTimeout && hasOutput && pendingToolCalls.size === 0) is precise and well-reasoned; each condition maps directly to a test case.
  • hasOutput and output fix on the error path are bonuses that correct silent data loss and wrong retry signals.
  • Mock event sequences (start/complete pairing) in the tests exercise real state transitions, not just surface throw/catch behaviour.
  • ✅ The PR description is excellent — the before/after code block makes the change immediately graspable.

🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · 73.3 AIC · ⌖ 10.6 AIC · ⊞ 6.9K

// deadline waiting for the session.idle event. This matches the message format
// "Timeout after <N>ms waiting for session.idle" produced by the Copilot SDK.
// Keep in sync with SDK_SESSION_IDLE_TIMEOUT_PATTERN in copilot_harness.cjs.
const SDK_IDLE_TIMEOUT_PATTERN = /Timeout after \d+ms waiting for session\.idle/;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/diagnose] SDK_IDLE_TIMEOUT_PATTERN is identical to SDK_SESSION_IDLE_TIMEOUT_PATTERN already defined in copilot_harness.cjs (line 120). The "Keep in sync" comment flags a live drift risk — the two patterns can diverge silently.

💡 Suggestion: import from here instead of duplicating

Since SDK_IDLE_TIMEOUT_PATTERN is now exported, copilot_harness.cjs could import it directly:

// copilot_harness.cjs
const { SDK_IDLE_TIMEOUT_PATTERN } = require("./copilot_sdk_session.cjs");
// ... replace SDK_SESSION_IDLE_TIMEOUT_PATTERN usages with SDK_IDLE_TIMEOUT_PATTERN

This eliminates the "keep in sync" maintenance burden and makes the single source of truth explicit.

// output and all tracked tool calls have already completed, the session work is
// done — the SDK simply failed to emit the idle signal. Treat it as a successful
// run so the harness does not classify it as a failure or waste retry attempts.
const isIdleTimeout = !catastrophicToolDenialsError && SDK_IDLE_TIMEOUT_PATTERN.test(failure.message);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] The !catastrophicToolDenialsError guard is a critical safety rail — it prevents a tool-denial failure whose error message accidentally matches the idle-timeout pattern from being promoted to success. But this branch is not covered by a regression test.

💡 Suggested test case
it("returns exitCode 1 when catastrophic tool denials fires even if timeout message matches", async () => {
  // Simulate maxToolDenials being hit; the driver should NOT treat this as
  // an idle-timeout success even if the error text happens to match.
  const session = {
    sessionId: "session-denial-not-idle",
    on: handler => { onEvent = handler; },
    sendAndWait: vi.fn().mockImplementation(async () => {
      // fire enough tool-denial events to breach the threshold, then stall
      throw new Error("Timeout after 870000ms waiting for session.idle");
    }),
    disconnect: vi.fn().mockResolvedValue(undefined),
  };
  // ... configure maxToolDenials:1 and fire denial events ...
  expect(result.exitCode).toBe(1);
});

Without this test, a future refactor that accidentally drops the !catastrophicToolDenialsError condition would go undetected.

});

expect(result.exitCode).toBe(1);
expect(result.hasOutput).toBe(false);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] This test verifies exitCode: 1 and hasOutput: false but doesn't assert what output contains. When hasOutput is false, the implementation returns failure.message — so the harness receives the timeout error string, not an empty value. Asserting this fully specifies the behaviour and prevents a silent regression if the fallback is changed.

💡 Suggested addition
// Add after the existing expects:
expect(result.output).toContain("waiting for session.idle");
// confirms the error string is surfaced to the harness for diagnostics


expect(result.exitCode).toBe(1);
expect(result.hasOutput).toBe(true);
expect(result.output).toContain("working on it");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] The pending-tools and no-output failure tests don't assert that cleanup (disconnect, stop) was called, unlike the success-case test above. The finally block should run on all exit paths — adding these assertions prevents a future regression where error handling disrupts cleanup.

💡 Suggested additions (also applies to the no-output test)
// Add to both failure tests:
expect(disconnect).toHaveBeenCalledTimes(1);
expect(stop).toHaveBeenCalledTimes(1);

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔎 Code quality review by PR Code Quality Reviewer · 153.4 AIC · ⌖ 10.2 AIC · ⊞ 5.1K

// deadline waiting for the session.idle event. This matches the message format
// "Timeout after <N>ms waiting for session.idle" produced by the Copilot SDK.
// Keep in sync with SDK_SESSION_IDLE_TIMEOUT_PATTERN in copilot_harness.cjs.
const SDK_IDLE_TIMEOUT_PATTERN = /Timeout after \d+ms waiting for session\.idle/;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exported pattern is not consumed by its stated peer — copilot_harness.cjs still carries its own copy. Two independent regexes can silently drift.

💡 Details

The comment says "Keep in sync with SDK_SESSION_IDLE_TIMEOUT_PATTERN in copilot_harness.cjs", implying the export is meant to be a single source of truth. But copilot_harness.cjs defines its own copy at line 120 and never imports from this module:

// copilot_harness.cjs line 120 — still exists, still independent
const SDK_SESSION_IDLE_TIMEOUT_PATTERN = /Timeout after \d+ms waiting for session\.idle/;

The export is semantically incomplete.

There is also a diagnostic side-effect: for the idle-timeout + pending-tools + output path, this PR correctly returns output = <assistant content> rather than the error string. That means isSDKSessionIdleTimeoutError(result.output) in the harness will now return false for those cases, so classifyCopilotFailure logs partial_execution instead of sdk_session_idle_timeout. Retry logic is unaffected (keyed on hasOutput), but failure traces become less specific.

Suggested fix: import from this module in copilot_harness.cjs:

const { SDK_IDLE_TIMEOUT_PATTERN } = require('./copilot_sdk_session.cjs');
// retain old name via alias if needed:
const SDK_SESSION_IDLE_TIMEOUT_PATTERN = SDK_IDLE_TIMEOUT_PATTERN;


expect(result.exitCode).toBe(1);
expect(result.hasOutput).toBe(true);
expect(result.output).toContain("working on it");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests 2 and 3 don't assert that disconnect and stop are called, leaving the error-path cleanup unverified. Test 1 (success path) asserts both; the failure paths should too.

💡 Suggested additions

For the pending-tools test (after line 230):

expect(disconnect).toHaveBeenCalledTimes(1);
expect(stop).toHaveBeenCalledTimes(1);

For the no-output test (after line 262), the same assertions apply, and it's also worth adding:

expect(result.output).toBe('Timeout after 870000ms waiting for session.idle');

to lock in the new behaviour where output = failure.message when hasOutput is false (the fix this PR introduces — currently this invariant has no test coverage).

Without the cleanup assertions, a regression that skips disconnect/stop in the catch path would not be caught by these two tests.

@github-actions

Copy link
Copy Markdown
Contributor

@copilot review all comments and address unresolved review feedback.

Generated by 👨‍🍳 PR Sous Chef · 87.7 AIC · ⌖ 1.04 AIC · ⊞ 17.2K ·

@github-actions

Copy link
Copy Markdown
Contributor

Fix the open review threads: dedupe the timeout regex and add cleanup/output assertions to the failure tests.

Generated by 👨‍🍳 PR Sous Chef · 87.7 AIC · ⌖ 1.04 AIC · ⊞ 17.2K ·

@github-actions

Copy link
Copy Markdown
Contributor

@copilot review all comments and address unresolved review feedback.
Please address the open review threads, especially the regex duplication and failure-path cleanup/output assertions, then re-request review.

Generated by 👨‍🍳 PR Sous Chef · 51.6 AIC · ⌖ 1.07 AIC · ⊞ 17.2K ·

@github-actions

Copy link
Copy Markdown
Contributor

@copilot review all comments and address the unresolved review feedback.
pr-sous-chef run: https://github.com/github/gh-aw/actions/runs/27868244120

Generated by 👨‍🍳 PR Sous Chef · 57.2 AIC · ⌖ 1.06 AIC · ⊞ 17.2K ·

@github-actions

Copy link
Copy Markdown
Contributor

@copilot review all comments and address unresolved review feedback. Then refresh the branch, rerun checks, and summarize anything still blocking mergeability.

Generated by 👨‍🍳 PR Sous Chef · 72.3 AIC · ⌖ 1.04 AIC · ⊞ 17.2K ·

@pelikhan pelikhan merged commit 7bb0ca2 into main Jun 20, 2026
54 of 62 checks passed
@pelikhan pelikhan deleted the copilot/aw-failures-fix-sdk-idle-timeout branch June 20, 2026 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants