fix(session): cap retry schedule at RETRY_MAX_ATTEMPTS = 3#26369
fix(session): cap retry schedule at RETRY_MAX_ATTEMPTS = 3#26369truenorth-lj wants to merge 2 commits into
Conversation
The retry schedule in session/retry.ts policy() previously had no numeric cap on the number of attempts — it continued as long as retryable() returned a truthy value. A misclassified-retryable error (or an upstream that's permanently unhealthy but keeps emitting retryable signals) could cause the schedule to spin without bound. Adds RETRY_MAX_ATTEMPTS = 3 alongside the other RETRY_* constants and an explicit cap check in the policy step function. With the existing 2-4-8 s backoff this gives a wall-clock retry budget of ~14 s. Test: drives the schedule RETRY_MAX_ATTEMPTS + 1 times and asserts that the set() callback was invoked exactly RETRY_MAX_ATTEMPTS times. bun test packages/opencode/test/session/retry.test.ts → 31 pass / 0 fail bun run typecheck → clean
Reconciles RETRY_MAX_ATTEMPTS=3 cap on top of anomalyco#26366's retryable(error, provider) and policy({provider,...}) signature changes. The cap check runs before the retryable() classification, so it still terminates the schedule independently of provider-specific reasoning. Test pattern matches the upstream change: maxAttempts test now also passes provider: 'test' to policy().
… max retries at 3 - Add RETRY_MAX_ATTEMPTS = 3 to prevent infinite retry loops - Add NETWORK_ERROR_PATTERNS for ECONNRESET, ECONNREFUSED, ETIMEDOUT, fetch failed, socket hang up, network error, connection reset/refused/timeout - Add nested error envelope inspection (server_error, upstream_error, stream_read_error, service_unavailable_error) - Fix OpenRouter numeric code bug (typeof json.code === 'number') - Add comprehensive test coverage for all new retry patterns Closes anomalyco#20822, anomalyco#21716, anomalyco#21893, anomalyco#23287 Related anomalyco#19394, anomalyco#20466, anomalyco#22448, anomalyco#26369
… max retries at 3 - Add RETRY_MAX_ATTEMPTS = 3 to prevent infinite retry loops - Add NETWORK_ERROR_PATTERNS for ECONNRESET, ECONNREFUSED, ETIMEDOUT, fetch failed, socket hang up, network error, connection reset/refused/timeout - Add nested error envelope inspection (server_error, upstream_error, stream_read_error, service_unavailable_error) - Fix OpenRouter numeric code bug (typeof json.code === 'number') - Add comprehensive test coverage for all new retry patterns Closes anomalyco#20822, anomalyco#21716, anomalyco#21893, anomalyco#23287 Related anomalyco#19394, anomalyco#20466, anomalyco#22448, anomalyco#26369
|
Hi, how about making retries configurable instead of hardcode 3? I find it quite handy sometimes to be able to retry several times. |
|
Hey, thanks for capping the max attempts! However, this doesn't fully fix the issue for users hitting the 429 hard monthly quota limit from the opencode proxy. The proxy returns a 429 with an 18-day |
|
Thanks for capping this — it fixes the runaway transient case. One gap that echoes @niStee's comment above: for the hard Suggestion: keep the attempt cap for transient 5xx / generic 429s, but treat the |
|
I've pushed the local commits that resolve the silent slumber and network hang gaps we discussed:
All fixes passed secret scanning locally and are actively preventing the gpt-5.5 network hangs in my live session right now. Ready for review! |
|
Follow-up: I've also pushed a commit to mirror the 180s TTFB timeout patch into the v1 |
|
Automated PR Cleanup Thank you for contributing to opencode. Due to the high volume of PRs from users and AI agents, we periodically close older PRs using automated criteria so maintainers can focus review time on the most active and community-supported contributions. This PR was closed because it matched the following cleanup criteria:
PRs created within the last month are not affected by this cleanup. If you believe this PR was closed incorrectly, or if you are still actively working on it, please leave a comment explaining why it should be reopened. A maintainer can review and reopen it if appropriate. Thanks again for taking the time to contribute. |
Issue for this PR
Closes #21960
Prior reports (this addresses what the community has already raised, with credit to the original reporters):
Type of change
What does this PR do?
Adds a hard upper bound to the retry schedule in
packages/opencode/src/session/retry.tspolicy().Before: the schedule had no numeric cap on attempts. It continued as long as
retryable()returned a truthy value, which meant a misclassified-retryable error (or an upstream that's permanently unhealthy but keeps emitting retryable signals) could spin without bound.After:
With the existing 2-4-8 s backoff this gives a wall-clock retry budget of ~14 s for a worst-case run.
Context: This PR is the standalone bug-fix half of the closed #26343. That PR additionally proposed a header-driven retry classification using custom
X-Llm-Error-*headers, which @rekram1-node correctly closed as non-standard. This PR drops that part entirely — only the standalone schedule-cap remains. Any proxy can express retry semantics through the standard signals already handled byretryable():429 Too Many Requests+Retry-After503 Service Unavailable+Retry-After402 Payment Required(4xx → no retry)400 Bad Request(4xx → no retry)How did you verify your code works?
The new test (
session.retry.policy.maxAttempts) drives the scheduleRETRY_MAX_ATTEMPTS + 1times and asserts that theset()callback was invoked exactlyRETRY_MAX_ATTEMPTStimes.Screenshots / recordings
N/A — no UI changes.
Checklist