fix: enable TCP keepalive on default httpx transports to prevent NAT idle-timeout drops#3270
Open
gsagrawal-binocs wants to merge 3 commits into
Open
fix: enable TCP keepalive on default httpx transports to prevent NAT idle-timeout drops#3270gsagrawal-binocs wants to merge 3 commits into
gsagrawal-binocs wants to merge 3 commits into
Conversation
…timeout drops Long-running non-streaming inference calls (Responses API, o-series and GPT-5.x reasoning models) hold a TCP connection idle for 300–600 s while the server generates. NAT gateways silently drop idle connections in this window — AWS NAT Gateway at ~350 s, GCP Cloud NAT at ~120 s, home routers at 60–300 s — causing the client to hang indefinitely (the default SDK timeout never fires because it measures time since the last received byte, and a NAT-dropped connection sends no further bytes). Enable SO_KEEPALIVE with 60 s idle/interval probes on the default httpx transport for both sync and async clients. This matches the pattern already used by the Anthropic Python SDK. Applied via kwargs.setdefault so any caller that passes a custom transport is completely unaffected. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 task
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fa0246e7d4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Address two review findings: - Bump httpx lower bound from 0.23.0 to 0.25.0; socket_options on HTTPTransport/AsyncHTTPTransport was added in httpx 0.25.0 and would raise TypeError on older allowed installs - Build the keepalive transport with limits from kwargs so the SDK's DEFAULT_CONNECTION_LIMITS (1000) is preserved; caller-supplied transport is still respected via the "transport" not in kwargs guard Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Author
|
@apcha-oai could you please review this PR ? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Non-streaming OpenAI API calls hang indefinitely when run behind a NAT gateway. The OpenAI server successfully generates the response (visible in the dashboard), but the client never receives it because the TCP connection was silently dropped mid-generation.
Root cause: the default httpx transport has no TCP keepalive (
SO_KEEPALIVEis off). During a long non-streaming call, the TCP connection sits idle while the server generates. NAT gateways silently drop idle connections:With o-series and GPT-5.x models under medium/high reasoning, server-side generation routinely takes 300–700 s — well past these thresholds. The client hangs indefinitely because the default SDK timeout measures time since the last received byte, and a NAT-dropped connection never sends another byte.
This affects any deployment behind NAT — EKS, ECS, Cloud Run, GKE, and even local development behind a home router.
Fix
Enable TCP keepalive on the default httpx transport for both sync (
_DefaultHttpxClient) and async (_DefaultAsyncHttpxClient) clients in_base_client.py:Applied via
kwargs.setdefaultso any caller that passes a customtransportis completely unaffected.This is identical to the pattern already used by the Anthropic Python SDK.
Tests
Added to
tests/test_client.pyfor bothTestOpenAIandTestAsyncOpenAI:test_default_transport_has_tcp_keepalive— assertsSO_KEEPALIVE=1is set on the default transporttest_custom_http_client_transport_is_not_overridden— asserts a caller-suppliedhttp_clientis not replacedReproducer
See linked issue for a standalone reproducer script demonstrating the hang : #3269