Skip to content

Fix gRPC thread leak on failed topic writer reconnect#845

Merged
vgvoleg merged 4 commits into
mainfrom
fix/topic-writer-reconnect-thread-leak
Jun 30, 2026
Merged

Fix gRPC thread leak on failed topic writer reconnect#845
vgvoleg merged 4 commits into
mainfrom
fix/topic-writer-reconnect-thread-leak

Conversation

@vgvoleg

@vgvoleg vgvoleg commented Jun 30, 2026

Copy link
Copy Markdown
Member

Problem

On connection loss during reconnect, WriterAsyncIOStream.create() raised after stream.start() without closing the stream, stranding a gRPC consumption thread on every attempt → unbounded growth → RuntimeError: can't start new thread. Happens even without any write() call.

Fix

Close the stream on failure, mirroring the already-fixed ReaderStream.create().

WriterAsyncIOStream.create() left the stream open when startup failed after stream.start(), leaking a gRPC consumer thread per reconnect attempt. Close it on failure (mirrors ReaderStream.create) and add a regression test.
@codecov-commenter

codecov-commenter commented Jun 30, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 92.85714% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 81.05%. Comparing base (e4c0347) to head (02228a7).

Files with missing lines Patch % Lines
ydb/_topic_writer/topic_writer_asyncio.py 92.85% 0 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #845   +/-   ##
=======================================
  Coverage   81.05%   81.05%           
=======================================
  Files          94       94           
  Lines       12101    12109    +8     
  Branches     1182     1184    +2     
=======================================
+ Hits         9808     9815    +7     
+ Misses       1836     1835    -1     
- Partials      457      459    +2     
Flag Coverage Δ
integration 79.07% <78.57%> (-0.02%) ⬇️
unit 47.12% <78.57%> (+0.41%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
ydb/_topic_writer/topic_writer_asyncio.py 88.90% <92.85%> (-0.03%) ⬇️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a gRPC consumer-thread leak in the topic writer reconnection path by ensuring WriterAsyncIOStream.create() closes the underlying GrpcWrapperAsyncIO stream when initialization fails after stream.start(). Adds a regression test that reproduces the leak using a real in-process gRPC streaming RPC (so the gRPC consumption thread is actually spawned).

Changes:

  • Wrap WriterAsyncIOStream.create() initialization in try/except and close the stream on failure (mirrors ReaderStream.create() behavior).
  • Add an in-process gRPC server–based regression test that repeatedly triggers create failures and asserts no consumer threads remain stranded.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
ydb/_topic_writer/topic_writer_asyncio.py Ensures streams started during writer creation are closed on init failure to prevent stranded gRPC consumption threads.
ydb/_topic_writer/topic_writer_asyncio_test.py Adds a regression test using an in-process gRPC StreamWrite endpoint to detect the thread leak.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ydb/_topic_writer/topic_writer_asyncio_test.py Outdated
Comment thread ydb/_topic_writer/topic_writer_asyncio_test.py Outdated

@robot-vibe-db robot-vibe-db Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI Review Summary

Verdict: ✅ No critical issues found

Critical issues

No critical issues found.

Other findings

  • Minor | Medium: Cleanup logic in create() diverges from the ReaderStream.create() pattern it claims to mirror; close() is unsafe for partially-initialized writers — ydb/_topic_writer/topic_writer_asyncio.py:864
  • Minor | Medium: WriterAsyncIOStream.close() unconditionally accesses self._stream without a guard, unlike ReaderStream.close()ydb/_topic_writer/topic_writer_asyncio.py:836
  • Nit | Medium: Thread-leak detection in the test relies on stack-frame inspection of specific filenames and method names — ydb/_topic_writer/topic_writer_asyncio_test.py:1053

This review was generated automatically. Critical issues require attention; other findings are advisory.
If this comment was useful, please give it a 👍 — it helps us improve the review bot.

Comment thread ydb/_topic_writer/topic_writer_asyncio.py Outdated
Comment thread ydb/_topic_writer/topic_writer_asyncio_test.py Outdated
@robot-vibe-db

robot-vibe-db Bot commented Jun 30, 2026

Copy link
Copy Markdown

Full analysis log

Analysis performed by claude, claude-opus-4-6.

Set _stream at the start of _start() and guard close() (mirrors ReaderStream); match the exact __next__ code object for leak detection and wait for the test gRPC server to stop.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread ydb/_topic_writer/topic_writer_asyncio_test.py
Comment thread ydb/_topic_writer/topic_writer_asyncio_test.py

@robot-vibe-db robot-vibe-db Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI Review Summary

Verdict: ✅ No critical issues found

Critical issues

No critical issues found.

Other findings

  • Nit | Medium: Stale comment in existing test test_init_timeout_behaviorydb/_topic_writer/topic_writer_asyncio_test.py:288

    Line 288 says _"Don't close writer since _start failed and stream was never set". After this PR, _start() assigns self._stream = stream as its first action (line 893), so _stream is now set even when _start() fails mid-handshake. The comment should be updated and ideally the test should call await writer.close() for consistency with the new cleanup pattern. (The mock stream is still cleaned up by the stream fixture teardown, so this is cosmetic.)

Notes on the review:

The fix is well-targeted and correctly mirrors the already-proven ReaderStream.create() pattern (line 586 in topic_reader_asyncio.py). The three changes form a coherent unit:

  1. create() try/except — ensures the stream is closed on any failure after stream.start(), whether the failure is in the WriterAsyncIOStream constructor or in _start().
  2. _stream assignment moved earlier in _start() — guarantees close() can reach the stream even if the init handshake fails (timeout, status error, connection loss).
  3. getattr guard in close() — safely handles the case where close() is called before _start() runs (since _stream is not set in __init__).

The regression test is thorough: it uses a real in-process gRPC server (not mocks) to reproduce the actual thread leak, and detects stranded threads by matching the exact AsyncQueueToSyncIteratorAsyncIO.__next__ code object in sys._current_frames().


This review was generated automatically. Critical issues require attention; other findings are advisory.
If this comment was useful, please give it a 👍 — it helps us improve the review bot.

@robot-vibe-db

robot-vibe-db Bot commented Jun 30, 2026

Copy link
Copy Markdown

Full analysis log

Analysis performed by claude, claude-opus-4-6.

@github-actions

Copy link
Copy Markdown

🌋 SLO Test Results

🟢 2 workload(s) tested — All thresholds passed

Commit: 2346362 · View run

Workload Thresholds Duration Report
sync-table 🟢 OK 10m 3s 📄 Report
sync-query 🟢 OK 10m 11s 📄 Report

Generated by ydb-slo-action

Restrict the test gRPC handler to StreamWrite and assert against a baseline thread count.
@vgvoleg vgvoleg merged commit 47e1578 into main Jun 30, 2026
39 checks passed
@vgvoleg vgvoleg deleted the fix/topic-writer-reconnect-thread-leak branch June 30, 2026 10:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants