Refactor topic reader/writer onto a shared stream stack by vgvoleg · Pull Request #842 · ydb-platform/ydb-python-sdk

vgvoleg · 2026-06-26T10:15:30Z

Extracts the reconnect + bidi-stream lifecycle shared by the topic reader and writer into two base classes in ydb/_topic_common — StreamReconnector and StreamConnection. The reader/writer reconnectors and stream objects become thin subclasses.

Why

one reconnect loop instead of two near-duplicate copies
single live stream, no zombie — structural: the connection is built synchronously and owns its gRPC stream before the first network await, so a cancel mid-handshake always closes it (this now covers the writer too, which had no such guard)
_run(conn) = conn.wait_error() symmetric for reader and writer

Fixes

writer reconnect backoff: attempt was reset every loop iteration → retry storm at the backoff floor; now grows correctly
flush-on-close deadlock

Notes

public API unchanged
unit + topic integration green; mypy on the topic code is clean
architecture write-up: ydb/_topic_common/STREAM_DESIGN.md

Draft — preserving the work, not for merge yet.

codecov-commenter · 2026-06-26T10:27:29Z

Codecov Report

❌ Patch coverage is 91.88192% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.03%. Comparing base (f2c742d) to head (5bac08f).

Files with missing lines	Patch %	Lines
ydb/_topic_reader/topic_reader_asyncio.py	78.94%	4 Missing and 8 partials ⚠️
ydb/_topic_writer/topic_writer_asyncio.py	89.61%	6 Missing and 2 partials ⚠️
ydb/_topic_common/_stream_reconnector.py	97.72%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #842      +/-   ##
==========================================
+ Coverage   81.01%   81.03%   +0.02%     
==========================================
  Files          94       96       +2     
  Lines       12091    12099       +8     
  Branches     1182     1185       +3     
==========================================
+ Hits         9795     9804       +9     
- Misses       1837     1840       +3     
+ Partials      459      455       -4

Flag	Coverage Δ
integration	`78.97% <80.44%> (-0.12%)`	⬇️
unit	`46.52% <86.34%> (+0.19%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
ydb/_topic_common/_stream_connection.py	`100.00% <100.00%> (ø)`
ydb/_topic_common/_stream_reconnector.py	`97.72% <97.72%> (ø)`
ydb/_topic_writer/topic_writer_asyncio.py	`88.06% <89.61%> (-0.87%)`	⬇️
ydb/_topic_reader/topic_reader_asyncio.py	`85.37% <78.94%> (-1.57%)`	⬇️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

robot-vibe-db · 2026-06-26T10:35:47Z

@vgvoleg Please set up your personal token, see instructions: https://nda.ya.ru/t/dur-L1y77biDC2

Copilot

Pull request overview

Refactors the topic reader/writer asyncio implementations to share a common bidi-stream lifecycle + reconnect loop via new base classes in ydb/_topic_common (StreamReconnector, StreamConnection), reducing duplication and addressing reconnect/close-ordering issues.

Changes:

Introduces StreamReconnector (shared reconnect loop/backoff/close ordering) and StreamConnection (shared stream ownership, connect handshake, update-token loop).
Migrates topic reader and writer asyncio reconnectors/streams to thin subclasses using the shared stack.
Updates and extends unit tests to validate the new structural “no zombie stream on handshake cancel” guarantee.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
ydb/_topic_writer/topic_writer_asyncio.py	Writer migrated to `StreamReconnector`/`StreamConnection`; reconnect hooks and close/flush ordering refactored.
ydb/_topic_writer/topic_writer_asyncio_test.py	Tests updated for `_new_connection`-based reconnect; adds handshake-cancel stream-close test.
ydb/_topic_reader/topic_reader_asyncio.py	Reader migrated to `StreamReconnector`/`StreamConnection`; reconnect loop replaced by shared hooks.
ydb/_topic_reader/topic_reader_asyncio_test.py	Tests updated for `_new_connection`-based reconnect; adds handshake-cancel stream-close test and close cleanup.
ydb/_topic_common/STREAM_DESIGN.md	Documents the shared stream stack architecture, hooks, signals, and invariants.
ydb/_topic_common/common_test.py	Adds unit tests covering StreamReconnector/StreamConnection base behavior (reconnect, cancellation, close swallowing).
ydb/_topic_common/_stream_reconnector.py	New shared reconnect loop implementation with hooks for concrete reader/writer logic.
ydb/_topic_common/_stream_connection.py	New shared connection lifecycle (sync wrapper ownership, connect handshake, update-token loop, per-conn death signal).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

robot-vibe-db

AI Review Summary

Verdict: ✅ No critical issues found

Critical issues

No critical issues found.

Other findings

Minor | Medium: asyncio.sleep(self._update_token_interval or 0) creates a hot loop when _update_token_interval is None — ydb/_topic_common/_stream_connection.py:74
Minor | Medium: self._conn = conn before handshake exposes half-initialized connection to concurrent consumers during failed handshakes — ydb/_topic_common/_stream_reconnector.py:85
Nit | High: Legacy create() + _start() paths allocate a duplicate GrpcWrapperAsyncIO from StreamConnection.__init__ that is immediately overridden — ydb/_topic_reader/topic_reader_asyncio.py:492 / ydb/_topic_writer/topic_writer_asyncio.py:787
Nit | Medium: _stream_reader backward-compatibility property on ReaderReconnector appears unused — can be removed unless external callers depend on it — ydb/_topic_reader/topic_reader_asyncio.py:249

This review was generated automatically. Critical issues require attention; other findings are advisory.
If this comment was useful, please give it a 👍 — it helps us improve the review bot.

robot-vibe-db · 2026-06-26T11:06:03Z

Full analysis log

Analysis performed by claude, claude-opus-4-6.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

+    def _set_first_error(self, err: BaseException) -> None:
+        """Record the first error that ended this connection; later errors are ignored."""
+        try:
+            self._first_error.set_result(err)
+        except asyncio.InvalidStateError:
+            pass
+        self._on_first_error()
+


+    Subclasses must set up their own attributes BEFORE calling ``super().__init__()``,
+    because that call schedules the connection loop which immediately uses ``_connect``.
+    """


+        self._loop = asyncio.get_running_loop()
+        self._stub = stub
+        self._method = method
+        # Built (not started) here so the connection owns its transport before connect()'s first
+        # network await — that is what makes the no-zombie guarantee structural. The legacy
+        # _start(stream, ...) injection path overrides this with an externally provided stream.
+        self._stream: IGrpcWrapperAsyncIO = GrpcWrapperAsyncIO(from_proto)
+        self._background_tasks: Set[asyncio.Task] = set()


    def receive_batch_with_tx_nowait(self, tx: "BaseQueryTxContext", max_messages: Optional[int] = None):
-        if self._stream_reader is None:
+        if self._conn is None:
            return None
-        batch = self._stream_reader.receive_batch_nowait(
+        batch = self._conn.receive_batch_nowait(
            max_messages=max_messages,
        )


Extract the reconnect and bidi-stream lifecycle shared by the topic reader and writer into two base classes in ydb/_topic_common: - StreamReconnector: the single reconnect loop, backoff, fatal signal and close ordering, via _new_connection()/_handshake()/_run() hooks. Because _new_connection() is synchronous and the connection owns its gRPC stream before the first network await, "one live stream, no zombie" becomes structural — a cancel mid-handshake always closes the stream (this now covers the writer too, which had no such guard). - StreamConnection: the bidi-stream lifecycle (gRPC wrapper, connect = start + init handshake, update-token loop, per-connection wait_error). The reader/writer reconnectors and stream objects become thin subclasses; _run(conn) = conn.wait_error() for both. Also fixes the writer reconnect backoff (attempt was reset every loop iteration, a retry storm at the backoff floor) and a flush-on-close deadlock.

robot-vibe-db · 2026-06-26T12:55:06Z

+attempt = 0
+while not closed:
+    conn = self._new_connection()      # SYNC: builds the connection, which owns its gRPC stream
+    self._conn = conn                  # owned BEFORE the first network await


Severity: Minor
Confidence: High

The pseudocode shows self._conn = conn before await self._handshake(conn), but the actual implementation in _stream_reconnector.py:84-89 assigns _conn after the handshake succeeds. The structural no-zombie guarantee relies on the local conn variable being available in the finally block, not on early _conn assignment. As written, this pseudocode could mislead maintainers into thinking consumers can observe the connection before the handshake completes (they cannot — _conn is only published after a full handshake).

Suggested fix: swap the two lines in the pseudocode to match the actual code:

conn = self._new_connection() # SYNC try: await self._handshake(conn) self._conn = conn # published only AFTER successful handshake

robot-vibe-db · 2026-06-26T12:55:20Z

+    # ------------------------------------------------------------------ the one reconnect loop
+
+    async def _connection_loop(self) -> None:
+        attempt = 0


Severity: Minor
Confidence: Medium

The PR description mentions two important bug fixes: (1) writer reconnect backoff storm (attempt was reset every iteration in the old code, now fixed by declaring attempt = 0 outside the while loop at line 74), and (2) flush-on-close deadlock (writer now flushes before _closed=True at lines 334-336).

Both fixes look correct, but neither has a regression test that would catch a reintroduction. Consider adding:

A test that verifies attempt grows across consecutive retriable connect failures (e.g., assert the backoff sleep increases on the 2nd and 3rd failures rather than staying at the floor).

A test that reproduces the deadlock scenario: enqueue a message, drop the connection during flush(), and verify the reconnect loop brings up a new connection to complete the flush (rather than deadlocking because _closed=True prevents reconnection).

robot-vibe-db · 2026-06-26T12:55:33Z

-                    "writer reconnector %s retry in %s seconds",
-                    self._id,
-                    err_info.sleep_timeout_seconds,
-                )


Severity: Nit
Confidence: High

This self._state_changed.set() is redundant: the base class _connection_loop already calls self._state_changed.set() on _stream_reconnector.py:92, immediately after await self._on_connected(conn) returns. Since there is no await between the two .set() calls, they fire in the same event-loop turn and the duplicate is a no-op.

Harmless, but removing it avoids the false impression that signaling here is load-bearing (it isn't — the base class handles it).

robot-vibe-db · 2026-06-26T12:56:00Z

AI Review Summary

Verdict: ✅ No critical issues found

Critical issues

No critical issues found.

Other findings

Minor | High: STREAM_DESIGN.md pseudocode shows self._conn = conn before await self._handshake(conn), but the actual code assigns _conn after the handshake succeeds. The structural guarantee relies on the local conn variable in finally, not on early _conn assignment. Misleading for maintainers. — ydb/_topic_common/STREAM_DESIGN.md:29
Minor | Medium: The two bug fixes described in the PR (writer backoff storm and flush-on-close deadlock) are correct but lack regression tests that would catch reintroduction. — ydb/_topic_common/_stream_reconnector.py:74
Nit | High: Redundant self._state_changed.set() in WriterAsyncIOReconnector._on_connected() — the base class already signals the event on the next line after _on_connected returns. — ydb/_topic_writer/topic_writer_asyncio.py:487

Detailed notes:

The refactoring is well-designed. The two base classes (StreamReconnector and StreamConnection) cleanly separate the reconnect lifecycle from the per-protocol stream logic. Key observations from the full review:

Backoff fix is correct: The old writer had attempt = 0 inside the while True loop, preventing backoff growth. The base class correctly declares it before the loop and resets only on a successful connect.
Flush-on-close deadlock fix is correct: The writer now flushes before self._closed = True, allowing the connection loop to reconnect during flush if needed. The old code set _closed=True first, making reconnection impossible during flush.
No-zombie guarantee is structural and correct: _new_connection() creates the connection synchronously, finally closes the local conn, and CancelledError during _close_connection is re-raised (preventing silent swallowing). Tests verify this for both reader and writer.
Error handling migration from set_exception/exception() to set_result/result() is correct: All call sites (_check_stop, wait_init, wait_stop, _on_fatal) have been properly adapted.
Message replay across reconnects is correct: The writer's _messages deque retains unacked messages, the new _send_loop resends them, and the last_seq_no dedup skips stale items from _new_messages.
Test coverage is good: New tests cover reconnect-on-retriable, CancelledError reconnect, finally-swallows-close-error, close-without-connection, handshake-cancel-closes-stream (structural no-zombie), and close-during-reconnect-does-not-hang.

This review was generated automatically. Critical issues require attention; other findings are advisory.
If this comment was useful, please give it a 👍 — it helps us improve the review bot.

robot-vibe-db · 2026-06-26T12:56:24Z

Full analysis log

Analysis performed by claude, claude-opus-4-6.

vgvoleg added the need_ai_review label Jun 26, 2026

robot-vibe-db Bot added ai_review_failed and removed need_ai_review labels Jun 26, 2026

vgvoleg force-pushed the refactor/topic-stream-reconnector branch from b6ab1c8 to f2eaa09 Compare June 26, 2026 10:40

vgvoleg added need_ai_review and removed ai_review_failed labels Jun 26, 2026

robot-vibe-db Bot removed the need_ai_review label Jun 26, 2026

vgvoleg requested a review from Copilot June 26, 2026 10:53

robot-vibe-db Bot added the ai_review_in_process label Jun 26, 2026

Copilot started reviewing on behalf of vgvoleg June 26, 2026 10:54 View session

Copilot AI reviewed Jun 26, 2026

View reviewed changes

Comment thread ydb/_topic_writer/topic_writer_asyncio.py

Comment thread ydb/_topic_common/_stream_connection.py

robot-vibe-db Bot reviewed Jun 26, 2026

View reviewed changes

Comment thread ydb/_topic_common/_stream_connection.py Outdated

Comment thread ydb/_topic_common/_stream_reconnector.py Outdated

Comment thread ydb/_topic_common/_stream_connection.py Outdated

Comment thread ydb/_topic_reader/topic_reader_asyncio.py Outdated

robot-vibe-db Bot added ai_reviewed and removed ai_review_in_process labels Jun 26, 2026

vgvoleg force-pushed the refactor/topic-stream-reconnector branch from f2eaa09 to 73c2519 Compare June 26, 2026 11:16

vgvoleg requested a review from Copilot June 26, 2026 11:56

vgvoleg added need_ai_review and removed ai_reviewed labels Jun 26, 2026

Copilot started reviewing on behalf of vgvoleg June 26, 2026 11:59 View session

Copilot AI reviewed Jun 26, 2026

View reviewed changes

robot-vibe-db Bot added ai_review_in_process and removed need_ai_review labels Jun 26, 2026

vgvoleg force-pushed the refactor/topic-stream-reconnector branch from 73c2519 to 5bac08f Compare June 26, 2026 12:40

robot-vibe-db Bot reviewed Jun 26, 2026

View reviewed changes

robot-vibe-db Bot added ai_reviewed and removed ai_review_in_process labels Jun 26, 2026

Uh oh!

Conversation

vgvoleg commented Jun 26, 2026

Why

Fixes

Notes

Uh oh!

codecov-commenter commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

robot-vibe-db Bot commented Jun 26, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

robot-vibe-db Bot left a comment

Choose a reason for hiding this comment

AI Review Summary

Critical issues

Other findings

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

robot-vibe-db Bot commented Jun 26, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

robot-vibe-db Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

robot-vibe-db Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

robot-vibe-db Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

robot-vibe-db Bot commented Jun 26, 2026

AI Review Summary

Critical issues

Other findings

Uh oh!

robot-vibe-db Bot commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Jun 26, 2026 •

edited

Loading