Skip to content

Refactor topic reader/writer onto a shared stream stack#842

Draft
vgvoleg wants to merge 1 commit into
mainfrom
refactor/topic-stream-reconnector
Draft

Refactor topic reader/writer onto a shared stream stack#842
vgvoleg wants to merge 1 commit into
mainfrom
refactor/topic-stream-reconnector

Conversation

@vgvoleg

@vgvoleg vgvoleg commented Jun 26, 2026

Copy link
Copy Markdown
Member

Extracts the reconnect + bidi-stream lifecycle shared by the topic reader and writer into two base classes in ydb/_topic_commonStreamReconnector and StreamConnection. The reader/writer reconnectors and stream objects become thin subclasses.

Why

  • one reconnect loop instead of two near-duplicate copies
  • single live stream, no zombie — structural: the connection is built synchronously and owns its gRPC stream before the first network await, so a cancel mid-handshake always closes it (this now covers the writer too, which had no such guard)
  • _run(conn) = conn.wait_error() symmetric for reader and writer

Fixes

  • writer reconnect backoff: attempt was reset every loop iteration → retry storm at the backoff floor; now grows correctly
  • flush-on-close deadlock

Notes

  • public API unchanged
  • unit + topic integration green; mypy on the topic code is clean
  • architecture write-up: ydb/_topic_common/STREAM_DESIGN.md

Draft — preserving the work, not for merge yet.

@codecov-commenter

codecov-commenter commented Jun 26, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 91.88192% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.03%. Comparing base (f2c742d) to head (5bac08f).

Files with missing lines Patch % Lines
ydb/_topic_reader/topic_reader_asyncio.py 78.94% 4 Missing and 8 partials ⚠️
ydb/_topic_writer/topic_writer_asyncio.py 89.61% 6 Missing and 2 partials ⚠️
ydb/_topic_common/_stream_reconnector.py 97.72% 1 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #842      +/-   ##
==========================================
+ Coverage   81.01%   81.03%   +0.02%     
==========================================
  Files          94       96       +2     
  Lines       12091    12099       +8     
  Branches     1182     1185       +3     
==========================================
+ Hits         9795     9804       +9     
- Misses       1837     1840       +3     
+ Partials      459      455       -4     
Flag Coverage Δ
integration 78.97% <80.44%> (-0.12%) ⬇️
unit 46.52% <86.34%> (+0.19%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
ydb/_topic_common/_stream_connection.py 100.00% <100.00%> (ø)
ydb/_topic_common/_stream_reconnector.py 97.72% <97.72%> (ø)
ydb/_topic_writer/topic_writer_asyncio.py 88.06% <89.61%> (-0.87%) ⬇️
ydb/_topic_reader/topic_reader_asyncio.py 85.37% <78.94%> (-1.57%) ⬇️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@robot-vibe-db

robot-vibe-db Bot commented Jun 26, 2026

Copy link
Copy Markdown

@vgvoleg Please set up your personal token, see instructions: https://nda.ya.ru/t/dur-L1y77biDC2

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors the topic reader/writer asyncio implementations to share a common bidi-stream lifecycle + reconnect loop via new base classes in ydb/_topic_common (StreamReconnector, StreamConnection), reducing duplication and addressing reconnect/close-ordering issues.

Changes:

  • Introduces StreamReconnector (shared reconnect loop/backoff/close ordering) and StreamConnection (shared stream ownership, connect handshake, update-token loop).
  • Migrates topic reader and writer asyncio reconnectors/streams to thin subclasses using the shared stack.
  • Updates and extends unit tests to validate the new structural “no zombie stream on handshake cancel” guarantee.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
ydb/_topic_writer/topic_writer_asyncio.py Writer migrated to StreamReconnector/StreamConnection; reconnect hooks and close/flush ordering refactored.
ydb/_topic_writer/topic_writer_asyncio_test.py Tests updated for _new_connection-based reconnect; adds handshake-cancel stream-close test.
ydb/_topic_reader/topic_reader_asyncio.py Reader migrated to StreamReconnector/StreamConnection; reconnect loop replaced by shared hooks.
ydb/_topic_reader/topic_reader_asyncio_test.py Tests updated for _new_connection-based reconnect; adds handshake-cancel stream-close test and close cleanup.
ydb/_topic_common/STREAM_DESIGN.md Documents the shared stream stack architecture, hooks, signals, and invariants.
ydb/_topic_common/common_test.py Adds unit tests covering StreamReconnector/StreamConnection base behavior (reconnect, cancellation, close swallowing).
ydb/_topic_common/_stream_reconnector.py New shared reconnect loop implementation with hooks for concrete reader/writer logic.
ydb/_topic_common/_stream_connection.py New shared connection lifecycle (sync wrapper ownership, connect handshake, update-token loop, per-conn death signal).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ydb/_topic_writer/topic_writer_asyncio.py
Comment thread ydb/_topic_common/_stream_connection.py

@robot-vibe-db robot-vibe-db Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI Review Summary

Verdict: ✅ No critical issues found

Critical issues

No critical issues found.

Other findings

  • Minor | Medium: asyncio.sleep(self._update_token_interval or 0) creates a hot loop when _update_token_interval is Noneydb/_topic_common/_stream_connection.py:74
  • Minor | Medium: self._conn = conn before handshake exposes half-initialized connection to concurrent consumers during failed handshakes — ydb/_topic_common/_stream_reconnector.py:85
  • Nit | High: Legacy create() + _start() paths allocate a duplicate GrpcWrapperAsyncIO from StreamConnection.__init__ that is immediately overridden — ydb/_topic_reader/topic_reader_asyncio.py:492 / ydb/_topic_writer/topic_writer_asyncio.py:787
  • Nit | Medium: _stream_reader backward-compatibility property on ReaderReconnector appears unused — can be removed unless external callers depend on it — ydb/_topic_reader/topic_reader_asyncio.py:249

This review was generated automatically. Critical issues require attention; other findings are advisory.
If this comment was useful, please give it a 👍 — it helps us improve the review bot.

Comment thread ydb/_topic_common/_stream_connection.py Outdated
Comment thread ydb/_topic_common/_stream_reconnector.py Outdated
Comment thread ydb/_topic_common/_stream_connection.py Outdated
Comment thread ydb/_topic_reader/topic_reader_asyncio.py Outdated
@robot-vibe-db

robot-vibe-db Bot commented Jun 26, 2026

Copy link
Copy Markdown

Full analysis log

Analysis performed by claude, claude-opus-4-6.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Comment on lines +55 to +62
def _set_first_error(self, err: BaseException) -> None:
"""Record the first error that ended this connection; later errors are ignored."""
try:
self._first_error.set_result(err)
except asyncio.InvalidStateError:
pass
self._on_first_error()

Comment on lines +26 to +28
Subclasses must set up their own attributes BEFORE calling ``super().__init__()``,
because that call schedules the connection loop which immediately uses ``_connect``.
"""
Comment thread ydb/_topic_common/STREAM_DESIGN.md
Comment on lines +34 to +41
self._loop = asyncio.get_running_loop()
self._stub = stub
self._method = method
# Built (not started) here so the connection owns its transport before connect()'s first
# network await — that is what makes the no-zombie guarantee structural. The legacy
# _start(stream, ...) injection path overrides this with an externally provided stream.
self._stream: IGrpcWrapperAsyncIO = GrpcWrapperAsyncIO(from_proto)
self._background_tasks: Set[asyncio.Task] = set()
Comment on lines 269 to 274
def receive_batch_with_tx_nowait(self, tx: "BaseQueryTxContext", max_messages: Optional[int] = None):
if self._stream_reader is None:
if self._conn is None:
return None
batch = self._stream_reader.receive_batch_nowait(
batch = self._conn.receive_batch_nowait(
max_messages=max_messages,
)
Extract the reconnect and bidi-stream lifecycle shared by the topic
reader and writer into two base classes in ydb/_topic_common:

- StreamReconnector: the single reconnect loop, backoff, fatal signal
  and close ordering, via _new_connection()/_handshake()/_run() hooks.
  Because _new_connection() is synchronous and the connection owns its
  gRPC stream before the first network await, "one live stream, no
  zombie" becomes structural — a cancel mid-handshake always closes the
  stream (this now covers the writer too, which had no such guard).
- StreamConnection: the bidi-stream lifecycle (gRPC wrapper, connect =
  start + init handshake, update-token loop, per-connection wait_error).

The reader/writer reconnectors and stream objects become thin
subclasses; _run(conn) = conn.wait_error() for both.

Also fixes the writer reconnect backoff (attempt was reset every loop
iteration, a retry storm at the backoff floor) and a flush-on-close
deadlock.
@vgvoleg vgvoleg force-pushed the refactor/topic-stream-reconnector branch from 73c2519 to 5bac08f Compare June 26, 2026 12:40
Comment thread ydb/_topic_common/STREAM_DESIGN.md Outdated
attempt = 0
while not closed:
conn = self._new_connection() # SYNC: builds the connection, which owns its gRPC stream
self._conn = conn # owned BEFORE the first network await

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: Minor
Confidence: High

The pseudocode shows self._conn = conn before await self._handshake(conn), but the actual implementation in _stream_reconnector.py:84-89 assigns _conn after the handshake succeeds. The structural no-zombie guarantee relies on the local conn variable being available in the finally block, not on early _conn assignment. As written, this pseudocode could mislead maintainers into thinking consumers can observe the connection before the handshake completes (they cannot — _conn is only published after a full handshake).

Suggested fix: swap the two lines in the pseudocode to match the actual code:

conn = self._new_connection()      # SYNC
try:
    await self._handshake(conn)
    self._conn = conn              # published only AFTER successful handshake

# ------------------------------------------------------------------ the one reconnect loop

async def _connection_loop(self) -> None:
attempt = 0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: Minor
Confidence: Medium

The PR description mentions two important bug fixes: (1) writer reconnect backoff storm (attempt was reset every iteration in the old code, now fixed by declaring attempt = 0 outside the while loop at line 74), and (2) flush-on-close deadlock (writer now flushes before _closed=True at lines 334-336).

Both fixes look correct, but neither has a regression test that would catch a reintroduction. Consider adding:

  • A test that verifies attempt grows across consecutive retriable connect failures (e.g., assert the backoff sleep increases on the 2nd and 3rd failures rather than staying at the floor).
  • A test that reproduces the deadlock scenario: enqueue a message, drop the connection during flush(), and verify the reconnect loop brings up a new connection to complete the flush (rather than deadlocking because _closed=True prevents reconnection).

"writer reconnector %s retry in %s seconds",
self._id,
err_info.sleep_timeout_seconds,
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: Nit
Confidence: High

This self._state_changed.set() is redundant: the base class _connection_loop already calls self._state_changed.set() on _stream_reconnector.py:92, immediately after await self._on_connected(conn) returns. Since there is no await between the two .set() calls, they fire in the same event-loop turn and the duplicate is a no-op.

Harmless, but removing it avoids the false impression that signaling here is load-bearing (it isn't — the base class handles it).

@robot-vibe-db

robot-vibe-db Bot commented Jun 26, 2026

Copy link
Copy Markdown

AI Review Summary

Verdict: ✅ No critical issues found

Critical issues

No critical issues found.

Other findings

  • Minor | High: STREAM_DESIGN.md pseudocode shows self._conn = conn before await self._handshake(conn), but the actual code assigns _conn after the handshake succeeds. The structural guarantee relies on the local conn variable in finally, not on early _conn assignment. Misleading for maintainers. — ydb/_topic_common/STREAM_DESIGN.md:29
  • Minor | Medium: The two bug fixes described in the PR (writer backoff storm and flush-on-close deadlock) are correct but lack regression tests that would catch reintroduction. — ydb/_topic_common/_stream_reconnector.py:74
  • Nit | High: Redundant self._state_changed.set() in WriterAsyncIOReconnector._on_connected() — the base class already signals the event on the next line after _on_connected returns. — ydb/_topic_writer/topic_writer_asyncio.py:487

Detailed notes:

The refactoring is well-designed. The two base classes (StreamReconnector and StreamConnection) cleanly separate the reconnect lifecycle from the per-protocol stream logic. Key observations from the full review:

  1. Backoff fix is correct: The old writer had attempt = 0 inside the while True loop, preventing backoff growth. The base class correctly declares it before the loop and resets only on a successful connect.
  2. Flush-on-close deadlock fix is correct: The writer now flushes before self._closed = True, allowing the connection loop to reconnect during flush if needed. The old code set _closed=True first, making reconnection impossible during flush.
  3. No-zombie guarantee is structural and correct: _new_connection() creates the connection synchronously, finally closes the local conn, and CancelledError during _close_connection is re-raised (preventing silent swallowing). Tests verify this for both reader and writer.
  4. Error handling migration from set_exception/exception() to set_result/result() is correct: All call sites (_check_stop, wait_init, wait_stop, _on_fatal) have been properly adapted.
  5. Message replay across reconnects is correct: The writer's _messages deque retains unacked messages, the new _send_loop resends them, and the last_seq_no dedup skips stale items from _new_messages.
  6. Test coverage is good: New tests cover reconnect-on-retriable, CancelledError reconnect, finally-swallows-close-error, close-without-connection, handshake-cancel-closes-stream (structural no-zombie), and close-during-reconnect-does-not-hang.

This review was generated automatically. Critical issues require attention; other findings are advisory.
If this comment was useful, please give it a 👍 — it helps us improve the review bot.

@robot-vibe-db

robot-vibe-db Bot commented Jun 26, 2026

Copy link
Copy Markdown

Full analysis log

Analysis performed by claude, claude-opus-4-6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants