First batch of metadata improvements#1526
Conversation
|
|
There was a problem hiding this comment.
Pull request overview
This PR refactors the metadata handling code by splitting the large metadata.rs file into three focused modules and rearchitecting the control connection management. The key change replaces the ConnectionPool-based control connection with a direct ControlConnection and introduces ControlConnectionState as the single source of truth for connection status.
Key Changes:
- Split
metadata.rsintomod.rs(types),reader.rs(MetadataReader), andfetching.rs(querying logic) - Replaced
ConnectionPoolwith directControlConnectionthat lives for the connection's lifetime - Introduced
ControlConnectionStateenum to track working/broken connection state with associated data
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
scylla/tests/integration/session/retries.rs |
Added proxy rule cleanup before DDL statement to prevent interference with control connection |
scylla/src/network/mod.rs |
Changed open_connection visibility from #[cfg(test)] to pub(crate) for use in metadata reader |
scylla/src/errors.rs |
Added ConnectionUnexpectedlyDropped error variant for control connection-specific error case |
scylla/src/cluster/worker.rs |
Removed separate server events and control connection repair channels; now uses unified wait_for_control_connection_event() API |
scylla/src/cluster/metadata/reader.rs |
New file containing MetadataReader with ControlConnectionState management and metadata fetching orchestration |
scylla/src/cluster/metadata/mod.rs |
New module file with public type definitions (Peer, Keyspace, Table, etc.) and documentation |
scylla/src/cluster/metadata/fetching.rs |
Reorganized file containing internal metadata querying implementation; made query_metadata pub(crate) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // TODO: Should this be a new variant of BrokenConnectionErrorKind? | ||
| // This doesn't really fit well because its specific to CC. | ||
| // OTOH if we make it a new type, then we lose a guarantee that BrokenConnectionError is | ||
| // backed by BrokenConnectionErrorKind. |
There was a problem hiding this comment.
The TODO comment on lines 151-154 has been addressed by adding the ConnectionUnexpectedlyDropped variant to BrokenConnectionErrorKind in errors.rs (lines 782-787). Consider removing this TODO comment since the issue has been resolved.
| // TODO: Should this be a new variant of BrokenConnectionErrorKind? | |
| // This doesn't really fit well because its specific to CC. | |
| // OTOH if we make it a new type, then we lose a guarantee that BrokenConnectionError is | |
| // backed by BrokenConnectionErrorKind. |
There was a problem hiding this comment.
This TODO is intended for reviewers. I did add the variant, but I'm not sure if this is the correct solution here.
scylla/src/errors.rs
Outdated
| )] | ||
| ChannelError, | ||
|
|
||
| /// Specific to Control Connection. We tried to poll recevier of error oneshot channel, |
There was a problem hiding this comment.
There's a typo in the comment: "recevier" should be "receiver".
| /// Specific to Control Connection. We tried to poll recevier of error oneshot channel, | |
| /// Specific to Control Connection. We tried to poll receiver of error oneshot channel, |
There was a problem hiding this comment.
I don't see this resolved.
There was a problem hiding this comment.
WDYM? I just checked, and commit "Move channel polling to MetadataReader" introduces the correct version.
028776a to
055a887
Compare
scylla/src/errors.rs
Outdated
| )] | ||
| ChannelError, | ||
|
|
||
| /// Specific to Control Connection. We tried to poll recevier of error oneshot channel, |
There was a problem hiding this comment.
I don't see this resolved.
It was failing for me. The way I understand it, the rules were preventing schema agreement made after DROP to succeed.
Metadata module is too big, and contains components with different purposes. In order to make it easier to understand, I want to split it. This commit moves all functions / structures related to fetching and parsing metadata from a server to a separate module.
Metadata module is too big. It contains various components that have different purpose. I want to split it to make it easier to understand. This commit splits off metadata reader. Now only definitions of metadata structures remain in metadata module.
055a887 to
1f26e54
Compare
|
Rebased on main, comments not addressed yet. |
I see 3 possible advantages of connection pool: 1. It uses mpsc for error notifications, instead of oneshot. This makes worker code simpler. 2. It automatically reconnects when connection is lost. 3. Creating it is not async, it creates connections in the background. (1) can be worked around as this commit proves. (2) is not really used by us. When receiving error notification, we immediately refresh metadata, which should almost always happen before pool reconnect succeeds, so the pool will be dropped. (3) is not a problem. MetadataReader constructor is already async because of contact point resolving. The big disadvantage of the current design (with the pool) is that it makes it impossible to order CQL events relative to CC drops and reconnects. This prevents implementing features (like event channels, or partial metadata fetches) in a robust way, but also makes code more difficult to understand. The other disadvantage is that pool returns a `Connection`. This forces us to create `ControlConnection` on each fetch, instead of when connection is created. This in turn makes it more difficult to make metadata reading use prepared statements (by caching them in `ControlConnection`). I also think any contributor will agree that storing `ControlConnection` instead of `Connection` or `ConnectionPool` in `MetadataReader` makes more conceptual sense. Side-effect of this change is that control connection will no longer be included in metrics about connections. This will affect, for example, tests in cpp-driver. Interestingly, this matches the behavior of legacy cpp-driver. If there is demand for bringing those metrics back, we could do that without using a connection pool. This commit removes this use of `ConnectionPool`. Instead, `Connection` is now used.
It makes more sense conceptually that we store `ControlConnection` instead of creating it on each fetch.
This advances the goal of storing stuff tied to a certain connection next to such connection.
1f26e54 to
30a71a0
Compare
Thanks to this, cc state field no longer needs to be pub(crate), which improves separation between `ClusterWorker` and `MetadataReader`.
Just a minor refactor for clarity.
I want to make ControlConnectionState the source of truth about CC being connected or not, with the goal of ClusterWorker using that (instead of mutable var set to false when a fetch fails). This requires setting state to Broken in `fetch_metadata`, but we only have `MetadataError` there when it fails (which makes sense of course). To solve this, let's store `MetadataError` in CC state. It introduces a bit of code duplication, which I'll solve in next commits.
Now `ControlConnectionState` should always be correct, and so there is no need to have a mut variable in `ClusterWorker`.
Endpoint is another piece of state related to a specific connection. Let's store it in state. This slightly increases amoun of (cheap) clone calls. I'll solve that in next commit.
Result of `make_control_connection` is always transformed into `ControlConnectionState`. In order to reduce code repetition, and clone calls, we can return `ControlConnectionState` from `make_control_connection`.
This is nicer than having those modules prefixed with `metadata_`.
30a71a0 to
fd72f91
Compare
wprzytula
left a comment
There was a problem hiding this comment.
Good job! 💯
On the way to robust cluster state management, both in Rust Driver and all wrapper drivers. 🚀
This PR performs some refactors in metadata code (mostly in MetadataReader).
The biggest changes:
metadata.rsinto 3 files, because it was too big and I couldn't comprehend it.ConnectionPool. See commit message for in-depth reasoning about that.ControlConnectionState, which is now a source of truth about status of CC, and stores all the stuff related to a single connection (event channels, endpoint etc).ControlConnectionis now created for duration of connection lifetime, not once per each fetch. This makes Internal reusable CQL statements should be prepared #417 easier.Unblocks: #417
Fixes: #1409
Pre-review checklist
I added relevant tests for new features and bug fixes.- no new features / bug fixes, so nothing to test.I have provided docstrings for the public items that I want to introduce.- no new itemsI have adjusted the documentation in- nothing to adjust./docs/source/.Fixes:annotations to PR description.