Skip to content

First batch of metadata improvements#1526

Merged
Lorak-mmk merged 15 commits intoscylladb:mainfrom
Lorak-mmk:metadata-improvements
Jan 8, 2026
Merged

First batch of metadata improvements#1526
Lorak-mmk merged 15 commits intoscylladb:mainfrom
Lorak-mmk:metadata-improvements

Conversation

@Lorak-mmk
Copy link
Collaborator

@Lorak-mmk Lorak-mmk commented Jan 7, 2026

This PR performs some refactors in metadata code (mostly in MetadataReader).
The biggest changes:

  • Split metadata.rs into 3 files, because it was too big and I couldn't comprehend it.
  • Removed usage of ConnectionPool. See commit message for in-depth reasoning about that.
  • Introduced ControlConnectionState, which is now a source of truth about status of CC, and stores all the stuff related to a single connection (event channels, endpoint etc).
  • ControlConnection is now created for duration of connection lifetime, not once per each fetch. This makes Internal reusable CQL statements should be prepared #417 easier.

Unblocks: #417
Fixes: #1409

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • I added relevant tests for new features and bug fixes. - no new features / bug fixes, so nothing to test.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • I have provided docstrings for the public items that I want to introduce. - no new items
  • I have adjusted the documentation in ./docs/source/. - nothing to adjust
  • I added appropriate Fixes: annotations to PR description.

@github-actions
Copy link

github-actions bot commented Jan 7, 2026

cargo semver-checks found no API-breaking changes in this PR.
Checked commit: fd72f91

@Lorak-mmk Lorak-mmk requested review from Copilot and wprzytula and removed request for wprzytula January 7, 2026 12:02
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the metadata handling code by splitting the large metadata.rs file into three focused modules and rearchitecting the control connection management. The key change replaces the ConnectionPool-based control connection with a direct ControlConnection and introduces ControlConnectionState as the single source of truth for connection status.

Key Changes:

  • Split metadata.rs into mod.rs (types), reader.rs (MetadataReader), and fetching.rs (querying logic)
  • Replaced ConnectionPool with direct ControlConnection that lives for the connection's lifetime
  • Introduced ControlConnectionState enum to track working/broken connection state with associated data

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
scylla/tests/integration/session/retries.rs Added proxy rule cleanup before DDL statement to prevent interference with control connection
scylla/src/network/mod.rs Changed open_connection visibility from #[cfg(test)] to pub(crate) for use in metadata reader
scylla/src/errors.rs Added ConnectionUnexpectedlyDropped error variant for control connection-specific error case
scylla/src/cluster/worker.rs Removed separate server events and control connection repair channels; now uses unified wait_for_control_connection_event() API
scylla/src/cluster/metadata/reader.rs New file containing MetadataReader with ControlConnectionState management and metadata fetching orchestration
scylla/src/cluster/metadata/mod.rs New module file with public type definitions (Peer, Keyspace, Table, etc.) and documentation
scylla/src/cluster/metadata/fetching.rs Reorganized file containing internal metadata querying implementation; made query_metadata pub(crate)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +151 to +154
// TODO: Should this be a new variant of BrokenConnectionErrorKind?
// This doesn't really fit well because its specific to CC.
// OTOH if we make it a new type, then we lose a guarantee that BrokenConnectionError is
// backed by BrokenConnectionErrorKind.
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TODO comment on lines 151-154 has been addressed by adding the ConnectionUnexpectedlyDropped variant to BrokenConnectionErrorKind in errors.rs (lines 782-787). Consider removing this TODO comment since the issue has been resolved.

Suggested change
// TODO: Should this be a new variant of BrokenConnectionErrorKind?
// This doesn't really fit well because its specific to CC.
// OTOH if we make it a new type, then we lose a guarantee that BrokenConnectionError is
// backed by BrokenConnectionErrorKind.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TODO is intended for reviewers. I did add the variant, but I'm not sure if this is the correct solution here.

)]
ChannelError,

/// Specific to Control Connection. We tried to poll recevier of error oneshot channel,
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a typo in the comment: "recevier" should be "receiver".

Suggested change
/// Specific to Control Connection. We tried to poll recevier of error oneshot channel,
/// Specific to Control Connection. We tried to poll receiver of error oneshot channel,

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this resolved.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYM? I just checked, and commit "Move channel polling to MetadataReader" introduces the correct version.

@Lorak-mmk Lorak-mmk force-pushed the metadata-improvements branch from 028776a to 055a887 Compare January 7, 2026 12:21
@Lorak-mmk Lorak-mmk requested a review from wprzytula January 7, 2026 13:46
@Lorak-mmk Lorak-mmk added this to the 1.5.0 milestone Jan 7, 2026
)]
ChannelError,

/// Specific to Control Connection. We tried to poll recevier of error oneshot channel,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this resolved.

It was failing for me. The way I understand it, the rules were
preventing schema agreement made after DROP to succeed.
Metadata module is too big, and contains components with different
purposes. In order to make it easier to understand, I want to split it.
This commit moves all functions / structures related to fetching and
parsing metadata from a server to a separate module.
Metadata module is too big. It contains various components that have
different purpose. I want to split it to make it easier to understand.
This commit splits off metadata reader. Now only definitions of metadata
structures remain in metadata module.
@Lorak-mmk Lorak-mmk force-pushed the metadata-improvements branch from 055a887 to 1f26e54 Compare January 7, 2026 20:06
@Lorak-mmk
Copy link
Collaborator Author

Rebased on main, comments not addressed yet.

I see 3 possible advantages of connection pool:
 1. It uses mpsc for error notifications, instead of oneshot. This makes
worker code simpler.
 2. It automatically reconnects when connection is lost.
 3. Creating it is not async, it creates connections in the background.

(1) can be worked around as this commit proves.
(2) is not really used by us. When receiving error notification, we
immediately refresh metadata, which should almost always happen before
pool reconnect succeeds, so the pool will be dropped.
(3) is not a problem. MetadataReader constructor is already async
because of contact point resolving.

The big disadvantage of the current design (with the pool) is that it
makes it impossible to order CQL events relative to CC drops and
reconnects.
This prevents implementing features (like event channels, or partial
metadata fetches) in a robust way, but also makes code more difficult to
understand.

The other disadvantage is that pool returns a `Connection`. This forces
us to create `ControlConnection` on each fetch, instead of when
connection is created. This in turn makes it more difficult to make
metadata reading use prepared statements (by caching them in
`ControlConnection`). I also think any contributor will agree that
storing `ControlConnection` instead of `Connection` or `ConnectionPool`
in `MetadataReader` makes more conceptual sense.

Side-effect of this change is that control connection will no longer be
included in metrics about connections. This will affect, for example,
tests in cpp-driver. Interestingly, this matches the behavior of legacy
cpp-driver. If there is demand for bringing those metrics back, we could
do that without using a connection pool.

This commit removes this use of `ConnectionPool`. Instead, `Connection`
is now used.
It makes more sense conceptually that we store `ControlConnection`
instead of creating it on each fetch.
This advances the goal of storing stuff tied to a certain connection
next to such connection.
@Lorak-mmk Lorak-mmk force-pushed the metadata-improvements branch from 1f26e54 to 30a71a0 Compare January 8, 2026 11:39
@Lorak-mmk Lorak-mmk requested a review from wprzytula January 8, 2026 11:40
Thanks to this, cc state field no longer needs to be pub(crate), which
improves separation between `ClusterWorker` and `MetadataReader`.
Just a minor refactor for clarity.
I want to make ControlConnectionState the source of truth about CC being
connected or not, with the goal of ClusterWorker using that (instead of
mutable var set to false when a fetch fails).
This requires setting state to Broken in `fetch_metadata`, but we only
have `MetadataError` there when it fails (which makes sense of course).
To solve this, let's store `MetadataError` in CC state. It introduces a
bit of code duplication, which I'll solve in next commits.
Now `ControlConnectionState` should always be correct, and so there is
no need to have a mut variable in `ClusterWorker`.
Endpoint is another piece of state related to a specific connection.
Let's store it in state.

This slightly increases amoun of (cheap) clone calls. I'll solve that in
next commit.
Result of `make_control_connection` is always transformed into
`ControlConnectionState`. In order to reduce code repetition, and clone
calls, we can return `ControlConnectionState` from
`make_control_connection`.
This is nicer than having those modules prefixed with `metadata_`.
@Lorak-mmk Lorak-mmk force-pushed the metadata-improvements branch from 30a71a0 to fd72f91 Compare January 8, 2026 11:59
@Lorak-mmk Lorak-mmk requested a review from wprzytula January 8, 2026 12:32
Copy link
Collaborator

@wprzytula wprzytula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job! 💯

On the way to robust cluster state management, both in Rust Driver and all wrapper drivers. 🚀

@Lorak-mmk Lorak-mmk merged commit 00b3399 into scylladb:main Jan 8, 2026
11 checks passed
@Lorak-mmk Lorak-mmk deleted the metadata-improvements branch January 8, 2026 12:48
@wprzytula wprzytula mentioned this pull request Jan 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Insufficient guarantees regarding ordering of events and control connection breakage.

3 participants