Skip to content

fix: expand contact point hostnames to all DNS IPs at connection time (DRIVER-201) — Part 1/2#889

Draft
nikagra wants to merge 3 commits into
scylladb:scylla-4.xfrom
nikagra:fix/DRIVER-201-contact-point-dns-expansion
Draft

fix: expand contact point hostnames to all DNS IPs at connection time (DRIVER-201) — Part 1/2#889
nikagra wants to merge 3 commits into
scylladb:scylla-4.xfrom
nikagra:fix/DRIVER-201-contact-point-dns-expansion

Conversation

@nikagra
Copy link
Copy Markdown

@nikagra nikagra commented May 15, 2026

Problem

When `RESOLVE_CONTACT_POINTS=false` (the default), a contact point hostname was stored as a single unresolved `InetSocketAddress`. At connection time the load-balancing query plan contained exactly one `Node` per hostname, so only the first IP returned by DNS was ever tried. If that IP was non-responsive the driver raised `AllNodesFailedException` with no fallback to other IPs the hostname might resolve to.

This is particularly impactful in dynamic DNS environments where a hostname maps to multiple nodes and the first one may be temporarily unavailable.

Fixes DRIVER-201.

Note: This is part 1 of 2, implementing the initial contact endpoints fix per @dkropachev's architectural direction. Part 2 will expand the `EndPoint` API with `resolveAll()` to address the broader connection path.

Changes

`DefaultDriverOption` / `TypedDriverOption`

  • `RESOLVE_CONTACT_POINTS` is now `@Deprecated`. Contact points are always kept as unresolved hostnames; the option is a no-op.

`SessionBuilder`

  • Removed the `RESOLVE_CONTACT_POINTS` read; `ContactPoints.merge()` is now always called with `resolve=false`, deferring all DNS expansion to connection time.

`MetadataManager`

  • New `getResolvedContactPoints()` method: for each contact point backed by an unresolved hostname, calls `InetAddress.getAllByName()` to obtain all IPs and creates a synthetic `DefaultNode` for each. Already-resolved and non-`InetSocketAddress` endpoints are returned as-is.

`LoadBalancingPolicyWrapper`

  • `newQueryPlan()` (in `BEFORE_INIT`/`DURING_INIT` states) and `newControlReconnectionQueryPlan()` now call `getResolvedContactPoints()` instead of `getContactPoints()`, so the query plan contains one entry per resolved IP. The driver naturally falls back to the next candidate if one is unreachable.

Tests

  • 4 new `MetadataManagerTest` cases covering: null state, already-resolved passthrough, single-hostname DNS expansion, multi-endpoint expansion.
  • `LoadBalancingPolicyWrapperTest` updated to stub `getResolvedContactPoints()`.
  • New `MockResolverIT.should_connect_when_first_dns_entry_is_non_responsive`: 2-node CCM cluster on `127.0.1.x`; first DNS entry points to a non-existent IP (`127.0.1.11`); asserts session opens successfully against the real nodes.

… (DRIVER-201)

Addresses the initial-contact-endpoints aspect of DRIVER-201.

Problem: with RESOLVE_CONTACT_POINTS=false (the default), a contact
point hostname was stored as a single unresolved InetSocketAddress.
At connection time the load-balancing query plan contained exactly one
Node per hostname, so only the first IP returned by DNS was ever tried.
If that IP was non-responsive the driver raised AllNodesFailedException
with no fallback to other IPs the hostname might resolve to.

Solution (per @dkropachev's architectural direction):
- Deprecate DefaultDriverOption.RESOLVE_CONTACT_POINTS. Contact points
  are now always kept as unresolved hostnames (resolve=false is hardcoded
  in SessionBuilder), deferring DNS expansion to connection time.
- Add MetadataManager.getResolvedContactPoints(): for each contact point
  backed by an unresolved hostname it calls InetAddress.getAllByName() to
  expand the hostname to all known IPs, creating a synthetic DefaultNode
  for each IP. Already-resolved or non-InetSocketAddress endpoints pass
  through unchanged.
- LoadBalancingPolicyWrapper now calls getResolvedContactPoints() instead
  of getContactPoints() in newQueryPlan() (BEFORE/DURING_INIT states) and
  newControlReconnectionQueryPlan(), so the query plan contains one node
  per resolved IP and the driver naturally falls back to the next IP when
  one is unreachable.

Tests:
- 4 new MetadataManagerTest cases covering null state, already-resolved
  passthrough, single-hostname expansion, and multi-endpoint expansion.
- LoadBalancingPolicyWrapperTest updated to stub getResolvedContactPoints().
- New MockResolverIT.should_connect_when_first_dns_entry_is_non_responsive
  integration test: first DNS entry is a non-existent IP, session must
  open successfully against the remaining real IPs.
nikagra added 2 commits May 15, 2026 20:37
…VER-201)

newControlReconnectionQueryPlan() now creates copies of the original
contact-point nodes (with their unresolved hostname endpoints) instead
of synthetic nodes with resolved IPs. This ensures the control channel
carries the hostname endpoint, which is preserved in metadata after
topology refresh.

DNS expansion for connection fallback is handled by ChannelFactory
(PR scylladb#890), so the control-reconnection path does not need to inject
resolved-IP nodes into the query plan.

Also adds getContactPoints() stub back to LoadBalancingPolicyWrapperTest
so tests that cover the control-reconnect path continue to pass.
Before-init query plan now uses getContactPoints() (original unresolved
hostname nodes) instead of getResolvedContactPoints(). The DNS expansion
to all IPs happens at the ChannelFactory level (PR scylladb#890), so expanding
here was redundant and broke should_connect_with_mocked_hostname by
replacing hostname endpoints with resolved-IP endpoints.

Also remove the should_connect_when_first_dns_entry_is_non_responsive
integration test from this PR; it belongs in PR scylladb#890 where ChannelFactory
expansion actually enables it to pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant