Skip to content

loop with DNS lookup when trying for SwitchLocation#7779

Merged
leftwo merged 4 commits into
mainfrom
alan/nexue-retry-dendrite
Mar 21, 2025
Merged

loop with DNS lookup when trying for SwitchLocation#7779
leftwo merged 4 commits into
mainfrom
alan/nexue-retry-dendrite

Conversation

@leftwo

@leftwo leftwo commented Mar 11, 2025

Copy link
Copy Markdown
Contributor

Currently, callers of map_switch_zone_addrs() first get the IP for ServiceName::Dendrite from DNS, then loop (forever) trying to translate that IP into a SwitchLocation. Under normal conditions, this is fine. However, if a sled has been expunged, or a new sled is being added, it's possible that what is returned in:

let switch_zone_addresses = match resolver
            .lookup_all_ipv6(ServiceName::Dendrite)
            .await

Will change. If that changes happens after we start looping in map_switch_zone_addrs(), then the loop will go on forever looking for something that is no longer correct.

To fix this we put the lookup_all_ipv6 into the loop by using the function switch_zone_address_mappings() instead. switch_zone_address_mappings()'s loop includes the call to lookup addresses in DNS will call map_switch_zone_addrs(). This allows us to include the DNS lookup inside the loop.

Most places where we called map_switch_zone_addrs() were also using the same lookup_all_ipv6() call, so transitioning them to call switch_zone_address_mappings() will just drop right in.

A fix for #7739

@leftwo leftwo marked this pull request as ready for review March 12, 2025 18:12
@leftwo leftwo changed the title WIP for switch zone retry loop with DNS lookup when trying for SwitchLocation Mar 12, 2025
@leftwo leftwo requested a review from internet-diglett March 13, 2025 23:52
@leftwo

leftwo commented Mar 13, 2025

Copy link
Copy Markdown
Contributor Author

Levon, This might be related to #5092
And, to #5201

This solution may not be all that you want to fix those, or, moving us in the wrong direction.

@davepacheco

Copy link
Copy Markdown
Collaborator

This might be a useful step but yeah I don't think it will fix the case of a cold boot of the rack while one Scrimlet is down. In that case, as I understand things, the control plane will not come up because of this, and repeating the DNS lookup won't help because DNS will still be reporting the same thing.

@leftwo leftwo added expunge expunge sled or disk issues Update System Replacing old bits with newer, cooler bits labels Mar 14, 2025
@leftwo leftwo self-assigned this Mar 19, 2025

@bnaecker bnaecker left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK to me, one suggestion about backoff but not blocking.

Comment thread nexus/src/app/mod.rs
}
Err(e) => {
warn!(log, "Failed to map switch zone addr: {e}, retrying");
tokio::time::sleep(std::time::Duration::from_secs(2)).await;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to use a backoff policy here, something like the "local service policy".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 2 seconds is what the loop had before, so I'm not making it worse then I found it :)
I agree a backoff policy here would be better, but I could not quite figure out how to detangle
the two actions in this loop to fit into the retry setup and I want this change landed before 14
if possible.

Comment thread nexus/src/app/mod.rs Outdated
log,
"Failed to map switch zone addr: {e}"
);
tokio::time::sleep(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same note here, may want to back off more carefully.

Comment thread nexus/src/app/mod.rs Outdated
log,
"Failed to map switch zone addr: {e}"
);
tokio::time::sleep(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, may want backoff.

@leftwo leftwo merged commit 1100ed4 into main Mar 21, 2025
@leftwo leftwo deleted the alan/nexue-retry-dendrite branch March 21, 2025 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

expunge expunge sled or disk issues Update System Replacing old bits with newer, cooler bits

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants