Skip to content

scrimlet-reconcilers: add mgd BGP config reconciler#10565

Open
jgallagher wants to merge 5 commits into
john/feature-sled-agent-scrimlet-reconcilersfrom
john/scrimlet-reconcilers-mgd-bgp
Open

scrimlet-reconcilers: add mgd BGP config reconciler#10565
jgallagher wants to merge 5 commits into
john/feature-sled-agent-scrimlet-reconcilersfrom
john/scrimlet-reconcilers-mgd-bgp

Conversation

@jgallagher

Copy link
Copy Markdown
Contributor

This is a significant departure from the sync_switch_configuration bg task's implementation of BGP config reconciliation. While adding proptests (which tend to hit unusual cases - here, it was "changing the ASN that owns a given peer"), we discovered:

  1. mgd's bgp_apply endpoint is a little buggy (the specific thing I hit is fixed by bgp_apply fixes maghemite#773)
  2. The structure of the bgp_apply endpoint, because it takes the config for a single ASN, doesn't really allow for easy "remove the BGP config" or "change the config from ASN x to ASN y".

Based on oxidecomputer/maghemite#773 (comment), this PR doesn't use the bgp_apply endpoint, and instead makes multiple smaller calls to the various REST endpoints that comprise the BGP config. This would be much racier in Nexus, where different Nexus instances could be competing (potentially with out of date information if the config just changed), but here the one and only client is the local sled-agent, so we don't have to worry about that problem.

@jgallagher jgallagher added the networking Related to the networking. label Jun 8, 2026

@internet-diglett internet-diglett left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good to me, just a few things to clarify

Comment on lines +499 to +505
for (asn, config) in routers_to_create {
record_count!(
client.create_router(&config.to_mgd_router(*asn)).await,
routers_created,
format!("failed to create router with asn {asn}"),
);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case where we fail to delete children, we don't delete the router.
Likewise, could it be argued that if we fail to create the router, we shouldn't try to create its children?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's fair. I didn't do that because it was a little messy (we'd have to keep track of which routers succeeded and which failed), and I assumed all those subsequent creations would also fail (more or less harmlessly). If you think it's worth doing I can see if I can make it not-too-messy.

Comment thread sled-agent/scrimlet-reconcilers/src/mgd_reconciler/bgp_reconciler.rs Outdated
Comment thread sled-agent/scrimlet-reconcilers/src/mgd_reconciler/bgp_reconciler.rs Outdated
Comment on lines +1569 to +1571
// TODO-correctness Nexus historically used the port name as the
// group name. Does this matter?
group: port_name.clone(),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +1608 to +1609
src_addr: None,
src_port: None,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for myself to plumb this through on #10572

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

networking Related to the networking.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants