Skip to content

[dns-server] servers should be able to use older configs#8278

Merged
iximeow merged 2 commits into
mainfrom
ixi/upgrade-into-soa
Jun 5, 2025
Merged

[dns-server] servers should be able to use older configs#8278
iximeow merged 2 commits into
mainfrom
ixi/upgrade-into-soa

Conversation

@iximeow

@iximeow iximeow commented Jun 5, 2025

Copy link
Copy Markdown
Member

PR #8047 comes with an unfortunate upgrade-only bug: before an upgrade, a system's DNS servers would write out a configuration without the new serial field added in 8047. When upgraded, the DNS servers would then try to load that config, see it is missing a serial field, and error for every query.

We expect to replace the previous-format configuration immediately after upgrade by regenerating a blueprint for the current system and executing it. But we should be able to use the previous-format configuration anyway, so that DNS functions enough to get a control plane capable of planning and executing that blueprint.

I'm going to test out the upgrade flow imminently by hand somewhere, so we can be certain this will work on dogfood. Draft as I've not actually tried an upgrade with this change yet, but I'm very confident in it.

If we merge this, we should not merge #8272. If we merge #8272, then this will conflict with main, but would be a clean patch on top of a revert #8272 commit where both can go in as a new squash commit. I'm fine either way, more focused on making sure upgrades still.. work.

PR #8047 comes with an unfortunate upgrade-only bug: before an upgrade,
a system's DNS servers would write out a configuration without the new
`serial` field added in 8047. When upgraded, the DNS servers would then
try to load that config, see it is missing a `serial` field, and error
for every query.

We expect to replace the previous-format configuration immediately after
upgrade by regenerating a blueprint for the current system and executing
it. But we should be able to use the previous-format configuration
anyway, so that DNS functions enough to get a control plane capable of
planning and executing that blueprint.
@iximeow iximeow requested a review from davepacheco June 5, 2025 03:22
so handle finding an old config at this point as well
@iximeow iximeow marked this pull request as ready for review June 5, 2025 10:13
@iximeow

iximeow commented Jun 5, 2025

Copy link
Copy Markdown
Member Author

on dublin, i first clean-slated and loaded the same TUF from dogfood on 05/30: 7547bfa56. this came up, as expected. then i mupdated to 99ffcbe2b, same as dogfood the DNS configuration was unusable, so Nexus never came up. then i mupdated to fe447e8, the first commit in this branch. (handwaving a bit; one sled didn't update and techports were in a state i didn't fully understand, but eventually i got all four sleds in dublin to the right version.) at this point a few interesting things became apparent.

first, the control plane comes up. fe447e8 does let us use the old config format.

second, regenerating a blueprint from a current inventory and omdb nexus blueprints diff --exit-code current-target <regenerated blueprint> shows no change. this is correct, because in the new Nexus we reinterpret the effect of the prior blueprint from RSS. so, in this sense blueprint diff may be misleading at upgrade time, and i'll add a note to at least the dogfood notes about this.

the misleading diff is not actually an issue, because when we execute the blueprint we compute a diff against what's in CRDB, not the previous blueprint. we correctly determine there are changes to be made, and send those updates out.

third, the update does not become effectual in dublin's DNS servers. this wasn't obvious to debug; the log here just looks like

09:36:48.407Z INFO dns-server (store): attempting generation update                                                 
    new_generation = 3                                                                                                                                                                                                                   
    req_id = f116e6fb-dce6-4cb8-a62c-2ddee530db4f                                                                                                                                                                                        
09:36:48.407Z INFO dns-server (store): pruning trees for generations newer than 2                                                                                                                                                        
09:36:48.407Z INFO dns-server (store): pruning tree                                                                                                                                                                                      
    reason = too new                                                                                                                                                                                                                     
    tree_name = generation_3_zone_dublin.eng.oxide.computer                                                         
09:36:48.408Z ERRO dns-server (store): failed update                                                                                                                                                                                     
    error_message = internal error                                                                                                                                                                                                       
    new_generation = 3                                                                                                                                                                                                                   
    req_id = f116e6fb-dce6-4cb8-a62c-2ddee530db4f                                                                                                                                                                                        
09:36:48.408Z INFO dns-server (http): request completed                                                                                                                                                                                  
    error_message_external = Internal Server Error                                                                                                                                                                                       
    error_message_internal = internal error                                                                                                                                                                                              
    latency_us = 908                                                                                                                                                                                                                     
    local_addr = [fd00:1122:3344:101::4]:5353                                                                                                                                                                                            
    method = PUT                                                                                                                                                                                                                         
    remote_addr = [fd00:1122:3344:103::4]:64471                                                                                                                                                                                          
    req_id = f116e6fb-dce6-4cb8-a62c-2ddee530db4f                                                                                                                                                                                        
    response_code = 500                                                                                                                                                                                                                  
    uri = /config     

so i pulled the DNS server state from /data/dns and /var/svc/manifest/site/external_dns to my workstation to adjust dns-server and learn more. that led me pretty quickly to the second place we parse CurrentConfig, which also needs to be fallback-aware.

i think we ought to have a better #[error(...)] on InternalError than the current #[error("internal error")] - as is we accidentally forget about the actual error. i'll take care of that tomorrow too.

with the second CurrentConfig parsing spot also tolerating old configs, dns-server accepts a new config locally (and serves the SOA record! yay). if dublin hasn't been repurposed, with 9725f54 and a new blueprint we should see DNS configuration update as one would expect, with the new NS/SOA records present too.


i've been wondering how to test this, and i'm still not really sure. i kind of want to be able to grab an old dns-server in CI, set up some state against it, swap it for whatever new dns-server has been built, and make sure that it's basically functional. this is morally similar to Propolis' migrate-from-base tests, but i'm not quite sure how to plumb that up here. obviously discovering this on the far side of wedging dogfood is not.. good.

@davepacheco davepacheco left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not catching this!

The intent here is that zones never have to deal with persistent data written by older (or newer) versions. This won't come up during a real upgrade, but unfortunately does come up with MUPdate.

@askfongjojo askfongjojo linked an issue Jun 5, 2025 that may be closed by this pull request
@iximeow iximeow merged commit 09ae666 into main Jun 5, 2025
16 checks passed
@iximeow iximeow deleted the ixi/upgrade-into-soa branch June 5, 2025 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

internal_dns went into maintenance during rack upgrade

2 participants