coordinated Nexus quiesce by davepacheco · Pull Request #9010 · oxidecomputer/omicron

davepacheco · 2025-09-06T04:36:26Z

Implements coordinated Nexus quiesce per RFD 588.

This PR only changes the existing Nexus quiesce process to do what RFD 588 documents. There are several other things that need to happen for a successful handoff. I will create follow-on PRs for those.

Fixes #8859
Fixes #8857
Fixes #8796
Fixes #8795
Fixes #8971
Fixes #8858

davepacheco · 2025-09-15T20:38:30Z

With this PR as-is as of f7c92b3, I was able to do a semi-manual handoff and write a working live test, too. However, I had to fix several other issues, some of which also overlap with #8936. I'll separate these out into separate PRs and coordinate with @smklein.

davepacheco · 2025-09-15T22:28:36Z

This is now ready for review.

Note that there should be minimal risk / impact to existing systems by landing this PR because existing systems do not quiesce until after #8936. By the time we do that, I hope we'll have more complete testing in place (e.g., a successful live test). I need to put together a few more PRs for that.

jgallagher · 2025-09-16T17:29:25Z

+
+        let db_nexus_ids: BTreeSet<_> = nexus_ids
+            .iter()
+            .cloned()


Nit - since UUIDs impl Copy, I think this could be

Suggested change

.cloned()

.copied()

(which is not different in terms of generated code but it's more obvious that this is cheap)

Fixed in e8eb204.

jgallagher · 2025-09-16T17:31:16Z

+        Ok(count)
+    }
+
+    /// Updates the "last_drained_blueprint_id" for the given Nexus id


Wrong docstring on this method?

Good catch. Fixed in e8eb204.

jgallagher · 2025-09-16T17:32:55Z

        }
    }

+    pub async fn extra_datastore(&self, log: &Logger) -> Arc<DataStore> {


Why do we need this, as opposed to cloning the return value of datastore()?

The datastore (really, the underlying pool) has in-memory state associated with being quiesced. In our test, we want independent instances of this so that they can be quiesced independently.

edit: I'll add a comment about this.

Ah, fair enough. Probably worth at least a doc comment noting how this is different? I'm tempted to suggest a more descriptive name too, but I'm not sure what. independent_datastore() maybe?

Comment added in e8eb204.

jgallagher · 2025-09-16T17:37:46Z

    /// Channel for TUF repository artifacts to be replicated out to sleds
    pub tuf_artifact_replication_rx: mpsc::Receiver<ArtifactsWithPlan>,
+    /// Channel for exposing the latest loaded blueprint
+    pub blueprint_load_tx:


Just making sure I understand: we have this field now because in nexus/app/mod.rs, we need to construct this channel to get a handle to the receiver before we've set up the background task system?

smklein · 2025-09-16T00:06:23Z

+            .map_err(|e| public_error_from_diesel(e, ErrorHandler::Server))
+    }
+
+    /// Updates the "last_drained_blueprint_id" for the given Nexus id


Nit, should we update this to return a bool? It's always going to be 0 or 1 rows updated, right?

Alternatively - we could just return an error if the "count == 0", right?

I don't think we want to return an error for the 0 case because that could just mean we've already updated it to this blueprint id and that's fine. The caller needs to retry errors, but not that case.

I can see the appeal of the bool. I want to log the value either way, which feels slightly more idiomatic at the caller level, so I'm going to leave it.

smklein · 2025-09-16T00:07:59Z

+    pub async fn database_nexus_access_update_quiesced(
+        &self,
+        nexus_id: OmicronZoneUuid,
+    ) -> Result<usize, Error> {


Same comment here about returning a usize - since we're indexing on nexus_id already, seems like we'll either perform the update successfully or not, and can return that more idiomatically than a usize.

davepacheco added 8 commits September 5, 2025 16:22

WIP: pretty good changes so far

6545d73

WIP: cont

cba808d

pass OpContext through usefully

8dce7c1

start fixing/writing tests

7ad5d55

fix test

40f75ac

WIP: live test for handoff

b57236b

Merge branch 'main' into dap/quiesce-with-db

ee285d7

final timeout was too short due to transient Cockroach issue on a4x2

f7c92b3

davepacheco added 2 commits September 15, 2025 14:04

remove stuff that will go into separate PRs

d45a995

clean up / refactor

bdf36a7

davepacheco requested review from jgallagher and smklein September 15, 2025 22:24

davepacheco self-assigned this Sep 15, 2025

davepacheco marked this pull request as ready for review September 15, 2025 22:28

fix wrong comment

0830d57

davepacheco mentioned this pull request Sep 16, 2025

write correct db_metadata_nexus records during blueprint execution #9023

Merged

Merge remote-tracking branch 'origin/main' into dap/quiesce-with-db

fa6d78f

jgallagher reviewed Sep 16, 2025

View reviewed changes

review feedback

e8eb204

jgallagher approved these changes Sep 16, 2025

View reviewed changes

smklein reviewed Sep 16, 2025

View reviewed changes

davepacheco enabled auto-merge (squash) September 16, 2025 18:37

davepacheco merged commit 2163cfa into main Sep 16, 2025
16 checks passed

davepacheco deleted the dap/quiesce-with-db branch September 16, 2025 22:42

charliepark pushed a commit that referenced this pull request Sep 19, 2025

coordinated Nexus quiesce (#9010)

4567fc5

Uh oh!

Conversation

davepacheco commented Sep 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davepacheco commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davepacheco commented Sep 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davepacheco Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

davepacheco commented Sep 6, 2025 •

edited

Loading

davepacheco commented Sep 15, 2025 •

edited

Loading

davepacheco Sep 16, 2025 •

edited

Loading