coordinated Nexus quiesce#9010
Conversation
|
This is now ready for review. Note that there should be minimal risk / impact to existing systems by landing this PR because existing systems do not quiesce until after #8936. By the time we do that, I hope we'll have more complete testing in place (e.g., a successful live test). I need to put together a few more PRs for that. |
|
|
||
| let db_nexus_ids: BTreeSet<_> = nexus_ids | ||
| .iter() | ||
| .cloned() |
There was a problem hiding this comment.
Nit - since UUIDs impl Copy, I think this could be
| .cloned() | |
| .copied() |
(which is not different in terms of generated code but it's more obvious that this is cheap)
| Ok(count) | ||
| } | ||
|
|
||
| /// Updates the "last_drained_blueprint_id" for the given Nexus id |
There was a problem hiding this comment.
Wrong docstring on this method?
| } | ||
| } | ||
|
|
||
| pub async fn extra_datastore(&self, log: &Logger) -> Arc<DataStore> { |
There was a problem hiding this comment.
Why do we need this, as opposed to cloning the return value of datastore()?
There was a problem hiding this comment.
The datastore (really, the underlying pool) has in-memory state associated with being quiesced. In our test, we want independent instances of this so that they can be quiesced independently.
edit: I'll add a comment about this.
There was a problem hiding this comment.
Ah, fair enough. Probably worth at least a doc comment noting how this is different? I'm tempted to suggest a more descriptive name too, but I'm not sure what. independent_datastore() maybe?
| /// Channel for TUF repository artifacts to be replicated out to sleds | ||
| pub tuf_artifact_replication_rx: mpsc::Receiver<ArtifactsWithPlan>, | ||
| /// Channel for exposing the latest loaded blueprint | ||
| pub blueprint_load_tx: |
There was a problem hiding this comment.
Just making sure I understand: we have this field now because in nexus/app/mod.rs, we need to construct this channel to get a handle to the receiver before we've set up the background task system?
| .map_err(|e| public_error_from_diesel(e, ErrorHandler::Server)) | ||
| } | ||
|
|
||
| /// Updates the "last_drained_blueprint_id" for the given Nexus id |
There was a problem hiding this comment.
Nit, should we update this to return a bool? It's always going to be 0 or 1 rows updated, right?
Alternatively - we could just return an error if the "count == 0", right?
There was a problem hiding this comment.
I don't think we want to return an error for the 0 case because that could just mean we've already updated it to this blueprint id and that's fine. The caller needs to retry errors, but not that case.
I can see the appeal of the bool. I want to log the value either way, which feels slightly more idiomatic at the caller level, so I'm going to leave it.
| pub async fn database_nexus_access_update_quiesced( | ||
| &self, | ||
| nexus_id: OmicronZoneUuid, | ||
| ) -> Result<usize, Error> { |
There was a problem hiding this comment.
Same comment here about returning a usize - since we're indexing on nexus_id already, seems like we'll either perform the update successfully or not, and can return that more idiomatically than a usize.
Implements coordinated Nexus quiesce per RFD 588.
This PR only changes the existing Nexus quiesce process to do what RFD 588 documents. There are several other things that need to happen for a successful handoff. I will create follow-on PRs for those.
Fixes #8859
Fixes #8857
Fixes #8796
Fixes #8795
Fixes #8971
Fixes #8858