Skip to content

Add omdb command to forcefully quiesce db_metadata_nexus records#9036

Merged
smklein merged 8 commits into
mainfrom
omdb_nexus_quiesce
Sep 30, 2025
Merged

Add omdb command to forcefully quiesce db_metadata_nexus records#9036
smklein merged 8 commits into
mainfrom
omdb_nexus_quiesce

Conversation

@smklein

@smklein smklein commented Sep 16, 2025

Copy link
Copy Markdown
Collaborator

Builds on #9034

Adds a command to handle permanent Nexus failure during handoff.

Fixes #9008

Comment thread nexus/db-queries/src/db/datastore/db_metadata.rs Outdated
Base automatically changed from omdb_nexus_gen to main September 19, 2025 00:58

@davepacheco davepacheco left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to do a confirmation and maybe an attempt to check whether Nexus is up, similar to what we do for dangerous saga operations:

// Add a confirmation prompt reminding the caller of the risks of this
// injection
let text = r#"
WARNING: Injecting an error into a saga will (hopefully) cause it to be
unwound, but if the actions into which errors are injected have taken effect,
those effects will not be undone. This can result in corruption of control
plane state, even if the Nexus assigned to this saga is not currently running.
You should only do this if:
- you've stopped Nexus and then verified that the currently-running nodes
either have no side effects, have not made any changes to the system, or
you've already undone them by hand
- this is a development system whose state can be wiped
"#;
if should_print_color {
println!("{}", text.red().bold());
} else {
println!("{text}");
}
// Before doing anything: find the current SEC for the saga, and ping it to
// ensure that the Nexus is down.
if !args.bypass_sec_check {
let saga: Saga = {
use nexus_db_schema::schema::saga::dsl;
dsl::saga
.filter(dsl::id.eq(args.saga_id))
.first_async(&*conn)
.await?
};
let status = get_saga_sec_status(omdb, opctx, &saga).await;
status.display_message(should_print_color);
status.into_result()?;
} else {
let text = "Skipping check of whether the Nexus assigned to this saga \
is running. If this Nexus is running, the control plane state managed \
by this saga may become corrupted!";
if should_print_color {
println!("{}", text.red().bold());
} else {
println!("{text}");
}
}
// Before making any changes, ask for confirmation
let mut prompt = ConfirmationPrompt::new();
prompt.read_and_validate("y/N", "y")?;
drop(prompt);

There are enough warning signs here, and I don't think this is likely a tool that people will stumble on accidentally, that I wouldn't call this a blocker.

Comment thread dev-tools/omdb/src/bin/omdb/db/db_metadata.rs Outdated
Comment thread dev-tools/omdb/src/bin/omdb/db/db_metadata.rs Outdated
@smklein

smklein commented Sep 29, 2025

Copy link
Copy Markdown
Collaborator Author

It would be nice to do a confirmation and maybe an attempt to check whether Nexus is up, similar to what we do for dangerous saga operations:

// Add a confirmation prompt reminding the caller of the risks of this
// injection
let text = r#"
WARNING: Injecting an error into a saga will (hopefully) cause it to be
unwound, but if the actions into which errors are injected have taken effect,
those effects will not be undone. This can result in corruption of control
plane state, even if the Nexus assigned to this saga is not currently running.
You should only do this if:
- you've stopped Nexus and then verified that the currently-running nodes
either have no side effects, have not made any changes to the system, or
you've already undone them by hand
- this is a development system whose state can be wiped
"#;
if should_print_color {
println!("{}", text.red().bold());
} else {
println!("{text}");
}
// Before doing anything: find the current SEC for the saga, and ping it to
// ensure that the Nexus is down.
if !args.bypass_sec_check {
let saga: Saga = {
use nexus_db_schema::schema::saga::dsl;
dsl::saga
.filter(dsl::id.eq(args.saga_id))
.first_async(&*conn)
.await?
};
let status = get_saga_sec_status(omdb, opctx, &saga).await;
status.display_message(should_print_color);
status.into_result()?;
} else {
let text = "Skipping check of whether the Nexus assigned to this saga \
is running. If this Nexus is running, the control plane state managed \
by this saga may become corrupted!";
if should_print_color {
println!("{}", text.red().bold());
} else {
println!("{text}");
}
}
// Before making any changes, ask for confirmation
let mut prompt = ConfirmationPrompt::new();
prompt.read_and_validate("y/N", "y")?;
drop(prompt);

There are enough warning signs here, and I don't think this is likely a tool that people will stumble on accidentally, that I wouldn't call this a blocker.

I went ahead and incorporated a confirmation prompt, but this breaks automated tests, so I also added an optional skip_confirmation argument.

@smklein smklein enabled auto-merge (squash) September 29, 2025 23:45
@smklein smklein merged commit 6871607 into main Sep 30, 2025
16 checks passed
@smklein smklein deleted the omdb_nexus_quiesce branch September 30, 2025 01:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[nexus handoff] omdb commands for viewing, forcibly updating the handoff state

2 participants