Skip to content

[nexus] ereport restart ID table with first-seen timestamps (good version)#10618

Merged
hawkw merged 33 commits into
mainfrom
eliza/ereport-restart-order-v2-final
Jun 18, 2026
Merged

[nexus] ereport restart ID table with first-seen timestamps (good version)#10618
hawkw merged 33 commits into
mainfrom
eliza/ereport-restart-order-v2-final

Conversation

@hawkw

@hawkw hawkw commented Jun 12, 2026

Copy link
Copy Markdown
Member

This branch is yet another attempt to come up with a mechanism for determining a partial ordering of ereport restart IDs. This time, we do it by maintaining a table of ereport restart IDs, along with the timestamp at which they were first seen. Unlike previous approaches, this table is not actually intended to be indexed by physical location or part number/serial number of the reporter. Instead, it is intended to be read as a sitrep analysis input, and used to compare two ereports by ID to see which restart ID was first observed earlier. The new table is populated by changing the query that inserts ereports into the database to a CTE which also inserts a record into the new ereporter_restart table, if one does not already exist for that restart ID.

This allows us to durably record the timestamp at which the first ereport from a reporter was observed, without requiring us to do a costly scan over all ereports at read-time. Storing the first-seen timestamp in a separate table also avoids the risk of having that timestamp change if earlier ereports are deleted.

This closes #10614, which it supersedes.

@hawkw hawkw left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes to reviewers.

Comment on lines +352 to +456
/// Returns a query for inserting ereports into the database and updating
/// the restart history table if the restart ID of the provided ereports is
/// not already present.
///
/// This is performed in a single atomic CTE, in order to ensure that the
/// earliest `time_collected` timestamp in the `ereports` table for a given
/// restart ID will always match the `time_first_seen` timestamp in the
/// restart history table, even if Nexus crashes midway through inserting
/// ereports.
fn ereports_insert_query(
restart_id: EreporterRestartUuid,
time_collected: chrono::DateTime<chrono::Utc>,
reporter: fm::Reporter,
ereports: Vec<Ereport>,
) -> impl RunnableQuery<i64> {
/// This is basically just a big pile of ceremony for combining two
/// little Diesel queries into a CTE...
struct EreportInsertQuery<IE, IR> {
insert_ereports: IE,
insert_reporter: IR,
slot: Option<SqlU16>,
}

impl<IE, IR> QueryId for EreportInsertQuery<IE, IR> {
type QueryId = ();
const HAS_STATIC_QUERY_ID: bool = false;
}

impl<IE, IR> Query for EreportInsertQuery<IE, IR> {
type SqlType = sql_types::BigInt;
}

impl<IE, IR> diesel::RunQueryDsl<DbConnection> for EreportInsertQuery<IE, IR> {}

impl<IE, IR> QueryFragment<Pg> for EreportInsertQuery<IE, IR>
where
IE: QueryFragment<Pg>,
IR: QueryFragment<Pg>,
{
fn walk_ast<'b>(
&'b self,
mut out: AstPass<'_, 'b, Pg>,
) -> QueryResult<()> {
out.push_sql("WITH inserted_ereports AS ( ");
self.insert_ereports.walk_ast(out.reborrow())?;

out.push_sql("), inserted_reporter AS (");
self.insert_reporter.walk_ast(out.reborrow())?;
out.push_sql(" ON CONFLICT (id) DO ");
// If we have a slot number, update it so that a previously-null
// slot number is filled in; if we do not, do nothing on
// conflict so a previously non-NULL slot is not clobbered.
if let Some(ref slot) = self.slot {
out.push_sql("UPDATE SET \"slot\" = ");
out.push_bind_param::<sql_types::Int4, _>(slot)?;
} else {
out.push_sql("NOTHING");
}
// We don't actually need this, but `WITH` clauses have to
// return something, sooo....
out.push_sql(" RETURNING id) ");
out.push_sql("SELECT count(*) FROM inserted_ereports");
Ok(())
}
}

let (reporter, slot_type, slot) = match reporter {
fm::Reporter::HostOs { slot, .. } => {
(EreporterType::Host, SpType::Sled, slot.map(SqlU16::from))
}
fm::Reporter::Sp { sp_type, slot } => {
let sp_type = sp_type.into();
let slot = SqlU16::from(slot);
(EreporterType::Sp, sp_type, Some(slot))
}
};

// The query fragment to insert ereports into the `ereport` table.
let insert_ereports = diesel::insert_into(dsl::ereport)
.values(ereports)
// Some or all of the ereports collected in this batch may already
// exist in the database because they were ingested by another
// Nexus. If the same ENAs exist for this restart ID, that's fine;
// don't overwrite them.
.on_conflict((dsl::restart_id, dsl::ena))
.do_nothing()
.returning(dsl::ena);
// Query fragment to insert the reporter restart entry into the
// ereporter restart table, or update the existing entry's slot column
// if one exists and the slot column is null. The null behavior will be
// added by the `walk_ast()` method on `EreporterInsertQuery`, because
// it depends on whether or not there is a slot number to insert, and I
// couldn't figure out how to get diesel to let me type erase an INSERT
// statement that may have one of multiple ON CONFLICT clauses...
let insert_reporter = diesel::insert_into(
restart_dsl::ereporter_restart,
)
.values(crate::db::model::EreporterRestart {
id: restart_id.into(),
time_first_seen: time_collected,
reporter,
slot_type,
slot: slot.map(SpMgsSlot::from),
});
EreportInsertQuery { insert_ereports, insert_reporter, slot }

@hawkw hawkw Jun 16, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of the core of this change. We've taken the query for inserting ereports into the database and turned it into a CTE which both inserts ereports into the ereport table, and inserts the restart ID into the ereporter_restart table, if it has not already been added.

It's necessary for these two operations to be done atomically, because if they are non-atomic, it is possible for a Nexus to die, or go out to lunch for a bit, between inserting the ereport records and creating the ereporter_restart record for a tranche of ereports from a not-yet-seen restart ID. If this happens, another Nexus could insert some more ereports from that restart ID, and try to put something in the restart ID table with a timestamp that is later than the actual first time ereports from that ID were observed. Thus, the CTE --- because I wanted to see if I could do it without a transaction.

Comment on lines +1356 to +1362
datastore
.ereports_insert(
opctx,
host0_restart_id,
host0_first_seen,
fm::Reporter::HostOs { sled, slot: None },
vec![ereport_data(host0_restart_id, 1, host0_first_seen)],

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something I don't love about this change: now that the ereports_insert function requires the restart_id and time_collected to be passed as arguments to ereports_insert, they are duplicated between the method arguments and the multiple EreportData values passed in the iterator. This is a bit of a shame, since the values are now duplicated, and it's possible to pass ereport data that has different restart IDs or time_collecteds from the ones passed as arguments.

I have an additional change that factors these out of EreportData and always uses the value from the argument to ereports_insert to fill them in when turning the EreportDatas into db::model::Ereports. I felt like it would be better to open a separate PR for that, because it touches a bunch of otherwise unrelated files (mostly tests).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. If we normalized these fields so that they aren't present in the ereport table, we would also avoid the duplication in the Rust API for it.

Comment on lines 341 to 348
let latest = self
.latest_ereport_id_on_conn(&conn, reporter)
.await
.map_err(|e| {
e.internal_context(format!(
"failed to refresh latest ereport ID for {reporter}"
"failed to refresh latest ereport ID for {reporter}",
))
})?;

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought about trying to fold this bit into the CTE, but realized it was not actually important to do that, and actually should be separate --- it means we are likelier to discover a subsequent restart should another Nexus have inserted a newer one while we were inserting ours.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's weird that the API of this function does not document that "latest" is being returned, which may not be the ereport we are inserting.

Admittedly I think that was a problem before this PR, but still

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that there ought to be a comment explaining this, and will add one.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think of 3c289b5 ?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding, I appreciate it!

Comment on lines +310 to +320
// Check if this is a reporter we know about, and issue a
// warning if it is not.
let id = ereport.id();
if !builder.ereporter_restarts().contains_key(&id.restart_id) {
let msg = format!(
"ereport {id} has a restart ID not contained in the \
`ereporter_restart` table"
);
slog::warn!(&opctx.log, "{msg}");
warnings.push(msg);
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could completely throw out these ereports, but...it seems mean to me.

hawkw added 3 commits June 16, 2026 12:43
This was intended for use by `omdb`, but it's doing its own slightly
different thing.
@hawkw hawkw changed the title copy of ereport restart ID ordering table v2 final 2 usethis FINAL real version 2 FINAL [nexus] ereport restart ID table with first-seen timestamps (good version) Jun 16, 2026
@hawkw hawkw marked this pull request as ready for review June 16, 2026 20:03
@hawkw hawkw requested a review from smklein June 16, 2026 20:03
@hawkw hawkw requested a review from mergeconflict June 16, 2026 20:04
@hawkw hawkw added the fault-management Everything related to the fault-management initiative (RFD480 and others) label Jun 16, 2026
@hawkw hawkw added this to the 21 milestone Jun 16, 2026
Comment on lines +151 to 152
#[clap(long = "serial")]
serial: Option<String>,

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was always intended to be --serial rather than a positional argument, but i had forgotten to make it one ages ago. whoops. 😅

Comment thread schema/crdb/dbinit.sql

-- Table tracking the timestamp of the first ereport received from each restart
-- ID.
CREATE TABLE IF NOT EXISTS omicron.public.ereporter_restart (

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When are we deleting these rows? I'm trying to understand the full lifetime here, especially because this table will gather additional rows every time we reboot

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, that's a great question...we aren't. 😅

The reason for that is that we are also not deleting ereports yet, and I would like the lifetime of rows in this table to be tied to the lifetime of the ereports we need the restart record in order to interpret...ideally, I would like us to to delete rows in this table once the last ereport from a particular restart ID has been deleted and there is a newer observed restart in the table (implying that no subsequent ereports from the older restart will be ingested).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like us to to delete rows in this table once the last ereport from a particular restart ID has been deleted and there is a newer observed restart in the table (implying that no subsequent ereports from the older restart will be ingested).

hrmmm I think that the precondition of "the last ereport from a particular restart ID" is a good start, but "there is a newer observed restart in the table" might not work.

What about a case where, for example, a sled dies and we expunge it? Then all ereports from the that "sled_id" will halt, and be gone forever?

If we aren't going to resolve such a thing in this PR, can we at least file an issue to track the eventual deletion of these rows?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree that the deletion conditions are more complex than only "there is a newer restart from the same location in the table". I was going to say that I would leave a comment about this table on "the existing issue for ereport deletion" but...I can't seem to find an existing issue. So I'll open one and try and discuss this PR there as well.

Comment on lines +11 to +21
-- Whether this ereport was generated by SP firmware or the host OS.
reporter omicron.public.ereporter_type NOT NULL,
-- The type of the physical slot occupied by the reporter.
slot_type omicron.public.sp_type NOT NULL,
-- The number of the physical slot occupied by the reporter.
--
-- For sled host OS reporters, this may be NULL if the sled's location is
-- not known to the system when the ereport was received. If the physical
-- location of the sled is determined later, subsequent attempts to insert
-- ereports will update this field.
slot INT4,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems kinda like a bummer that we're duplicating a bit of this "reporter / slot_type / slot" info between the ereports and this table. IMO ideally we'd have:

  • ereporter_restart contains this info, and a UUID
  • ereports reference the ereporter UUID

However, I'm not going to block the PR on such a change. Just seems like a place where the schema allows for skew, that we could prevent by using "just UUIDs" as the FK from ereport -> ereporter_restart.

(For example: what if omicron.public.ereport has restart_id = X, but says it's in slot 5, while ereporter_retart claims to have the same restart_id = X, but says it's in slot 6?)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so the reason I put this data here is because I am hoping we can drop it from the ereport table in a subsequent PR, as it is static and should not change once a restart ID has been observed. I didn't want to do that in this branch, but I wanted to reserve the ability to do this later. I don't love that the current schema allows us to represent a bunch of ereports from a single restart ID which have multiple slots, because this should never actually happen.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a related note, I'd like to be able to do a similar normalization to the part number/serial-number fields, since they could also retroactively be added to nullable fields in this table as they become known.

Unfortunately, while MGS learns of these from the SP in a separate exchange of messages, it currently "de-normalizes" the part number, serial number, and other Hubris metadata into fields on every ereport in the response. At the time, this felt like a good idea, but now, I'm regretting it. It would be nice to change the MGS API for this so that these fields were in a top-level field in the message rather than splatted out onto every ereport in the body. Then, we could take a single optional part number and serial number as arguments to ereports_insert and have that function set a field on the restart record if it is non-null. This would be much nicer because, again, it changes the schema to not represent nonsensical behaviors like serial numbers changing within a restart (which shouldn't happen; the only possible transition is for them to go from null to non-null).

Comment on lines 341 to 348
let latest = self
.latest_ereport_id_on_conn(&conn, reporter)
.await
.map_err(|e| {
e.internal_context(format!(
"failed to refresh latest ereport ID for {reporter}"
"failed to refresh latest ereport ID for {reporter}",
))
})?;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's weird that the API of this function does not document that "latest" is being returned, which may not be the ereport we are inserting.

Admittedly I think that was a problem before this PR, but still

Comment on lines +1356 to +1362
datastore
.ereports_insert(
opctx,
host0_restart_id,
host0_first_seen,
fm::Reporter::HostOs { sled, slot: None },
vec![ereport_data(host0_restart_id, 1, host0_first_seen)],

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +20 to +21
-- the (schema-prohibited, domain-impossible) case of a single restart ID having
-- inconsistent reporter/slot_type values across its ereports.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your analysis here (and the usage of "MAX" / "ON CONFLICT DO NOTHING") - as long as we don't have a single restart ID claiming to have multiple potential slots. Reaaaaaally hope that isn't happening in the field anywhere!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Luckily basically the only way I can think of that happening with the current code is if we generated a colliding restart ID, which...I really hope doesn't happen.

@hawkw hawkw enabled auto-merge (squash) June 18, 2026 19:52
@hawkw hawkw merged commit 9fc857e into main Jun 18, 2026
19 checks passed
@hawkw hawkw deleted the eliza/ereport-restart-order-v2-final branch June 18, 2026 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fault-management Everything related to the fault-management initiative (RFD480 and others)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants