[nexus] ereport restart ID table with first-seen timestamps (good version) by hawkw · Pull Request #10618 · oxidecomputer/omicron

hawkw · 2026-06-12T17:33:09Z

This branch is yet another attempt to come up with a mechanism for determining a partial ordering of ereport restart IDs. This time, we do it by maintaining a table of ereport restart IDs, along with the timestamp at which they were first seen. Unlike previous approaches, this table is not actually intended to be indexed by physical location or part number/serial number of the reporter. Instead, it is intended to be read as a sitrep analysis input, and used to compare two ereports by ID to see which restart ID was first observed earlier. The new table is populated by changing the query that inserts ereports into the database to a CTE which also inserts a record into the new ereporter_restart table, if one does not already exist for that restart ID.

This allows us to durably record the timestamp at which the first ereport from a reporter was observed, without requiring us to do a costly scan over all ereports at read-time. Storing the first-seen timestamp in a separate table also avoids the risk of having that timestamp change if earlier ereports are deleted.

This closes #10614, which it supersedes.

fine actually

stop trying to add the "latest ENA" fetch to the CTE since that was too annoying, comments, nicer errors

we can add it later if we do want it

hawkw

Some notes to reviewers.

hawkw · 2026-06-16T19:25:47Z

+    /// Returns a query for inserting ereports into the database and updating
+    /// the restart history table if the restart ID of the provided ereports is
+    /// not already present.
+    ///
+    /// This is performed in a single atomic CTE, in order to ensure that the
+    /// earliest `time_collected` timestamp in the `ereports` table for a given
+    /// restart ID will always match the `time_first_seen` timestamp in the
+    /// restart history table, even if Nexus crashes midway through inserting
+    /// ereports.
+    fn ereports_insert_query(
+        restart_id: EreporterRestartUuid,
+        time_collected: chrono::DateTime<chrono::Utc>,
+        reporter: fm::Reporter,
+        ereports: Vec<Ereport>,
+    ) -> impl RunnableQuery<i64> {
+        /// This is basically just a big pile of ceremony for combining two
+        /// little Diesel queries into a CTE...
+        struct EreportInsertQuery<IE, IR> {
+            insert_ereports: IE,
+            insert_reporter: IR,
+            slot: Option<SqlU16>,
+        }
+
+        impl<IE, IR> QueryId for EreportInsertQuery<IE, IR> {
+            type QueryId = ();
+            const HAS_STATIC_QUERY_ID: bool = false;
+        }
+
+        impl<IE, IR> Query for EreportInsertQuery<IE, IR> {
+            type SqlType = sql_types::BigInt;
+        }
+
+        impl<IE, IR> diesel::RunQueryDsl<DbConnection> for EreportInsertQuery<IE, IR> {}
+
+        impl<IE, IR> QueryFragment<Pg> for EreportInsertQuery<IE, IR>
+        where
+            IE: QueryFragment<Pg>,
+            IR: QueryFragment<Pg>,
+        {
+            fn walk_ast<'b>(
+                &'b self,
+                mut out: AstPass<'_, 'b, Pg>,
+            ) -> QueryResult<()> {
+                out.push_sql("WITH inserted_ereports AS ( ");
+                self.insert_ereports.walk_ast(out.reborrow())?;
+
+                out.push_sql("), inserted_reporter AS (");
+                self.insert_reporter.walk_ast(out.reborrow())?;
+                out.push_sql(" ON CONFLICT (id) DO ");
+                // If we have a slot number, update it so that a previously-null
+                // slot number is filled in; if we do not, do nothing on
+                // conflict so a previously non-NULL slot is not clobbered.
+                if let Some(ref slot) = self.slot {
+                    out.push_sql("UPDATE SET \"slot\" = ");
+                    out.push_bind_param::<sql_types::Int4, _>(slot)?;
+                } else {
+                    out.push_sql("NOTHING");
+                }
+                // We don't actually need this, but `WITH` clauses have to
+                // return something, sooo....
+                out.push_sql(" RETURNING id) ");
+                out.push_sql("SELECT count(*) FROM inserted_ereports");
+                Ok(())
+            }
+        }
+
+        let (reporter, slot_type, slot) = match reporter {
+            fm::Reporter::HostOs { slot, .. } => {
+                (EreporterType::Host, SpType::Sled, slot.map(SqlU16::from))
+            }
+            fm::Reporter::Sp { sp_type, slot } => {
+                let sp_type = sp_type.into();
+                let slot = SqlU16::from(slot);
+                (EreporterType::Sp, sp_type, Some(slot))
+            }
+        };
+
+        // The query fragment to insert ereports into the `ereport` table.
+        let insert_ereports = diesel::insert_into(dsl::ereport)
+            .values(ereports)
+            // Some or all of the ereports collected in this batch may already
+            // exist in the database because they were ingested by another
+            // Nexus. If the same ENAs exist for this restart ID, that's fine;
+            // don't overwrite them.
+            .on_conflict((dsl::restart_id, dsl::ena))
+            .do_nothing()
+            .returning(dsl::ena);
+        // Query fragment to insert the reporter restart entry into the
+        // ereporter restart table, or update the existing entry's slot column
+        // if one exists and the slot column is null. The null behavior will be
+        // added by the `walk_ast()` method on `EreporterInsertQuery`, because
+        // it depends on whether or not there is a slot number to insert, and I
+        // couldn't figure out how to get diesel to let me type erase an INSERT
+        // statement that may have one of multiple ON CONFLICT clauses...
+        let insert_reporter = diesel::insert_into(
+            restart_dsl::ereporter_restart,
+        )
+        .values(crate::db::model::EreporterRestart {
+            id: restart_id.into(),
+            time_first_seen: time_collected,
+            reporter,
+            slot_type,
+            slot: slot.map(SpMgsSlot::from),
+        });
+        EreportInsertQuery { insert_ereports, insert_reporter, slot }


This is kind of the core of this change. We've taken the query for inserting ereports into the database and turned it into a CTE which both inserts ereports into the ereport table, and inserts the restart ID into the ereporter_restart table, if it has not already been added.

It's necessary for these two operations to be done atomically, because if they are non-atomic, it is possible for a Nexus to die, or go out to lunch for a bit, between inserting the ereport records and creating the ereporter_restart record for a tranche of ereports from a not-yet-seen restart ID. If this happens, another Nexus could insert some more ereports from that restart ID, and try to put something in the restart ID table with a timestamp that is later than the actual first time ereports from that ID were observed. Thus, the CTE --- because I wanted to see if I could do it without a transaction.

hawkw · 2026-06-16T19:30:55Z

+        datastore
+            .ereports_insert(
+                opctx,
+                host0_restart_id,
+                host0_first_seen,
+                fm::Reporter::HostOs { sled, slot: None },
+                vec![ereport_data(host0_restart_id, 1, host0_first_seen)],


This is something I don't love about this change: now that the ereports_insert function requires the restart_id and time_collected to be passed as arguments to ereports_insert, they are duplicated between the method arguments and the multiple EreportData values passed in the iterator. This is a bit of a shame, since the values are now duplicated, and it's possible to pass ereport data that has different restart IDs or time_collecteds from the ones passed as arguments.

I have an additional change that factors these out of EreportData and always uses the value from the argument to ereports_insert to fill them in when turning the EreportDatas into db::model::Ereports. I felt like it would be better to open a separate PR for that, because it touches a bunch of otherwise unrelated files (mostly tests).

This seems somewhat related to https://github.com/oxidecomputer/omicron/pull/10618/changes#r3424619788 as well?

Yeah. If we normalized these fields so that they aren't present in the ereport table, we would also avoid the duplication in the Rust API for it.

hawkw · 2026-06-16T19:38:29Z

        let latest = self
            .latest_ereport_id_on_conn(&conn, reporter)
            .await
            .map_err(|e| {
                e.internal_context(format!(
-                    "failed to refresh latest ereport ID for {reporter}"
+                    "failed to refresh latest ereport ID for {reporter}",
                ))
            })?;


i thought about trying to fold this bit into the CTE, but realized it was not actually important to do that, and actually should be separate --- it means we are likelier to discover a subsequent restart should another Nexus have inserted a newer one while we were inserting ours.

I think it's weird that the API of this function does not document that "latest" is being returned, which may not be the ereport we are inserting.

Admittedly I think that was a problem before this PR, but still

I agree that there ought to be a comment explaining this, and will add one.

what do you think of 3c289b5 ?

Thanks for adding, I appreciate it!

hawkw · 2026-06-16T19:40:41Z

+                // Check if this is a reporter we know about, and issue a
+                // warning if it is not.
+                let id = ereport.id();
+                if !builder.ereporter_restarts().contains_key(&id.restart_id) {
+                    let msg = format!(
+                        "ereport {id} has a restart ID not contained in the \
+                         `ereporter_restart` table"
+                    );
+                    slog::warn!(&opctx.log, "{msg}");
+                    warnings.push(msg);
+                }


we could completely throw out these ereports, but...it seems mean to me.

This was intended for use by `omdb`, but it's doing its own slightly different thing.

hawkw · 2026-06-16T23:16:02Z

+    #[clap(long = "serial")]
    serial: Option<String>,


this was always intended to be --serial rather than a positional argument, but i had forgotten to make it one ages ago. whoops. 😅

smklein · 2026-06-16T23:21:34Z


+-- Table tracking the timestamp of the first ereport received from each restart
+-- ID.
+CREATE TABLE IF NOT EXISTS omicron.public.ereporter_restart (


When are we deleting these rows? I'm trying to understand the full lifetime here, especially because this table will gather additional rows every time we reboot

Heh, that's a great question...we aren't. 😅

The reason for that is that we are also not deleting ereports yet, and I would like the lifetime of rows in this table to be tied to the lifetime of the ereports we need the restart record in order to interpret...ideally, I would like us to to delete rows in this table once the last ereport from a particular restart ID has been deleted and there is a newer observed restart in the table (implying that no subsequent ereports from the older restart will be ingested).

I would like us to to delete rows in this table once the last ereport from a particular restart ID has been deleted and there is a newer observed restart in the table (implying that no subsequent ereports from the older restart will be ingested).

hrmmm I think that the precondition of "the last ereport from a particular restart ID" is a good start, but "there is a newer observed restart in the table" might not work.

What about a case where, for example, a sled dies and we expunge it? Then all ereports from the that "sled_id" will halt, and be gone forever?

If we aren't going to resolve such a thing in this PR, can we at least file an issue to track the eventual deletion of these rows?

Yeah, I agree that the deletion conditions are more complex than only "there is a newer restart from the same location in the table". I was going to say that I would leave a comment about this table on "the existing issue for ereport deletion" but...I can't seem to find an existing issue. So I'll open one and try and discuss this PR there as well.

smklein · 2026-06-16T23:30:13Z

+    -- Whether this ereport was generated by SP firmware or the host OS.
+    reporter omicron.public.ereporter_type NOT NULL,
+    -- The type of the physical slot occupied by the reporter.
+    slot_type omicron.public.sp_type NOT NULL,
+    -- The number of the physical slot occupied by the reporter.
+    --
+    -- For sled host OS reporters, this may be NULL if the sled's location is
+    -- not known to the system when the ereport was received. If the physical
+    -- location of the sled is determined later, subsequent attempts to insert
+    -- ereports will update this field.
+    slot INT4,


It seems kinda like a bummer that we're duplicating a bit of this "reporter / slot_type / slot" info between the ereports and this table. IMO ideally we'd have:

ereporter_restart contains this info, and a UUID

ereports reference the ereporter UUID

However, I'm not going to block the PR on such a change. Just seems like a place where the schema allows for skew, that we could prevent by using "just UUIDs" as the FK from ereport -> ereporter_restart.

(For example: what if omicron.public.ereport has restart_id = X, but says it's in slot 5, while ereporter_retart claims to have the same restart_id = X, but says it's in slot 6?)

Yeah, so the reason I put this data here is because I am hoping we can drop it from the ereport table in a subsequent PR, as it is static and should not change once a restart ID has been observed. I didn't want to do that in this branch, but I wanted to reserve the ability to do this later. I don't love that the current schema allows us to represent a bunch of ereports from a single restart ID which have multiple slots, because this should never actually happen.

On a related note, I'd like to be able to do a similar normalization to the part number/serial-number fields, since they could also retroactively be added to nullable fields in this table as they become known.

Unfortunately, while MGS learns of these from the SP in a separate exchange of messages, it currently "de-normalizes" the part number, serial number, and other Hubris metadata into fields on every ereport in the response. At the time, this felt like a good idea, but now, I'm regretting it. It would be nice to change the MGS API for this so that these fields were in a top-level field in the message rather than splatted out onto every ereport in the body. Then, we could take a single optional part number and serial number as arguments to ereports_insert and have that function set a field on the restart record if it is non-null. This would be much nicer because, again, it changes the schema to not represent nonsensical behaviors like serial numbers changing within a restart (which shouldn't happen; the only possible transition is for them to go from null to non-null).

smklein · 2026-06-16T23:35:07Z

        let latest = self
            .latest_ereport_id_on_conn(&conn, reporter)
            .await
            .map_err(|e| {
                e.internal_context(format!(
-                    "failed to refresh latest ereport ID for {reporter}"
+                    "failed to refresh latest ereport ID for {reporter}",
                ))
            })?;


I think it's weird that the API of this function does not document that "latest" is being returned, which may not be the ereport we are inserting.

Admittedly I think that was a problem before this PR, but still

smklein · 2026-06-16T23:38:52Z

+        datastore
+            .ereports_insert(
+                opctx,
+                host0_restart_id,
+                host0_first_seen,
+                fm::Reporter::HostOs { sled, slot: None },
+                vec![ereport_data(host0_restart_id, 1, host0_first_seen)],


This seems somewhat related to https://github.com/oxidecomputer/omicron/pull/10618/changes#r3424619788 as well?

smklein · 2026-06-18T19:12:12Z

+-- the (schema-prohibited, domain-impossible) case of a single restart ID having
+-- inconsistent reporter/slot_type values across its ereports.


I agree with your analysis here (and the usage of "MAX" / "ON CONFLICT DO NOTHING") - as long as we don't have a single restart ID claiming to have multiple potential slots. Reaaaaaally hope that isn't happening in the field anywhere!

Yeah. Luckily basically the only way I can think of that happening with the current code is if we generated a colliding restart ID, which...I really hope doesn't happen.

hawkw added 23 commits June 12, 2026 10:32

okay i think this might actually work

319da58

diesel hell

b2fff05

okay get the latest query in there too

5602f03

move stuff into other stuff

972cfc2

okay i can't quite get the latest query inlined into the CTE, that's

675e2ed

fine actually

whoa it actually works

c98befb

have a restart list query

7aa683a

let's have some tests for that as well

7cd919c

clean up ereports_insert_query

3cbade2

stop trying to add the "latest ENA" fetch to the CTE since that was too annoying, comments, nicer errors

migration, index, constraints

7e29d33

actually nothing is using the slot index

92ebeb5

we can add it later if we do want it

wip add to loader

68a2f39

finish loader

82c5cb4

gotta also add it here

3732a29

data migration test

9f1437b

Merge branch 'main' into eliza/ereport-restart-order-v2-final

91df26b

update omdb command to use restart table as authortiative source

dd76be4

Merge branch 'main' into eliza/ereport-restart-order-v2-final

97a5487

issue warnings for ereports with unknown restart IDs

30d72f2

post merge test fixup

6b157b2

nexus-fm now depends on a db type sigh

c063bd2

restart query will also do a full scan

6269baa

you have to actually get the rpaths thing correct lol

efe895f

hawkw commented Jun 16, 2026

View reviewed changes

hawkw added 3 commits June 16, 2026 12:43

fix typo in migration test

8b01405

some commentary improvements

0b96394

remove some dead code

4de2c50

This was intended for use by `omdb`, but it's doing its own slightly different thing.

hawkw changed the title ~~copy of ereport restart ID ordering table v2 final 2 usethis FINAL real version 2 FINAL~~ [nexus] ereport restart ID table with first-seen timestamps (good version) Jun 16, 2026

hawkw marked this pull request as ready for review June 16, 2026 20:03

hawkw requested a review from smklein June 16, 2026 20:03

hawkw requested a review from mergeconflict June 16, 2026 20:04

hawkw added the fault-management Everything related to the fault-management initiative (RFD480 and others) label Jun 16, 2026

hawkw added this to the 21 milestone Jun 16, 2026

hawkw mentioned this pull request Jun 16, 2026

[2/n][fm] factor restart_id and time_created out of EreportData #10631

Open

hawkw added 3 commits June 16, 2026 14:32

BLERGH THERES EVEN MORE COPY PASTE MISTAKES IN HERE

875fc5b

UGH I TYPED THE WRONG ONE KMS KMS KMS KMS

59cc9fb

fixup omdb db ereport reporters arguments, some expectorates

f9f2e34

hawkw commented Jun 16, 2026

View reviewed changes

smklein reviewed Jun 16, 2026

View reviewed changes

hawkw added 3 commits June 18, 2026 09:33

Merge branch 'main' into eliza/ereport-restart-order-v2-final

22417d8

post merge expectoration

8376b52

comment explaining ereports insert return value

3c289b5

smklein reviewed Jun 18, 2026

View reviewed changes

smklein approved these changes Jun 18, 2026

View reviewed changes

Merge branch 'main' into eliza/ereport-restart-order-v2-final

30370df

hawkw enabled auto-merge (squash) June 18, 2026 19:52

hawkw merged commit 9fc857e into main Jun 18, 2026
19 checks passed

hawkw deleted the eliza/ereport-restart-order-v2-final branch June 18, 2026 21:25

		-- the (schema-prohibited, domain-impossible) case of a single restart ID having
		-- inconsistent reporter/slot_type values across its ereports.

Conversation

hawkw commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hawkw left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hawkw Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hawkw commented Jun 12, 2026 •

edited

Loading

hawkw left a comment •

edited

Loading

hawkw Jun 16, 2026 •

edited

Loading