[nexus] ereport restart ID table with first-seen timestamps (good version)#10618
Conversation
stop trying to add the "latest ENA" fetch to the CTE since that was too annoying, comments, nicer errors
we can add it later if we do want it
| /// Returns a query for inserting ereports into the database and updating | ||
| /// the restart history table if the restart ID of the provided ereports is | ||
| /// not already present. | ||
| /// | ||
| /// This is performed in a single atomic CTE, in order to ensure that the | ||
| /// earliest `time_collected` timestamp in the `ereports` table for a given | ||
| /// restart ID will always match the `time_first_seen` timestamp in the | ||
| /// restart history table, even if Nexus crashes midway through inserting | ||
| /// ereports. | ||
| fn ereports_insert_query( | ||
| restart_id: EreporterRestartUuid, | ||
| time_collected: chrono::DateTime<chrono::Utc>, | ||
| reporter: fm::Reporter, | ||
| ereports: Vec<Ereport>, | ||
| ) -> impl RunnableQuery<i64> { | ||
| /// This is basically just a big pile of ceremony for combining two | ||
| /// little Diesel queries into a CTE... | ||
| struct EreportInsertQuery<IE, IR> { | ||
| insert_ereports: IE, | ||
| insert_reporter: IR, | ||
| slot: Option<SqlU16>, | ||
| } | ||
|
|
||
| impl<IE, IR> QueryId for EreportInsertQuery<IE, IR> { | ||
| type QueryId = (); | ||
| const HAS_STATIC_QUERY_ID: bool = false; | ||
| } | ||
|
|
||
| impl<IE, IR> Query for EreportInsertQuery<IE, IR> { | ||
| type SqlType = sql_types::BigInt; | ||
| } | ||
|
|
||
| impl<IE, IR> diesel::RunQueryDsl<DbConnection> for EreportInsertQuery<IE, IR> {} | ||
|
|
||
| impl<IE, IR> QueryFragment<Pg> for EreportInsertQuery<IE, IR> | ||
| where | ||
| IE: QueryFragment<Pg>, | ||
| IR: QueryFragment<Pg>, | ||
| { | ||
| fn walk_ast<'b>( | ||
| &'b self, | ||
| mut out: AstPass<'_, 'b, Pg>, | ||
| ) -> QueryResult<()> { | ||
| out.push_sql("WITH inserted_ereports AS ( "); | ||
| self.insert_ereports.walk_ast(out.reborrow())?; | ||
|
|
||
| out.push_sql("), inserted_reporter AS ("); | ||
| self.insert_reporter.walk_ast(out.reborrow())?; | ||
| out.push_sql(" ON CONFLICT (id) DO "); | ||
| // If we have a slot number, update it so that a previously-null | ||
| // slot number is filled in; if we do not, do nothing on | ||
| // conflict so a previously non-NULL slot is not clobbered. | ||
| if let Some(ref slot) = self.slot { | ||
| out.push_sql("UPDATE SET \"slot\" = "); | ||
| out.push_bind_param::<sql_types::Int4, _>(slot)?; | ||
| } else { | ||
| out.push_sql("NOTHING"); | ||
| } | ||
| // We don't actually need this, but `WITH` clauses have to | ||
| // return something, sooo.... | ||
| out.push_sql(" RETURNING id) "); | ||
| out.push_sql("SELECT count(*) FROM inserted_ereports"); | ||
| Ok(()) | ||
| } | ||
| } | ||
|
|
||
| let (reporter, slot_type, slot) = match reporter { | ||
| fm::Reporter::HostOs { slot, .. } => { | ||
| (EreporterType::Host, SpType::Sled, slot.map(SqlU16::from)) | ||
| } | ||
| fm::Reporter::Sp { sp_type, slot } => { | ||
| let sp_type = sp_type.into(); | ||
| let slot = SqlU16::from(slot); | ||
| (EreporterType::Sp, sp_type, Some(slot)) | ||
| } | ||
| }; | ||
|
|
||
| // The query fragment to insert ereports into the `ereport` table. | ||
| let insert_ereports = diesel::insert_into(dsl::ereport) | ||
| .values(ereports) | ||
| // Some or all of the ereports collected in this batch may already | ||
| // exist in the database because they were ingested by another | ||
| // Nexus. If the same ENAs exist for this restart ID, that's fine; | ||
| // don't overwrite them. | ||
| .on_conflict((dsl::restart_id, dsl::ena)) | ||
| .do_nothing() | ||
| .returning(dsl::ena); | ||
| // Query fragment to insert the reporter restart entry into the | ||
| // ereporter restart table, or update the existing entry's slot column | ||
| // if one exists and the slot column is null. The null behavior will be | ||
| // added by the `walk_ast()` method on `EreporterInsertQuery`, because | ||
| // it depends on whether or not there is a slot number to insert, and I | ||
| // couldn't figure out how to get diesel to let me type erase an INSERT | ||
| // statement that may have one of multiple ON CONFLICT clauses... | ||
| let insert_reporter = diesel::insert_into( | ||
| restart_dsl::ereporter_restart, | ||
| ) | ||
| .values(crate::db::model::EreporterRestart { | ||
| id: restart_id.into(), | ||
| time_first_seen: time_collected, | ||
| reporter, | ||
| slot_type, | ||
| slot: slot.map(SpMgsSlot::from), | ||
| }); | ||
| EreportInsertQuery { insert_ereports, insert_reporter, slot } |
There was a problem hiding this comment.
This is kind of the core of this change. We've taken the query for inserting ereports into the database and turned it into a CTE which both inserts ereports into the ereport table, and inserts the restart ID into the ereporter_restart table, if it has not already been added.
It's necessary for these two operations to be done atomically, because if they are non-atomic, it is possible for a Nexus to die, or go out to lunch for a bit, between inserting the ereport records and creating the ereporter_restart record for a tranche of ereports from a not-yet-seen restart ID. If this happens, another Nexus could insert some more ereports from that restart ID, and try to put something in the restart ID table with a timestamp that is later than the actual first time ereports from that ID were observed. Thus, the CTE --- because I wanted to see if I could do it without a transaction.
| datastore | ||
| .ereports_insert( | ||
| opctx, | ||
| host0_restart_id, | ||
| host0_first_seen, | ||
| fm::Reporter::HostOs { sled, slot: None }, | ||
| vec![ereport_data(host0_restart_id, 1, host0_first_seen)], |
There was a problem hiding this comment.
This is something I don't love about this change: now that the ereports_insert function requires the restart_id and time_collected to be passed as arguments to ereports_insert, they are duplicated between the method arguments and the multiple EreportData values passed in the iterator. This is a bit of a shame, since the values are now duplicated, and it's possible to pass ereport data that has different restart IDs or time_collecteds from the ones passed as arguments.
I have an additional change that factors these out of EreportData and always uses the value from the argument to ereports_insert to fill them in when turning the EreportDatas into db::model::Ereports. I felt like it would be better to open a separate PR for that, because it touches a bunch of otherwise unrelated files (mostly tests).
There was a problem hiding this comment.
This seems somewhat related to https://github.com/oxidecomputer/omicron/pull/10618/changes#r3424619788 as well?
There was a problem hiding this comment.
Yeah. If we normalized these fields so that they aren't present in the ereport table, we would also avoid the duplication in the Rust API for it.
| let latest = self | ||
| .latest_ereport_id_on_conn(&conn, reporter) | ||
| .await | ||
| .map_err(|e| { | ||
| e.internal_context(format!( | ||
| "failed to refresh latest ereport ID for {reporter}" | ||
| "failed to refresh latest ereport ID for {reporter}", | ||
| )) | ||
| })?; |
There was a problem hiding this comment.
i thought about trying to fold this bit into the CTE, but realized it was not actually important to do that, and actually should be separate --- it means we are likelier to discover a subsequent restart should another Nexus have inserted a newer one while we were inserting ours.
There was a problem hiding this comment.
I think it's weird that the API of this function does not document that "latest" is being returned, which may not be the ereport we are inserting.
Admittedly I think that was a problem before this PR, but still
There was a problem hiding this comment.
I agree that there ought to be a comment explaining this, and will add one.
There was a problem hiding this comment.
Thanks for adding, I appreciate it!
| // Check if this is a reporter we know about, and issue a | ||
| // warning if it is not. | ||
| let id = ereport.id(); | ||
| if !builder.ereporter_restarts().contains_key(&id.restart_id) { | ||
| let msg = format!( | ||
| "ereport {id} has a restart ID not contained in the \ | ||
| `ereporter_restart` table" | ||
| ); | ||
| slog::warn!(&opctx.log, "{msg}"); | ||
| warnings.push(msg); | ||
| } |
There was a problem hiding this comment.
we could completely throw out these ereports, but...it seems mean to me.
This was intended for use by `omdb`, but it's doing its own slightly different thing.
| #[clap(long = "serial")] | ||
| serial: Option<String>, |
There was a problem hiding this comment.
this was always intended to be --serial rather than a positional argument, but i had forgotten to make it one ages ago. whoops. 😅
|
|
||
| -- Table tracking the timestamp of the first ereport received from each restart | ||
| -- ID. | ||
| CREATE TABLE IF NOT EXISTS omicron.public.ereporter_restart ( |
There was a problem hiding this comment.
When are we deleting these rows? I'm trying to understand the full lifetime here, especially because this table will gather additional rows every time we reboot
There was a problem hiding this comment.
Heh, that's a great question...we aren't. 😅
The reason for that is that we are also not deleting ereports yet, and I would like the lifetime of rows in this table to be tied to the lifetime of the ereports we need the restart record in order to interpret...ideally, I would like us to to delete rows in this table once the last ereport from a particular restart ID has been deleted and there is a newer observed restart in the table (implying that no subsequent ereports from the older restart will be ingested).
There was a problem hiding this comment.
I would like us to to delete rows in this table once the last ereport from a particular restart ID has been deleted and there is a newer observed restart in the table (implying that no subsequent ereports from the older restart will be ingested).
hrmmm I think that the precondition of "the last ereport from a particular restart ID" is a good start, but "there is a newer observed restart in the table" might not work.
What about a case where, for example, a sled dies and we expunge it? Then all ereports from the that "sled_id" will halt, and be gone forever?
If we aren't going to resolve such a thing in this PR, can we at least file an issue to track the eventual deletion of these rows?
There was a problem hiding this comment.
Yeah, I agree that the deletion conditions are more complex than only "there is a newer restart from the same location in the table". I was going to say that I would leave a comment about this table on "the existing issue for ereport deletion" but...I can't seem to find an existing issue. So I'll open one and try and discuss this PR there as well.
| -- Whether this ereport was generated by SP firmware or the host OS. | ||
| reporter omicron.public.ereporter_type NOT NULL, | ||
| -- The type of the physical slot occupied by the reporter. | ||
| slot_type omicron.public.sp_type NOT NULL, | ||
| -- The number of the physical slot occupied by the reporter. | ||
| -- | ||
| -- For sled host OS reporters, this may be NULL if the sled's location is | ||
| -- not known to the system when the ereport was received. If the physical | ||
| -- location of the sled is determined later, subsequent attempts to insert | ||
| -- ereports will update this field. | ||
| slot INT4, |
There was a problem hiding this comment.
It seems kinda like a bummer that we're duplicating a bit of this "reporter / slot_type / slot" info between the ereports and this table. IMO ideally we'd have:
- ereporter_restart contains this info, and a UUID
- ereports reference the ereporter UUID
However, I'm not going to block the PR on such a change. Just seems like a place where the schema allows for skew, that we could prevent by using "just UUIDs" as the FK from ereport -> ereporter_restart.
(For example: what if omicron.public.ereport has restart_id = X, but says it's in slot 5, while ereporter_retart claims to have the same restart_id = X, but says it's in slot 6?)
There was a problem hiding this comment.
Yeah, so the reason I put this data here is because I am hoping we can drop it from the ereport table in a subsequent PR, as it is static and should not change once a restart ID has been observed. I didn't want to do that in this branch, but I wanted to reserve the ability to do this later. I don't love that the current schema allows us to represent a bunch of ereports from a single restart ID which have multiple slots, because this should never actually happen.
There was a problem hiding this comment.
On a related note, I'd like to be able to do a similar normalization to the part number/serial-number fields, since they could also retroactively be added to nullable fields in this table as they become known.
Unfortunately, while MGS learns of these from the SP in a separate exchange of messages, it currently "de-normalizes" the part number, serial number, and other Hubris metadata into fields on every ereport in the response. At the time, this felt like a good idea, but now, I'm regretting it. It would be nice to change the MGS API for this so that these fields were in a top-level field in the message rather than splatted out onto every ereport in the body. Then, we could take a single optional part number and serial number as arguments to ereports_insert and have that function set a field on the restart record if it is non-null. This would be much nicer because, again, it changes the schema to not represent nonsensical behaviors like serial numbers changing within a restart (which shouldn't happen; the only possible transition is for them to go from null to non-null).
| let latest = self | ||
| .latest_ereport_id_on_conn(&conn, reporter) | ||
| .await | ||
| .map_err(|e| { | ||
| e.internal_context(format!( | ||
| "failed to refresh latest ereport ID for {reporter}" | ||
| "failed to refresh latest ereport ID for {reporter}", | ||
| )) | ||
| })?; |
There was a problem hiding this comment.
I think it's weird that the API of this function does not document that "latest" is being returned, which may not be the ereport we are inserting.
Admittedly I think that was a problem before this PR, but still
| datastore | ||
| .ereports_insert( | ||
| opctx, | ||
| host0_restart_id, | ||
| host0_first_seen, | ||
| fm::Reporter::HostOs { sled, slot: None }, | ||
| vec![ereport_data(host0_restart_id, 1, host0_first_seen)], |
There was a problem hiding this comment.
This seems somewhat related to https://github.com/oxidecomputer/omicron/pull/10618/changes#r3424619788 as well?
| -- the (schema-prohibited, domain-impossible) case of a single restart ID having | ||
| -- inconsistent reporter/slot_type values across its ereports. |
There was a problem hiding this comment.
I agree with your analysis here (and the usage of "MAX" / "ON CONFLICT DO NOTHING") - as long as we don't have a single restart ID claiming to have multiple potential slots. Reaaaaaally hope that isn't happening in the field anywhere!
There was a problem hiding this comment.
Yeah. Luckily basically the only way I can think of that happening with the current code is if we generated a colliding restart ID, which...I really hope doesn't happen.
This branch is yet another attempt to come up with a mechanism for determining a partial ordering of ereport restart IDs. This time, we do it by maintaining a table of ereport restart IDs, along with the timestamp at which they were first seen. Unlike previous approaches, this table is not actually intended to be indexed by physical location or part number/serial number of the reporter. Instead, it is intended to be read as a sitrep analysis input, and used to compare two ereports by ID to see which restart ID was first observed earlier. The new table is populated by changing the query that inserts ereports into the database to a CTE which also inserts a record into the new
ereporter_restarttable, if one does not already exist for that restart ID.This allows us to durably record the timestamp at which the first ereport from a reporter was observed, without requiring us to do a costly scan over all ereports at read-time. Storing the first-seen timestamp in a separate table also avoids the risk of having that timestamp change if earlier ereports are deleted.
This closes #10614, which it supersedes.