once more unto the breach of "how do we determine the ordering of ereport restart IDs"#10614
once more unto the breach of "how do we determine the ordering of ereport restart IDs"#10614hawkw wants to merge 3 commits into
Conversation
my query planner is dying
| let pool = db.pool(); | ||
| let conn = pool.claim().await.unwrap(); | ||
|
|
||
| let query = DataStore::ereport_unmarked_class_totals_query(); |
There was a problem hiding this comment.
This is not the ereporter_restart_history query?
There was a problem hiding this comment.
AGH, that's a copy-paste mishap on my part :sweat
| self.first_seen_at | ||
| .cmp(&other.first_seen_at) |
There was a problem hiding this comment.
Do we trust that within the context of a Restart object, we'll be comparing times from one "lifetime" of an ereport reporter?
I'm asking because I want to understand: Is this immune to e.g. small clock skew between systems? E.g., this couldn't have an issue where "two Nexuses with slightly different notions of time are writing down events from a crash-looping service"
| dsl::slot_type, | ||
| max(dsl::slot), | ||
| count(dsl::slot).aggregate_distinct(), | ||
| min(dsl::time_collected), |
There was a problem hiding this comment.
This min(dsl::time_collected) value becomes first_seen, which then becomes part of Restart ordering.
Does this make the assumption that we'll never prune ereports for a "restarted component" while it's alive?
For example, suppose a host boots, and makes some ereports:
| Time | Ereport From Host 1 | Ereport from Host 2 |
|---|---|---|
| 1 | A | ... |
| 2 | ... | X |
| 3 | B | Y |
As currently written, if we prune "A", then the Restart value of host 1 would change relative to host 2. Do we not care, because of other pieces of the restart ID (like the slot ordering)?
| assert!( | ||
| !explanation.contains("FULL SCAN"), | ||
| "Found an unexpected FULL SCAN: {}", | ||
| explanation | ||
| ); |
There was a problem hiding this comment.
You may want to look into using assert_uses_partial_index_only with the lookup_ereports_by_serial index. I think that this only scans over "live ereports" which is okay? but also, I guess, as an FYI: this is scanning over all live ereports
…sion) (#10618) This branch is yet another attempt to come up with a mechanism for determining a partial ordering of ereport restart IDs. This time, we do it by maintaining a table of ereport restart IDs, along with the timestamp at which they were first seen. Unlike previous approaches, this table is not actually intended to be indexed by physical location or part number/serial number of the reporter. Instead, it is intended to be read as a sitrep analysis input, and used to compare two ereports by ID to see which restart ID was first observed earlier. The new table is populated by changing the query that inserts ereports into the database to a CTE which also inserts a record into the new `ereporter_restart` table, if one does not already exist for that restart ID. This allows us to durably record the timestamp at which the first ereport from a reporter was observed, without requiring us to do a costly scan over all ereports at read-time. Storing the first-seen timestamp in a separate table also avoids the risk of having that timestamp *change* if earlier ereports are deleted. This closes #10614, which it supersedes.
this needs: