Skip to content

once more unto the breach of "how do we determine the ordering of ereport restart IDs"#10614

Closed
hawkw wants to merge 3 commits into
mainfrom
eliza/restart-order-good-version
Closed

once more unto the breach of "how do we determine the ordering of ereport restart IDs"#10614
hawkw wants to merge 3 commits into
mainfrom
eliza/restart-order-good-version

Conversation

@hawkw

@hawkw hawkw commented Jun 11, 2026

Copy link
Copy Markdown
Member

this needs:

  • testing
  • plumbing into sitrep inputs
  • probably some refactoring of the restart locations models and stuff

let pool = db.pool();
let conn = pool.claim().await.unwrap();

let query = DataStore::ereport_unmarked_class_totals_query();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the ereporter_restart_history query?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AGH, that's a copy-paste mishap on my part :sweat

Comment on lines +89 to +90
self.first_seen_at
.cmp(&other.first_seen_at)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we trust that within the context of a Restart object, we'll be comparing times from one "lifetime" of an ereport reporter?

I'm asking because I want to understand: Is this immune to e.g. small clock skew between systems? E.g., this couldn't have an issue where "two Nexuses with slightly different notions of time are writing down events from a crash-looping service"

dsl::slot_type,
max(dsl::slot),
count(dsl::slot).aggregate_distinct(),
min(dsl::time_collected),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This min(dsl::time_collected) value becomes first_seen, which then becomes part of Restart ordering.

Does this make the assumption that we'll never prune ereports for a "restarted component" while it's alive?

For example, suppose a host boots, and makes some ereports:

Time Ereport From Host 1 Ereport from Host 2
1 A ...
2 ... X
3 B Y

As currently written, if we prune "A", then the Restart value of host 1 would change relative to host 2. Do we not care, because of other pieces of the restart ID (like the slot ordering)?

Comment on lines +1050 to +1054
assert!(
!explanation.contains("FULL SCAN"),
"Found an unexpected FULL SCAN: {}",
explanation
);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to look into using assert_uses_partial_index_only with the lookup_ereports_by_serial index. I think that this only scans over "live ereports" which is okay? but also, I guess, as an FYI: this is scanning over all live ereports

@hawkw hawkw closed this in #10618 Jun 18, 2026
hawkw added a commit that referenced this pull request Jun 18, 2026
…sion) (#10618)

This branch is yet another attempt to come up with a mechanism for
determining a partial ordering of ereport restart IDs. This time, we do
it by maintaining a table of ereport restart IDs, along with the
timestamp at which they were first seen. Unlike previous approaches,
this table is not actually intended to be indexed by physical location
or part number/serial number of the reporter. Instead, it is intended to
be read as a sitrep analysis input, and used to compare two ereports by
ID to see which restart ID was first observed earlier. The new table is
populated by changing the query that inserts ereports into the database
to a CTE which also inserts a record into the new `ereporter_restart`
table, if one does not already exist for that restart ID.

This allows us to durably record the timestamp at which the first
ereport from a reporter was observed, without requiring us to do a
costly scan over all ereports at read-time. Storing the first-seen
timestamp in a separate table also avoids the risk of having that
timestamp *change* if earlier ereports are deleted.

This closes #10614, which it supersedes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants