Inventory read: report more fine-grained errors and retry on (some) failures by jgallagher · Pull Request #10595 · oxidecomputer/omicron

jgallagher · 2026-06-10T22:07:54Z

We now return a "not found" error if attempting to read an inventory collection that doesn't exist or is deleting concurrently with our read. Fixes #8336:

% omdb db inventory collections show 3d7d9e30-611c-42db-8eb9-e0f1c7b60631
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using database URL postgresql://root@[::1]:45975/omicron?sslmode=disable
note: database schema version matches expected (265.0.0)
Error: fetching collection 3d7d9e30-611c-42db-8eb9-e0f1c7b60631

Caused by:
    Not found: inventory collection not found: 3d7d9e30-611c-42db-8eb9-e0f1c7b60631

Adds a single retry to inventory_get_latest_collection() in case it's called concurrently with a delete; this should fix #10573.

…ailures We now return a "not found" error if attempting to read an inventory collection that doesn't exist or is deleting concurrently with our read. Fixes #8336. Adds a single retry to `inventory_get_latest_collection()` in case it's called concurrently with a delete; this should fix #10573.

…etter-errors

hawkw · 2026-06-16T16:04:15Z

+                        "prev-collection-id" => %collection_id,
+                        "prev-collection-err" => InlineErrorChain::new(&err),


super nitpicky: i feel like we almost always use _ in structured log record fields, right?

Suggested change

"prev-collection-id" => %collection_id,

"prev-collection-err" => InlineErrorChain::new(&err),

"prev_collection_id" => %collection_id,

"prev_collection_err" => InlineErrorChain::new(&err),

hawkw · 2026-06-16T16:05:26Z

+                            opctx,
+                            new_collection_id,
+                        )
+                        .await?,


I wonder if errors returned here ought to have an internal_context noting that we attempted to read that inventory after retrying a failure to read an older inventory collection?

hawkw · 2026-06-16T16:06:03Z

+        // This method has quite a lot of error handling. Most fall into one of
+        // three categories:


Thank you for writing all of this down <3

hawkw · 2026-06-16T16:07:18Z

+        async fn err_checking_for_deletion_impl<
+            F1: FnOnce() -> String,
+            F2: FnOnce() -> Error,
+        >(
+            conn: &async_bb8_diesel::Connection<DbConnection>,
+            id: CollectionUuid,
+            make_internal_error_message: F1,
+            make_notfound_error: F2,
+        ) -> Error {


super nitpicky: this could maybe be written a bit more concisely as:

Suggested change

async fn err_checking_for_deletion_impl<

F1: FnOnce() -> String,

F2: FnOnce() -> Error,

>(

conn: &async_bb8_diesel::Connection<DbConnection>,

id: CollectionUuid,

make_internal_error_message: F1,

make_notfound_error: F2,

) -> Error {

async fn err_checking_for_deletion_impl(

conn: &async_bb8_diesel::Connection<DbConnection>,

id: CollectionUuid,

make_internal_error_message: impl FnOnce() -> String,

make_notfound_error: impl FnOnce() -> Error,

) -> Error {

since I don't think we actually need to refer to F1 or F2 anyplace.

hawkw · 2026-06-16T16:08:44Z

+        macro_rules! err_checking_for_deletion {
+            ($($arg:tt)+) => {
+                err_checking_for_deletion_impl(
+                    &conn, id, || format!($($arg)*), make_notfound_err,
+                ).await
+            };
+        }


unimportant style nit: i might have put the macro definition before the err_checking_for_deletion_impl function so that the code reads topohraphically as a "here is the thing you call, and then here is the implementation details of that"?

hawkw · 2026-06-16T16:10:34Z

+            .collect::<Result<BTreeMap<_, _>, ()>>()
+        else {
+            return Err(err_checking_for_deletion!(
+                "missing baseboard that we should have fetched",


i realize that this is basically reimplementing the previous code without changing the error message, but it occurs to me that it might be nice to include the missing baseboard ID here and in subsequent similar errors --- if we ever actually saw that happen i can imagine wanting to be able to track it down while debugging?

hawkw · 2026-06-16T16:15:54Z

+        // We expect to find exactly 1 collection. 0 means the collection
+        // doesn't exist (also reasonable); any other number means we got
+        // multiple rows with the same primary key - not possible.
+        match collections.len() {
+            0 => Ok(None),
+            1 => Ok(Some(collections.into_iter().next().unwrap())),
+            n => Err(Error::internal_error(format!(
+                "found {n} collections with id {id}: \
+                 this should be impossible!"
+            ))),
+        }


huh, is this pattern of using limit(2) and checking for duplicate primary keys, rather than just using limit(1), a thing we should be doing elsewhere? is there a reason we ought to be concerned about this sort of thing happening in the wild, or is this just extreme defensive programming?

i'm asking because in FM, we have a very similar fm_sitrep_read_metadata_on_conn method that doesn't handle duplicate pkeys, and i'm wondering if i ought to be.

Good question, I'm not sure! It looks like this dates back to when reading an inventory was first introduced (#4291) - @davepacheco might have an opinion? My current opinion is that this is an unnecessary (but pretty harmless, other than the confusion it produces when someone reads it) extreme defensive programming bit. If crdb allows duplicate primary keys, way too much is going to go off the rails for this one to matter much.

That was my thinking as well, which was why I asked. Well, that and my misreading the diff and thinking it was new in this branch. 😅

This was a pattern I used early on in Omicron (I'm surprised as late as this) and "extreme defensive programming" is a fair explanation. I had hoped at one point (very early on) that the DataStore's abstractions would just all do this without folks having to think about it everywhere (i.e., this kind of query would go through a common code path). I think that'd be worth it in an ideal world, but it's probably not worth having this confusing construction scattered all over the place.

jgallagher requested review from davepacheco and smklein June 10, 2026 22:07

Merge remote-tracking branch 'origin/main' into john/inventory-read-b…

ccbf264

…etter-errors

hawkw reviewed Jun 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inventory read: report more fine-grained errors and retry on (some) failures#10595

Inventory read: report more fine-grained errors and retry on (some) failures#10595
jgallagher wants to merge 2 commits into
mainfrom
john/inventory-read-better-errors

jgallagher commented Jun 10, 2026

Uh oh!

hawkw Jun 16, 2026

Uh oh!

hawkw Jun 16, 2026

Uh oh!

hawkw Jun 16, 2026

Uh oh!

hawkw Jun 16, 2026

Uh oh!

hawkw Jun 16, 2026

Uh oh!

hawkw Jun 16, 2026

Uh oh!

hawkw Jun 16, 2026

Uh oh!

jgallagher Jun 17, 2026

Uh oh!

hawkw Jun 17, 2026

Uh oh!

davepacheco Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		"prev-collection-id" => %collection_id,
		"prev-collection-err" => InlineErrorChain::new(&err),

		// This method has quite a lot of error handling. Most fall into one of
		// three categories:

Conversation

jgallagher commented Jun 10, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants