Inventory read: report more fine-grained errors and retry on (some) failures#10595
Inventory read: report more fine-grained errors and retry on (some) failures#10595jgallagher wants to merge 2 commits into
Conversation
| "prev-collection-id" => %collection_id, | ||
| "prev-collection-err" => InlineErrorChain::new(&err), |
There was a problem hiding this comment.
super nitpicky: i feel like we almost always use _ in structured log record fields, right?
| "prev-collection-id" => %collection_id, | |
| "prev-collection-err" => InlineErrorChain::new(&err), | |
| "prev_collection_id" => %collection_id, | |
| "prev_collection_err" => InlineErrorChain::new(&err), |
| opctx, | ||
| new_collection_id, | ||
| ) | ||
| .await?, |
There was a problem hiding this comment.
I wonder if errors returned here ought to have an internal_context noting that we attempted to read that inventory after retrying a failure to read an older inventory collection?
| // This method has quite a lot of error handling. Most fall into one of | ||
| // three categories: |
There was a problem hiding this comment.
Thank you for writing all of this down <3
| async fn err_checking_for_deletion_impl< | ||
| F1: FnOnce() -> String, | ||
| F2: FnOnce() -> Error, | ||
| >( | ||
| conn: &async_bb8_diesel::Connection<DbConnection>, | ||
| id: CollectionUuid, | ||
| make_internal_error_message: F1, | ||
| make_notfound_error: F2, | ||
| ) -> Error { |
There was a problem hiding this comment.
super nitpicky: this could maybe be written a bit more concisely as:
| async fn err_checking_for_deletion_impl< | |
| F1: FnOnce() -> String, | |
| F2: FnOnce() -> Error, | |
| >( | |
| conn: &async_bb8_diesel::Connection<DbConnection>, | |
| id: CollectionUuid, | |
| make_internal_error_message: F1, | |
| make_notfound_error: F2, | |
| ) -> Error { | |
| async fn err_checking_for_deletion_impl( | |
| conn: &async_bb8_diesel::Connection<DbConnection>, | |
| id: CollectionUuid, | |
| make_internal_error_message: impl FnOnce() -> String, | |
| make_notfound_error: impl FnOnce() -> Error, | |
| ) -> Error { |
since I don't think we actually need to refer to F1 or F2 anyplace.
| macro_rules! err_checking_for_deletion { | ||
| ($($arg:tt)+) => { | ||
| err_checking_for_deletion_impl( | ||
| &conn, id, || format!($($arg)*), make_notfound_err, | ||
| ).await | ||
| }; | ||
| } |
There was a problem hiding this comment.
unimportant style nit: i might have put the macro definition before the err_checking_for_deletion_impl function so that the code reads topohraphically as a "here is the thing you call, and then here is the implementation details of that"?
| .collect::<Result<BTreeMap<_, _>, ()>>() | ||
| else { | ||
| return Err(err_checking_for_deletion!( | ||
| "missing baseboard that we should have fetched", |
There was a problem hiding this comment.
i realize that this is basically reimplementing the previous code without changing the error message, but it occurs to me that it might be nice to include the missing baseboard ID here and in subsequent similar errors --- if we ever actually saw that happen i can imagine wanting to be able to track it down while debugging?
| // We expect to find exactly 1 collection. 0 means the collection | ||
| // doesn't exist (also reasonable); any other number means we got | ||
| // multiple rows with the same primary key - not possible. | ||
| match collections.len() { | ||
| 0 => Ok(None), | ||
| 1 => Ok(Some(collections.into_iter().next().unwrap())), | ||
| n => Err(Error::internal_error(format!( | ||
| "found {n} collections with id {id}: \ | ||
| this should be impossible!" | ||
| ))), | ||
| } |
There was a problem hiding this comment.
huh, is this pattern of using limit(2) and checking for duplicate primary keys, rather than just using limit(1), a thing we should be doing elsewhere? is there a reason we ought to be concerned about this sort of thing happening in the wild, or is this just extreme defensive programming?
i'm asking because in FM, we have a very similar fm_sitrep_read_metadata_on_conn method that doesn't handle duplicate pkeys, and i'm wondering if i ought to be.
There was a problem hiding this comment.
Good question, I'm not sure! It looks like this dates back to when reading an inventory was first introduced (#4291) - @davepacheco might have an opinion? My current opinion is that this is an unnecessary (but pretty harmless, other than the confusion it produces when someone reads it) extreme defensive programming bit. If crdb allows duplicate primary keys, way too much is going to go off the rails for this one to matter much.
There was a problem hiding this comment.
That was my thinking as well, which was why I asked. Well, that and my misreading the diff and thinking it was new in this branch. 😅
There was a problem hiding this comment.
This was a pattern I used early on in Omicron (I'm surprised as late as this) and "extreme defensive programming" is a fair explanation. I had hoped at one point (very early on) that the DataStore's abstractions would just all do this without folks having to think about it everywhere (i.e., this kind of query would go through a common code path). I think that'd be worth it in an ideal world, but it's probably not worth having this confusing construction scattered all over the place.
We now return a "not found" error if attempting to read an inventory collection that doesn't exist or is deleting concurrently with our read. Fixes #8336:
Adds a single retry to
inventory_get_latest_collection()in case it's called concurrently with a delete; this should fix #10573.