better handle undo actions that fail#138
Conversation
andrewjstone
left a comment
There was a problem hiding this comment.
This looks correct to me, and should alleviate some pain in deployments hopefully. I expect we'll get better at writing sagas where undo steps cannot fail over time.
| live_state | ||
| .undo_errors | ||
| .insert(self.node_id, self.state.0.clone()) | ||
| .expect_none("undo node failed twice (storing error)"); |
There was a problem hiding this comment.
NIt: The assert(!live_state.undo_errors.contains_key(&self.node_id)) above is an equivalent assertion. I'd probably remove the former.
There was a problem hiding this comment.
Yes, good call. Fixed.
| .undo_errors | ||
| .insert(node_id, error.clone()) | ||
| .expect_none( | ||
| "recovered node twice (undo failure case)", |
There was a problem hiding this comment.
same as above with duplicate assertions
There was a problem hiding this comment.
Fixed. Thanks for catching these.
| +-- undone: (subsaga end) (produces "server_alloc") | ||
| +-- undone: InstanceConfigure (produces "instance_configure") | ||
| +-- undone: VolumeAttach (produces "volume_attach") | ||
| +-- failed: InstanceBoot (produces "instance_boot") |
There was a problem hiding this comment.
Just to clarify my understanding here: If InstanceBoot fails because of an undo failure, the rest of the actions above it won't actually be undone, right? I got confused, because they are all marked with "undone".
There was a problem hiding this comment.
Good question. In this example, InstanceBoot failed because of an action failure. Then, while unwinding, InstanceCreate's undo action failed. So those middle ones (like VolumeAttach) will have been undone.
For background, see #26. This is a first step towards resolving that. The basic plan here is that if an undo action fails, then the executor will stop dispatching any new tasks, wait for any outstanding tasks to complete, and then come to rest. The result looks similar to what happens if the saga finishes. You can tell from the result type (see
SagaResultErr) whether this is what happened.For an example, see the output from the new smoke test.
I wrestled a little bit about whether "Stuck" should be a new terminal state (separate from "Done"). I tried that at first but it just made the code trickier and wasn't adding anything. It may make sense to separate these better in the final View type but I figured let's see how this plays out in practice.
Here's what's still to-do here:
After this, I think the remaining part of fixing #26 will be to change the return type for undo actions to
UndoActionErrorso that callers are forced to writeUndoActionError::PermanentError. I hope that will guide people away from returning an error here out of convenience, since the impact of doing so is significant.