Add some jitter to pkg/retry#2444
Conversation
Reviewer's GuideThis PR augments the existing retry mechanism by injecting a random jitter of ±5% into each retry delay calculation, preventing synchronized retry storms without altering the overall expected delay duration. File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey @mtrmac - I've reviewed your changes and they look great!
Here's what I looked at during the review
- 🟢 General issues: all looks good
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟢 Complexity: all looks good
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
d833c2e to
0edf0bc
Compare
Luap99
left a comment
There was a problem hiding this comment.
Why exactly is this needed?
This option is exposed to end users directly via --retry-delay in podman cli and containers.conf, I feel like it magically chaining the value to add jitter seems unexpected. At the very least I would expect the documentation to reflect it and also I would rather have it be the minimum delay then so that we never wait less than the user asked for.
And I guess this will break podman tests as well that match the log message
libimage/manifests/manifests_test.go
Outdated
| // Must show warning where libimage is going to retry 5 times with 1s delay | ||
| assert.Contains(t, logString, "Failed, retrying in 1s ... (1/5)", "warning not matched") | ||
| // Must show warning where libimage is going to retry 5 times with ~1s delay | ||
| assert.Regexp(t, `Failed, retrying in (1\.[0-9]*s|[0-9].*ms) ... \(1/5\)`, logString, "warning not matched") |
There was a problem hiding this comment.
This feels aesthetically unappealing to me at the very least. If I set the delay to 1s and it logs some ms I will be somewhat confused.
There was a problem hiding this comment.
Yes … OTOH being inaccurate about the delay could be very confusing when analyzing logs; and the API lets users specify the timeout to an arbitrary time.Duration precision.
I think “if the user-specified or default timeout is specified in whole seconds, round to 10 ms” could work well here — almost all users are specifying whole seconds, or using the default ((2^attempt) whole seconds); and the hypothetical extreme users would not lose precision.
Compare the retries / jitter discussion in https://sre.google/sre-book/addressing-cascading-failures/ . I think being considerate to servers would be nice.
I agree that this is unexpected. At the API level, it would be trivial to add an opt-in — but this is, mostly, directly controlled from the CLI; and I don’t think it makes much sense to make this an user-visible option; I’d recommend to opt all 3 callers in. So, upsides / downsides. I don’t feel that strongly about this, and I wouldn’t mind closing this PR. Let’s make this decision, and then I can fine-tune.
That’s fair.
I don’t really have a preference. |
Thanks, that is good to know.
I agree that an option for this doesn't add much value and exposing that to users would just add extra docs that nobody is going to care about. I think most sleep APIs always guarantee that it sleeps for a minimum of that time only anyway. So maybe if we add jitter like 100-105% of the time that would be good enough? And then maybe we could "lie" about the log line and just log what the user requested? Looking at podman tests I see at least 4 check for the exact timestamp. Sure we can update them to use regex too but it would not look nice. So maybe lying for the exact timestamp might be a good option to avoid end users seeing the jitter? Anyhow I don't feel strongly for or against this either. Maybe we could need some more opinions, @containers/podman-maintainers? |
Good point about the lower-bound guarantee. It’s not typical to log not exactly what the code does, but in terms of actual promises, it would be perfectly legitimate. |
... so that a short failure does not trigger a cascade of synchronized retries at exactly the same time. Generally, delays are always a lower bound (the OS doesn't promise non-realtime code to run on _exactly_ requested times), so technically we are allowed to add this delay. Signed-off-by: Miloslav Trmač <mitr@redhat.com>
|
Updated to only add a delay, to not log it in the user output (and to add a debug log line). |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: giuseppe, Luap99, mtrmac, sourcery-ai[bot] The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
... so that a short failure does not trigger a cascade of synchronized retries at exactly the same time.
Keep the expected duration of the retry unchanged, vary it 5 % up or down.
Summary by Sourcery
Enhancements: