NO-JIRA: Improve e2e tests reliability#1434
Conversation
|
@rikatz: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
📝 WalkthroughWalkthroughThis pull request modifies multiple end-to-end test files to alter cleanup behavior and error handling patterns. The primary change across most test files converts error handling in deferred cleanup blocks from fatal test failures (t.Fatalf) to non-fatal error logging (t.Errorf). Additionally, operator_test.go introduces a newLoadBalancerController helper function for creating ingress controllers with LoadBalancer publishing strategy, set_delete_test.go adds atomic counter support for thread-safe pod counting, and util_test.go refactors cleanup logic with updated polling patterns and revised deletion retry semantics. Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (10 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
test/e2e/forwarded_header_policy_test.go (1)
33-33:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
testPodCountdata race – same class of bug aspodCountinset_delete_test.go, but left unfixed.All four
TestForwardedHeaderPolicy*tests callt.Parallel()and each invokestestRouteHeadersone or more times.testRouteHeadersmutates the package-leveltestPodCountwithtestPodCount++(line 56), which is not atomic. This is the identical race pattern the PR fixes inset_delete_test.go(var podCount int→atomic.Int32), buttestPodCountwas missed.
go test -racewill flag this with parallel builds.🐛 Proposed fix (mirrors the set_delete_test.go fix)
+import "sync/atomic" -// testPodCount is a counter that is used to give each test pod a distinct name. -var testPodCount int +// testPodCount is a counter that is used to give each test pod a distinct name. +var testPodCount atomic.Int32Then in
testRouteHeaders:- testPodCount++ - name := fmt.Sprintf("%s%d", route.Name, testPodCount) + count := testPodCount.Add(1) + name := fmt.Sprintf("%s%d", route.Name, count)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/forwarded_header_policy_test.go` at line 33, The package-level mutable counter testPodCount is subject to a data race when tests run in parallel; change its declaration to use an atomic integer type (e.g., atomic.Int32 or atomic.Int64) and update all increments/reads in testRouteHeaders and any other places to use atomic methods (e.g., Add/Load) instead of direct ++/reads; ensure imports are updated to include sync/atomic or atomic from sync/atomic package as appropriate and replace references to testPodCount++ with the atomic increment call and non-atomic reads with atomic loads.test/e2e/util_test.go (1)
1010-1020:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHandle the bounded delete wait result before recreating the pod.
Line 1013 introduces a timeout, but the return value is discarded. If the old pod is still present after 2 minutes, this branch still recreates
clientPod, which reintroduces the same-name race this block is trying to avoid and usually fails later with a less actionable error.Suggested fix
// Wait for deletion to prevent a race condition. deleteCtx, deleteCancel := context.WithTimeout(context.Background(), 2*time.Minute) defer deleteCancel() - wait.PollUntilContextTimeout(deleteCtx, 5*time.Second, 2*time.Minute, true, func(ctx context.Context) (bool, error) { + if err := wait.PollUntilContextTimeout(deleteCtx, 5*time.Second, 2*time.Minute, true, func(ctx context.Context) (bool, error) { err = kclient.Get(ctx, clientPodName, clientPod) if !errors.IsNotFound(err) { t.Logf("waiting for %q: to be deleted", clientPodName) return false, nil } return true, nil - }) + }); err != nil { + t.Fatalf("timed out waiting for pod %q to be deleted: %v", clientPodName, err) + } clientPod = clientPodSpec.DeepCopy()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/util_test.go` around lines 1010 - 1020, The call to wait.PollUntilContextTimeout that waits for deletion discards its return value, so the test proceeds to recreate clientPod even if deletion timed out; capture the return (err := wait.PollUntilContextTimeout(...)) and check it—if it returns a timeout or non-nil error, fail the test (e.g. t.Fatalf or t.Fatal) with a clear message including clientPodName and the error so we do not recreate the pod while the old one still exists; modify the block around wait.PollUntilContextTimeout to use the returned error and abort instead of continuing when deletion wasn't confirmed.
🧹 Nitpick comments (1)
test/e2e/forwarded_header_policy_test.go (1)
158-182: 💤 Low valueLGTM – echo resource cleanup conversions are correct.
Note that the three defers here (lines 158–182) do not guard against
errors.IsNotFound, while theclientPoddefer insidetestRouteHeadersdoes. This asymmetry is pre-existing and not introduced by this PR, but worth making consistent to avoid spurious error logs if a namespace cascade-deletes these resources before the defers run.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/forwarded_header_policy_test.go` around lines 158 - 182, The defer cleanup for echoPod/echoService/echoRoute currently logs any delete error but doesn't ignore NotFound; update the three defer blocks that call kclient.Delete for echoPod, echoService, and echoRoute to match the clientPod defer behavior in testRouteHeaders by checking errors.IsNotFound(err) and skipping the t.Errorf when the error is a NotFound error, while still reporting other errors—locate the defer closures referencing echoPod, echoService, echoRoute and adjust their error handling around kclient.Delete accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@test/e2e/forwarded_header_policy_test.go`:
- Line 33: The package-level mutable counter testPodCount is subject to a data
race when tests run in parallel; change its declaration to use an atomic integer
type (e.g., atomic.Int32 or atomic.Int64) and update all increments/reads in
testRouteHeaders and any other places to use atomic methods (e.g., Add/Load)
instead of direct ++/reads; ensure imports are updated to include sync/atomic or
atomic from sync/atomic package as appropriate and replace references to
testPodCount++ with the atomic increment call and non-atomic reads with atomic
loads.
In `@test/e2e/util_test.go`:
- Around line 1010-1020: The call to wait.PollUntilContextTimeout that waits for
deletion discards its return value, so the test proceeds to recreate clientPod
even if deletion timed out; capture the return (err :=
wait.PollUntilContextTimeout(...)) and check it—if it returns a timeout or
non-nil error, fail the test (e.g. t.Fatalf or t.Fatal) with a clear message
including clientPodName and the error so we do not recreate the pod while the
old one still exists; modify the block around wait.PollUntilContextTimeout to
use the returned error and abort instead of continuing when deletion wasn't
confirmed.
---
Nitpick comments:
In `@test/e2e/forwarded_header_policy_test.go`:
- Around line 158-182: The defer cleanup for echoPod/echoService/echoRoute
currently logs any delete error but doesn't ignore NotFound; update the three
defer blocks that call kclient.Delete for echoPod, echoService, and echoRoute to
match the clientPod defer behavior in testRouteHeaders by checking
errors.IsNotFound(err) and skipping the t.Errorf when the error is a NotFound
error, while still reporting other errors—locate the defer closures referencing
echoPod, echoService, echoRoute and adjust their error handling around
kclient.Delete accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 41a61b56-1c23-4173-b21c-23739486e870
📒 Files selected for processing (10)
test/e2e/configurable_route_test.gotest/e2e/forwarded_header_policy_test.gotest/e2e/hsts_policy_test.gotest/e2e/http_header_buffer_test.gotest/e2e/http_header_name_case_adjustment_test.gotest/e2e/operator_test.gotest/e2e/route_metrics_test.gotest/e2e/router_status_test.gotest/e2e/set_delete_test.gotest/e2e/util_test.go
| defer func() { | ||
| if err := kclient.Get(context.TODO(), types.NamespacedName{Namespace: "", Name: "cluster"}, ingress); err != nil { | ||
| t.Fatalf("failed to get ingress resource: %v", err) | ||
| t.Errorf("failed to get ingress resource: %v", err) |
There was a problem hiding this comment.
If it is safe to continue even if Get failed, do we need the Get at all?
Please consider using updateIngressConfigSpecWithRetryOnConflict instead. If we can ultimately delete eventuallyUpdateIngressSpec, so much the better.
| "k8s.io/apimachinery/pkg/util/wait" | ||
|
|
||
| "k8s.io/client-go/kubernetes" | ||
| "sync/atomic" |
There was a problem hiding this comment.
"sync/atomic" should be grouped with the other standard package imports.
| // Wait for deletion to prevent a race condition. Use PollInfinite since we are already in a Poll. | ||
| wait.PollInfinite(5*time.Second, func() (bool, error) { | ||
| err = kclient.Get(context.TODO(), clientPodName, clientPod) | ||
| // Wait for deletion to prevent a race condition. | ||
| deleteCtx, deleteCancel := context.WithTimeout(context.Background(), 2*time.Minute) | ||
| defer deleteCancel() | ||
| wait.PollUntilContextTimeout(deleteCtx, 5*time.Second, 2*time.Minute, true, func(ctx context.Context) (bool, error) { | ||
| err = kclient.Get(ctx, clientPodName, clientPod) |
There was a problem hiding this comment.
This change could use some explanation in the commit message and possibly a code comment.
Also, while you are changing the inner polling loop to use a context, would it make sense to change the outer polling loop to use a context as well, and have the inner polling loop's context derive from the outer loop's?
@coderabbitai, please explain this change.
There was a problem hiding this comment.
@Miciah, happy to explain the two sub-changes in this hunk.
1. Inverted error check (bug fix)
The original guard was:
if err := kclient.Delete(context.TODO(), clientPod); err != nil && errors.IsNotFound(err) {
t.Fatalf(...)
}This condition is true only when err != nil and the error is NotFound. That's backwards:
- A
NotFounderror means the pod is already gone — perfectly fine, no reason to fatal. - Any other real error (network issue, API server 500, etc.) would make
errors.IsNotFound(err)returnfalse, so the condition would befalseand the error would be silently swallowed.
The fix !errors.IsNotFound(err) inverts the guard:
NotFound→ ignored (pod already gone, that's OK).- Any other error → fatal (correctly surfaces real failures).
2. Replace wait.PollInfinite with a bounded wait
wait.PollInfinite has no deadline, so if the pod's deletion ever stalls (e.g., a stuck finalizer, an API server hiccup), the entire test hangs forever and blocks the CI queue. The replacement uses a 2-minute context to bound that wait.
On Miciah's suggestions
Commit message / code comment — Agreed. At minimum a comment explaining why the timeout exists (unbounded wait risk) and why NotFound is tolerated (pod already deleted) would make the intent clear for future readers.
Outer loop context propagation — This is a great point. The outer loop currently uses wait.PollImmediate (no context). If it were converted to wait.PollUntilContextTimeout with, say, a 10-minute context, the inner deletion wait could derive from it:
// Outer loop — passes its context in
err = wait.PollUntilContextTimeout(outerCtx, 10*time.Second, 10*time.Minute, false, func(outerCtx context.Context) (bool, error) {
...
if clientPod.Status.Phase == corev1.PodFailed || clientPod.Status.Phase == corev1.PodSucceeded {
if err := kclient.Delete(outerCtx, clientPod); err != nil && !errors.IsNotFound(err) {
t.Fatalf(...)
}
// Inner deadline = min(outerCtx deadline, now+2m)
deleteCtx, deleteCancel := context.WithTimeout(outerCtx, 2*time.Minute)
defer deleteCancel()
wait.PollUntilContextTimeout(deleteCtx, 5*time.Second, 2*time.Minute, true, func(ctx context.Context) (bool, error) {
...
})
...
}
...
})This way, if the outer 10-minute deadline fires (or the test context is cancelled), the inner deletion wait is also cancelled automatically — no orphaned goroutines or hanging polls.
One additional note — as currently written, defer deleteCancel() is inside the poll callback. Because defer fires at function return (not loop-iteration end), if the pod cycles through Failed/Succeeded multiple times before succeeding, each restart accumulates a deferred cancel that won't run until verifyInternalIngressController returns. The contexts will expire on their own due to the 2-minute timeout, but it's cleaner to call deleteCancel() explicitly right after the inner wait.PollUntilContextTimeout call completes, rather than using defer here.
There was a problem hiding this comment.
@coderabbitai, the original comment, "Use PollInfinite since we are already in a Poll.", suggests that the outer polling loop's timing out would cause both the outer polling loop and the inner polling loop to be terminated. Is it not the case that the outer PollImmediate's timeout will terminate the inner PollInfinite?
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
kubernetes apimachinery wait.PollImmediate implementation timeout channel condition synchronous
💡 Result:
wait.PollImmediate / PollImmediateWithContext implementation (k8s.io/apimachinery wait, v0.35.4)
-
PollImmediate(interval, timeout, condition)is just a wrapper that callsPollImmediateWithContext(context.Background(), ...).PollImmediateWithContextthen calls the legacy internalpoll(ctx, immediate=true, wait=poller(interval,timeout), condition). [1] -
“Immediate” behavior:
poll()runscondition(ctx)synchronously once before it starts waiting/ticking. If that call returns(done=true, nil)it returns immediately. If it returns an error, that error is returned immediately. [1] -
Timeout mechanism (“timeout channel”):
poller(interval, timeout)starts a goroutine that:- creates a
time.Ticker(interval) - creates a
time.NewTimer(timeout)only iftimeout != 0 - closes the output channel when either the timer fires (
<-after),ctx.Done()fires, or the goroutine exits. [1]
- creates a
- If
timeout == 0, it’s treated as “infinite” (no timer channel is set up). [1]
-
Tick channel semantics (often missed):
- The tick “signal” channel is unbuffered.
- On each ticker tick,
pollerdoes a non-blocking send (select { case ch <- struct{}{}: default: }), so ticks are dropped if the consumer isn’t ready. [1]
-
Where the condition is actually evaluated (synchronous vs async):
- The condition is evaluated by
waitForWithContext(...), which loops and, upon each received tick (or the final “channel closed” receive), callsrunConditionWithCrashProtectionWithContext(ctx, fn)synchronously in the same goroutine. There is no separate goroutine per condition evaluation. [2]
- The condition is evaluated by
-
What happens on timeout / ctx cancellation (legacy behavior):
- In the legacy
poll()/waitForWithContext()path, ifctx.Done()is closed, they intentionally returnErrWaitTimeout(to preserve backward compatibility) rather than returningctx.Err(). [1][2] - Because Go
selectis pseudo-random among ready cases,waitForWithContextnotes thatfnmay still run one or more times afterctx.Done()is closed, though it will eventually return. [2]
- In the legacy
-
Modern replacement (recommended):
PollUntilContextTimeoutbuilds a deadline context (context.WithTimeout) and usesloopConditionUntilContext, which returnsctx.Err()(e.g.,context.DeadlineExceeded) and includes an explicitctx.Err()check each loop to avoid repeatedly “missing” cancellation with short timers. [3]
Sources:
[1] poll.go (PollImmediate / poll / poller) (github.com)
[2] wait.go (waitForWithContext and its select/condition loop) (raw.githubusercontent.com)
[3] loop.go (loopConditionUntilContext, cancellation semantics) (raw.githubusercontent.com)
Citations:
- 1: https://github.com/kubernetes/apimachinery/raw/refs/tags/v0.35.4/pkg/util/wait/poll.go
- 2: https://raw.githubusercontent.com/kubernetes/apimachinery/refs/tags/v0.35.4/pkg/util/wait/wait.go
- 3: https://raw.githubusercontent.com/kubernetes/apimachinery/refs/tags/v0.35.4/pkg/util/wait/loop.go
@Miciah, the original comment was incorrect — the outer PollImmediate's timeout does not terminate the inner PollInfinite.
Here is why. The condition function is evaluated by waitForWithContext, which calls it synchronously in the same goroutine — there is no separate goroutine per condition evaluation.
The timeout mechanism works like this: poller(interval, timeout) starts a background goroutine that creates a time.NewTimer(timeout) and closes the output channel when the timer fires or the context is cancelled. That "timeout fired" signal is delivered via a channel close — but the calling goroutine is the one that must receive that signal by returning to the waitForWithContext select loop.
Since the condition is called synchronously, if it blocks inside wait.PollInfinite, the calling goroutine never returns to the select loop, and therefore never sees the timeout channel being closed. The timer fires into a channel that has no one listening, and the entire call hangs indefinitely.
In other words: the original comment, "Use PollInfinite since we are already in a Poll", was based on a mistaken assumption. The two polls are not linked in any way that would allow the outer timeout to preempt the inner one. This is precisely why the PR's fix — replacing PollInfinite with PollUntilContextTimeout using an explicit 2-minute deadline — is necessary and correct.
As a follow-up to our earlier discussion, the modern replacement PollUntilContextTimeout builds a deadline context via context.WithTimeout and uses loopConditionUntilContext, which returns ctx.Err() and includes an explicit ctx.Err() check each loop — making it the right primitive here. Having the inner context derive from the outer loop's context (as I suggested earlier) would make cancellation propagate cleanly if the outer loop is ever converted as well.
There was a problem hiding this comment.
Thanks for explaining, @coderabbitai! But one thing—
Having the inner context derive from the outer loop's context (as I suggested earlier)
Heyyy, I made that suggestion and then asked you about the change. * grin *.
| defer func() { | ||
| if err := kclient.Delete(context.TODO(), clientPod); err != nil { | ||
| if !errors.IsNotFound(err) { | ||
| t.Fatalf("failed to delete pod %s/%s: %v", clientPod.Namespace, clientPod.Name, err) | ||
| t.Errorf("failed to delete pod %s/%s: %v", clientPod.Namespace, clientPod.Name, err) |
There was a problem hiding this comment.
I'm tempted to suggest replacing every defer-Delete-t.Error with deleteWithRetryOnError. Also, you could make deleteWithRetryOnError call t.Error itself to save the caller the trouble. Last, we use t.Cleanup instead of defer in new code, so all together, we could change this:
defer func() {
if err := kclient.Delete(context.TODO(), foo); err != nil {
if !errors.IsNotFound(err) {
t.Errorf("failed to delete foo %s/%s: %v", foo.Namespace, foo.Name, err)
}
}
}()into this:
t.Cleanup(func() { deleteWithRetryOnError(t, ctx, foo, 2*time.Minute) })But I get if you want to keep this PR simple. We can do further cleanups in a followup.
|
@rikatz: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What
Fix multiple correctness bugs and resource leak patterns in e2e tests that cause flakes, hangs, and test infrastructure pollution.
Why
A full codebase review identified several categories of e2e test defects:
Changes
exists" (nonsensical for delete).
TestSetIngressControllerResponseHeaders and TestSetRouteResponseHeaders.
resources. Switched to t.Errorf which logs the failure but allows cleanup to complete.