Skip to content

NO-JIRA: Improve e2e tests reliability#1434

Open
rikatz wants to merge 1 commit intoopenshift:masterfrom
rikatz:fix-some-e2e
Open

NO-JIRA: Improve e2e tests reliability#1434
rikatz wants to merge 1 commit intoopenshift:masterfrom
rikatz:fix-some-e2e

Conversation

@rikatz
Copy link
Copy Markdown
Member

@rikatz rikatz commented May 6, 2026

What

Fix multiple correctness bugs and resource leak patterns in e2e tests that cause flakes, hangs, and test infrastructure pollution.

Why

A full codebase review identified several categories of e2e test defects:

  • Inverted error checks that silently swallow real failures or retry on the wrong conditions, masking bugs in the operator.
  • An unbounded wait.PollInfinite call that can hang a test run forever if pod deletion stalls.
  • A data race on a shared mutable counter (podCount) accessed by parallel tests without synchronization.
  • 75 instances of t.Fatalf called inside defer func() blocks, which invokes runtime.Goexit() and prevents subsequent defers from executing — leaking pods, services, routes, and namespaces in the test cluster.

Changes

  • Fix inverted error check in deleteWithRetryOnError (util_test.go): Changed !errors.IsAlreadyExists(err) to !errors.IsNotFound(err) so that delete retries correctly ignore "not found" (already deleted) rather than "already
    exists" (nonsensical for delete).
  • Fix inverted error check in verifyInternalIngressController (util_test.go): Added missing negation ! before errors.IsNotFound(err) so cleanup delete correctly tolerates already-gone resources.
  • Replace wait.PollInfinite with bounded timeout (util_test.go): Switched to wait.PollUntilContextTimeout with a 2-minute deadline to prevent infinite hangs when pod deletion stalls.
  • Fix data race on shared podCount (set_delete_test.go): Changed var podCount int with non-atomic podCount++ to atomic.Int32 with podCount.Add(1), eliminating the race between parallel
    TestSetIngressControllerResponseHeaders and TestSetRouteResponseHeaders.
  • Replace t.Fatalf with t.Errorf in all defer cleanup blocks (10 files, 75 locations): t.Fatalf in a deferred function calls runtime.Goexit(), which aborts the goroutine and skips all remaining defers — leaking test
    resources. Switched to t.Errorf which logs the failure but allows cleanup to complete.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 6, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@rikatz: This pull request explicitly references no jira issue.

Details

In response to this:

What

Fix multiple correctness bugs and resource leak patterns in e2e tests that cause flakes, hangs, and test infrastructure pollution.

Why

A full codebase review identified several categories of e2e test defects:

  • Inverted error checks that silently swallow real failures or retry on the wrong conditions, masking bugs in the operator.
  • An unbounded wait.PollInfinite call that can hang a test run forever if pod deletion stalls.
  • A data race on a shared mutable counter (podCount) accessed by parallel tests without synchronization.
  • 75 instances of t.Fatalf called inside defer func() blocks, which invokes runtime.Goexit() and prevents subsequent defers from executing — leaking pods, services, routes, and namespaces in the test cluster.

Changes

  • Fix inverted error check in deleteWithRetryOnError (util_test.go): Changed !errors.IsAlreadyExists(err) to !errors.IsNotFound(err) so that delete retries correctly ignore "not found" (already deleted) rather than "already
    exists" (nonsensical for delete).
  • Fix inverted error check in verifyInternalIngressController (util_test.go): Added missing negation ! before errors.IsNotFound(err) so cleanup delete correctly tolerates already-gone resources.
  • Replace wait.PollInfinite with bounded timeout (util_test.go): Switched to wait.PollUntilContextTimeout with a 2-minute deadline to prevent infinite hangs when pod deletion stalls.
  • Fix data race on shared podCount (set_delete_test.go): Changed var podCount int with non-atomic podCount++ to atomic.Int32 with podCount.Add(1), eliminating the race between parallel
    TestSetIngressControllerResponseHeaders and TestSetRouteResponseHeaders.
  • Replace t.Fatalf with t.Errorf in all defer cleanup blocks (10 files, 75 locations): t.Fatalf in a deferred function calls runtime.Goexit(), which aborts the goroutine and skips all remaining defers — leaking test
    resources. Switched to t.Errorf which logs the failure but allows cleanup to complete.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 6, 2026

📝 Walkthrough

Walkthrough

This pull request modifies multiple end-to-end test files to alter cleanup behavior and error handling patterns. The primary change across most test files converts error handling in deferred cleanup blocks from fatal test failures (t.Fatalf) to non-fatal error logging (t.Errorf). Additionally, operator_test.go introduces a newLoadBalancerController helper function for creating ingress controllers with LoadBalancer publishing strategy, set_delete_test.go adds atomic counter support for thread-safe pod counting, and util_test.go refactors cleanup logic with updated polling patterns and revised deletion retry semantics.


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 58.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Ote Binary Stdout Contract ❓ Inconclusive No result was produced after verification. Marking as INCONCLUSIVE. Re-run the check or adjust instructions to produce a final result.
✅ Passed checks (10 passed)
Check name Status Explanation
Title check ✅ Passed The title 'NO-JIRA: Improve e2e tests reliability' clearly and concisely summarizes the main objective of the pull request, which is to fix multiple e2e test defects and improve test reliability.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, detailing the specific bugs fixed, the reasoning behind the changes, and the 75 locations where t.Fatalf was replaced with t.Errorf in defer cleanup blocks.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Repository uses standard Go testing.T framework, not Ginkgo. The check targets Ginkgo test titles (It(), Describe(), etc.). No Ginkgo tests exist and no test function names were modified in this PR.
Test Structure And Quality ✅ Passed All 5 quality criteria met or improved: cleanup non-fatal (80 defer blocks, 0 t.Fatalf), waits explicit (no PollInfinite), assertions contextual, patterns consistent, single responsibility maintained.
Microshift Test Compatibility ✅ Passed This PR modifies only existing standard Go testing.T tests, not Ginkgo tests. The check requires NEW Ginkgo test additions. No new tests are added, only reliability improvements to existing ones.
Single Node Openshift (Sno) Test Compatibility ✅ Passed Check not applicable. PR uses standard Go testing (testing package), not Ginkgo. No new Ginkgo tests are being added, only modifications to existing tests and a private helper function.
Topology-Aware Scheduling Compatibility ✅ Passed Not applicable. PR modifies only test files (test/e2e/*.go). No deployment manifests, operator code, or scheduling constraints are changed. Check applies only to operator/deployment code changes.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR modifies only existing tests (error handling in cleanup paths). No new test functions are added. The check applies only to new tests.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from Thealisyed and bentito May 6, 2026 03:00
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 6, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign knobunc for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
test/e2e/forwarded_header_policy_test.go (1)

33-33: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

testPodCount data race – same class of bug as podCount in set_delete_test.go, but left unfixed.

All four TestForwardedHeaderPolicy* tests call t.Parallel() and each invokes testRouteHeaders one or more times. testRouteHeaders mutates the package-level testPodCount with testPodCount++ (line 56), which is not atomic. This is the identical race pattern the PR fixes in set_delete_test.go (var podCount intatomic.Int32), but testPodCount was missed.

go test -race will flag this with parallel builds.

🐛 Proposed fix (mirrors the set_delete_test.go fix)
+import "sync/atomic"

-// testPodCount is a counter that is used to give each test pod a distinct name.
-var testPodCount int
+// testPodCount is a counter that is used to give each test pod a distinct name.
+var testPodCount atomic.Int32

Then in testRouteHeaders:

-	testPodCount++
-	name := fmt.Sprintf("%s%d", route.Name, testPodCount)
+	count := testPodCount.Add(1)
+	name := fmt.Sprintf("%s%d", route.Name, count)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/forwarded_header_policy_test.go` at line 33, The package-level
mutable counter testPodCount is subject to a data race when tests run in
parallel; change its declaration to use an atomic integer type (e.g.,
atomic.Int32 or atomic.Int64) and update all increments/reads in
testRouteHeaders and any other places to use atomic methods (e.g., Add/Load)
instead of direct ++/reads; ensure imports are updated to include sync/atomic or
atomic from sync/atomic package as appropriate and replace references to
testPodCount++ with the atomic increment call and non-atomic reads with atomic
loads.
test/e2e/util_test.go (1)

1010-1020: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle the bounded delete wait result before recreating the pod.

Line 1013 introduces a timeout, but the return value is discarded. If the old pod is still present after 2 minutes, this branch still recreates clientPod, which reintroduces the same-name race this block is trying to avoid and usually fails later with a less actionable error.

Suggested fix
 			// Wait for deletion to prevent a race condition.
 			deleteCtx, deleteCancel := context.WithTimeout(context.Background(), 2*time.Minute)
 			defer deleteCancel()
-			wait.PollUntilContextTimeout(deleteCtx, 5*time.Second, 2*time.Minute, true, func(ctx context.Context) (bool, error) {
+			if err := wait.PollUntilContextTimeout(deleteCtx, 5*time.Second, 2*time.Minute, true, func(ctx context.Context) (bool, error) {
 				err = kclient.Get(ctx, clientPodName, clientPod)
 				if !errors.IsNotFound(err) {
 					t.Logf("waiting for %q: to be deleted", clientPodName)
 					return false, nil
 				}
 				return true, nil
-			})
+			}); err != nil {
+				t.Fatalf("timed out waiting for pod %q to be deleted: %v", clientPodName, err)
+			}
 			clientPod = clientPodSpec.DeepCopy()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/util_test.go` around lines 1010 - 1020, The call to
wait.PollUntilContextTimeout that waits for deletion discards its return value,
so the test proceeds to recreate clientPod even if deletion timed out; capture
the return (err := wait.PollUntilContextTimeout(...)) and check it—if it returns
a timeout or non-nil error, fail the test (e.g. t.Fatalf or t.Fatal) with a
clear message including clientPodName and the error so we do not recreate the
pod while the old one still exists; modify the block around
wait.PollUntilContextTimeout to use the returned error and abort instead of
continuing when deletion wasn't confirmed.
🧹 Nitpick comments (1)
test/e2e/forwarded_header_policy_test.go (1)

158-182: 💤 Low value

LGTM – echo resource cleanup conversions are correct.

Note that the three defers here (lines 158–182) do not guard against errors.IsNotFound, while the clientPod defer inside testRouteHeaders does. This asymmetry is pre-existing and not introduced by this PR, but worth making consistent to avoid spurious error logs if a namespace cascade-deletes these resources before the defers run.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/forwarded_header_policy_test.go` around lines 158 - 182, The defer
cleanup for echoPod/echoService/echoRoute currently logs any delete error but
doesn't ignore NotFound; update the three defer blocks that call kclient.Delete
for echoPod, echoService, and echoRoute to match the clientPod defer behavior in
testRouteHeaders by checking errors.IsNotFound(err) and skipping the t.Errorf
when the error is a NotFound error, while still reporting other errors—locate
the defer closures referencing echoPod, echoService, echoRoute and adjust their
error handling around kclient.Delete accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@test/e2e/forwarded_header_policy_test.go`:
- Line 33: The package-level mutable counter testPodCount is subject to a data
race when tests run in parallel; change its declaration to use an atomic integer
type (e.g., atomic.Int32 or atomic.Int64) and update all increments/reads in
testRouteHeaders and any other places to use atomic methods (e.g., Add/Load)
instead of direct ++/reads; ensure imports are updated to include sync/atomic or
atomic from sync/atomic package as appropriate and replace references to
testPodCount++ with the atomic increment call and non-atomic reads with atomic
loads.

In `@test/e2e/util_test.go`:
- Around line 1010-1020: The call to wait.PollUntilContextTimeout that waits for
deletion discards its return value, so the test proceeds to recreate clientPod
even if deletion timed out; capture the return (err :=
wait.PollUntilContextTimeout(...)) and check it—if it returns a timeout or
non-nil error, fail the test (e.g. t.Fatalf or t.Fatal) with a clear message
including clientPodName and the error so we do not recreate the pod while the
old one still exists; modify the block around wait.PollUntilContextTimeout to
use the returned error and abort instead of continuing when deletion wasn't
confirmed.

---

Nitpick comments:
In `@test/e2e/forwarded_header_policy_test.go`:
- Around line 158-182: The defer cleanup for echoPod/echoService/echoRoute
currently logs any delete error but doesn't ignore NotFound; update the three
defer blocks that call kclient.Delete for echoPod, echoService, and echoRoute to
match the clientPod defer behavior in testRouteHeaders by checking
errors.IsNotFound(err) and skipping the t.Errorf when the error is a NotFound
error, while still reporting other errors—locate the defer closures referencing
echoPod, echoService, echoRoute and adjust their error handling around
kclient.Delete accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 41a61b56-1c23-4173-b21c-23739486e870

📥 Commits

Reviewing files that changed from the base of the PR and between 1a9e496 and 0f36c79.

📒 Files selected for processing (10)
  • test/e2e/configurable_route_test.go
  • test/e2e/forwarded_header_policy_test.go
  • test/e2e/hsts_policy_test.go
  • test/e2e/http_header_buffer_test.go
  • test/e2e/http_header_name_case_adjustment_test.go
  • test/e2e/operator_test.go
  • test/e2e/route_metrics_test.go
  • test/e2e/router_status_test.go
  • test/e2e/set_delete_test.go
  • test/e2e/util_test.go

defer func() {
if err := kclient.Get(context.TODO(), types.NamespacedName{Namespace: "", Name: "cluster"}, ingress); err != nil {
t.Fatalf("failed to get ingress resource: %v", err)
t.Errorf("failed to get ingress resource: %v", err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is safe to continue even if Get failed, do we need the Get at all?

Please consider using updateIngressConfigSpecWithRetryOnConflict instead. If we can ultimately delete eventuallyUpdateIngressSpec, so much the better.

"k8s.io/apimachinery/pkg/util/wait"

"k8s.io/client-go/kubernetes"
"sync/atomic"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"sync/atomic" should be grouped with the other standard package imports.

Comment thread test/e2e/util_test.go
Comment on lines -1010 to +1014
// Wait for deletion to prevent a race condition. Use PollInfinite since we are already in a Poll.
wait.PollInfinite(5*time.Second, func() (bool, error) {
err = kclient.Get(context.TODO(), clientPodName, clientPod)
// Wait for deletion to prevent a race condition.
deleteCtx, deleteCancel := context.WithTimeout(context.Background(), 2*time.Minute)
defer deleteCancel()
wait.PollUntilContextTimeout(deleteCtx, 5*time.Second, 2*time.Minute, true, func(ctx context.Context) (bool, error) {
err = kclient.Get(ctx, clientPodName, clientPod)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change could use some explanation in the commit message and possibly a code comment.

Also, while you are changing the inner polling loop to use a context, would it make sense to change the outer polling loop to use a context as well, and have the inner polling loop's context derive from the outer loop's?

@coderabbitai, please explain this change.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Miciah, happy to explain the two sub-changes in this hunk.


1. Inverted error check (bug fix)

The original guard was:

if err := kclient.Delete(context.TODO(), clientPod); err != nil && errors.IsNotFound(err) {
    t.Fatalf(...)
}

This condition is true only when err != nil and the error is NotFound. That's backwards:

  • A NotFound error means the pod is already gone — perfectly fine, no reason to fatal.
  • Any other real error (network issue, API server 500, etc.) would make errors.IsNotFound(err) return false, so the condition would be false and the error would be silently swallowed.

The fix !errors.IsNotFound(err) inverts the guard:

  • NotFound → ignored (pod already gone, that's OK).
  • Any other error → fatal (correctly surfaces real failures).

2. Replace wait.PollInfinite with a bounded wait

wait.PollInfinite has no deadline, so if the pod's deletion ever stalls (e.g., a stuck finalizer, an API server hiccup), the entire test hangs forever and blocks the CI queue. The replacement uses a 2-minute context to bound that wait.


On Miciah's suggestions

Commit message / code comment — Agreed. At minimum a comment explaining why the timeout exists (unbounded wait risk) and why NotFound is tolerated (pod already deleted) would make the intent clear for future readers.

Outer loop context propagation — This is a great point. The outer loop currently uses wait.PollImmediate (no context). If it were converted to wait.PollUntilContextTimeout with, say, a 10-minute context, the inner deletion wait could derive from it:

// Outer loop — passes its context in
err = wait.PollUntilContextTimeout(outerCtx, 10*time.Second, 10*time.Minute, false, func(outerCtx context.Context) (bool, error) {
    ...
    if clientPod.Status.Phase == corev1.PodFailed || clientPod.Status.Phase == corev1.PodSucceeded {
        if err := kclient.Delete(outerCtx, clientPod); err != nil && !errors.IsNotFound(err) {
            t.Fatalf(...)
        }
        // Inner deadline = min(outerCtx deadline, now+2m)
        deleteCtx, deleteCancel := context.WithTimeout(outerCtx, 2*time.Minute)
        defer deleteCancel()
        wait.PollUntilContextTimeout(deleteCtx, 5*time.Second, 2*time.Minute, true, func(ctx context.Context) (bool, error) {
            ...
        })
        ...
    }
    ...
})

This way, if the outer 10-minute deadline fires (or the test context is cancelled), the inner deletion wait is also cancelled automatically — no orphaned goroutines or hanging polls.

One additional note — as currently written, defer deleteCancel() is inside the poll callback. Because defer fires at function return (not loop-iteration end), if the pod cycles through Failed/Succeeded multiple times before succeeding, each restart accumulates a deferred cancel that won't run until verifyInternalIngressController returns. The contexts will expire on their own due to the 2-minute timeout, but it's cleaner to call deleteCancel() explicitly right after the inner wait.PollUntilContextTimeout call completes, rather than using defer here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai, the original comment, "Use PollInfinite since we are already in a Poll.", suggests that the outer polling loop's timing out would cause both the outer polling loop and the inner polling loop to be terminated. Is it not the case that the outer PollImmediate's timeout will terminate the inner PollInfinite?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🌐 Web query:

kubernetes apimachinery wait.PollImmediate implementation timeout channel condition synchronous

💡 Result:

wait.PollImmediate / PollImmediateWithContext implementation (k8s.io/apimachinery wait, v0.35.4)

  • PollImmediate(interval, timeout, condition) is just a wrapper that calls PollImmediateWithContext(context.Background(), ...). PollImmediateWithContext then calls the legacy internal poll(ctx, immediate=true, wait=poller(interval,timeout), condition). [1]

  • “Immediate” behavior: poll() runs condition(ctx) synchronously once before it starts waiting/ticking. If that call returns (done=true, nil) it returns immediately. If it returns an error, that error is returned immediately. [1]

  • Timeout mechanism (“timeout channel”):

    • poller(interval, timeout) starts a goroutine that:
      • creates a time.Ticker(interval)
      • creates a time.NewTimer(timeout) only if timeout != 0
      • closes the output channel when either the timer fires (<-after), ctx.Done() fires, or the goroutine exits. [1]
    • If timeout == 0, it’s treated as “infinite” (no timer channel is set up). [1]
  • Tick channel semantics (often missed):

    • The tick “signal” channel is unbuffered.
    • On each ticker tick, poller does a non-blocking send (select { case ch <- struct{}{}: default: }), so ticks are dropped if the consumer isn’t ready. [1]
  • Where the condition is actually evaluated (synchronous vs async):

    • The condition is evaluated by waitForWithContext(...), which loops and, upon each received tick (or the final “channel closed” receive), calls runConditionWithCrashProtectionWithContext(ctx, fn) synchronously in the same goroutine. There is no separate goroutine per condition evaluation. [2]
  • What happens on timeout / ctx cancellation (legacy behavior):

    • In the legacy poll()/waitForWithContext() path, if ctx.Done() is closed, they intentionally return ErrWaitTimeout (to preserve backward compatibility) rather than returning ctx.Err(). [1][2]
    • Because Go select is pseudo-random among ready cases, waitForWithContext notes that fn may still run one or more times after ctx.Done() is closed, though it will eventually return. [2]
  • Modern replacement (recommended):

    • PollUntilContextTimeout builds a deadline context (context.WithTimeout) and uses loopConditionUntilContext, which returns ctx.Err() (e.g., context.DeadlineExceeded) and includes an explicit ctx.Err() check each loop to avoid repeatedly “missing” cancellation with short timers. [3]

Sources:
[1] poll.go (PollImmediate / poll / poller) (github.com)
[2] wait.go (waitForWithContext and its select/condition loop) (raw.githubusercontent.com)
[3] loop.go (loopConditionUntilContext, cancellation semantics) (raw.githubusercontent.com)

Citations:


@Miciah, the original comment was incorrect — the outer PollImmediate's timeout does not terminate the inner PollInfinite.

Here is why. The condition function is evaluated by waitForWithContext, which calls it synchronously in the same goroutine — there is no separate goroutine per condition evaluation.

The timeout mechanism works like this: poller(interval, timeout) starts a background goroutine that creates a time.NewTimer(timeout) and closes the output channel when the timer fires or the context is cancelled. That "timeout fired" signal is delivered via a channel close — but the calling goroutine is the one that must receive that signal by returning to the waitForWithContext select loop.

Since the condition is called synchronously, if it blocks inside wait.PollInfinite, the calling goroutine never returns to the select loop, and therefore never sees the timeout channel being closed. The timer fires into a channel that has no one listening, and the entire call hangs indefinitely.

In other words: the original comment, "Use PollInfinite since we are already in a Poll", was based on a mistaken assumption. The two polls are not linked in any way that would allow the outer timeout to preempt the inner one. This is precisely why the PR's fix — replacing PollInfinite with PollUntilContextTimeout using an explicit 2-minute deadline — is necessary and correct.

As a follow-up to our earlier discussion, the modern replacement PollUntilContextTimeout builds a deadline context via context.WithTimeout and uses loopConditionUntilContext, which returns ctx.Err() and includes an explicit ctx.Err() check each loop — making it the right primitive here. Having the inner context derive from the outer loop's context (as I suggested earlier) would make cancellation propagate cleanly if the outer loop is ever converted as well.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining, @coderabbitai! But one thing—

Having the inner context derive from the outer loop's context (as I suggested earlier)

Heyyy, I made that suggestion and then asked you about the change. * grin *.

Comment on lines 62 to +65
defer func() {
if err := kclient.Delete(context.TODO(), clientPod); err != nil {
if !errors.IsNotFound(err) {
t.Fatalf("failed to delete pod %s/%s: %v", clientPod.Namespace, clientPod.Name, err)
t.Errorf("failed to delete pod %s/%s: %v", clientPod.Namespace, clientPod.Name, err)
Copy link
Copy Markdown
Contributor

@Miciah Miciah May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm tempted to suggest replacing every defer-Delete-t.Error with deleteWithRetryOnError. Also, you could make deleteWithRetryOnError call t.Error itself to save the caller the trouble. Last, we use t.Cleanup instead of defer in new code, so all together, we could change this:

	defer func() {
		if err := kclient.Delete(context.TODO(), foo); err != nil {
			if !errors.IsNotFound(err) {
				t.Errorf("failed to delete foo %s/%s: %v", foo.Namespace, foo.Name, err)
			}
		}
	}()

into this:

	t.Cleanup(func() { deleteWithRetryOnError(t, ctx, foo, 2*time.Minute) })

But I get if you want to keep this PR simple. We can do further cleanups in a followup.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 6, 2026

@rikatz: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants