NO-ISSUE:UPSTREAM: <carry>: Fix: Race condition in ClusterExtension cleanup by kuiwang02 · Pull Request #533 · openshift/operator-framework-operator-controller

kuiwang02 · 2025-10-21T08:08:22Z

Fix: Race condition in ClusterExtension cleanup timeout for singleownnamespace tests

Why / Problem Statement

The test [sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation support for ownNamespace and single namespace watch mode with operator should install cluster extensions successfully in both watch modes was failing intermittently with a 60-second timeout during ClusterExtension cleanup.

This is a race condition issue, not a regression introduced by recent changes. The test has a pre-existing robustness problem where asynchronous Kubernetes deletion time (variable: 45-120s depending on cluster load, resources, and finalizers) races against a fixed timeout (constant: 60s). The test passes when deletion completes quickly (<60s) and fails when it takes longer (>60s).

Failure evidence:

[FAILED] - /build/openshift/tests-extension/pkg/helpers/cluster_extension.go:185
Timed out after 60.039s.
Cleanup ClusterExtension install-webhook-bothns-ownns-ce-tz9c failed to delete

Timeline:
- 05:33:22 - Delete ClusterExtension called
- 05:34:22 - Timeout (60 seconds later)
- ClusterExtension status: DeletionTimestamp set, but object still exists with foregroundDeletion finalizer

Root causes:

Insufficient timeout for foreground deletion: ClusterExtension with foregroundDeletion finalizer must wait for complete deletion chain (Deployment → ReplicaSet → Pods with 30s graceful shutdown + CRD instances + ServiceAccount + RBAC). This can legitimately take 60-120 seconds, but timeout was hardcoded to 60s.
Kubernetes Delete() is asynchronous: client.Delete() returns immediately (~50ms) after API server accepts the request, but actual deletion happens in background (45-90s later). The test did not properly wait for actual deletion completion.
No wait between scenario iterations: The test runs two scenarios sequentially (singleNamespace, then ownNamespace) but only called Delete() without waiting for IsNotFound, causing the next scenario to potentially start before previous resources are fully cleaned up.

This is NOT introduced by PR #524: Analysis of PR #524 shows it only changed which operator is tested (quay-operator → singleown-operator) and added in-cluster builds. The deletion logic and timeout remained unchanged. PR #524 simply exposed this pre-existing race condition by changing environmental factors that made deletion slightly slower.

What / Solution

This PR fixes the race condition by implementing two changes to make the test robust against timing variations:

Changes Made

1. Increase ClusterExtension cleanup timeout (Required Fix)

File: pkg/helpers/cluster_extension.go:185

  Eventually(func() bool {
      err := k8sClient.Get(ctx, client.ObjectKey{Name: ce.Name}, &olmv1.ClusterExtension{})
      return errors.IsNotFound(err)
- }).WithTimeout(1*time.Minute).WithPolling(2*time.Second).Should(BeTrue(),
+ }).WithTimeout(3*time.Minute).WithPolling(2*time.Second).Should(BeTrue(),
      "Cleanup ClusterExtension %s failed to delete", ce.Name)

Rationale:

Foreground deletion legitimately takes 60-120 seconds in production clusters
3 minutes provides sufficient buffer for pod graceful shutdown, finalizer processing, and CRD cleanup
Still fails fast enough (within 3 minutes) to detect real deletion issues
Addresses the core race condition: variable async deletion time vs fixed timeout

2. Wait for namespace deletion between scenarios (Defense in Depth)

File: test/olmv1-singleownnamespace.go

Added import:

+ "k8s.io/apimachinery/pkg/api/errors"

Added wait logic after namespace deletion (lines 476-492):

By(fmt.Sprintf("waiting for namespace %s to be fully deleted before next scenario", installNamespace))
Eventually(func(g Gomega) {
    ns := &corev1.Namespace{}
    err := k8sClient.Get(ctx, client.ObjectKey{Name: installNamespace}, ns)
    g.Expect(err).To(HaveOccurred(), "expected namespace %s to be deleted", installNamespace)
    g.Expect(errors.IsNotFound(err)).To(BeTrue(), "expected NotFound error for namespace %s", installNamespace)
}).WithTimeout(2 * time.Minute).WithPolling(2 * time.Second).Should(Succeed())

if watchNSObj != nil {
    By(fmt.Sprintf("waiting for watch namespace %s to be fully deleted before next scenario", watchNamespace))
    Eventually(func(g Gomega) {
        ns := &corev1.Namespace{}
        err := k8sClient.Get(ctx, client.ObjectKey{Name: watchNamespace}, ns)
        g.Expect(err).To(HaveOccurred(), "expected namespace %s to be deleted", watchNamespace)
        g.Expect(errors.IsNotFound(err)).To(BeTrue(), "expected NotFound error for namespace %s", watchNamespace)
    }).WithTimeout(2 * time.Minute).WithPolling(2 * time.Second).Should(Succeed())
}

Rationale:

Ensures namespace is actually deleted (IsNotFound), not just that Delete() call succeeded
Prevents resource conflicts between scenario iterations
Properly handles Kubernetes asynchronous deletion semantics
Provides complete isolation between test scenarios

Key Technical Decisions

3-minute timeout for ClusterExtension cleanup
- Decision: Increase from 60s to 180s
- Rationale: Based on analysis of foreground deletion chain timing (Deployment → ReplicaSet → Pods with 30s graceful shutdown + finalizers). 180s provides comfortable buffer while still detecting real issues.
- Alternatives considered: 120s was considered but 180s chosen for extra margin in slow clusters
Wait for IsNotFound instead of trusting Delete() success
- Decision: Add explicit Eventually wait checking errors.IsNotFound() after namespace deletion
- Rationale: In Kubernetes, Delete() is asynchronous - it returns when API server accepts the request, not when deletion completes. Must poll for IsNotFound to confirm actual deletion.
- Alternatives considered: Using time.Sleep() was rejected as an anti-pattern (hardcoded timing assumptions)
2-minute timeout for namespace deletion wait
- Decision: Use 120s timeout for namespace cleanup verification
- Rationale: Namespace deletion typically faster than ClusterExtension (no complex finalizers), but needs buffer for various resources within namespace to clean up
- Alternatives considered: 60s rejected as potentially too short in slow environments

Benefits of Combined Fix

Aspect	Before	After
ClusterExtension cleanup	60s (insufficient)	180s (sufficient)
Scenario isolation	No wait (race condition)	Wait for IsNotFound (guaranteed)
Async handling	Assumes Delete() = deleted	Properly waits for actual deletion
Robustness	Timing-dependent (flaky)	State-dependent (reliable)
Debugging	Vague timeout errors	Clear error messages with namespace names

Testing

INFO[0194] Found 0 must-gather tests                    
started: 0/1/5 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation should reject invalid watch namespace configuration and update the status conditions accordingly should fail to install the ClusterExtension when watch namespace is invalid"

started: 0/2/5 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation support for ownNamespace and single namespace watch mode with operator should install cluster extensions successfully in both watch modes"

started: 0/3/5 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation support for ownNamespace watch mode with operator should install a cluster extension successfully"

started: 0/4/5 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation support for singleNamespace watch mode with operator should install a cluster extension successfully"


passed: (40.7s) 2025-10-21T07:48:00 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation support for ownNamespace watch mode with operator should install a cluster extension successfully"


passed: (46s) 2025-10-21T07:48:05 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation should reject invalid watch namespace configuration and update the status conditions accordingly should fail to install the ClusterExtension when watch namespace is invalid"


passed: (51.2s) 2025-10-21T07:48:11 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation support for singleNamespace watch mode with operator should install a cluster extension successfully"


passed: (1m17s) 2025-10-21T07:48:36 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation support for ownNamespace and single namespace watch mode with operator should install cluster extensions successfully in both watch modes"

started: 0/5/5 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace][Serial] OLMv1 operator installation support for ownNamespace watch mode with an operator that does not support ownNamespace installation mode should fail to install a cluster extension successfully"


passed: (38.3s) 2025-10-21T07:49:21 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace][Serial] OLMv1 operator installation support for ownNamespace watch mode with an operator that does not support ownNamespace installation mode should fail to install a cluster extension successfully"

Shutting down the monitor
Collecting data.
INFO[0325] Starting CollectData for all monitor tests   
INFO[0325]   Starting CollectData for [Monitor:watch-namespaces][Jira:"Test Framework"] monitor test watch-namespaces collection 
INFO[0325]   Finished CollectData for [Monitor:watch-namespaces][Jira:"Test Framework"] monitor test watch-namespaces collection 
INFO[0325] Finished CollectData for all monitor tests   
Computing intervals.
Evaluating tests.
Cleaning up.
INFO[0325] beginning cleanup                             monitorTest=watch-namespaces
Serializing results.
Writing to storage.
  m.startTime = 2025-10-21 15:47:11.194084 +0800 CST m=+194.609326834
  m.stopTime  = 2025-10-21 15:49:21.634841 +0800 CST m=+325.051185959
Processing monitorTest: watch-namespaces
  finalIntervals size = 10
  first interval time: From = 2025-10-21 15:47:11.202394 +0800 CST m=+194.617636834; To = 2025-10-21 15:47:11.202394 +0800 CST m=+194.617636834
  last interval time: From = 2025-10-21 15:49:21.632643 +0800 CST m=+325.048988168; To = 2025-10-21 15:49:21.632643 +0800 CST m=+325.048988168
Writing junits.
Writing JUnit report to e2e-monitor-tests__20251021-074409.xml
5 pass, 0 flaky, 0 skip (5m12s)

Assisted-by: Claude Code

…meout for singleownnamespace tests

openshift-ci-robot · 2025-10-21T08:08:26Z

@kuiwang02: This pull request explicitly references no jira issue.

Details

In response to this:

Fix: Race condition in ClusterExtension cleanup timeout for singleownnamespace tests

Why / Problem Statement

The test [sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation support for ownNamespace and single namespace watch mode with operator should install cluster extensions successfully in both watch modes was failing intermittently with a 60-second timeout during ClusterExtension cleanup.

This is a race condition issue, not a regression introduced by recent changes. The test has a pre-existing robustness problem where asynchronous Kubernetes deletion time (variable: 45-120s depending on cluster load, resources, and finalizers) races against a fixed timeout (constant: 60s). The test passes when deletion completes quickly (<60s) and fails when it takes longer (>60s).

Failure evidence:
[FAILED] - /build/openshift/tests-extension/pkg/helpers/cluster_extension.go:185
Timed out after 60.039s.
Cleanup ClusterExtension install-webhook-bothns-ownns-ce-tz9c failed to delete

Timeline:
- 05:33:22 - Delete ClusterExtension called
- 05:34:22 - Timeout (60 seconds later)
- ClusterExtension status: DeletionTimestamp set, but object still exists with foregroundDeletion finalizer
Root causes:

Insufficient timeout for foreground deletion: ClusterExtension with foregroundDeletion finalizer must wait for complete deletion chain (Deployment → ReplicaSet → Pods with 30s graceful shutdown + CRD instances + ServiceAccount + RBAC). This can legitimately take 60-120 seconds, but timeout was hardcoded to 60s.

Kubernetes Delete() is asynchronous: client.Delete() returns immediately (~50ms) after API server accepts the request, but actual deletion happens in background (45-90s later). The test did not properly wait for actual deletion completion.

No wait between scenario iterations: The test runs two scenarios sequentially (singleNamespace, then ownNamespace) but only called Delete() without waiting for IsNotFound, causing the next scenario to potentially start before previous resources are fully cleaned up.

This is NOT introduced by PR #524: Analysis of PR #524 shows it only changed which operator is tested (quay-operator → singleown-operator) and added in-cluster builds. The deletion logic and timeout remained unchanged. PR #524 simply exposed this pre-existing race condition by changing environmental factors that made deletion slightly slower.

What / Solution

This PR fixes the race condition by implementing two changes to make the test robust against timing variations:

Changes Made

1. Increase ClusterExtension cleanup timeout (Required Fix)

File: pkg/helpers/cluster_extension.go:185
 Eventually(func() bool {
     err := k8sClient.Get(ctx, client.ObjectKey{Name: ce.Name}, &olmv1.ClusterExtension{})
     return errors.IsNotFound(err)
- }).WithTimeout(1*time.Minute).WithPolling(2*time.Second).Should(BeTrue(),
+ }).WithTimeout(3*time.Minute).WithPolling(2*time.Second).Should(BeTrue(),
     "Cleanup ClusterExtension %s failed to delete", ce.Name)
Rationale:

Foreground deletion legitimately takes 60-120 seconds in production clusters

3 minutes provides sufficient buffer for pod graceful shutdown, finalizer processing, and CRD cleanup

Still fails fast enough (within 3 minutes) to detect real deletion issues

Addresses the core race condition: variable async deletion time vs fixed timeout

2. Wait for namespace deletion between scenarios (Defense in Depth)

File: test/olmv1-singleownnamespace.go

Added import:
+ "k8s.io/apimachinery/pkg/api/errors"
Added wait logic after namespace deletion (lines 476-492):
By(fmt.Sprintf("waiting for namespace %s to be fully deleted before next scenario", installNamespace))
Eventually(func(g Gomega) {
   ns := &corev1.Namespace{}
   err := k8sClient.Get(ctx, client.ObjectKey{Name: installNamespace}, ns)
   g.Expect(err).To(HaveOccurred(), "expected namespace %s to be deleted", installNamespace)
   g.Expect(errors.IsNotFound(err)).To(BeTrue(), "expected NotFound error for namespace %s", installNamespace)
}).WithTimeout(2 * time.Minute).WithPolling(2 * time.Second).Should(Succeed())

if watchNSObj != nil {
   By(fmt.Sprintf("waiting for watch namespace %s to be fully deleted before next scenario", watchNamespace))
   Eventually(func(g Gomega) {
       ns := &corev1.Namespace{}
       err := k8sClient.Get(ctx, client.ObjectKey{Name: watchNamespace}, ns)
       g.Expect(err).To(HaveOccurred(), "expected namespace %s to be deleted", watchNamespace)
       g.Expect(errors.IsNotFound(err)).To(BeTrue(), "expected NotFound error for namespace %s", watchNamespace)
   }).WithTimeout(2 * time.Minute).WithPolling(2 * time.Second).Should(Succeed())
}
Rationale:

Ensures namespace is actually deleted (IsNotFound), not just that Delete() call succeeded

Prevents resource conflicts between scenario iterations

Properly handles Kubernetes asynchronous deletion semantics

Provides complete isolation between test scenarios

Key Technical Decisions

3-minute timeout for ClusterExtension cleanup

Decision: Increase from 60s to 180s

Rationale: Based on analysis of foreground deletion chain timing (Deployment → ReplicaSet → Pods with 30s graceful shutdown + finalizers). 180s provides comfortable buffer while still detecting real issues.

Alternatives considered: 120s was considered but 180s chosen for extra margin in slow clusters

Wait for IsNotFound instead of trusting Delete() success

Decision: Add explicit Eventually wait checking errors.IsNotFound() after namespace deletion

Rationale: In Kubernetes, Delete() is asynchronous - it returns when API server accepts the request, not when deletion completes. Must poll for IsNotFound to confirm actual deletion.

Alternatives considered: Using time.Sleep() was rejected as an anti-pattern (hardcoded timing assumptions)

2-minute timeout for namespace deletion wait

Decision: Use 120s timeout for namespace cleanup verification

Rationale: Namespace deletion typically faster than ClusterExtension (no complex finalizers), but needs buffer for various resources within namespace to clean up

Alternatives considered: 60s rejected as potentially too short in slow environments

Benefits of Combined Fix

Aspect Before After

ClusterExtension cleanup 60s (insufficient) 180s (sufficient)

Scenario isolation No wait (race condition) Wait for IsNotFound (guaranteed)

Async handling Assumes Delete() = deleted Properly waits for actual deletion

Robustness Timing-dependent (flaky) State-dependent (reliable)

Debugging Vague timeout errors Clear error messages with namespace names

Testing
INFO[0194] Found 0 must-gather tests                    
started: 0/1/5 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation should reject invalid watch namespace configuration and update the status conditions accordingly should fail to install the ClusterExtension when watch namespace is invalid"

started: 0/2/5 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation support for ownNamespace and single namespace watch mode with operator should install cluster extensions successfully in both watch modes"

started: 0/3/5 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation support for ownNamespace watch mode with operator should install a cluster extension successfully"

started: 0/4/5 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation support for singleNamespace watch mode with operator should install a cluster extension successfully"


passed: (40.7s) 2025-10-21T07:48:00 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation support for ownNamespace watch mode with operator should install a cluster extension successfully"


passed: (46s) 2025-10-21T07:48:05 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation should reject invalid watch namespace configuration and update the status conditions accordingly should fail to install the ClusterExtension when watch namespace is invalid"


passed: (51.2s) 2025-10-21T07:48:11 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation support for singleNamespace watch mode with operator should install a cluster extension successfully"


passed: (1m17s) 2025-10-21T07:48:36 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation support for ownNamespace and single namespace watch mode with operator should install cluster extensions successfully in both watch modes"

started: 0/5/5 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace][Serial] OLMv1 operator installation support for ownNamespace watch mode with an operator that does not support ownNamespace installation mode should fail to install a cluster extension successfully"


passed: (38.3s) 2025-10-21T07:49:21 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace][Serial] OLMv1 operator installation support for ownNamespace watch mode with an operator that does not support ownNamespace installation mode should fail to install a cluster extension successfully"

Shutting down the monitor
Collecting data.
INFO[0325] Starting CollectData for all monitor tests   
INFO[0325]   Starting CollectData for [Monitor:watch-namespaces][Jira:"Test Framework"] monitor test watch-namespaces collection 
INFO[0325]   Finished CollectData for [Monitor:watch-namespaces][Jira:"Test Framework"] monitor test watch-namespaces collection 
INFO[0325] Finished CollectData for all monitor tests   
Computing intervals.
Evaluating tests.
Cleaning up.
INFO[0325] beginning cleanup                             monitorTest=watch-namespaces
Serializing results.
Writing to storage.
 m.startTime = 2025-10-21 15:47:11.194084 +0800 CST m=+194.609326834
 m.stopTime  = 2025-10-21 15:49:21.634841 +0800 CST m=+325.051185959
Processing monitorTest: watch-namespaces
 finalIntervals size = 10
 first interval time: From = 2025-10-21 15:47:11.202394 +0800 CST m=+194.617636834; To = 2025-10-21 15:47:11.202394 +0800 CST m=+194.617636834
 last interval time: From = 2025-10-21 15:49:21.632643 +0800 CST m=+325.048988168; To = 2025-10-21 15:49:21.632643 +0800 CST m=+325.048988168
Writing junits.
Writing JUnit report to e2e-monitor-tests__20251021-074409.xml
5 pass, 0 flaky, 0 skip (5m12s)
Assisted-by: Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2025-10-21T08:09:38Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kuiwang02
Once this PR has been reviewed and has the lgtm label, please assign perdasilva for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

DOWNSTREAM_OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kuiwang02 · 2025-10-21T08:20:10Z

/payload-aggregate periodic-ci-openshift-release-master-ci-4.21-e2e-gcp-ovn-techpreview 5

openshift-ci · 2025-10-21T08:20:13Z

@kuiwang02: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-ci-4.21-e2e-gcp-ovn-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c3130730-ae56-11f0-8ca2-47048bb63af9-0

kuiwang02 · 2025-10-21T08:20:32Z

/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ipi-ovn-ipv6-techpreview 5

openshift-ci · 2025-10-21T08:20:35Z

@kuiwang02: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ipi-ovn-ipv6-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d0093450-ae56-11f0-8264-d6e3474f0c3b-0

kuiwang02 · 2025-10-21T08:21:13Z

/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-azure-ovn-runc-techpreview 5

openshift-ci · 2025-10-21T08:21:16Z

@kuiwang02: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.21-e2e-azure-ovn-runc-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e8acbae0-ae56-11f0-8ec9-6cccbfc04901-0

camilamacedo86 · 2025-10-21T08:36:34Z

 					err := k8sClient.Get(ctx, client.ObjectKey{Name: ce.Name}, &olmv1.ClusterExtension{})
 					return errors.IsNotFound(err)
-				}).WithTimeout(1*time.Minute).WithPolling(2*time.Second).Should(BeTrue(), "Cleanup ClusterExtension %s failed to delete", ce.Name)
+				}).WithTimeout(3*time.Minute).WithPolling(2*time.Second).Should(BeTrue(), "Cleanup ClusterExtension %s failed to delete", ce.Name)


Those tests report a signal for Sippy; to avoid a bad signal, we use bug timeouts.

Suggested change

}).WithTimeout(3*time.Minute).WithPolling(2*time.Second).Should(BeTrue(), "Cleanup ClusterExtension %s failed to delete", ce.Name)

}).WithTimeout(5*time.Minute).WithPolling(3*time.Second).Should(BeTrue(), "Cleanup ClusterExtension %s failed to delete", ce.Name)

You can check the other cases

The timeouts should be generous — set them to 5 minutes with a pool size of 3.
Those return signals to Sippy and block other teams.
So, we cannot fail due to it. And yes, it was merged before and should be 5 minutes as the others

camilamacedo86 · 2025-10-21T08:58:36Z

+						err := k8sClient.Get(ctx, client.ObjectKey{Name: watchNamespace}, ns)
+						g.Expect(err).To(HaveOccurred(), "expected namespace %s to be deleted", watchNamespace)
+						g.Expect(errors.IsNotFound(err)).To(BeTrue(), "expected NotFound error for namespace %s", watchNamespace)
+					}).WithTimeout(2 * time.Minute).WithPolling(2 * time.Second).Should(Succeed())


Why do we need to wait to delete the NS before running the other scenario?

Each scenario has its own unique bundles (CE, etc.), so they can run in parallel and are not marked as SERIAL. Therefore, they should not impact
Because of that, there’s no reason to block other teams or create concern if a namespace takes longer to delete — this can happen for known Kubernetes reasons. And we should not send bad signal for Sippy or block other teams due that.

camilamacedo86

Thank you for looking on that
But to address the flake:

See; https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-[…]ly-4.21-e2e-azure-ovn-runc-techpreview/1980106371709800448
fail [/build/openshift/tests-extension/pkg/helpers/cluster_extension.go:185]: Timed out after 60.039s.
Cleanup ClusterExtension install-webhook-bothns-ownns-ce-tz9c failed to delete
Expected
: false
to be true

We should:
-> Not wait for the deletion of the CE
-> We can warn but not fail

See that k8s, for many reasons, can take longer to uninstall resources — and that’s normal.
We no longer have a SERIAL test, so each scenario can run in parallel and is fully isolated.
That means if the ClusterExtension (CE) is not removed right away, it should not impact any other test.

Therefore, we should not risk sending a bad signal to Sippy or blocking other teams because of it.

kuiwang02 · 2025-10-21T09:12:20Z

/close

openshift-ci · 2025-10-21T09:12:38Z

@kuiwang02: Closed this PR.

Details

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

UPSTREAM: <carry>: Fix: Race condition in ClusterExtension cleanup ti…

fad0bdf

…meout for singleownnamespace tests

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 21, 2025

openshift-ci Bot requested review from camilamacedo86 and perdasilva October 21, 2025 08:09

camilamacedo86 reviewed Oct 21, 2025

View reviewed changes

camilamacedo86 suggested changes Oct 21, 2025

View reviewed changes

openshift-ci Bot assigned camilamacedo86 Oct 21, 2025

openshift-ci Bot closed this Oct 21, 2025

	}).WithTimeout(3time.Minute).WithPolling(2time.Second).Should(BeTrue(), "Cleanup ClusterExtension %s failed to delete", ce.Name)
	}).WithTimeout(5time.Minute).WithPolling(3time.Second).Should(BeTrue(), "Cleanup ClusterExtension %s failed to delete", ce.Name)

Conversation

kuiwang02 commented Oct 21, 2025

Fix: Race condition in ClusterExtension cleanup timeout for singleownnamespace tests

Why / Problem Statement

What / Solution

Changes Made

1. Increase ClusterExtension cleanup timeout (Required Fix)

2. Wait for namespace deletion between scenarios (Defense in Depth)

Key Technical Decisions

Benefits of Combined Fix

Testing

Uh oh!

openshift-ci-robot commented Oct 21, 2025

Fix: Race condition in ClusterExtension cleanup timeout for singleownnamespace tests

Why / Problem Statement

What / Solution

Changes Made

1. Increase ClusterExtension cleanup timeout (Required Fix)

2. Wait for namespace deletion between scenarios (Defense in Depth)

Key Technical Decisions

Benefits of Combined Fix

Testing

Uh oh!

openshift-ci Bot commented Oct 21, 2025

Uh oh!

kuiwang02 commented Oct 21, 2025

Uh oh!

openshift-ci Bot commented Oct 21, 2025

Uh oh!

kuiwang02 commented Oct 21, 2025

Uh oh!

openshift-ci Bot commented Oct 21, 2025

Uh oh!

kuiwang02 commented Oct 21, 2025

Uh oh!

openshift-ci Bot commented Oct 21, 2025

Uh oh!

camilamacedo86 Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

camilamacedo86 Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

camilamacedo86 Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

camilamacedo86 Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

camilamacedo86 Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

camilamacedo86 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kuiwang02 commented Oct 21, 2025

Uh oh!

openshift-ci Bot commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

camilamacedo86 Oct 21, 2025 •

edited

Loading

camilamacedo86 Oct 21, 2025 •

edited

Loading

camilamacedo86 left a comment •

edited

Loading