Skip to content

OCPBUGS-76609: Do not take over OSSM subscription#1398

Merged
openshift-merge-bot[bot] merged 9 commits intoopenshift:masterfrom
rikatz:do-not-takeover-olm
Apr 8, 2026
Merged

OCPBUGS-76609: Do not take over OSSM subscription#1398
openshift-merge-bot[bot] merged 9 commits intoopenshift:masterfrom
rikatz:do-not-takeover-olm

Conversation

@rikatz
Copy link
Copy Markdown
Member

@rikatz rikatz commented Mar 20, 2026

This change implements a check if there is an OSSM subscription already on the cluster, and on a different namespace expected by CIO. In that case the OSSM installation will be skipped and a status condition will be added to the ingress controller status informing that while it is not degraded, there is a warning of Gateway API about the installed ossm version

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 20, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Mar 20, 2026

@rikatz: This pull request references NE-2224 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target either version "4.22." or "openshift-4.22.", but it targets "openshift-5.0" instead.

Details

In response to this:

This change implements a check if there is an OSSM subscription already on the cluster, and on a different namespace expected by CIO. In that case the OSSM installation will be skipped and a status condition will be added to the Gateway class informing that it is not using CIO subscription

It is a WIP

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 20, 2026

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Review skipped — only excluded labels are configured. (1)
  • do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: bc2ed6f1-8acd-410d-92d6-1a79c38fd947

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Controller logic now performs a cluster-wide scan for OSSM Subscription resources via client.List with ListOptions{Namespace: ""} and returns early on list errors. reconcileWithOLM captures the returned subscription and only enforces the install-plan when a non-nil subscription exists and carries the operatorcontroller.IngressOperatorOwnedAnnotation; otherwise it skips install-plan enforcement. The status controller records each discovered OSSM subscription as a separate related-object (removing haveOSSMSubscription) and updated degraded-condition messages to include the subscription namespace/name and to warn when the installed CSV equals the expected version but the subscription is not CIO-owned. New unit tests exercise ensureServiceMeshOperatorSubscription and the updated status behavior.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from jcmoraisjr and rfredette March 20, 2026 15:45
Comment thread pkg/operator/controller/gatewayclass/subscription.go Outdated
Comment thread pkg/operator/controller/gatewayclass/subscription.go Outdated
Comment thread pkg/operator/controller/gatewayclass/controller.go Outdated
Comment thread pkg/operator/controller/gatewayclass/controller.go Outdated
Comment thread pkg/operator/controller/gatewayclass/subscription.go Outdated
Comment thread pkg/operator/controller/gatewayclass/subscription.go Outdated
Comment thread pkg/operator/controller/gatewayclass/subscription.go Outdated
Comment thread pkg/operator/controller/gatewayclass/controller.go Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/operator/controller/gatewayclass/subscription.go (1)

34-37: Misleading function docstring.

The comment states "another boolean indicating if the one that exists is out of our namespace" but the function signature only returns one boolean. The caller must infer the "pre-existing" state by comparing subscription.Namespace with the expected namespace. Consider updating the comment to accurately reflect the return contract.

 // ensureServiceMeshOperatorSubscription attempts to ensure that a subscription
 // for servicemeshoperator is present and returns a Boolean indicating whether
-// it exists, another boolean indicating if the one that exists is out of our namespace (pre-existing),
-// the subscription if it exists, and an error value.
+// it exists, the subscription if it exists, and an error value. When a subscription
+// with the same name exists in a namespace other than the reconciler's target namespace,
+// the function returns early with (true, existingSubscription, nil) to indicate
+// that installation should be skipped.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/controller/gatewayclass/subscription.go` around lines 34 - 37,
The function comment for ensureServiceMeshOperatorSubscription is inaccurate
about returned booleans; update the docstring to state it returns (exists bool,
subscription *Subscription, err error) and explicitly mention that callers
should determine "pre-existing" by comparing subscription.Namespace to the
expected namespace (e.g., compare subscription.Namespace to your operator
namespace) rather than relying on a second boolean; refer to
ensureServiceMeshOperatorSubscription and subscription.Namespace in the comment
so the contract is clear.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/operator/controller/gatewayclass/subscription.go`:
- Around line 34-37: The function comment for
ensureServiceMeshOperatorSubscription is inaccurate about returned booleans;
update the docstring to state it returns (exists bool, subscription
*Subscription, err error) and explicitly mention that callers should determine
"pre-existing" by comparing subscription.Namespace to the expected namespace
(e.g., compare subscription.Namespace to your operator namespace) rather than
relying on a second boolean; refer to ensureServiceMeshOperatorSubscription and
subscription.Namespace in the comment so the contract is clear.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 959aff2c-26d5-42bb-9e0e-2e4bf6c6d788

📥 Commits

Reviewing files that changed from the base of the PR and between 2b994ed and ef6970a.

📒 Files selected for processing (2)
  • pkg/operator/controller/gatewayclass/controller.go
  • pkg/operator/controller/gatewayclass/subscription.go

Comment thread pkg/operator/controller/gatewayclass/controller.go Outdated
Comment thread pkg/operator/controller/gatewayclass/subscription.go Outdated
Comment thread pkg/operator/controller/gatewayclass/controller.go Outdated
Comment thread pkg/operator/controller/gatewayclass/subscription.go Outdated
Comment thread pkg/operator/controller/status/controller.go
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/operator/controller/gatewayclass/controller.go`:
- Around line 459-470: The code calls ensureServiceMeshOperatorSubscription and
appends an error but then unconditionally reads subscription.Annotations,
risking a nil pointer; change the logic in the block after the call to first
check for err or subscription == nil and skip the annotation/InstallPlan branch
if either is true (e.g., append the error and return/continue or explicitly if
subscription == nil { /* skip install plan */ }), otherwise safely inspect
subscription.Annotations for operatorcontroller.IngressOperatorOwnedAnnotation
and only then call ensureServiceMeshOperatorInstallPlan; keep references to
ensureServiceMeshOperatorSubscription, subscription,
operatorcontroller.IngressOperatorOwnedAnnotation, and
ensureServiceMeshOperatorInstallPlan to locate the fix.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 4df648ae-6527-4207-8dbf-e980a2eead16

📥 Commits

Reviewing files that changed from the base of the PR and between ef6970a and 0b45ea4.

📒 Files selected for processing (3)
  • pkg/operator/controller/gatewayclass/controller.go
  • pkg/operator/controller/gatewayclass/subscription.go
  • pkg/operator/controller/status/controller.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/operator/controller/gatewayclass/subscription.go

Comment thread pkg/operator/controller/gatewayclass/controller.go
Comment thread pkg/operator/controller/gatewayclass/subscription.go Outdated
Comment on lines +42 to +58
if err := r.client.List(ctx, &subscriptionList, &client.ListOptions{
Namespace: "",
}); err != nil {
return false, nil, err
}

for _, subscription := range subscriptionList.Items {
// There is a subscription that is not owned by us. In this case we early return
// because we cannot support multiple existing OSSM subscriptions, so instead of
// trying to continue the workflow of making CIO take over the subscription
// the code is early returned without further update, and CIO can be marked
// with a degradation warning that allows the cluster admin to identify the
// other existing subscriptions, and decide further action.
// This does not block the rest of GatewayClass reconciliation, it just avoids
// CIO taking over subscriptions (or adding new ones) while other subscription
// exists.
if _, ok := subscription.Annotations[operatorcontroller.IngressOperatorOwnedAnnotation]; subscription.Name == name.Name && !ok {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai, consider the current logic:

	if err := r.client.List(ctx, &subscriptionList, &client.ListOptions{
		Namespace: "",
	}); err != nil {
		return false, nil, err
	}

	for _, subscription := range subscriptionList.Items {
		if _, ok := subscription.Annotations[operatorcontroller.IngressOperatorOwnedAnnotation]; subscription.Name == name.Name && !ok {

Would the following be equivalent? Assume we import "k8s.io/apimachinery/pkg/fields":

	if err := r.client.List(ctx, &subscriptionList, &client.ListOptions{
		FieldSelector: fields.SelectorFromSet(fields.Set{"metadata.name": name.Name}),
		Namespace:     "",
	}); err != nil {
		return false, nil, err
	}
	for _, subscription := range subscriptionList.Items {
		if _, ok := subscription.Annotations[operatorcontroller.IngressOperatorOwnedAnnotation]; !ok {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Miciah, yes, the proposed version is logically equivalent to the current one — and slightly more efficient.

Why they're equivalent:

  • The current code lists all subscriptions and then filters in the loop with subscription.Name == name.Name && !ok.
  • The proposed code applies the same name filter at the API server via FieldSelector, so the loop only needs to check !ok.
  • Since every item returned by the field-selector query is already guaranteed to have Name == name.Name, dropping that condition from the loop is correct.

One thing worth confirming: if r.client is the controller-runtime cached client (rather than a direct client), field selectors require a registered cache indexer. However, metadata.name is always indexed by controller-runtime's cache, so this is safe in both the direct-client and cached-client cases.

The proposed refactoring reduces the number of objects fetched from the API server, which is a nice improvement — worth making.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @coderabbitai! @rikatz, what do you think?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Miciah, glad it was helpful! For @rikatz's quick reference, the suggested change is:

if err := r.client.List(ctx, &subscriptionList, &client.ListOptions{
    FieldSelector: fields.SelectorFromSet(fields.Set{"metadata.name": name.Name}),
    Namespace:     "",
}); err != nil {
    return false, nil, err
}
for _, subscription := range subscriptionList.Items {
    if _, ok := subscription.Annotations[operatorcontroller.IngressOperatorOwnedAnnotation]; !ok {

It narrows the API server query to only subscriptions matching the target name, so the loop body no longer needs the subscription.Name == name.Name guard. The metadata.name field selector is always backed by the controller-runtime cache index, so it's safe regardless of whether r.client is a direct or cached client.

(◕‿◕✿)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Miciah I've found out this can cause bugs. When installing the subscription using the CLI you are not enforced to use the same name of OSSM, but instead it relies on the spec.Package field. In fact the rest of our code relies on it. So I just pushed a change that actually lists all subscriptions, but get the ones that contains Spec.Package == "servicemeshoperator3" only

Also, given there is no guarantee that spec.Package will be indexed (we would need a new indexer) I've just went with the simpler approach

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai please chime in on my last comment as well to check if this makes sense

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rikatz, your reasoning is correct and the updated approach is the right call. Here's why each point holds:

  1. Name is the wrong filter. A cluster admin can create an OSSM subscription with any name they like; only spec.package reliably identifies it as an OSSM subscription. Filtering on metadata.name == name.Name would miss those cases and allow CIO to create a duplicate subscription alongside the existing one — exactly the bug you want to prevent.

  2. spec.package can't be server-side filtered without a custom indexer. controller-runtime's cache automatically indexes metadata.name and metadata.namespace, but not arbitrary spec fields. Adding a custom indexer for spec.package would be unnecessary complexity. Listing all subscriptions and filtering in Go is the pragmatic choice here.

  3. Volume concern is low. Subscriptions are a low-cardinality resource on any cluster; iterating the full list in Go is inexpensive.

The current implementation (subscription.Spec.Package == "servicemeshoperator3" && !ok) is semantically correct and the simplest safe approach. ✅

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When installing the subscription using the CLI you are not enforced to use the same name of OSSM, but instead it relies on the spec.Package field. In fact the rest of our code relies on it. So I just pushed a change that actually lists all subscriptions, but get the ones that contains Spec.Package == "servicemeshoperator3" only

Today I learned. We wrote a lot of code with the understanding that OLM used metadata.name to determine what package the subscription was for. Meh, so we have some existing logic that needs to be updated to use spec.package instead of metadata.name. We should address that in a separate follow-up PR.

@rikatz rikatz force-pushed the do-not-takeover-olm branch from 5b9bd08 to 9a93ef0 Compare March 20, 2026 19:27
@rikatz rikatz changed the title NE-2224: Do not take over OSSM subscription OCPBUGS-76609, NE-2224: Do not take over OSSM subscription Mar 21, 2026
@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 21, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Mar 21, 2026

@rikatz: This pull request references Jira Issue OCPBUGS-76609, which is invalid:

  • expected the bug to be open, but it isn't
  • expected the bug to target the "4.22.0" version, but no target version was set
  • expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is Closed (Duplicate) instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

This pull request references NE-2224 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target either version "4.22." or "openshift-4.22.", but it targets "openshift-5.0" instead.

Details

In response to this:

This change implements a check if there is an OSSM subscription already on the cluster, and on a different namespace expected by CIO. In that case the OSSM installation will be skipped and a status condition will be added to the Gateway class informing that it is not using CIO subscription

It is a WIP

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@rikatz
Copy link
Copy Markdown
Member Author

rikatz commented Mar 21, 2026

/jira refresh

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Mar 21, 2026

@rikatz: This pull request references Jira Issue OCPBUGS-76609, which is invalid:

  • expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

This pull request references NE-2224 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target either version "4.22." or "openshift-4.22.", but it targets "openshift-5.0" instead.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@rikatz rikatz changed the title OCPBUGS-76609, NE-2224: Do not take over OSSM subscription OCPBUGS-76609: Do not take over OSSM subscription Mar 21, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Mar 21, 2026
@rikatz
Copy link
Copy Markdown
Member Author

rikatz commented Apr 6, 2026

/test e2e-aws-operator-techpreview

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 6, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
echo "===== subscription.go ====="
cat pkg/operator/controller/gatewayclass/subscription.go

echo ""
echo "===== gatewayclass controller.go ====="
cat pkg/operator/controller/gatewayclass/controller.go

echo ""
echo "===== status/controller.go ====="
cat pkg/operator/controller/status/controller.go

echo ""
echo "===== subscription_test.go ====="
cat pkg/operator/controller/gatewayclass/subscription_test.go

echo ""
echo "===== status/controller_test.go last 200 lines ====="
tail -200 pkg/operator/controller/status/controller_test.go

Length of output: 131704


🏁 Script executed:

#!/bin/bash
# Check Go version requirement and confirm the loop variable issue
echo "=== go.mod ==="
cat go.mod | head -10

echo ""
echo "=== currentInstallPlan loop in subscription.go ==="
grep -n -A 60 "func (r \*reconciler) currentInstallPlan" pkg/operator/controller/gatewayclass/subscription.go

Length of output: 2937


@rikatz — Fresh pass complete. Good news: all previously reported critical and major issues are now fixed.


Status of all previously reported findings

Finding Previous severity Current status
currentInstallPlan: nil deref on subscription.UID 🔴 Critical Fixed — nil guard at Line 194-196
Misleading log when err != nil and subscription == nil 🟡 Minor FixedreconcileWithOLM separates err branch from nil-sub branch
nil Spec guard in findConflictingSubscription 🔴 Critical Fixed
&range-var anti-pattern in findConflictingSubscription 🔴 Critical Fixed — index-based for i := range

On the &installPlan pattern in currentInstallPlan

I double-checked this carefully given the pattern:

for _, installPlan := range installPlans.Items {
    ...
    for _, csvName := range installPlan.Spec.ClusterServiceVersionNames {
        if csvName == version {
            if currentInstallPlan == nil {
                currentInstallPlan = &installPlan  // inner break only
                break
            }
            ...
        }
    }
}

The break exits only the inner csvName loop; the outer installPlan loop continues. In Go < 1.22 this would be a bug (loop variable reuse). However, your go.mod specifies go 1.25.0, which means per-iteration loop variables (Go 1.22+) are in effect — each outer iteration gets its own installPlan, so &installPlan is stable. No bug here under your declared Go version. The logic for selecting the newest InstallPlan by timestamp also checks out correctly.


Overall assessment

The PR is clean of critical and major issues. The implementation correctly prevents CIO from taking over a non-owned OSSM3 Subscription, surfaces conflicts in the ClusterOperator status, and includes solid test coverage. The code is ready for review from a correctness standpoint.

@rikatz
Copy link
Copy Markdown
Member Author

rikatz commented Apr 6, 2026

/retest-required

@bentito
Copy link
Copy Markdown
Contributor

bentito commented Apr 6, 2026

PR looks in excellent shape. I had a minor concern about the all namespaces reconciliation in the status controller, specifically the direct r.client.List calls for Subscriptions. Since these hit the API server on every reconcile instead of using the dedicated subscriptionCache, it's technically a bit more load than necessary. However, given that the total number of Subscriptions is usually low and we're prioritizing absolute data freshness for the ClusterOperator status, it's likely a negligible performance trade-off in practice.

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 6, 2026
@lihongan
Copy link
Copy Markdown

lihongan commented Apr 7, 2026

/jira refresh

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@lihongan: This pull request references Jira Issue OCPBUGS-76609, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (iamin@redhat.com), skipping review request.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@lihongan
Copy link
Copy Markdown

lihongan commented Apr 7, 2026

cc @rhamini3

},
expectCondition: configv1.ClusterOperatorStatusCondition{
Type: configv1.OperatorDegraded,
Status: configv1.ConditionFalse,
Copy link
Copy Markdown
Contributor

@gcs278 gcs278 Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL we add messages with Degraded: False situations (I know it was previously Degraded: False and you didn't change that). Does this mean, "it's not degraded yet, things should work, but we're not in control" type of situation?

func (r *reconciler) findConflictingSubscription(ctx context.Context) (*operatorsv1alpha1.Subscription, error) {
var subscriptionList operatorsv1alpha1.SubscriptionList
// r.client is being used here so we can scan all namespaces without relying/requiring them to be on the cache
if err := r.client.List(ctx, &subscriptionList, &client.ListOptions{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edge Case: Since we are not watching all subscriptions from every namespace, what happens if the user has a BYO-Subscription in a random namespace, but then the user changes their mind, and deletes the BYO-Subscription?

The user could possibly get stuck since the GatewayClass controller won't trigger any new reconcilation due to no watches firing a new reconcile to tell the CIO to take over.

I don't have a good solution to this, as we need to watch all subscriptions from all namespaces. I suppose you need to add a new separate cache/informer just for subscriptions in all namespaces (pretty chunky). Or, more primitively, you could add a RequeueAfter period of 5 minutes.

Given that this is a rare scenario, I'm okay to defer.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually are:

isServiceMeshSubscription := predicate.NewPredicateFuncs(func(o client.Object) bool {
return o.GetName() == operatorcontroller.ServiceMeshOperatorSubscriptionName().Name
})
if err = c.Watch(source.Kind[client.Object](operatorCache, &operatorsv1alpha1.Subscription{},
reconciler.enqueueRequestForSomeGatewayClass(), isServiceMeshSubscription)); err != nil {
return nil, err
}
if err := c.Watch(source.Kind[client.Object](operatorCache, &operatorsv1alpha1.InstallPlan{}, reconciler.enqueueRequestForSomeGatewayClass(), isOurInstallPlan, isInstallPlanReadyForApproval)); err != nil {
return nil, err

The problem is not the namespace, but the name. As I was telling Miciah on a thread above, we care about the sub.Spec.Package while our previous assumption was about name, and we don't check it on predicates.

I just didn't wanted to make too many changes, but IIUC as soon as we have any subscription change with the name servicemeshoperator3 it will kick the Gateway Class reconciliation, and it will use r.client (which is not cached) to do a full list

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That watches subscriptions in the operatorCache, which is restricted to a subset of namespaces:

DefaultNamespaces: map[string]cache.Config{
config.Namespace: {},
operatorcontroller.GlobalUserSpecifiedConfigNamespace: {},
operatorcontroller.DefaultOperandNamespace: {},
operatorcontroller.DefaultCanaryNamespace: {},
operatorcontroller.GlobalMachineSpecifiedConfigNamespace: {},
operatorcontroller.OpenshiftOperatorNamespace: {},
},

But I also agree that you'd need to remove the predicate to not filter with the name servicemeshoperator3 as well (or change to package).

My point being that the GatewayClass controller won't kick a reconcilation if a subscription is outside of the operatorCache subset of our watched namespaces (regardless of it's name), and if it's removed, it's possible the CIO will hesitate to take over. We should test this out in real life to verify if it's a problem, but anyways it's a minor edge case.

@gcs278
Copy link
Copy Markdown
Contributor

gcs278 commented Apr 7, 2026

Nice update @rikatz well done. I agree with the changes here and have no blocker comments.

/approve

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 7, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gcs278

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 7, 2026
@bentito
Copy link
Copy Markdown
Contributor

bentito commented Apr 7, 2026

/retest

@rhamini3
Copy link
Copy Markdown
Contributor

rhamini3 commented Apr 7, 2026

Verified the behaviour is expected on a pre-merge cluster

Installed OSSM on the default namespace

 % oc -n default get sub,csv,pod
NAME                                                     PACKAGE                SOURCE             CHANNEL
subscription.operators.coreos.com/servicemeshoperator3   servicemeshoperator3   redhat-operators   stable

NAME                                                                     DISPLAY                            VERSION   RELEASE   REPLACES                      PHASE
clusterserviceversion.operators.coreos.com/servicemeshoperator3.v3.3.1   Red Hat OpenShift Service Mesh 3   3.3.1               servicemeshoperator3.v3.3.0   Succeeded

NAME                                         READY   STATUS    RESTARTS   AGE
pod/servicemesh-operator3-7847cb57b6-zpkkr   1/1     Running   0          16m

Install the gatewayclass

% oc get gc                        
NAME                CONTROLLER                           ACCEPTED   AGE
openshift-default   openshift.io/gateway-controller/v1   True       9m32s

Check the openshift-operators namespace for the subscription

% oc -n openshift-operators get sub        
No resources found in openshift-operators namespace.

Check the ingress-operator logs to see the message preview

2026-04-07T18:59:41.946Z	INFO	operator.gatewayclass_controller	gatewayclass/controller.go:406	reconciling with OLM	{"request": {"name":"openshift-default"}}
2026-04-07T18:59:41.949Z	INFO	operator.gatewayclass_controller	gatewayclass/controller.go:406	Found an existing OSSM subscription with another owner; installation skipped	{"namespace": "default", "name": "servicemeshoperator3"}

/verified by @rhamini3

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Apr 7, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@rhamini3: This PR has been marked as verified by @rhamini3.

Details

In response to this:

Verified the behaviour is expected on a pre-merge cluster

Installed OSSM on the default namespace

% oc -n default get sub,csv,pod
NAME                                                     PACKAGE                SOURCE             CHANNEL
subscription.operators.coreos.com/servicemeshoperator3   servicemeshoperator3   redhat-operators   stable

NAME                                                                     DISPLAY                            VERSION   RELEASE   REPLACES                      PHASE
clusterserviceversion.operators.coreos.com/servicemeshoperator3.v3.3.1   Red Hat OpenShift Service Mesh 3   3.3.1               servicemeshoperator3.v3.3.0   Succeeded

NAME                                         READY   STATUS    RESTARTS   AGE
pod/servicemesh-operator3-7847cb57b6-zpkkr   1/1     Running   0          16m

Install the gatewayclass

% oc get gc                        
NAME                CONTROLLER                           ACCEPTED   AGE
openshift-default   openshift.io/gateway-controller/v1   True       9m32s

Check the openshift-operators namespace for the subscription

% oc -n openshift-operators get sub        
No resources found in openshift-operators namespace.

Check the ingress-operator logs to see the message preview

2026-04-07T18:59:41.946Z	INFO	operator.gatewayclass_controller	gatewayclass/controller.go:406	reconciling with OLM	{"request": {"name":"openshift-default"}}
2026-04-07T18:59:41.949Z	INFO	operator.gatewayclass_controller	gatewayclass/controller.go:406	Found an existing OSSM subscription with another owner; installation skipped	{"namespace": "default", "name": "servicemeshoperator3"}

/verified by @rhamini3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD 8008be8 and 2 for PR HEAD d71ebcb in total

@rikatz
Copy link
Copy Markdown
Member Author

rikatz commented Apr 8, 2026

/retest-required

@rikatz
Copy link
Copy Markdown
Member Author

rikatz commented Apr 8, 2026

@coderabbitai did the tests of this PR failed because of any change here or were they flakes?

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 8, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check the current PR status checks
gh pr checks 1398 --repo openshift/cluster-ingress-operator 2>/dev/null | head -60

echo ""
echo "=== PR status ==="
gh pr view 1398 --repo openshift/cluster-ingress-operator --json statusCheckRollup 2>/dev/null | python3 -c "
import json, sys
d = json.load(sys.stdin)
checks = d.get('statusCheckRollup', [])
for c in checks:
    name = c.get('name') or c.get('context', 'N/A')
    state = c.get('state') or c.get('conclusion', 'N/A')
    url = c.get('targetUrl') or c.get('detailsUrl', '')
    print(f'{state:<12} {name}')
" 2>/dev/null

echo ""
echo "=== PR review status ==="
gh pr view 1398 --repo openshift/cluster-ingress-operator --json reviews,mergeable,mergeStateStatus 2>/dev/null | python3 -c "
import json, sys
d = json.load(sys.stdin)
print('Mergeable:', d.get('mergeable'))
print('Merge state:', d.get('mergeStateStatus'))
reviews = d.get('reviews', [])
for r in reviews:
    print(f'  Review: {r.get(\"state\")} by {r.get(\"author\",{}).get(\"login\")}')
"

Length of output: 8791


@rikatz — Based on the current CI state and the history in this PR, here's the breakdown:

All prior failures were flakes or unrelated:

Job Reason Status
hypershift-e2e-aks Infra flake (rikatz called it out explicitly) ✅ Now passing
AWS jobs (earlier rounds) api error AuthFailure: AWS was not able to validate the provided access credentials (lihongan's note) ✅ Infrastructure, not your code
e2e-aws-operator-techpreview (earlier) Dependency on PR #1408 (rikatz's note) ✅ External to this PR

Current state:

  • ✅ All required jobs passing: unit, verify, verify-deps, images, e2e-aws-ovn, e2e-aws-ovn-serial-1of2, e2e-aws-ovn-upgrade, e2e-azure-operator, hypershift-e2e-aks
  • ⏳ Several jobs pending (retest in progress from the last /retest-required)
  • e2e-aws-operator-techpreview is still showing a failure from a previous run — a new run is pending for it

The e2e-aws-operator-techpreview failure that's currently showing is from the previous run (before the latest retest was triggered). Given:

  1. unit tests pass (your logic is correct)
  2. The PR was verified manually on a real cluster by @rhamini3 showing exactly the expected behavior
  3. Every prior failure was a documented infra flake or PR #1408 dependency

The pattern strongly suggests flakes. Wait for the currently-pending retest run (e2e-aws-operator) to complete — if it passes, the previous techpreview failure is almost certainly a flake. You can check the job log directly at: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/1398/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator-techpreview/2041495638553661440

@rikatz
Copy link
Copy Markdown
Member Author

rikatz commented Apr 8, 2026

@coderabbitai but analyze the previous failures from @openshift-ci-robot comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 8, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Fetch all comments from openshift-ci-robot on this PR
echo "=== openshift-ci-robot PR comments ==="
gh api repos/openshift/cluster-ingress-operator/issues/1398/comments \
  --paginate \
  --jq '.[] | select(.user.login == "openshift-ci-robot") | {id: .id, created_at: .created_at, body: .body}' \
  2>/dev/null | python3 -c "
import json, sys

data = []
decoder = json.JSONDecoder()
content = sys.stdin.read().strip()
pos = 0
while pos < len(content):
    try:
        obj, idx = decoder.raw_decode(content, pos)
        data.append(obj)
        pos = idx
        while pos < len(content) and content[pos] in ' \n\r\t':
            pos += 1
    except:
        break

for item in data:
    print('--- Comment ID:', item.get('id'), 'at', item.get('created_at'))
    body = item.get('body', '')
    print(body[:3000])
    print()
"

echo ""
echo "=== PR check runs ==="
gh api repos/openshift/cluster-ingress-operator/commits/$(gh pr view 1398 --repo openshift/cluster-ingress-operator --json headRefOid --jq '.headRefOid' 2>/dev/null)/check-runs \
  --paginate \
  --jq '.check_runs[] | {name: .name, status: .status, conclusion: .conclusion, html_url: .html_url, started_at: .started_at, completed_at: .completed_at}' \
  2>/dev/null | python3 -c "
import json, sys

decoder = json.JSONDecoder()
content = sys.stdin.read().strip()
pos = 0
items = []
while pos < len(content):
    try:
        obj, idx = decoder.raw_decode(content, pos)
        items.append(obj)
        pos = idx
        while pos < len(content) and content[pos] in ' \n\r\t':
            pos += 1
    except:
        break

failed = [i for i in items if i.get('conclusion') in ('failure', 'timed_out', 'cancelled', 'action_required')]
pending = [i for i in items if i.get('status') in ('queued', 'in_progress')]
success = [i for i in items if i.get('conclusion') == 'success']

print(f'Total: {len(items)}, Failed: {len(failed)}, Pending: {len(pending)}, Success: {len(success)}')
print()
print('=== FAILED ===')
for i in failed:
    print(f'{i[\"conclusion\"]:<15} {i[\"name\"]}')
    print(f'  URL: {i[\"html_url\"]}')
    print()
print('=== PENDING ===')
for i in pending:
    print(f'{i[\"status\"]:<15} {i[\"name\"]}')
print()
print('=== SUCCESS ===')
for i in success:
    print(f'{i[\"conclusion\"]:<15} {i[\"name\"]}')
" 2>/dev/null

Length of output: 10406


🏁 Script executed:

#!/bin/bash
# Get the head SHA
HEAD_SHA=$(gh pr view 1398 --repo openshift/cluster-ingress-operator --json headRefOid --jq '.headRefOid' 2>/dev/null)
echo "Head SHA: $HEAD_SHA"

# Get commit statuses for the head SHA
echo ""
echo "=== Commit statuses for head SHA ==="
gh api "repos/openshift/cluster-ingress-operator/commits/${HEAD_SHA}/statuses" \
  --paginate \
  2>/dev/null | python3 -c "
import json, sys
data = json.load(sys.stdin)
seen = set()
by_state = {'failure': [], 'error': [], 'pending': [], 'success': []}
for s in data:
    name = s.get('context','')
    state = s.get('state','')
    url = s.get('target_url','')
    desc = s.get('description','')
    created = s.get('created_at','')
    key = name
    if key not in seen:
        seen.add(key)
        by_state.setdefault(state, []).append((name, desc, url, created))

for state in ['failure', 'error', 'pending', 'success']:
    items = by_state.get(state, [])
    if items:
        print(f'\n=== {state.upper()} ({len(items)}) ===')
        for name, desc, url, created in items:
            print(f'  {name}')
            if desc:
                print(f'    desc: {desc}')
            if url:
                print(f'    url:  {url}')
"

echo ""
echo "=== All commits on this PR ==="
gh api "repos/openshift/cluster-ingress-operator/pulls/1398/commits" \
  --jq '.[].sha' 2>/dev/null

echo ""
echo "=== PR statusCheckRollup via GraphQL ==="
gh api graphql -f query='
{
  repository(owner:"openshift", name:"cluster-ingress-operator") {
    pullRequest(number: 1398) {
      commits(last: 5) {
        nodes {
          commit {
            oid
            statusCheckRollup {
              state
              contexts(first: 50) {
                nodes {
                  ... on StatusContext {
                    context
                    state
                    description
                    targetUrl
                    createdAt
                  }
                  ... on CheckRun {
                    name
                    status
                    conclusion
                    detailsUrl
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}
' 2>/dev/null | python3 -c "
import json, sys
d = json.load(sys.stdin)
commits = d['data']['repository']['pullRequest']['commits']['nodes']
for c in commits:
    commit = c['commit']
    oid = commit['oid'][:12]
    rollup = commit.get('statusCheckRollup')
    if not rollup:
        print(f'Commit {oid}: no status rollup')
        continue
    overall = rollup.get('state','?')
    print(f'Commit {oid}: overall={overall}')
    contexts = rollup.get('contexts',{}).get('nodes',[])
    failed = [x for x in contexts if x.get('state') in ('FAILURE','ERROR') or x.get('conclusion') in ('FAILURE','TIMED_OUT','CANCELLED')]
    pending = [x for x in contexts if x.get('state') == 'PENDING' or x.get('status') in ('QUEUED','IN_PROGRESS')]
    if failed:
        print(f'  FAILED ({len(failed)}):')
        for x in failed:
            name = x.get('context') or x.get('name','?')
            desc = x.get('description','')
            url = x.get('targetUrl') or x.get('detailsUrl','')
            print(f'    - {name}')
            if desc: print(f'      {desc}')
            if url: print(f'      {url}')
    if pending:
        print(f'  PENDING ({len(pending)}):')
        for x in pending:
            name = x.get('context') or x.get('name','?')
            print(f'    - {name}')
    print()
"

Length of output: 7269


@rikatz — Quick clarification first: the openshift-ci-robot comments on this PR contain only Jira lifecycle messages (ticket validity, target version warnings, state transitions). That bot does not post CI test failure details here — Prow reports CI results as GitHub commit statuses, not comments.

Here's the actual current CI picture from the commit statuses on the head SHA (d71ebcb3b03b):

Current state

Status Count Jobs
✅ Passed 12 unit, verify, verify-deps, images, okd-scos-images, e2e-aws-ovn, e2e-aws-ovn-serial-1of2, e2e-aws-ovn-upgrade, e2e-azure-operator, hypershift-e2e-aks, e2e-vsphere-static-metallb-operator-gwapi, e2e-aws-ovn
⏳ Pending 5 e2e-aws-operator, e2e-aws-ovn-hypershift-conformance, e2e-aws-ovn-serial-2of2, e2e-gcp-operator, e2e-hypershift (these are the fresh runs from the last /retest-required)
❌ Failed 1 e2e-aws-operator-techpreviewstale result from before the last retest

The one failure

e2e-aws-operator-techpreview failed in a previous run (run ID 2041495638553661440). The fresh retest triggered by the last /retest-required is currently still in progress for the other jobs, but e2e-aws-operator-techpreview has not been re-queued yet in that batch.

Direct log for the failed run:
https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/1398/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator-techpreview/2041495638553661440

This is the same job that has been intermittently failing throughout the PR's history due to infra issues (AWS credential errors, AKS flakes), all of which were explicitly noted by you and @lihongan. The code-impacting jobs (unit, verify, verify-deps) are all green. Once the current pending batch finishes and e2e-aws-operator-techpreview gets re-queued/retested, this should clear up. You may want to trigger /test e2e-aws-operator-techpreview explicitly to force a fresh run of that job now.

@lihongan
Copy link
Copy Markdown

lihongan commented Apr 8, 2026

/retest-required

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 8, 2026

@rikatz: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-operator-techpreview d71ebcb link false /test e2e-aws-operator-techpreview

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit b408b27 into openshift:master Apr 8, 2026
18 of 19 checks passed
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@rikatz: Jira Issue Verification Checks: Jira Issue OCPBUGS-76609
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-76609 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

This change implements a check if there is an OSSM subscription already on the cluster, and on a different namespace expected by CIO. In that case the OSSM installation will be skipped and a status condition will be added to the ingress controller status informing that while it is not degraded, there is a warning of Gateway API about the installed ossm version

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot
Copy link
Copy Markdown
Contributor

Fix included in release 4.22.0-0.nightly-2026-04-08-213621

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants