OCPBUGS-76609: Do not take over OSSM subscription#1398
OCPBUGS-76609: Do not take over OSSM subscription#1398openshift-merge-bot[bot] merged 9 commits intoopenshift:masterfrom
Conversation
|
@rikatz: This pull request references NE-2224 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target either version "4.22." or "openshift-4.22.", but it targets "openshift-5.0" instead. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Important Review skippedAuto reviews are limited based on label configuration. 🚫 Review skipped — only excluded labels are configured. (1)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository YAML (base), Organization UI (inherited) Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughController logic now performs a cluster-wide scan for OSSM ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/operator/controller/gatewayclass/subscription.go (1)
34-37: Misleading function docstring.The comment states "another boolean indicating if the one that exists is out of our namespace" but the function signature only returns one boolean. The caller must infer the "pre-existing" state by comparing
subscription.Namespacewith the expected namespace. Consider updating the comment to accurately reflect the return contract.// ensureServiceMeshOperatorSubscription attempts to ensure that a subscription // for servicemeshoperator is present and returns a Boolean indicating whether -// it exists, another boolean indicating if the one that exists is out of our namespace (pre-existing), -// the subscription if it exists, and an error value. +// it exists, the subscription if it exists, and an error value. When a subscription +// with the same name exists in a namespace other than the reconciler's target namespace, +// the function returns early with (true, existingSubscription, nil) to indicate +// that installation should be skipped.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/controller/gatewayclass/subscription.go` around lines 34 - 37, The function comment for ensureServiceMeshOperatorSubscription is inaccurate about returned booleans; update the docstring to state it returns (exists bool, subscription *Subscription, err error) and explicitly mention that callers should determine "pre-existing" by comparing subscription.Namespace to the expected namespace (e.g., compare subscription.Namespace to your operator namespace) rather than relying on a second boolean; refer to ensureServiceMeshOperatorSubscription and subscription.Namespace in the comment so the contract is clear.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/operator/controller/gatewayclass/subscription.go`:
- Around line 34-37: The function comment for
ensureServiceMeshOperatorSubscription is inaccurate about returned booleans;
update the docstring to state it returns (exists bool, subscription
*Subscription, err error) and explicitly mention that callers should determine
"pre-existing" by comparing subscription.Namespace to the expected namespace
(e.g., compare subscription.Namespace to your operator namespace) rather than
relying on a second boolean; refer to ensureServiceMeshOperatorSubscription and
subscription.Namespace in the comment so the contract is clear.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 959aff2c-26d5-42bb-9e0e-2e4bf6c6d788
📒 Files selected for processing (2)
pkg/operator/controller/gatewayclass/controller.gopkg/operator/controller/gatewayclass/subscription.go
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/operator/controller/gatewayclass/controller.go`:
- Around line 459-470: The code calls ensureServiceMeshOperatorSubscription and
appends an error but then unconditionally reads subscription.Annotations,
risking a nil pointer; change the logic in the block after the call to first
check for err or subscription == nil and skip the annotation/InstallPlan branch
if either is true (e.g., append the error and return/continue or explicitly if
subscription == nil { /* skip install plan */ }), otherwise safely inspect
subscription.Annotations for operatorcontroller.IngressOperatorOwnedAnnotation
and only then call ensureServiceMeshOperatorInstallPlan; keep references to
ensureServiceMeshOperatorSubscription, subscription,
operatorcontroller.IngressOperatorOwnedAnnotation, and
ensureServiceMeshOperatorInstallPlan to locate the fix.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 4df648ae-6527-4207-8dbf-e980a2eead16
📒 Files selected for processing (3)
pkg/operator/controller/gatewayclass/controller.gopkg/operator/controller/gatewayclass/subscription.gopkg/operator/controller/status/controller.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/operator/controller/gatewayclass/subscription.go
| if err := r.client.List(ctx, &subscriptionList, &client.ListOptions{ | ||
| Namespace: "", | ||
| }); err != nil { | ||
| return false, nil, err | ||
| } | ||
|
|
||
| for _, subscription := range subscriptionList.Items { | ||
| // There is a subscription that is not owned by us. In this case we early return | ||
| // because we cannot support multiple existing OSSM subscriptions, so instead of | ||
| // trying to continue the workflow of making CIO take over the subscription | ||
| // the code is early returned without further update, and CIO can be marked | ||
| // with a degradation warning that allows the cluster admin to identify the | ||
| // other existing subscriptions, and decide further action. | ||
| // This does not block the rest of GatewayClass reconciliation, it just avoids | ||
| // CIO taking over subscriptions (or adding new ones) while other subscription | ||
| // exists. | ||
| if _, ok := subscription.Annotations[operatorcontroller.IngressOperatorOwnedAnnotation]; subscription.Name == name.Name && !ok { |
There was a problem hiding this comment.
@coderabbitai, consider the current logic:
if err := r.client.List(ctx, &subscriptionList, &client.ListOptions{
Namespace: "",
}); err != nil {
return false, nil, err
}
for _, subscription := range subscriptionList.Items {
if _, ok := subscription.Annotations[operatorcontroller.IngressOperatorOwnedAnnotation]; subscription.Name == name.Name && !ok {Would the following be equivalent? Assume we import "k8s.io/apimachinery/pkg/fields":
if err := r.client.List(ctx, &subscriptionList, &client.ListOptions{
FieldSelector: fields.SelectorFromSet(fields.Set{"metadata.name": name.Name}),
Namespace: "",
}); err != nil {
return false, nil, err
}
for _, subscription := range subscriptionList.Items {
if _, ok := subscription.Annotations[operatorcontroller.IngressOperatorOwnedAnnotation]; !ok {There was a problem hiding this comment.
@Miciah, yes, the proposed version is logically equivalent to the current one — and slightly more efficient.
Why they're equivalent:
- The current code lists all subscriptions and then filters in the loop with
subscription.Name == name.Name && !ok. - The proposed code applies the same name filter at the API server via
FieldSelector, so the loop only needs to check!ok. - Since every item returned by the field-selector query is already guaranteed to have
Name == name.Name, dropping that condition from the loop is correct.
One thing worth confirming: if r.client is the controller-runtime cached client (rather than a direct client), field selectors require a registered cache indexer. However, metadata.name is always indexed by controller-runtime's cache, so this is safe in both the direct-client and cached-client cases.
The proposed refactoring reduces the number of objects fetched from the API server, which is a nice improvement — worth making.
There was a problem hiding this comment.
Thanks, @coderabbitai! @rikatz, what do you think?
There was a problem hiding this comment.
@Miciah, glad it was helpful! For @rikatz's quick reference, the suggested change is:
if err := r.client.List(ctx, &subscriptionList, &client.ListOptions{
FieldSelector: fields.SelectorFromSet(fields.Set{"metadata.name": name.Name}),
Namespace: "",
}); err != nil {
return false, nil, err
}
for _, subscription := range subscriptionList.Items {
if _, ok := subscription.Annotations[operatorcontroller.IngressOperatorOwnedAnnotation]; !ok {It narrows the API server query to only subscriptions matching the target name, so the loop body no longer needs the subscription.Name == name.Name guard. The metadata.name field selector is always backed by the controller-runtime cache index, so it's safe regardless of whether r.client is a direct or cached client.
(◕‿◕✿)
There was a problem hiding this comment.
@Miciah I've found out this can cause bugs. When installing the subscription using the CLI you are not enforced to use the same name of OSSM, but instead it relies on the spec.Package field. In fact the rest of our code relies on it. So I just pushed a change that actually lists all subscriptions, but get the ones that contains Spec.Package == "servicemeshoperator3" only
Also, given there is no guarantee that spec.Package will be indexed (we would need a new indexer) I've just went with the simpler approach
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
@coderabbitai please chime in on my last comment as well to check if this makes sense
There was a problem hiding this comment.
@rikatz, your reasoning is correct and the updated approach is the right call. Here's why each point holds:
-
Name is the wrong filter. A cluster admin can create an OSSM subscription with any name they like; only
spec.packagereliably identifies it as an OSSM subscription. Filtering onmetadata.name == name.Namewould miss those cases and allow CIO to create a duplicate subscription alongside the existing one — exactly the bug you want to prevent. -
spec.packagecan't be server-side filtered without a custom indexer. controller-runtime's cache automatically indexesmetadata.nameandmetadata.namespace, but not arbitrary spec fields. Adding a custom indexer forspec.packagewould be unnecessary complexity. Listing all subscriptions and filtering in Go is the pragmatic choice here. -
Volume concern is low. Subscriptions are a low-cardinality resource on any cluster; iterating the full list in Go is inexpensive.
The current implementation (subscription.Spec.Package == "servicemeshoperator3" && !ok) is semantically correct and the simplest safe approach. ✅
There was a problem hiding this comment.
When installing the subscription using the CLI you are not enforced to use the same name of OSSM, but instead it relies on the spec.Package field. In fact the rest of our code relies on it. So I just pushed a change that actually lists all subscriptions, but get the ones that contains
Spec.Package == "servicemeshoperator3"only
Today I learned. We wrote a lot of code with the understanding that OLM used metadata.name to determine what package the subscription was for. Meh, so we have some existing logic that needs to be updated to use spec.package instead of metadata.name. We should address that in a separate follow-up PR.
5b9bd08 to
9a93ef0
Compare
|
@rikatz: This pull request references Jira Issue OCPBUGS-76609, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. This pull request references NE-2224 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target either version "4.22." or "openshift-4.22.", but it targets "openshift-5.0" instead. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@rikatz: This pull request references Jira Issue OCPBUGS-76609, which is invalid:
Comment This pull request references NE-2224 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target either version "4.22." or "openshift-4.22.", but it targets "openshift-5.0" instead. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test e2e-aws-operator-techpreview |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 131704 🏁 Script executed: Length of output: 2937
Status of all previously reported findings
On the
|
|
/retest-required |
|
PR looks in excellent shape. I had a minor concern about the all namespaces reconciliation in the status controller, specifically the direct /lgtm |
|
/jira refresh |
|
@lihongan: This pull request references Jira Issue OCPBUGS-76609, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (iamin@redhat.com), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
cc @rhamini3 |
| }, | ||
| expectCondition: configv1.ClusterOperatorStatusCondition{ | ||
| Type: configv1.OperatorDegraded, | ||
| Status: configv1.ConditionFalse, |
There was a problem hiding this comment.
TIL we add messages with Degraded: False situations (I know it was previously Degraded: False and you didn't change that). Does this mean, "it's not degraded yet, things should work, but we're not in control" type of situation?
| func (r *reconciler) findConflictingSubscription(ctx context.Context) (*operatorsv1alpha1.Subscription, error) { | ||
| var subscriptionList operatorsv1alpha1.SubscriptionList | ||
| // r.client is being used here so we can scan all namespaces without relying/requiring them to be on the cache | ||
| if err := r.client.List(ctx, &subscriptionList, &client.ListOptions{ |
There was a problem hiding this comment.
Edge Case: Since we are not watching all subscriptions from every namespace, what happens if the user has a BYO-Subscription in a random namespace, but then the user changes their mind, and deletes the BYO-Subscription?
The user could possibly get stuck since the GatewayClass controller won't trigger any new reconcilation due to no watches firing a new reconcile to tell the CIO to take over.
I don't have a good solution to this, as we need to watch all subscriptions from all namespaces. I suppose you need to add a new separate cache/informer just for subscriptions in all namespaces (pretty chunky). Or, more primitively, you could add a RequeueAfter period of 5 minutes.
Given that this is a rare scenario, I'm okay to defer.
There was a problem hiding this comment.
We actually are:
cluster-ingress-operator/pkg/operator/controller/gatewayclass/controller.go
Lines 168 to 176 in 8008be8
The problem is not the namespace, but the name. As I was telling Miciah on a thread above, we care about the sub.Spec.Package while our previous assumption was about name, and we don't check it on predicates.
I just didn't wanted to make too many changes, but IIUC as soon as we have any subscription change with the name servicemeshoperator3 it will kick the Gateway Class reconciliation, and it will use r.client (which is not cached) to do a full list
There was a problem hiding this comment.
That watches subscriptions in the operatorCache, which is restricted to a subset of namespaces:
cluster-ingress-operator/pkg/operator/operator.go
Lines 157 to 164 in eeb09ed
But I also agree that you'd need to remove the predicate to not filter with the name servicemeshoperator3 as well (or change to package).
My point being that the GatewayClass controller won't kick a reconcilation if a subscription is outside of the operatorCache subset of our watched namespaces (regardless of it's name), and if it's removed, it's possible the CIO will hesitate to take over. We should test this out in real life to verify if it's a problem, but anyways it's a minor edge case.
|
Nice update @rikatz well done. I agree with the changes here and have no blocker comments. /approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gcs278 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
|
Verified the behaviour is expected on a pre-merge cluster Installed OSSM on the Install the gatewayclass Check the Check the ingress-operator logs to see the message preview /verified by @rhamini3 |
|
@rhamini3: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest-required |
|
@coderabbitai did the tests of this PR failed because of any change here or were they flakes? |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 8791
All prior failures were flakes or unrelated:
Current state:
The
The pattern strongly suggests flakes. Wait for the currently-pending retest run ( |
|
@coderabbitai but analyze the previous failures from @openshift-ci-robot comment |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 10406 🏁 Script executed: Length of output: 7269
Here's the actual current CI picture from the commit statuses on the head SHA ( Current state
The one failure
Direct log for the failed run: This is the same job that has been intermittently failing throughout the PR's history due to infra issues (AWS credential errors, AKS flakes), all of which were explicitly noted by you and |
|
/retest-required |
|
@rikatz: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
b408b27
into
openshift:master
|
@rikatz: Jira Issue Verification Checks: Jira Issue OCPBUGS-76609 Jira Issue OCPBUGS-76609 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Fix included in release 4.22.0-0.nightly-2026-04-08-213621 |
This change implements a check if there is an OSSM subscription already on the cluster, and on a different namespace expected by CIO. In that case the OSSM installation will be skipped and a status condition will be added to the ingress controller status informing that while it is not degraded, there is a warning of Gateway API about the installed ossm version