OCPBUGS-62517: enable downstream pdb and update replicas to meet MCO drain#576
OCPBUGS-62517: enable downstream pdb and update replicas to meet MCO drain#576jianzhangbjz wants to merge 1 commit intoopenshift:mainfrom
Conversation
|
@jianzhangbjz: This pull request references Jira Issue OCPBUGS-62517, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (jiazha@redhat.com), skipping review request. The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jianzhangbjz The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
I just use this PR for testing. The upstream PR: operator-framework/operator-controller#2371 |
|
/payload-job periodic-ci-openshift-release-master-ci-4.21-e2e-azure-ovn-upgrade |
|
@jianzhangbjz: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/74cbcd50-d0ca-11f0-9907-eb8e705acca9-0 |
|
All tests passed. See: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-operator-framework-operator-controller-576-ci-4.21-e2e-azure-ovn-upgrade/1996437818976505856 |
tmshort
left a comment
There was a problem hiding this comment.
/hold
This should not be making changes outside of the openshift directory. The replicas value should be added to the files in openshift/helm. Upstream PR operator-framework/operator-controller#2371 needs to be approved and merged first.
|
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
WalkthroughThis pull request increases controller deployment replica counts from 1 to 2 for catalogd-controller-manager and operator-controller-controller-manager across Helm charts and manifests. Adds pre-upgrade setup infrastructure via new Bash script. Introduces PodDisruptionBudgets for these critical controllers in OpenShift deployments. Updates test helpers with longer timeouts and adds comprehensive post-upgrade E2E tests. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/payload-job periodic-ci-openshift-release-master-ci-4.22-e2e-azure-ovn-upgrade |
|
@jianzhangbjz: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
openshift/catalogd/manifests.yaml (1)
845-850: Stale comment: maxSurge calculation is now outdated.The comment says "Allow temporary 2 pods (1 + 1)" but with
replicas: 2, the max during rolling update is 3 pods (2 + 1).📝 Suggested fix
- maxSurge: 1 # Allow temporary 2 pods (1 + 1) for zero-downtime updates + maxSurge: 1 # Allow temporary 3 pods (2 + 1) for zero-downtime updates🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openshift/catalogd/manifests.yaml` around lines 845 - 850, The inline comment next to the rollingUpdate.maxSurge is outdated: with replicas: 2 and maxSurge: 1 the cluster can temporarily have 3 pods (2 + 1), not 2; update the comment for rollingUpdate.maxSurge to reflect "Allow temporary 3 pods (2 + 1) for zero-downtime updates" or replace it with a generic note like "Allow temporary one extra pod (replicas + maxSurge) for zero-downtime updates" so the comment stays correct even if replicas change; modify the text next to maxSurge only, leaving the values for replicas, type, rollingUpdate, and maxUnavailable unchanged.openshift/operator-controller/manifests-experimental.yaml (1)
1204-1209: Same stale maxSurge comment as other manifests.📝 Suggested fix
- maxSurge: 1 # Allow temporary 2 pods (1 + 1) for zero-downtime updates + maxSurge: 1 # Allow temporary 3 pods (2 + 1) for zero-downtime updates🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openshift/operator-controller/manifests-experimental.yaml` around lines 1204 - 1209, The inline comment on the rollingUpdate.maxSurge is stale: with replicas: 2 and maxSurge: 1 the cluster can temporarily run 3 pods, not "2 pods (1 + 1)". Update the comment on the rollingUpdate.maxSurge line to accurately reflect the behavior (e.g., "Allow one extra pod (replicas + 1) for zero-downtime updates — temporary 3 pods (2 + 1)") so it matches the values in replicas, strategy.type (RollingUpdate) and rollingUpdate.maxSurge.test/upgrade-e2e/post_upgrade_test.go (1)
160-171: Silent return on nil condition may mask issues in CI.When
condis nil at line 164, the function returns without failing the assertion, which is intentional for retry logic. However, if the Progressing condition is never set (due to a bug), the test will eventually timeout without indicating which specific check failed.Consider adding a log statement to aid debugging:
📝 Optional improvement for debuggability
cond = apimeta.FindStatusCondition(clusterCatalog.Status.Conditions, ocv1.TypeProgressing) if cond == nil { + t.Log("Progressing condition not yet present, retrying...") return }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/upgrade-e2e/post_upgrade_test.go` around lines 160 - 171, The current polling callback silently returns when cond == nil, which can hide why the poll eventually times out; update the callback in post_upgrade_test.go to emit a debug/log message before returning (e.g., using t.Logf or the test logger) when apimeta.FindStatusCondition(clusterCatalog.Status.Conditions, ocv1.TypeProgressing) yields nil, so you still retry but have diagnostics showing clusterCatalog and catalogdManagerPod identifiers/timestamps; keep the existing retry behavior and only add the log, referencing cond, clusterCatalog, apimeta.FindStatusCondition, and catalogdManagerPod to aid investigation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/upgrade-e2e/post_upgrade_test.go`:
- Around line 1-35: Declare the missing package-level variables so the tests
compile: add declarations for c (controller-runtime client.Client), cfg
(*rest.Config), kclientset (kubernetes.Interface or *kubernetes.Clientset
depending on usage), and testClusterCatalogName/testClusterExtensionName
(strings) at package scope in the test package, and initialize them in your test
setup (e.g., TestMain or your existing suite setup function) before tests run;
ensure you import the correct types (sigs.k8s.io/controller-runtime/pkg/client
for Client and k8s.io/client-go/kubernetes for the clientset) and assign cfg and
kclientset from the test env bootstrap so existing references to c, cfg,
kclientset, testClusterCatalogName, and testClusterExtensionName in
post_upgrade_test.go compile and run.
---
Nitpick comments:
In `@openshift/catalogd/manifests.yaml`:
- Around line 845-850: The inline comment next to the rollingUpdate.maxSurge is
outdated: with replicas: 2 and maxSurge: 1 the cluster can temporarily have 3
pods (2 + 1), not 2; update the comment for rollingUpdate.maxSurge to reflect
"Allow temporary 3 pods (2 + 1) for zero-downtime updates" or replace it with a
generic note like "Allow temporary one extra pod (replicas + maxSurge) for
zero-downtime updates" so the comment stays correct even if replicas change;
modify the text next to maxSurge only, leaving the values for replicas, type,
rollingUpdate, and maxUnavailable unchanged.
In `@openshift/operator-controller/manifests-experimental.yaml`:
- Around line 1204-1209: The inline comment on the rollingUpdate.maxSurge is
stale: with replicas: 2 and maxSurge: 1 the cluster can temporarily run 3 pods,
not "2 pods (1 + 1)". Update the comment on the rollingUpdate.maxSurge line to
accurately reflect the behavior (e.g., "Allow one extra pod (replicas + 1) for
zero-downtime updates — temporary 3 pods (2 + 1)") so it matches the values in
replicas, strategy.type (RollingUpdate) and rollingUpdate.maxSurge.
In `@test/upgrade-e2e/post_upgrade_test.go`:
- Around line 160-171: The current polling callback silently returns when cond
== nil, which can hide why the poll eventually times out; update the callback in
post_upgrade_test.go to emit a debug/log message before returning (e.g., using
t.Logf or the test logger) when
apimeta.FindStatusCondition(clusterCatalog.Status.Conditions,
ocv1.TypeProgressing) yields nil, so you still retry but have diagnostics
showing clusterCatalog and catalogdManagerPod identifiers/timestamps; keep the
existing retry behavior and only add the log, referencing cond, clusterCatalog,
apimeta.FindStatusCondition, and catalogdManagerPod to aid investigation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 550920a7-82a0-4eeb-99fc-84eda6f6786e
📒 Files selected for processing (16)
hack/test/pre-upgrade-setup.shhelm/olmv1/templates/deployment-olmv1-system-catalogd-controller-manager.ymlhelm/olmv1/templates/deployment-olmv1-system-operator-controller-controller-manager.ymlhelm/olmv1/values.yamlmanifests/experimental-e2e.yamlmanifests/experimental.yamlmanifests/standard-e2e.yamlmanifests/standard.yamlopenshift/catalogd/manifests-experimental.yamlopenshift/catalogd/manifests.yamlopenshift/helm/catalogd.yamlopenshift/helm/operator-controller.yamlopenshift/operator-controller/manifests-experimental.yamlopenshift/operator-controller/manifests.yamltest/helpers/helpers.gotest/upgrade-e2e/post_upgrade_test.go
| package upgradee2e | ||
|
|
||
| import ( | ||
| "bufio" | ||
| "context" | ||
| "fmt" | ||
| "strings" | ||
| "testing" | ||
| "time" | ||
|
|
||
| "github.com/stretchr/testify/assert" | ||
| "github.com/stretchr/testify/require" | ||
| appsv1 "k8s.io/api/apps/v1" | ||
| coordinationv1 "k8s.io/api/coordination/v1" | ||
| corev1 "k8s.io/api/core/v1" | ||
| apimeta "k8s.io/apimachinery/pkg/api/meta" | ||
| metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | ||
| "k8s.io/apimachinery/pkg/labels" | ||
| "k8s.io/apimachinery/pkg/types" | ||
| "k8s.io/apimachinery/pkg/util/wait" | ||
| "sigs.k8s.io/controller-runtime/pkg/client" | ||
|
|
||
| ocv1 "github.com/operator-framework/operator-controller/api/v1" | ||
| testutil "github.com/operator-framework/operator-controller/internal/shared/util/test" | ||
| ) | ||
|
|
||
| const ( | ||
| artifactName = "operator-controller-upgrade-e2e" | ||
| container = "manager" | ||
| // pollDuration is set to 3 minutes for upgrade tests because upgrades cause pod restarts. | ||
| // With LeaderElectionReleaseOnCancel: true, graceful shutdowns take ~26s (RetryPeriod). | ||
| // In the worst case (pod crash), leader election can take up to 163s (LeaseDuration: 137s + RetryPeriod: 26s). | ||
| pollDuration = 3 * time.Minute | ||
| pollInterval = time.Second | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify that required package-level variables are defined in the upgradee2e package
echo "Checking for variable definitions in test/upgrade-e2e/"
rg -n "var\s+(c|cfg|kclientset|testClusterCatalogName|testClusterExtensionName)\s" test/upgrade-e2e/
rg -n "^\s*(c|cfg|kclientset|testClusterCatalogName|testClusterExtensionName)\s*(=|client\.Client)" test/upgrade-e2e/Repository: openshift/operator-framework-operator-controller
Length of output: 146
🏁 Script executed:
# List all files in the test/upgrade-e2e directory
find test/upgrade-e2e/ -type f -name "*.go" | head -20Repository: openshift/operator-framework-operator-controller
Length of output: 162
🏁 Script executed:
# Check the full post_upgrade_test.go file to see what variables are actually used
cat -n test/upgrade-e2e/post_upgrade_test.goRepository: openshift/operator-framework-operator-controller
Length of output: 17150
🏁 Script executed:
# Search more broadly for these variable usages in the entire test/upgrade-e2e directory
rg -i "(c|cfg|kclientset|testClusterCatalogName|testClusterExtensionName)" test/upgrade-e2e/ --type goRepository: openshift/operator-framework-operator-controller
Length of output: 22039
🏁 Script executed:
# Read the upgrade_test.go file
cat -n test/upgrade-e2e/upgrade_test.go | head -100Repository: openshift/operator-framework-operator-controller
Length of output: 1193
🏁 Script executed:
# Search for variable declarations at package level in upgrade_test.go
rg "var\s+(c|cfg|kclientset|testClusterCatalogName|testClusterExtensionName)" test/upgrade-e2e/upgrade_test.go -A 2Repository: openshift/operator-framework-operator-controller
Length of output: 74
🏁 Script executed:
# Search all .go files in test/ directory for these variable declarations
find test/ -name "*.go" -type f -exec grep -l "var.*\(c\|cfg\|kclientset\|testClusterCatalogName\|testClusterExtensionName\)" {} \;Repository: openshift/operator-framework-operator-controller
Length of output: 269
🏁 Script executed:
# Check if there's a setup/init file or any other Go file in upgrade-e2e directory
ls -la test/upgrade-e2e/Repository: openshift/operator-framework-operator-controller
Length of output: 461
🏁 Script executed:
# Search more broadly for where these variables might be assigned/initialized
rg "^\s*(c|cfg|kclientset|testClusterCatalogName|testClusterExtensionName)\s*=" test/upgrade-e2e/ -A 1Repository: openshift/operator-framework-operator-controller
Length of output: 74
🏁 Script executed:
# Check test/utils.go for variable declarations
cat -n test/utils.go | head -100Repository: openshift/operator-framework-operator-controller
Length of output: 3281
🏁 Script executed:
# Search in test/e2e/steps for the variable declarations
rg "var\s+(c|cfg|kclientset|testClusterCatalogName|testClusterExtensionName)" test/e2e/steps/Repository: openshift/operator-framework-operator-controller
Length of output: 378
🏁 Script executed:
# Check if there are any build tags in post_upgrade_test.go that might import external vars
head -20 test/upgrade-e2e/post_upgrade_test.goRepository: openshift/operator-framework-operator-controller
Length of output: 553
🏁 Script executed:
# Look for these variables in ALL test files
rg -l "testClusterCatalogName|testClusterExtensionName" test/ --type goRepository: openshift/operator-framework-operator-controller
Length of output: 129
🏁 Script executed:
# Search for "var c " or "c =" or "var cfg" patterns in all test files
rg "var\s+c\s+|var\s+cfg\s+|var\s+kclientset\s+" test/ --type goRepository: openshift/operator-framework-operator-controller
Length of output: 74
🏁 Script executed:
# Check if post_upgrade_test.go even compiles by looking for any variable declarations before the functions
sed -n '27,103p' test/upgrade-e2e/post_upgrade_test.goRepository: openshift/operator-framework-operator-controller
Length of output: 3776
🏁 Script executed:
# Search for any "c =" or "cfg =" initialization patterns
rg "^\s*(c|cfg|kclientset|testClusterCatalogName|testClusterExtensionName)\s*:?=" test/upgrade-e2e/ --multilineRepository: openshift/operator-framework-operator-controller
Length of output: 74
🏁 Script executed:
# Look at what the actual error would be - check the README in upgrade-e2e
cat test/upgrade-e2e/README.mdRepository: openshift/operator-framework-operator-controller
Length of output: 1998
Define missing package-level variables.
The test file uses c (client), cfg (config), kclientset, testClusterCatalogName, and testClusterExtensionName without declaring them anywhere in the upgradee2e package. These variables are referenced throughout the test functions but have no declarations in either post_upgrade_test.go or upgrade_test.go. This will cause compilation failures. These must be declared as package-level variables (likely in a setup file like suite_test.go or init_test.go).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/upgrade-e2e/post_upgrade_test.go` around lines 1 - 35, Declare the
missing package-level variables so the tests compile: add declarations for c
(controller-runtime client.Client), cfg (*rest.Config), kclientset
(kubernetes.Interface or *kubernetes.Clientset depending on usage), and
testClusterCatalogName/testClusterExtensionName (strings) at package scope in
the test package, and initialize them in your test setup (e.g., TestMain or your
existing suite setup function) before tests run; ensure you import the correct
types (sigs.k8s.io/controller-runtime/pkg/client for Client and
k8s.io/client-go/kubernetes for the clientset) and assign cfg and kclientset
from the test env bootstrap so existing references to c, cfg, kclientset,
testClusterCatalogName, and testClusterExtensionName in post_upgrade_test.go
compile and run.
|
@jianzhangbjz: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
To address: OLM pdb blocks the MCO drain.
More on Slack: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1764766284275389