Skip to content

OCPBUGS-81452: prevent webhook rollout stalls by issuing certificates earlier#684

Closed
bandrade wants to merge 1 commit intoopenshift:mainfrom
bandrade:feature/OCPBUGS-66965-rollout-stall
Closed

OCPBUGS-81452: prevent webhook rollout stalls by issuing certificates earlier#684
bandrade wants to merge 1 commit intoopenshift:mainfrom
bandrade:feature/OCPBUGS-66965-rollout-stall

Conversation

@bandrade
Copy link
Copy Markdown
Contributor

What changed

  • move cert-manager Certificate objects into the infrastructure phase
  • keep Issuer in infrastructure and leave workload Deployment objects in deploy
  • update phase sorting tests to cover the new ordering

Why

Webhook installs can stall in RollingOut when certificate issuance and deployment availability are gated within the same phase. Starting certificate issuance earlier shortens the rollout critical path for webhook-backed operators.

Impact

This reduces the chance that ClusterExtension installations for webhook operators remain stuck waiting for the generated deployment to become available.

Root cause

The rollout path allowed cert-manager Certificate objects to be applied in the same phase as the webhook deployment. For operators that mount the serving cert secret, that can delay deployment availability long enough to hit external rollout timeouts.

Validation

  • go test -tags containers_image_openpgp ./internal/operator-controller/applier -run Test_PhaseSort

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 30, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 30, 2026
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 30, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@bandrade: This pull request references Jira Issue OCPBUGS-66965, which is invalid:

  • expected the bug to be open, but it isn't
  • expected the bug to target the "4.22.0" version, but no target version was set
  • expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is Closed (Won't Do) instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What changed

  • move cert-manager Certificate objects into the infrastructure phase
  • keep Issuer in infrastructure and leave workload Deployment objects in deploy
  • update phase sorting tests to cover the new ordering

Why

Webhook installs can stall in RollingOut when certificate issuance and deployment availability are gated within the same phase. Starting certificate issuance earlier shortens the rollout critical path for webhook-backed operators.

Impact

This reduces the chance that ClusterExtension installations for webhook operators remain stuck waiting for the generated deployment to become available.

Root cause

The rollout path allowed cert-manager Certificate objects to be applied in the same phase as the webhook deployment. For operators that mount the serving cert secret, that can delay deployment availability long enough to hit external rollout timeouts.

Validation

  • go test -tags containers_image_openpgp ./internal/operator-controller/applier -run Test_PhaseSort

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 30, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 970a302a-fe74-430d-bce1-ab2e6695bb3b

📥 Commits

Reviewing files that changed from the base of the PR and between 3a91c53 and 57d188a.

📒 Files selected for processing (2)
  • internal/operator-controller/applier/phase.go
  • internal/operator-controller/applier/phase_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • internal/operator-controller/applier/phase.go
  • internal/operator-controller/applier/phase_test.go

Walkthrough

Certificate resources from group cert-manager.io were moved from PhaseDeploy to PhaseInfrastructure in the phase mapping; corresponding unit tests were updated to expect Certificate in PhaseInfrastructure.

Changes

Cohort / File(s) Summary
Phase sorting logic
internal/operator-controller/applier/phase.go
Updated gkPhaseMap so {Kind: "Certificate", Group: "cert-manager.io"} maps to PhaseInfrastructure instead of PhaseDeploy, changing computed phases for Certificate objects.
Phase sorting tests
internal/operator-controller/applier/phase_test.go
Added a cert-manager.io/v1 Certificate unstructured object to test inputs, renamed a test case, and updated expected phase groupings so Certificate appears under applier.PhaseInfrastructure rather than applier.PhaseDeploy.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 30, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bandrade
Once this PR has been reviewed and has the lgtm label, please assign jianzhangbjz for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@bandrade bandrade changed the title OCPBUGS-66965: prevent webhook rollout stalls by issuing certificates earlier OCPBUGS-81452: prevent webhook rollout stalls by issuing certificates earlier Mar 31, 2026
@openshift-ci-robot openshift-ci-robot added the jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. label Mar 31, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@bandrade: This pull request references Jira Issue OCPBUGS-81452, which is invalid:

  • expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What changed

  • move cert-manager Certificate objects into the infrastructure phase
  • keep Issuer in infrastructure and leave workload Deployment objects in deploy
  • update phase sorting tests to cover the new ordering

Why

Webhook installs can stall in RollingOut when certificate issuance and deployment availability are gated within the same phase. Starting certificate issuance earlier shortens the rollout critical path for webhook-backed operators.

Impact

This reduces the chance that ClusterExtension installations for webhook operators remain stuck waiting for the generated deployment to become available.

Root cause

The rollout path allowed cert-manager Certificate objects to be applied in the same phase as the webhook deployment. For operators that mount the serving cert secret, that can delay deployment availability long enough to hit external rollout timeouts.

Validation

  • go test -tags containers_image_openpgp ./internal/operator-controller/applier -run Test_PhaseSort

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bandrade bandrade marked this pull request as ready for review March 31, 2026 00:45
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 31, 2026
@bandrade bandrade force-pushed the feature/OCPBUGS-66965-rollout-stall branch from 16707ab to 3a91c53 Compare March 31, 2026 01:08
@bandrade bandrade force-pushed the feature/OCPBUGS-66965-rollout-stall branch from 3a91c53 to 57d188a Compare March 31, 2026 05:02
Copy link
Copy Markdown
Contributor

@camilamacedo86 camilamacedo86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bandrade

This code is maintained in upstream and we sync to here.
So, we need to do this changes in: https://github.com/operator-framework/operator-controller instead. Could you please push those there?

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 31, 2026

@bandrade: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@tmshort
Copy link
Copy Markdown
Contributor

tmshort commented Mar 31, 2026

/hold
Until #687 or #682 merge

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 31, 2026
@bandrade
Copy link
Copy Markdown
Contributor Author

@bandrade bandrade closed this Mar 31, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@bandrade: This pull request references Jira Issue OCPBUGS-81452. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state.

Details

In response to this:

What changed

  • move cert-manager Certificate objects into the infrastructure phase
  • keep Issuer in infrastructure and leave workload Deployment objects in deploy
  • update phase sorting tests to cover the new ordering

Why

Webhook installs can stall in RollingOut when certificate issuance and deployment availability are gated within the same phase. Starting certificate issuance earlier shortens the rollout critical path for webhook-backed operators.

Impact

This reduces the chance that ClusterExtension installations for webhook operators remain stuck waiting for the generated deployment to become available.

Root cause

The rollout path allowed cert-manager Certificate objects to be applied in the same phase as the webhook deployment. For operators that mount the serving cert secret, that can delay deployment availability long enough to hit external rollout timeouts.

Validation

  • go test -tags containers_image_openpgp ./internal/operator-controller/applier -run Test_PhaseSort

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants