OCPBUGS-86299: skip upstream LB tests on dual-stack and patch NLB service IPFamilyPolicy#466
OCPBUGS-86299: skip upstream LB tests on dual-stack and patch NLB service IPFamilyPolicy#466mtulio wants to merge 2 commits into
Conversation
The AWSServiceLBNetworkSecurityGroup e2e tests fail on HyperShift because the test binary cannot reach AWS APIs (VPC private endpoint DNS) and the cloud-config ConfigMap has a different name and location than on standalone clusters. AWS endpoint resolution: - Force public regional endpoints (BaseEndpoint) on ELBv2 and EC2 clients to bypass VPC private endpoint DNS that is unreachable from the CI management cluster. - Validate cfg.Region against the AWS region pattern and fall back to infrastructure/cluster status.platformStatus.aws.region when the SDK returns a non-region value (e.g. CI lease UUID from LEASED_RESOURCE). Cloud-config validation (HyperShift-aware): - Detect External topology via Infrastructure resource to choose the right cloud-config source. - On HyperShift: read ConfigMap aws-cloud-config from the HCP namespace on the management cluster (via HYPERSHIFT_MANAGEMENT_CLUSTER_KUBECONFIG and HYPERSHIFT_MANAGEMENT_CLUSTER_NAMESPACE env vars). - On standalone: read ConfigMap cloud-conf from openshift-cloud-controller-manager namespace (existing behavior). - Extract cloud-config helpers (GetCloudConfig, IsConfigPresentCloudConfig, IsNLBSecurityGroupModeManaged, GetKubeClient) into common/helper.go without Ginkgo control-flow calls so they are safe to use from both spec and non-spec contexts. Skip mechanism: - Add SKIP_MANAGEMENT_CLUSTER_TESTS=true env var to gracefully skip tests requiring management cluster access when the kubeconfig is not available. Documentation: - Replace stale docs/dev/ote-ccm-aws.md with e2e-ote-ccm-aws.md covering corrected paths, batch test execution, HyperShift setup (env vars, NLB ingress patching), and the skip flag. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… IPFamilyPolicy Upstream cloud-provider-aws load balancer tests do not support dual-stack clusters yet. Detect dual-stack configuration from the cloud-config (ipFamilies key) at startup and exclude upstream LB tests when the cluster is dual-stack. When detection fails, upstream LB tests are also excluded to avoid false positives (fail-closed). For the downstream AWSServiceLBNetworkSecurityGroup tests, patch the NLB service with IPFamilyPolicy=RequireDualStack when dual-stack is detected so the service matches the cluster's network configuration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@mtulio: This pull request references Jira Issue OCPBUGS-86299, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughThis PR adds infrastructure to support OpenShift e2e testing for the AWS Cloud Controller Manager on HyperShift and dual-stack clusters. It introduces shared test helpers for cloud-config retrieval and cluster topology detection, implements region-aware AWS client configuration, integrates dual-stack detection into test selection, and provides comprehensive documentation for the test binary operations. ChangesTest infrastructure for HyperShift and dual-stack
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@mtulio: This pull request references Jira Issue OCPBUGS-86299, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@mtulio: This pull request references Jira Issue OCPBUGS-86299, which is valid. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-installer-dualstack-ipv6-primary-techpreview |
|
@mtulio: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/dfe82140-5551-11f1-976e-fc5230192bd3-0 |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
openshift-tests/ccm-aws-tests/main.go (1)
50-65: 💤 Low valueConsider using a context with timeout for dual-stack detection.
The detection uses
context.TODO()which has no timeout. If the cluster is unreachable or slow, this could block indefinitely during test initialization. A bounded context would make startup behavior more predictable.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@openshift-tests/ccm-aws-tests/main.go` around lines 50 - 65, Replace the unbounded context.TODO() used in the dual-stack detection call chain with a bounded context: create a context with timeout (e.g. context.WithTimeout(context.Background(), 30*time.Second)), defer its cancel, and pass that context into common.GetCloudConfig instead of context.TODO(); ensure error handling/log messages for the LoadConfig -> kclientset.NewForConfig -> common.GetCloudConfig -> common.IsDualStack sequence still log the error (including timeout) and that isDualStackCluster and dualStackDetectionReady are only set when detection completes successfully, and keep the existing log.Debugf("Dual-stack cluster detected: %v", isDualStackCluster) after successful detection.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/dev/e2e-ote-ccm-aws.md`:
- Around line 166-168: The quoted test name in the HyperShift run example passed
to the run-test command is missing a closing quote around Managed; update the
test string used in the
./openshift-tests/bin/cloud-controller-manager-aws-tests-ext run-test invocation
(the long bracketed test selector starting with
"[cloud-provider-aws-e2e-openshift] ... with 'Managed value in cloud-config") to
close the single quote around Managed (e.g., "with 'Managed' value in
cloud-config ...") so the quoted test name is well-formed.
In `@openshift-tests/ccm-aws-tests/e2e/common/helper.go`:
- Around line 161-164: The error returned when fetching the ConfigMap via
mgmtClient.CoreV1().ConfigMaps(hcpNamespace).Get(...) drops the underlying err;
update the return to include/wrap the original error (e.g., append ": %w" or
include err with %v) so the failure from Get is preserved and actionable; change
the fmt.Errorf call that currently formats "failed to get HCP cloud-config
ConfigMap %s/%s" to include the err and wrap it using the original err variable
(referencing hcpNamespace and hcpCloudConfigName to locate the site).
- Around line 151-159: The error returns for BuildConfigFromFlags and
NewForConfig discard the underlying error; update the error wrapping in the
blocks where restConfig is built (clientcmd.BuildConfigFromFlags with
mgmtKubeconfig) and where mgmtClient is created (clientset.NewForConfig) to
include the original err (e.g. use fmt.Errorf with %w or include %v) so callers
see the underlying failure details; locate the two error branches around
restConfig and mgmtClient and wrap or append err to the returned fmt.Errorf
messages.
---
Nitpick comments:
In `@openshift-tests/ccm-aws-tests/main.go`:
- Around line 50-65: Replace the unbounded context.TODO() used in the dual-stack
detection call chain with a bounded context: create a context with timeout (e.g.
context.WithTimeout(context.Background(), 30*time.Second)), defer its cancel,
and pass that context into common.GetCloudConfig instead of context.TODO();
ensure error handling/log messages for the LoadConfig -> kclientset.NewForConfig
-> common.GetCloudConfig -> common.IsDualStack sequence still log the error
(including timeout) and that isDualStackCluster and dualStackDetectionReady are
only set when detection completes successfully, and keep the existing
log.Debugf("Dual-stack cluster detected: %v", isDualStackCluster) after
successful detection.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ec60e4a6-62e6-4798-a902-08de766c83c7
📒 Files selected for processing (7)
docs/dev/e2e-ote-ccm-aws.mddocs/dev/ote-ccm-aws.mdopenshift-tests/ccm-aws-tests/e2e/aws/helper.goopenshift-tests/ccm-aws-tests/e2e/aws/loadbalancer.goopenshift-tests/ccm-aws-tests/e2e/common/helper.goopenshift-tests/ccm-aws-tests/go.modopenshift-tests/ccm-aws-tests/main.go
💤 Files with no reviewable changes (1)
- docs/dev/ote-ccm-aws.md
| ./openshift-tests/bin/cloud-controller-manager-aws-tests-ext run-test \ | ||
| "[cloud-provider-aws-e2e-openshift] loadbalancer NLB [OCPFeatureGate:AWSServiceLBNetworkSecurityGroup] should have NLBSecurityGroupMode with 'Managed value in cloud-config [Suite:openshift/conformance/parallel]" | ||
| ``` |
There was a problem hiding this comment.
Fix malformed quoted test name in the HyperShift run example.
The example test string appears to miss a closing quote around Managed (with 'Managed value...), which can cause failed copy/paste execution for run-test.
Suggested doc fix
./openshift-tests/bin/cloud-controller-manager-aws-tests-ext run-test \
- "[cloud-provider-aws-e2e-openshift] loadbalancer NLB [OCPFeatureGate:AWSServiceLBNetworkSecurityGroup] should have NLBSecurityGroupMode with 'Managed value in cloud-config [Suite:openshift/conformance/parallel]"
+ "[cloud-provider-aws-e2e-openshift] loadbalancer NLB [OCPFeatureGate:AWSServiceLBNetworkSecurityGroup] should have NLBSecurityGroupMode with 'Managed' value in cloud-config [Suite:openshift/conformance/parallel]"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ./openshift-tests/bin/cloud-controller-manager-aws-tests-ext run-test \ | |
| "[cloud-provider-aws-e2e-openshift] loadbalancer NLB [OCPFeatureGate:AWSServiceLBNetworkSecurityGroup] should have NLBSecurityGroupMode with 'Managed value in cloud-config [Suite:openshift/conformance/parallel]" | |
| ``` | |
| ./openshift-tests/bin/cloud-controller-manager-aws-tests-ext run-test \ | |
| "[cloud-provider-aws-e2e-openshift] loadbalancer NLB [OCPFeatureGate:AWSServiceLBNetworkSecurityGroup] should have NLBSecurityGroupMode with 'Managed' value in cloud-config [Suite:openshift/conformance/parallel]" |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/dev/e2e-ote-ccm-aws.md` around lines 166 - 168, The quoted test name in
the HyperShift run example passed to the run-test command is missing a closing
quote around Managed; update the test string used in the
./openshift-tests/bin/cloud-controller-manager-aws-tests-ext run-test invocation
(the long bracketed test selector starting with
"[cloud-provider-aws-e2e-openshift] ... with 'Managed value in cloud-config") to
close the single quote around Managed (e.g., "with 'Managed' value in
cloud-config ...") so the quoted test name is well-formed.
| restConfig, err := clientcmd.BuildConfigFromFlags("", mgmtKubeconfig) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to load management cluster kubeconfig") | ||
| } | ||
|
|
||
| mgmtClient, err := clientset.NewForConfig(restConfig) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to create management cluster client") | ||
| } |
There was a problem hiding this comment.
Include underlying errors in error messages for better diagnostics.
Lines 153 and 158 discard the underlying error, making debugging harder when these calls fail.
Proposed fix
restConfig, err := clientcmd.BuildConfigFromFlags("", mgmtKubeconfig)
if err != nil {
- return nil, fmt.Errorf("failed to load management cluster kubeconfig")
+ return nil, fmt.Errorf("failed to load management cluster kubeconfig: %w", err)
}
mgmtClient, err := clientset.NewForConfig(restConfig)
if err != nil {
- return nil, fmt.Errorf("failed to create management cluster client")
+ return nil, fmt.Errorf("failed to create management cluster client: %w", err)
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| restConfig, err := clientcmd.BuildConfigFromFlags("", mgmtKubeconfig) | |
| if err != nil { | |
| return nil, fmt.Errorf("failed to load management cluster kubeconfig") | |
| } | |
| mgmtClient, err := clientset.NewForConfig(restConfig) | |
| if err != nil { | |
| return nil, fmt.Errorf("failed to create management cluster client") | |
| } | |
| restConfig, err := clientcmd.BuildConfigFromFlags("", mgmtKubeconfig) | |
| if err != nil { | |
| return nil, fmt.Errorf("failed to load management cluster kubeconfig: %w", err) | |
| } | |
| mgmtClient, err := clientset.NewForConfig(restConfig) | |
| if err != nil { | |
| return nil, fmt.Errorf("failed to create management cluster client: %w", err) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@openshift-tests/ccm-aws-tests/e2e/common/helper.go` around lines 151 - 159,
The error returns for BuildConfigFromFlags and NewForConfig discard the
underlying error; update the error wrapping in the blocks where restConfig is
built (clientcmd.BuildConfigFromFlags with mgmtKubeconfig) and where mgmtClient
is created (clientset.NewForConfig) to include the original err (e.g. use
fmt.Errorf with %w or include %v) so callers see the underlying failure details;
locate the two error branches around restConfig and mgmtClient and wrap or
append err to the returned fmt.Errorf messages.
| cm, err := mgmtClient.CoreV1().ConfigMaps(hcpNamespace).Get(ctx, hcpCloudConfigName, metav1.GetOptions{}) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to get HCP cloud-config ConfigMap %s/%s", hcpNamespace, hcpCloudConfigName) | ||
| } |
There was a problem hiding this comment.
Include underlying error for ConfigMap fetch failure.
Proposed fix
cm, err := mgmtClient.CoreV1().ConfigMaps(hcpNamespace).Get(ctx, hcpCloudConfigName, metav1.GetOptions{})
if err != nil {
- return nil, fmt.Errorf("failed to get HCP cloud-config ConfigMap %s/%s", hcpNamespace, hcpCloudConfigName)
+ return nil, fmt.Errorf("failed to get HCP cloud-config ConfigMap %s/%s: %w", hcpNamespace, hcpCloudConfigName, err)
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cm, err := mgmtClient.CoreV1().ConfigMaps(hcpNamespace).Get(ctx, hcpCloudConfigName, metav1.GetOptions{}) | |
| if err != nil { | |
| return nil, fmt.Errorf("failed to get HCP cloud-config ConfigMap %s/%s", hcpNamespace, hcpCloudConfigName) | |
| } | |
| cm, err := mgmtClient.CoreV1().ConfigMaps(hcpNamespace).Get(ctx, hcpCloudConfigName, metav1.GetOptions{}) | |
| if err != nil { | |
| return nil, fmt.Errorf("failed to get HCP cloud-config ConfigMap %s/%s: %w", hcpNamespace, hcpCloudConfigName, err) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@openshift-tests/ccm-aws-tests/e2e/common/helper.go` around lines 161 - 164,
The error returned when fetching the ConfigMap via
mgmtClient.CoreV1().ConfigMaps(hcpNamespace).Get(...) drops the underlying err;
update the return to include/wrap the original error (e.g., append ": %w" or
include err with %v) so the failure from Get is preserved and actionable; change
the fmt.Errorf call that currently formats "failed to get HCP cloud-config
ConfigMap %s/%s" to include the err and wrap it using the original err variable
(referencing hcpNamespace and hcpCloudConfigName to locate the site).
Summary
Upstream
cloud-provider-awsload balancer tests do not support dual-stack clusters — they create single-stack NLB services that fail when the cluster network requires dual-stack IPFamilyPolicy. This PR detects dual-stack configuration from the CCM cloud-config and adapts the test suite accordingly.Changes
Exclude upstream LB tests on dual-stack clusters: Detect dual-stack at startup by reading the
ipFamilieskey from the cloud-config ConfigMap. When the cluster is dual-stack, upstream[cloud-provider-aws-e2e] loadbalancertests are excluded from the spec selector. When detection fails, upstream LB tests are also excluded (fail-closed) to avoid false positives.Patch NLB services for dual-stack: The downstream
createServiceNLBhelper now setsIPFamilyPolicy=RequireDualStackon NLB services when dual-stack is detected, so theAWSServiceLBNetworkSecurityGrouptests create services matching the cluster's network configuration.Add
IsDualStackhelper: Newcommon.IsDualStack()function parses theipFamiliesconfig key from the cloud-config ConfigMap and returns true when both IPv4 and IPv6 are present.Dependencies
Test plan
list testsshould not contain[cloud-provider-aws-e2e] loadbalancerentries)AWSServiceLBNetworkSecurityGrouptests create dual-stack NLB services on dual-stack clusters🤖 Generated with Claude Code
Summary by CodeRabbit
Documentation
New Features
Bug Fixes