Skip to content

OCPBUGS-74497: Add UserAgent to Azure SDK client telemetry options#1400

Open
Nirshal wants to merge 1 commit intoopenshift:masterfrom
Nirshal:OCPBUGS-74497-azure-useragent-telemetry
Open

OCPBUGS-74497: Add UserAgent to Azure SDK client telemetry options#1400
Nirshal wants to merge 1 commit intoopenshift:masterfrom
Nirshal:OCPBUGS-74497-azure-useragent-telemetry

Conversation

@Nirshal
Copy link
Copy Markdown

@Nirshal Nirshal commented Mar 24, 2026

What this PR does / why we need it

The Cluster Ingress Operator is not setting the ApplicationID in the Azure SDK
TelemetryOptions when creating Azure ARM SDK clients and credential clients.
This means Azure API requests from CIO do not include proper application
identification in the User-Agent header for request tracing and telemetry purposes.

This PR adds policy.TelemetryOptions with ApplicationID to:

  • ARM dns RecordSetsClient
  • ARM dns PrivateRecordSetsClient
  • Azure credential clients (UserAssignedIdentityCredential, WorkloadIdentityCredential, ClientSecretCredential)

Which issue(s) this PR fixes

Fixes https://issues.redhat.com/browse/OCPBUGS-74497

Special notes for your reviewer

  • The ApplicationID value is set to [cio-useragent] as suggested in the Jira issue. Note that the AWS provider uses a different format that includes the operator release version
    (e.g., OpenShift/<version> (ingress-operator)), but the Azure SDK enforces a maximum of 24 characters with no spaces for this field. Additionally, the release version is not currently
    available in the Azure client package and would require propagating it from the controller. Happy to align to a different format if preferred.
  • The Jira issue only mentions the ARM DNS clients, but the same problem exists in the credential clients (auth.go). This PR fixes both.

Checklist

  • Subject and description added to both the commit and PR
  • Relevant issues referenced
  • Unit tests included

…when authenticating to Azure Cloud

Set policy.TelemetryOptions with ApplicationID for ARM dns RecordSetsClient,
PrivateRecordSetsClient, and Azure credential clients."
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 24, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 24, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Mar 24, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@Nirshal: This pull request references Jira Issue OCPBUGS-74497, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @wewang58

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What this PR does / why we need it

The Cluster Ingress Operator is not setting the ApplicationID in the Azure SDK
TelemetryOptions when creating Azure ARM SDK clients and credential clients.
This means Azure API requests from CIO do not include proper application
identification in the User-Agent header for request tracing and telemetry purposes.

This PR adds policy.TelemetryOptions with ApplicationID to:

  • ARM dns RecordSetsClient
  • ARM dns PrivateRecordSetsClient
  • Azure credential clients (UserAssignedIdentityCredential, WorkloadIdentityCredential, ClientSecretCredential)

Which issue(s) this PR fixes

Fixes https://issues.redhat.com/browse/OCPBUGS-74497

Special notes for your reviewer

  • The ApplicationID value is set to [cio-useragent] as suggested in the Jira issue. Note that the AWS provider uses a different format that includes the operator release version
    (e.g., OpenShift/<version> (ingress-operator)), but the Azure SDK enforces a maximum of 24 characters with no spaces for this field. Additionally, the release version is not currently
    available in the Azure client package and would require propagating it from the controller. Happy to align to a different format if preferred.
  • The Jira issue only mentions the ARM DNS clients, but the same problem exists in the credential clients (auth.go). This PR fixes both.

Checklist

  • Subject and description added to both the commit and PR
  • Relevant issues referenced
  • Unit tests included

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested a review from wewang58 March 24, 2026 15:04
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 24, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 81fcf17e-6d44-45cb-b5fa-8ee64e55f90c

📥 Commits

Reviewing files that changed from the base of the PR and between dd97cd3 and a14c606.

📒 Files selected for processing (2)
  • pkg/dns/azure/client/auth.go
  • pkg/dns/azure/client/client.go

📝 Walkthrough

Walkthrough

Azure SDK telemetry configuration has been added to the Azure DNS client code. The changes introduce policy.TelemetryOptions{ApplicationID: "[cio-useragent]"} into the azcore.ClientOptions across credential initialization paths in the authentication module, including user-assigned managed identity, workload identity, and client secret credentials. Similar telemetry options are also configured in the record set client initialization functions. These modifications include a new import for the Azure SDK policy package and require no alterations to public APIs or control flow logic.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 24, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign frobware for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Nirshal Nirshal marked this pull request as ready for review March 25, 2026 10:28
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 25, 2026
@Nirshal
Copy link
Copy Markdown
Author

Nirshal commented Mar 25, 2026

/retest

@Nirshal
Copy link
Copy Markdown
Author

Nirshal commented Mar 26, 2026

According to Claude e2e-analyze skill:
"The failure is clear. Let me provide the analysis.

Error: Failed to acquire lease for "hypershift-aws-quota-slice": resources not found
Summary: The e2e test never ran. The CI job failed during the infrastructure setup phase while trying to acquire an AWS resource lease from the shared CI pool. After waiting the full 2-hour
timeout, all 30 lease slots remained occupied (0 free, 30 leased), so the job was unable to provision a cluster and timed out.
Evidence: Build log shows: "Failed to acquire resource, current capacity: 0 free, 30 leased" and "could not run steps: step e2e-aws-ovn-hypershift-conformance failed: failed to acquire
lease for 'hypershift-aws-quota-slice': resources not found". Job ran for 2h19m41s before failing with reason 'executing_graph:step_failed:utilizing_lease:acquiring_lease'.
Additional evidence: This is a transient CI infrastructure issue, not a code problem. The build and image steps all succeeded (src build in 2m3s, cluster-ingress-operator build in 4m15s,
release image created successfully). The failure occurred entirely at the lease acquisition step, meaning the test code from PR #1400 on openshift/cluster-ingress-operator was never
executed. A retry when CI load is lower should succeed."

/test e2e-aws-ovn-hypershift-conformance

@Nirshal
Copy link
Copy Markdown
Author

Nirshal commented Mar 26, 2026

/test e2e-aws-ovn-hypershift-conformance

@Nirshal
Copy link
Copy Markdown
Author

Nirshal commented Mar 31, 2026

/retest

1 similar comment
@Nirshal
Copy link
Copy Markdown
Author

Nirshal commented Apr 1, 2026

/retest

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 1, 2026

@Nirshal: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-operator-techpreview a14c606 link false /test e2e-aws-operator-techpreview

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@wewang58
Copy link
Copy Markdown

wewang58 commented Apr 2, 2026

/verified by @wewang58

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Apr 2, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@wewang58: This PR has been marked as verified by @wewang58.

Details

In response to this:

/verified by @wewang58

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@Nirshal
Copy link
Copy Markdown
Author

Nirshal commented Apr 2, 2026

I let claude check the ci/prow/e2e-aws-operator-techpreview error, and it is unrelated to this PR modifications:

Now I have all the information needed. Here's the analysis:

  ---
  Error: TestAll/parallel/TestUnsupportedConfigOverride failed (310.05s) — subtests
  "dynamic-config-manager" and "max-dynamic-servers" both timed out waiting for a condition.

  Summary: The test TestUnsupportedConfigOverride verifies that unsupported config overrides
  (ROUTER_HAPROXY_CONFIG_MANAGER and ROUTER_MAX_DYNAMIC_SERVERS) are NOT set in the initial
  router deployment when the DynamicConfigurationManager feature gate is enabled. Both subtests
  timed out (~110s each) waiting for the deployment to NOT contain these environment variables,
  suggesting the router deployment was incorrectly setting these variables when the
  DynamicConfigurationManager feature gate was enabled.

  This failure is UNRELATED to PR #1400 (which only adds Azure SDK User-Agent telemetry to
  DNS/credential clients). The test is a pre-existing TechPreview e2e test for the
  DynamicConfigurationManager feature gate behavior in the HAProxy router. The "contstats"
  subtest of the same parent test PASSED, confirming partial functionality.

  Evidence:
  - operator_test.go:3671: "expected initial deployment not to set
    ROUTER_HAPROXY_CONFIG_MANAGER=true: timed out waiting for the condition"
  - operator_test.go:3671: "expected initial deployment not to set
    ROUTER_MAX_DYNAMIC_SERVERS=1: timed out waiting for the condition"
  - operator_test.go:3635: "DynamicConfigurationManager feature gate is enabled for this test"
  - Test result: FAIL TestAll/parallel/TestUnsupportedConfigOverride/dynamic-config-manager (112.59s)
  - Test result: FAIL TestAll/parallel/TestUnsupportedConfigOverride/max-dynamic-servers (110.61s)
  - Test result: PASS TestAll/parallel/TestUnsupportedConfigOverride/contstats (84.73s)

  Additional evidence:
  - The PR changes only Azure SDK telemetry in pkg/dns/azure/client/ — completely unrelated
    to HAProxy router configuration or feature gates.
  - The serial test suite PASSED entirely (2232.97s), suggesting no systemic infrastructure issue.
  - Multiple other parallel tests showed resource contention symptoms (ContainerCreating delays,
    DNS resolution failures for ELB endpoints), indicating the cluster was under heavy load
    during parallel execution, which may have contributed to the timeout.
  - The failure appears to be a flaky/pre-existing issue with the DynamicConfigurationManager
    feature gate test in TechPreview mode, where the router deployment retains unsupported
    config override env vars longer than the test timeout allows.
    ```

@Nirshal
Copy link
Copy Markdown
Author

Nirshal commented Apr 2, 2026

@davidesalerno @Thealisyed can you provide some help with this PR? I need the proper labels to merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants