Skip to content

OSAC: add AAP job failure diagnostics to e2e-vmaas CI#79650

Open
omer-vishlitzky wants to merge 4 commits into
openshift:mainfrom
omer-vishlitzky:osac-e2e-vmaas-debug-diagnostics
Open

OSAC: add AAP job failure diagnostics to e2e-vmaas CI#79650
omer-vishlitzky wants to merge 4 commits into
openshift:mainfrom
omer-vishlitzky:osac-e2e-vmaas-debug-diagnostics

Conversation

@omer-vishlitzky
Copy link
Copy Markdown
Contributor

@omer-vishlitzky omer-vishlitzky commented May 23, 2026

Summary

  • e2e-vmaas has ~50% failure rate with AAP provision jobs crashing (rc=None) but zero visibility into the crash reason
  • Adds diagnostics to the gather and test steps to capture the missing data

Changes

Gather step (osac-project-gather-commands.sh):

  • Query AAP REST API for failed/errored jobs — saves result_traceback, job_explanation, stdout per job
  • Collect automation-job pod descriptions (exit codes, OOMKill events)
  • Collect instance group capacity data
  • Collect VirtualNetwork/Subnet/SecurityGroup YAML status

Test step (osac-project-cluster-tool-test-commands.sh):

  • Pre-test resource baseline: node resources, pod count, LVM thin pool, disk usage
  • Monkey-patch poll_until via conftest.py injection to dump resource state at exact moment of timeout
  • Post-test trap handler: node/pod resources, stuck resources, automation-job pod exit codes

What this gets us

On the next CI failure, Prow artifacts will contain:

  • aap-jobs/job-*-detail.jsonresult_traceback + job_explanation (the exact crash reason)
  • aap-jobs/job-*-stdout.txt — which Ansible task was running when the pod died
  • aap-jobs/pod-*-describe.txt — container exit codes, OOMKill events
  • Pre/post resource snapshots — CPU/memory pressure data
  • Inline timeout diagnostics — resource state at exact moment of failure

Test plan

  • Rehearsal job passes (step scripts are syntactically valid)
  • On a failing run: verify AAP job artifacts appear in gathered logs
  • On a failing run: verify timeout diagnostics appear in build log

This PR enhances the osac-project e2e-vmaas CI (ci-operator step-registry under ci-operator/step-registry/osac-project/...) to add targeted AAP (Ansible Automation Platform / Automation Controller) job failure diagnostics and timed resource snapshots so intermittent, opaque AAP provision job crashes (rc=None / no visibility) can be diagnosed.

What is affected

  • CI steps used by the osac-project e2e-vmaas workflow: cluster-tool boot, test and gather steps (ci-operator/step-registry/osac-project/boot/, .../test/, .../gather/).

Practical behavior changes

  • Gather step (ci-operator/step-registry/osac-project/gather/osac-project-gather-commands.sh)

    • Creates artifacts/aap-jobs and, when the AAP route + API token exist, queries the Automation Controller API to collect failed/errored jobs, per-job detail JSON (includes result_traceback and job_explanation), per-job stdout (Ansible task stdout), per-job events, and instance-group data.
    • Exports JSON + pod-describe for pods labeled ansible_job (container exit codes, OOMKill/reason) into aap-jobs/.
    • Continues broader cluster-state collection and adds networking resource YAMLs (VirtualNetwork/Subnet/SecurityGroup).
  • Test step (ci-operator/step-registry/osac-project/cluster-tool/test/osac-project-cluster-tool-test-commands.sh)

    • Captures a pre-test resource baseline on the remote test machine: oc adm top nodes, top pods (by memory), pod counts, LVM thin pool stats, and disk usage.
    • Installs /tmp/patch_helpers.py and appends a monkey-patch into tests/conftest.py to override tests.core.runner.poll_until so that a TimeoutError triggers an immediate dump of targeted diagnostics (resource YAMLs, oc adm top outputs, ansible_job pod lists, warning events, etc.) at the exact timeout moment.
    • Adds an EXIT trap to collect post-test diagnostics: post-test node/pod top outputs, stuck resources, and automation-job pod terminated exit codes/reasons.
    • Runs vmaas tests with the helper injected so timeouts produce inline diagnostics.
  • Boot step (ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh)

    • Mounts patched installer scripts (prepare-aap.sh and refresh-after-snapshot.sh) into the installer runtime:
      • prepare-aap.sh: captures raw curl responses before jq and reports clearer errors when token responses are non-JSON or missing to avoid silent jq failures.
      • refresh-after-snapshot.sh: expands stabilization/synchronization after refresh to wait for AutomationController reconciliation (Successful reason) or fall back to waiting for the osac-aap-controller-task deployment rollout before proceeding—reducing mid-refresh controller restarts that can disrupt AAP jobs.

Additional commit fix

  • The commit fixes a permission problem for patched/heredoc-created scripts by making them executable (chmod +x) before mounting into containers, preventing "Permission denied" (exit code 126) failures.

Expected CI artifacts on failing runs

  • artifacts/.../aap-jobs/failed-jobs.json
  • artifacts/.../aap-jobs/job--detail.json (result_traceback, job_explanation)
  • artifacts/.../aap-jobs/job--stdout.txt (Ansible stdout)
  • artifacts/.../aap-jobs/job--events.json
  • artifacts/.../aap-jobs/instance-groups.json
  • artifacts/.../aap-jobs/pod-.json and pod--describe.txt (container exit codes, OOMKill info)
  • Pre/post resource snapshots and inline timeout diagnostics emitted by the conftest monkey-patch (oc/ kubectl adm top outputs, warning events, resource YAML excerpts).

Test plan / Validation

  • Rehearsal jobs validate syntax; on failing runs the new artifacts and inline timeout diagnostics should appear for root-cause analysis.

Impact

  • Removes a major blind spot for AAP job failures by surfacing controller/job-level tracebacks and Ansible stdout and correlating them with node/pod/storage and controller rollout state (OOMKills, disk/LVM pressure, mid-refresh controller restarts), improving diagnosability of intermittent provisioning crashes.

e2e-vmaas has a 50% failure rate with provision jobs crashing
(rc=None) but no visibility into why. Add diagnostics to capture
the crash reason from AAP API and cluster state on failure.

Gather step: query AAP REST API for failed job details
(result_traceback, job_explanation, stdout), collect automation-job
pod descriptions (exit codes, OOMKill events), and instance group
capacity. Also collect VirtualNetwork/Subnet/SecurityGroup status.

Test step: add pre-test resource baseline (node resources, storage),
monkey-patch poll_until to dump resource state at exact moment of
timeout, and collect post-test diagnostics in the trap handler.
@openshift-ci openshift-ci Bot requested review from danmanor and jhernand May 23, 2026 11:57
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 23, 2026

Note

Currently processing new changes in this PR. This may take a few minutes, please wait...

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 26993eb2-269b-4c8a-b401-c899ad80da86

📥 Commits

Reviewing files that changed from the base of the PR and between b3f95e7 and 58b24fa.

📒 Files selected for processing (1)
  • ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh

Walkthrough

Adds pre-test cluster/resource baselining and a Python timeout-instrumentation helper, mounts and injects that helper into the test container, expands post-test remote diagnostics, collects AAP failed-job data and pod artifacts, exports networking resources, and waits for the AAP controller rollout during boot.

Changes

OSAC test execution and diagnostics gathering

Layer / File(s) Summary
Pre-test resource baseline and timeout diagnostics helper
ci-operator/step-registry/osac-project/cluster-tool/test/osac-project-cluster-tool-test-commands.sh
Records node metrics, pod counts, lvs and disk usage; creates /tmp/patch_helpers.py that monkey-patches tests.core.runner.poll_until to emit targeted timeout diagnostics before re-raising.
Test execution with helper integration
ci-operator/step-registry/osac-project/cluster-tool/test/osac-project-cluster-tool-test-commands.sh
Mounts the Python helper into the vmaas test container and appends it to tests/conftest.py before running pytest (replaces direct pytest invocation).
Post-test diagnostics collection
ci-operator/step-registry/osac-project/cluster-tool/test/osac-project-cluster-tool-test-commands.sh
collect_artifacts now runs remote diagnostics: oc adm top for nodes and pods (top 20 by memory), wide resource listings, and per-automation-job pod terminated container exit codes and reasons.
AAP job failure and controller-API diagnostics
ci-operator/step-registry/osac-project/gather/osac-project-gather-commands.sh
Creates aap-jobs artifact dir, extracts AAP route and API token from cluster resources, and (when present) calls the AAP controller API to download failed-job lists, per-job details, stdout, events, and instance-group data into artifacts.
AAP-labeled pod artifacts
ci-operator/step-registry/osac-project/gather/osac-project-gather-commands.sh
Iterates pods labeled ansible_job and saves each pod's JSON and oc describe pod output into aap-jobs.
Networking resource artifact collection
ci-operator/step-registry/osac-project/gather/osac-project-gather-commands.sh
Exports virtualnetwork, subnet, and securitygroup resources as YAML artifacts.
Boot: wait for AAP controller rollout
ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh
Adds patched prepare-aap.sh and refresh-after-snapshot.sh with improved token parsing and expanded stabilization/post-phase reconciliation checks; mounts these into the installer image and adds oc rollout status deployment/osac-aap-controller-task -n "${NAMESPACE}" --timeout=300s as a fallback synchronization step.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • openshift/release#79422: Modifies the same boot script injection point for prepare/refresh logic with opposite approach (removal of injected overrides).
  • openshift/release#79365: Also touches refresh flow and AAP override/rollout handling in the boot process.

Suggested labels

rehearsals-ack

Suggested reviewers

  • danmanor
  • eranco74
🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'OSAC: add AAP job failure diagnostics to e2e-vmaas CI' accurately summarizes the main change: adding AAP job failure diagnostics collection to CI tests for the OSAC e2e-vmaas pipeline.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR modifies only shell scripts in CI infrastructure; no Ginkgo test definitions or Go test files are present or modified, making this check not applicable.
Test Structure And Quality ✅ Passed PR contains only shell scripts for CI automation, not Ginkgo test code. Custom check for Ginkgo test structure and quality is not applicable.
Microshift Test Compatibility ✅ Passed PR adds no Ginkgo e2e tests—only modifies CI infrastructure shell scripts for artifact collection and diagnostics. Check only applies to new e2e tests.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR. All modifications are to bash shell scripts used for CI/CD automation and diagnostic collection, not Go test code with Ginkgo test definitions.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies three CI step-registry shell scripts without deployment manifests, operator code, or scheduling constraints (no affinity, topology keys, node selectors, replicas, taints).
Ote Binary Stdout Contract ✅ Passed PR modifies only shell scripts and embedded Python patches for CI orchestration. The OTE Binary Stdout Contract check applies to Go OTE test binaries. This PR contains no Go OTE binaries.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo e2e tests are added in this PR. The three modified files are bash CI orchestration scripts, not test code. The check is not applicable.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 23, 2026
@omer-vishlitzky
Copy link
Copy Markdown
Contributor Author

/pj-rehearse pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-test-infra-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-aap-main-e2e-vmaas

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@omer-vishlitzky
Copy link
Copy Markdown
Contributor Author

/pj-rehearse pull-ci-osac-project-fulfillment-service-main-e2e-vmaas

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@omer-vishlitzky
Copy link
Copy Markdown
Contributor Author

/pj-rehearse pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-test-infra-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-aap-main-e2e-vmaas

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

The AAP operator triggers controller-task deployment rollouts
during the refresh-after-snapshot sequence. If tests start before
the rollout completes, the old pod is terminated mid-test, its
Redis sidecar socket vanishes, and running AAP jobs crash with
redis.exceptions.ConnectionError on /var/run/redis/redis.sock.

Wait for the rollout to finish after refresh, before declaring
boot complete. This matches how the refresh script already waits
for fulfillment deployment rollouts.
@omer-vishlitzky
Copy link
Copy Markdown
Contributor Author

/pj-rehearse pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-test-infra-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-aap-main-e2e-vmaas

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 23, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: omer-vishlitzky

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@omer-vishlitzky
Copy link
Copy Markdown
Contributor Author

/pj-rehearse pull-ci-osac-project-osac-operator-main-e2e-vmaas

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@omer-vishlitzky
Copy link
Copy Markdown
Contributor Author

/pj-rehearse pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-aap-main-e2e-vmaas

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

Embed verbatim copies of refresh-after-snapshot.sh and
prepare-aap.sh from osac-installer main, with two fixes:

prepare-aap.sh: capture curl response before jq so we see
what AAP returns when it responds with non-JSON (was causing
silent "parse error: Invalid numeric literal" crash).

refresh-after-snapshot.sh: after step [8/8], wait for the
AutomationController to reach Successful status before
declaring refresh complete. The AAP operator triggers multiple
async controller-task rollouts that kill the Redis sidecar
socket on the old pod, crashing in-flight provision jobs.
Waiting for Successful ensures all rollouts are done.
@omer-vishlitzky
Copy link
Copy Markdown
Contributor Author

/pj-rehearse pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-test-infra-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-aap-main-e2e-vmaas

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

The heredoc-created scripts need chmod +x before being
mounted into the container, otherwise the refresh script
fails with "Permission denied" (exit code 126).
@omer-vishlitzky
Copy link
Copy Markdown
Contributor Author

/pj-rehearse pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-test-infra-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-aap-main-e2e-vmaas

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

[REHEARSALNOTIFIER]
@omer-vishlitzky: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
pull-ci-osac-project-fulfillment-service-main-e2e-vmaas osac-project/fulfillment-service presubmit Registry content changed
pull-ci-osac-project-osac-installer-main-e2e-vmaas osac-project/osac-installer presubmit Registry content changed
pull-ci-osac-project-osac-test-infra-main-e2e-vmaas osac-project/osac-test-infra presubmit Registry content changed
pull-ci-osac-project-osac-aap-main-e2e-vmaas osac-project/osac-aap presubmit Registry content changed
pull-ci-osac-project-osac-operator-main-e2e-vmaas osac-project/osac-operator presubmit Registry content changed
periodic-ci-osac-project-osac-test-infra-main-e2e-metal-vmaas-compute-instance-cli-fields N/A periodic Registry content changed
periodic-ci-osac-project-osac-test-infra-main-e2e-metal-vmaas-compute-instance-delete-during-provision N/A periodic Registry content changed
periodic-ci-osac-project-osac-test-infra-main-e2e-metal-vmaas-compute-instance-restart N/A periodic Registry content changed
periodic-ci-osac-project-osac-test-infra-main-e2e-metal-vmaas-compute-instance-restart-negative N/A periodic Registry content changed
periodic-ci-osac-project-osac-test-infra-main-e2e-metal-vmaas-subnet-lifecycle N/A periodic Registry content changed
periodic-ci-osac-project-osac-test-infra-main-e2e-metal-vmaas-virtual-network-lifecycle N/A periodic Registry content changed
periodic-ci-osac-project-osac-test-infra-main-e2e-metal-vmaas-compute-instance-creation N/A periodic Registry content changed
periodic-ci-osac-project-osac-test-infra-main-e2e-metal-vmaas-compute-instance-api-fields N/A periodic Registry content changed
Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant