OSAC: add AAP job failure diagnostics to e2e-vmaas CI#79650
OSAC: add AAP job failure diagnostics to e2e-vmaas CI#79650omer-vishlitzky wants to merge 4 commits into
Conversation
e2e-vmaas has a 50% failure rate with provision jobs crashing (rc=None) but no visibility into why. Add diagnostics to capture the crash reason from AAP API and cluster state on failure. Gather step: query AAP REST API for failed job details (result_traceback, job_explanation, stdout), collect automation-job pod descriptions (exit codes, OOMKill events), and instance group capacity. Also collect VirtualNetwork/Subnet/SecurityGroup status. Test step: add pre-test resource baseline (node resources, storage), monkey-patch poll_until to dump resource state at exact moment of timeout, and collect post-test diagnostics in the trap handler.
|
Note Currently processing new changes in this PR. This may take a few minutes, please wait... ⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
WalkthroughAdds pre-test cluster/resource baselining and a Python timeout-instrumentation helper, mounts and injects that helper into the test container, expands post-test remote diagnostics, collects AAP failed-job data and pod artifacts, exports networking resources, and waits for the AAP controller rollout during boot. ChangesOSAC test execution and diagnostics gathering
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/pj-rehearse pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-test-infra-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-aap-main-e2e-vmaas |
|
@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse pull-ci-osac-project-fulfillment-service-main-e2e-vmaas |
|
@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-test-infra-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-aap-main-e2e-vmaas |
|
@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
The AAP operator triggers controller-task deployment rollouts during the refresh-after-snapshot sequence. If tests start before the rollout completes, the old pod is terminated mid-test, its Redis sidecar socket vanishes, and running AAP jobs crash with redis.exceptions.ConnectionError on /var/run/redis/redis.sock. Wait for the rollout to finish after refresh, before declaring boot complete. This matches how the refresh script already waits for fulfillment deployment rollouts.
|
/pj-rehearse pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-test-infra-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-aap-main-e2e-vmaas |
|
@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: omer-vishlitzky The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/pj-rehearse pull-ci-osac-project-osac-operator-main-e2e-vmaas |
|
@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-aap-main-e2e-vmaas |
|
@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
Embed verbatim copies of refresh-after-snapshot.sh and prepare-aap.sh from osac-installer main, with two fixes: prepare-aap.sh: capture curl response before jq so we see what AAP returns when it responds with non-JSON (was causing silent "parse error: Invalid numeric literal" crash). refresh-after-snapshot.sh: after step [8/8], wait for the AutomationController to reach Successful status before declaring refresh complete. The AAP operator triggers multiple async controller-task rollouts that kill the Redis sidecar socket on the old pod, crashing in-flight provision jobs. Waiting for Successful ensures all rollouts are done.
|
/pj-rehearse pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-test-infra-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-aap-main-e2e-vmaas |
|
@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
The heredoc-created scripts need chmod +x before being mounted into the container, otherwise the refresh script fails with "Permission denied" (exit code 126).
|
/pj-rehearse pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-test-infra-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-aap-main-e2e-vmaas |
|
@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
[REHEARSALNOTIFIER]
Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
Summary
rc=None) but zero visibility into the crash reasonChanges
Gather step (
osac-project-gather-commands.sh):result_traceback,job_explanation,stdoutper jobTest step (
osac-project-cluster-tool-test-commands.sh):poll_untilvia conftest.py injection to dump resource state at exact moment of timeoutWhat this gets us
On the next CI failure, Prow artifacts will contain:
aap-jobs/job-*-detail.json—result_traceback+job_explanation(the exact crash reason)aap-jobs/job-*-stdout.txt— which Ansible task was running when the pod diedaap-jobs/pod-*-describe.txt— container exit codes, OOMKill eventsTest plan
This PR enhances the osac-project e2e-vmaas CI (ci-operator step-registry under ci-operator/step-registry/osac-project/...) to add targeted AAP (Ansible Automation Platform / Automation Controller) job failure diagnostics and timed resource snapshots so intermittent, opaque AAP provision job crashes (rc=None / no visibility) can be diagnosed.
What is affected
Practical behavior changes
Gather step (ci-operator/step-registry/osac-project/gather/osac-project-gather-commands.sh)
Test step (ci-operator/step-registry/osac-project/cluster-tool/test/osac-project-cluster-tool-test-commands.sh)
Boot step (ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh)
Additional commit fix
Expected CI artifacts on failing runs
Test plan / Validation
Impact