OSAC: add AAP job failure diagnostics to e2e-vmaas CI (2)#79651
OSAC: add AAP job failure diagnostics to e2e-vmaas CI (2)#79651omer-vishlitzky wants to merge 4 commits into
Conversation
e2e-vmaas has a 50% failure rate with provision jobs crashing (rc=None) but no visibility into why. Add diagnostics to capture the crash reason from AAP API and cluster state on failure. Gather step: query AAP REST API for failed job details (result_traceback, job_explanation, stdout), collect automation-job pod descriptions (exit codes, OOMKill events), and instance group capacity. Also collect VirtualNetwork/Subnet/SecurityGroup status. Test step: add pre-test resource baseline (node resources, storage), monkey-patch poll_until to dump resource state at exact moment of timeout, and collect post-test diagnostics in the trap handler.
WalkthroughAdds pre-test baselines, a Python helper that instruments test timeouts to emit targeted Kubernetes diagnostics, mounts that helper into the test container, expands remote post-test diagnostics (node/pod/resource listings and ansible-job container exit info), collects AAP failed-job details and pod diagnostics, exports virtualization networking YAMLs, and waits for the AAP controller-task rollout during boot. ChangesEnhanced Test Diagnostics and Artifact Collection
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/pj-rehearse pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-test-infra-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-aap-main-e2e-vmaas |
|
@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@ci-operator/step-registry/osac-project/cluster-tool/test/osac-project-cluster-tool-test-commands.sh`:
- Around line 106-111: The test loop prints only r.stdout so error output is
lost; update the cmds loop to print both stdout and stderr from subprocess.run
(e.g., include r.stderr in the print call) and handle timeouts by catching
subprocess.TimeoutExpired and printing e.stdout and e.stderr as well;
specifically modify the for cmd in cmds: block that calls subprocess.run(...) to
emit f"--- {cmd} ---\n{r.stdout}\n{r.stderr}" to sys.stderr (truncated if
desired) and in the except Exception as e: branch, detect
subprocess.TimeoutExpired (or print getattr(e, 'stdout', '') and getattr(e,
'stderr', '')) so timeout failures also dump stderr.
- Around line 103-104: The timeout-diagnostics commands use invalid invocations
("kubectl adm top nodes" and "kubectl adm top pods -n {ns} --sort-by=memory");
change them to the correct kubectl subcommand names ("kubectl top nodes" and
"kubectl top pods -n {ns} --sort-by=memory") so the functions that build the
diagnostics list (the strings containing "kubectl adm top nodes" and "kubectl
adm top pods -n {ns} --sort-by=memory") will run successfully and produce
node/pod metrics.
In
`@ci-operator/step-registry/osac-project/gather/osac-project-gather-commands.sh`:
- Around line 92-126: The new AAP gather block runs without errexit so setup
failures can be skipped; add set -e (preferably set -o errexit or set -euo
pipefail consistent with other step scripts) at the start of that remote heredoc
before the AAP block to enable errexit, and then remove the blanket "|| true"
only from critical setup commands (e.g. mkdir -p "${ARTIFACT_DIR}/aap-jobs",
AAP_ROUTE=$(oc get route ...), AAP_TOKEN=$(oc get secret ...)) so those fail the
script on error; keep "|| true" on the best-effort collectors/outputs (the curl
and oc get pod/json/describe lines that write into ${ARTIFACT_DIR}) to preserve
diagnostics gathering.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: a692b313-7c18-407b-95af-8ef3092d920a
📒 Files selected for processing (2)
ci-operator/step-registry/osac-project/cluster-tool/test/osac-project-cluster-tool-test-commands.shci-operator/step-registry/osac-project/gather/osac-project-gather-commands.sh
| "kubectl adm top nodes", | ||
| f"kubectl adm top pods -n {ns} --sort-by=memory", |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Does kubectlsupport the subcommand sequencekubectl adm top, or is the correct syntax kubectl top? If ocis used instead, what is the correctoc adm top syntax?
💡 Result:
The correct syntax for the Kubernetes command-line tool is kubectl top [1][2]. The sequence kubectl adm top is not a valid command [1][3]. When using the OpenShift CLI (oc), the correct syntax is oc adm top [4][5]. This command is used to analyze resources managed by the platform and provides usage statistics for various resources, such as nodes and pods [4][6]. Common examples for the OpenShift CLI include: - oc adm top node: Displays resource (CPU/memory) usage of nodes [5][6]. - oc adm top pod: Displays resource (CPU/memory) usage of pods [5][7]. The oc adm top command internally utilizes kubectl top functionality to retrieve these metrics [8].
Citations:
- 1: https://kubernetes.io/docs/reference/kubectl/generated/kubectl_top/
- 2: https://signoz.io/blog/kubectl-top/
- 3: https://manpages.debian.org/bullseye/kubernetes-client/kubectl-top.1.en.html
- 4: https://manpages.opensuse.org/Tumbleweed/oc/oc-adm-top.1.en.html
- 5: https://docs.okd.io/latest/cli_reference/openshift_cli/administrator-cli-commands.html
- 6: https://github.com/RedHatWorkshops/openshiftv3-ops-workshop/blob/master/oc_adm_top.md
- 7: https://manpages.opensuse.org/Tumbleweed/oc/oc-adm-top-pod.1
- 8: https://www.redhat.com/en/blog/using-oc-adm-top-to-monitor-memory-usage
Fix invalid kubectl metrics commands in timeout diagnostics
kubectl adm top ... is not a valid kubectl invocation, so the timeout diagnostics will fail and produce no node/pod metrics.
Suggested fix
- "kubectl adm top nodes",
- f"kubectl adm top pods -n {ns} --sort-by=memory",
+ "kubectl top nodes",
+ f"kubectl top pods -n {ns} --sort-by=memory",📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "kubectl adm top nodes", | |
| f"kubectl adm top pods -n {ns} --sort-by=memory", | |
| "kubectl top nodes", | |
| f"kubectl top pods -n {ns} --sort-by=memory", |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@ci-operator/step-registry/osac-project/cluster-tool/test/osac-project-cluster-tool-test-commands.sh`
around lines 103 - 104, The timeout-diagnostics commands use invalid invocations
("kubectl adm top nodes" and "kubectl adm top pods -n {ns} --sort-by=memory");
change them to the correct kubectl subcommand names ("kubectl top nodes" and
"kubectl top pods -n {ns} --sort-by=memory") so the functions that build the
diagnostics list (the strings containing "kubectl adm top nodes" and "kubectl
adm top pods -n {ns} --sort-by=memory") will run successfully and produce
node/pod metrics.
| for cmd in cmds: | ||
| try: | ||
| r = subprocess.run(cmd, shell=True, capture_output=True, text=True, timeout=15) | ||
| print(f"--- {cmd} ---\n{r.stdout[:3000]}", file=sys.stderr, flush=True) | ||
| except Exception as e: | ||
| print(f"--- {cmd} FAILED: {e} ---", file=sys.stderr, flush=True) |
There was a problem hiding this comment.
Include stderr in the timeout dump.
subprocess.run(..., capture_output=True) captures both streams, but this code only prints r.stdout. Any failed kubectl call becomes an empty section instead of showing the actual error.
Suggested fix
for cmd in cmds:
try:
r = subprocess.run(cmd, shell=True, capture_output=True, text=True, timeout=15)
- print(f"--- {cmd} ---\n{r.stdout[:3000]}", file=sys.stderr, flush=True)
+ output = "\n".join(part for part in (r.stdout.strip(), r.stderr.strip()) if part)
+ print(
+ f"--- {cmd} (rc={r.returncode}) ---\n{output[:3000]}",
+ file=sys.stderr,
+ flush=True,
+ )
except Exception as e:
print(f"--- {cmd} FAILED: {e} ---", file=sys.stderr, flush=True)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@ci-operator/step-registry/osac-project/cluster-tool/test/osac-project-cluster-tool-test-commands.sh`
around lines 106 - 111, The test loop prints only r.stdout so error output is
lost; update the cmds loop to print both stdout and stderr from subprocess.run
(e.g., include r.stderr in the print call) and handle timeouts by catching
subprocess.TimeoutExpired and printing e.stdout and e.stderr as well;
specifically modify the for cmd in cmds: block that calls subprocess.run(...) to
emit f"--- {cmd} ---\n{r.stdout}\n{r.stderr}" to sys.stderr (truncated if
desired) and in the except Exception as e: branch, detect
subprocess.TimeoutExpired (or print getattr(e, 'stdout', '') and getattr(e,
'stderr', '')) so timeout failures also dump stderr.
| echo "=== Collecting AAP job failure diagnostics ===" | ||
| mkdir -p "${ARTIFACT_DIR}/aap-jobs" | ||
|
|
||
| AAP_ROUTE=$(oc get route osac-aap -n "${E2E_NAMESPACE}" -o jsonpath='{.spec.host}' 2>/dev/null || true) | ||
| AAP_TOKEN=$(oc get secret osac-aap-api-token -n "${E2E_NAMESPACE}" -o jsonpath='{.data.token}' 2>/dev/null | base64 -d || true) | ||
|
|
||
| if [[ -n "${AAP_ROUTE}" && -n "${AAP_TOKEN}" ]]; then | ||
| AUTH="Authorization: Bearer ${AAP_TOKEN}" | ||
| BASE="https://${AAP_ROUTE}/api/controller/v2" | ||
|
|
||
| curl -sk -H "${AUTH}" "${BASE}/jobs/?status__in=error,failed&order_by=-finished&page_size=20" \ | ||
| > "${ARTIFACT_DIR}/aap-jobs/failed-jobs.json" 2>&1 || true | ||
|
|
||
| for JOB_ID in $(jq -r '.results[].id' "${ARTIFACT_DIR}/aap-jobs/failed-jobs.json" 2>/dev/null | head -10); do | ||
| curl -sk -H "${AUTH}" "${BASE}/jobs/${JOB_ID}/" \ | ||
| > "${ARTIFACT_DIR}/aap-jobs/job-${JOB_ID}-detail.json" 2>&1 || true | ||
| curl -sk -H "${AUTH}" "${BASE}/jobs/${JOB_ID}/stdout/?format=txt" \ | ||
| > "${ARTIFACT_DIR}/aap-jobs/job-${JOB_ID}-stdout.txt" 2>&1 || true | ||
| curl -sk -H "${AUTH}" "${BASE}/jobs/${JOB_ID}/job_events/?order_by=-counter&page_size=30" \ | ||
| > "${ARTIFACT_DIR}/aap-jobs/job-${JOB_ID}-events.json" 2>&1 || true | ||
| done | ||
|
|
||
| curl -sk -H "${AUTH}" "${BASE}/instance_groups/" \ | ||
| > "${ARTIFACT_DIR}/aap-jobs/instance-groups.json" 2>&1 || true | ||
| fi | ||
|
|
||
| for POD in $(oc get pods -n "${E2E_NAMESPACE}" -l ansible_job --no-headers -o custom-columns=NAME:.metadata.name 2>/dev/null); do | ||
| oc get pod "${POD}" -n "${E2E_NAMESPACE}" -o json > "${ARTIFACT_DIR}/aap-jobs/pod-${POD}.json" 2>&1 || true | ||
| oc describe pod "${POD}" -n "${E2E_NAMESPACE}" > "${ARTIFACT_DIR}/aap-jobs/pod-${POD}-describe.txt" 2>&1 || true | ||
| done | ||
|
|
||
| echo "=== Collecting networking resource status ===" | ||
| oc get virtualnetwork -n "${E2E_NAMESPACE}" -o yaml > "${ARTIFACT_DIR}/virtualnetworks.yaml" 2>&1 || true | ||
| oc get subnet -n "${E2E_NAMESPACE}" -o yaml > "${ARTIFACT_DIR}/subnets.yaml" 2>&1 || true | ||
| oc get securitygroup -n "${E2E_NAMESPACE}" -o yaml > "${ARTIFACT_DIR}/securitygroups.yaml" 2>&1 || true |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Run the new gather diagnostics under errexit too.
This new block still executes inside a remote heredoc that defaults to nounset + pipefail only, so setup failures here can be skipped silently unless each command is manually guarded. Please enable set -o errexit for that remote shell and keep || true only on the best-effort collectors.
Suggested fix
timeout -s 9 10m ssh -F "${SHARED_DIR}/ssh_config" ci_machine bash -s "${E2E_NAMESPACE}" "${REMOTE_ARTIFACT_DIR}" <<'REMOTE_EOF'
+set -o errexit
set -o nounset
set -o pipefailAs per coding guidelines "Step registry script files must use set -euo pipefail (without -x) as default and only enable -x when actively debugging".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@ci-operator/step-registry/osac-project/gather/osac-project-gather-commands.sh`
around lines 92 - 126, The new AAP gather block runs without errexit so setup
failures can be skipped; add set -e (preferably set -o errexit or set -euo
pipefail consistent with other step scripts) at the start of that remote heredoc
before the AAP block to enable errexit, and then remove the blanket "|| true"
only from critical setup commands (e.g. mkdir -p "${ARTIFACT_DIR}/aap-jobs",
AAP_ROUTE=$(oc get route ...), AAP_TOKEN=$(oc get secret ...)) so those fail the
script on error; keep "|| true" on the best-effort collectors/outputs (the curl
and oc get pod/json/describe lines that write into ${ARTIFACT_DIR}) to preserve
diagnostics gathering.
|
@omer-vishlitzky: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/pj-rehearse pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-test-infra-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-aap-main-e2e-vmaas |
The AAP operator triggers controller-task deployment rollouts during the refresh-after-snapshot sequence. If tests start before the rollout completes, the old pod is terminated mid-test, its Redis sidecar socket vanishes, and running AAP jobs crash with redis.exceptions.ConnectionError on /var/run/redis/redis.sock. Wait for the rollout to finish after refresh, before declaring boot complete. This matches how the refresh script already waits for fulfillment deployment rollouts.
|
/pj-rehearse pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-test-infra-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-aap-main-e2e-vmaas |
|
@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: omer-vishlitzky The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/pj-rehearse pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-aap-main-e2e-vmaas |
|
@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
Embed verbatim copies of refresh-after-snapshot.sh and prepare-aap.sh from osac-installer main, with two fixes: prepare-aap.sh: capture curl response before jq so we see what AAP returns when it responds with non-JSON (was causing silent "parse error: Invalid numeric literal" crash). refresh-after-snapshot.sh: after step [8/8], wait for the AutomationController to reach Successful status before declaring refresh complete. The AAP operator triggers multiple async controller-task rollouts that kill the Redis sidecar socket on the old pod, crashing in-flight provision jobs. Waiting for Successful ensures all rollouts are done.
|
/pj-rehearse pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-test-infra-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-aap-main-e2e-vmaas |
|
@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
The heredoc-created scripts need chmod +x before being mounted into the container, otherwise the refresh script fails with "Permission denied" (exit code 126).
|
/pj-rehearse pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-test-infra-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-aap-main-e2e-vmaas |
|
@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
[REHEARSALNOTIFIER]
Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh (2)
213-225:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAvoid printing the raw token endpoint response.
Lines 219 and 224 log the body returned by an authenticated token-creation call. Even truncated, that can expose the AAP route or token material in CI logs. Please redact sensitive fields before logging, or persist the body outside the log stream.
🔒 Proposed fix
AAP_RESPONSE=$(curl -sk -X POST \ -u "admin:${AAP_ADMIN_PASSWORD}" \ -H "Content-Type: application/json" \ -d '{"description": "osac-operator", "scope": "write"}' \ "${AAP_URL}/api/gateway/v1/tokens/") +REDACTED_AAP_RESPONSE=$(printf '%s' "${AAP_RESPONSE}" | sed -E \ + -e 's/"token"[[:space:]]*:[[:space:]]*"[^"]*"/"token":"<redacted>"/g' \ + -e 's#https?://[^"[:space:]]+#<redacted-url>`#g`') AAP_TOKEN=$(echo "${AAP_RESPONSE}" | jq -r '.token') || { - echo "ERROR: AAP gateway returned non-JSON response: ${AAP_RESPONSE:0:500}" + echo "ERROR: AAP gateway returned non-JSON response: ${REDACTED_AAP_RESPONSE:0:500}" exit 1 } if [[ -z "${AAP_TOKEN}" || "${AAP_TOKEN}" == "null" ]]; then - echo "Failed to create AAP API token. Response: ${AAP_RESPONSE:0:500}" + echo "Failed to create AAP API token. Response: ${REDACTED_AAP_RESPONSE:0:500}" exit 1 fiAs per coding guidelines, "Protect sensitive information in step registry scripts - never echo or print passwords, tokens, API keys, cluster URLs, or kubeconfig contents".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh` around lines 213 - 225, The script currently echoes raw AAP_RESPONSE and slices it into logs when token creation fails; update the token creation/check block (the AAP_RESPONSE and AAP_TOKEN handling) to never print raw response contents: instead parse and mask sensitive fields (e.g., remove or replace .token and any URL fields via jq) before any echo, or write the full response to a secure file/secret store and only log a non-sensitive stub like "REDACTED_RESPONSE" or a masked summary; ensure the failure messages referencing AAP_RESPONSE use the masked summary variable rather than the raw AAP_RESPONSE.
433-436:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse the condition
statusas the readiness signal (notreason) (ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh:433-436). Kubernetes conditions usestatusas the satisfaction/truth value (True/False/Unknown);reasonis diagnostic text and can change independently.Proposed fix
-retry_until 300 10 '[[ "$(oc get automationcontroller osac-aap-controller -n '"${INSTALLER_NAMESPACE}"' -o jsonpath='"'"'{.status.conditions[?(@.type=="Successful")].reason}'"'"' 2>/dev/null)" == "Successful" ]]' || { +retry_until 300 10 '[[ "$(oc get automationcontroller osac-aap-controller -n '"${INSTALLER_NAMESPACE}"' -o jsonpath='"'"'{.status.conditions[?(@.type=="Successful")].status}'"'"' 2>/dev/null)" == "True" ]]' || { echo "WARNING: AAP operator did not reach Successful state, waiting for controller-task rollout instead..." oc rollout status deployment/osac-aap-controller-task -n "${INSTALLER_NAMESPACE}" --timeout=300s }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh` around lines 433 - 436, The check in the retry_until call uses the condition's reason instead of its truth value; change the jsonpath used in the oc get command inside retry_until to read .status.conditions[?(@.type=="Successful")].status and compare against "True" (i.e. update the oc get automationcontroller osac-aap-controller ... -o jsonpath to use .status rather than .reason) so readiness is based on condition.status; keep the existing fallback that logs the warning and calls oc rollout status deployment/osac-aap-controller-task -n "${INSTALLER_NAMESPACE}" --timeout=300s unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In
`@ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh`:
- Around line 213-225: The script currently echoes raw AAP_RESPONSE and slices
it into logs when token creation fails; update the token creation/check block
(the AAP_RESPONSE and AAP_TOKEN handling) to never print raw response contents:
instead parse and mask sensitive fields (e.g., remove or replace .token and any
URL fields via jq) before any echo, or write the full response to a secure
file/secret store and only log a non-sensitive stub like "REDACTED_RESPONSE" or
a masked summary; ensure the failure messages referencing AAP_RESPONSE use the
masked summary variable rather than the raw AAP_RESPONSE.
- Around line 433-436: The check in the retry_until call uses the condition's
reason instead of its truth value; change the jsonpath used in the oc get
command inside retry_until to read
.status.conditions[?(@.type=="Successful")].status and compare against "True"
(i.e. update the oc get automationcontroller osac-aap-controller ... -o jsonpath
to use .status rather than .reason) so readiness is based on condition.status;
keep the existing fallback that logs the warning and calls oc rollout status
deployment/osac-aap-controller-task -n "${INSTALLER_NAMESPACE}" --timeout=300s
unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 11628275-9d16-4378-b957-454d5080714c
📒 Files selected for processing (1)
ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh
Summary
rc=None) but zero visibility into the crash reasonChanges
Gather step (
osac-project-gather-commands.sh):result_traceback,job_explanation,stdoutper jobTest step (
osac-project-cluster-tool-test-commands.sh):poll_untilvia conftest.py injection to dump resource state at exact moment of timeoutTest plan
This PR enhances OpenShift CI for the OSAC project (e2e-vmaas jobs) by adding targeted diagnostics during boot, test, and gather steps to capture missing data when AAP provision jobs crash or time out. It's a duplicate PR to increase rehearsal coverage and contains only CI/infrastructure script changes (no public API or function signature changes).
Systems affected
Key practical changes and impact
Gather step (ci-operator/.../gather/osac-project-gather-commands.sh)
Test step (ci-operator/.../cluster-tool/test/osac-project-cluster-tool-test-commands.sh)
Boot/refresh step (ci-operator/.../cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh)
Test plan / expected behavior
Risk/impact
Primary files changed