Skip to content

OSAC: add AAP job failure diagnostics to e2e-vmaas CI (2)#79651

Open
omer-vishlitzky wants to merge 4 commits into
openshift:mainfrom
omer-vishlitzky:osac-e2e-vmaas-debug-diagnostics-2
Open

OSAC: add AAP job failure diagnostics to e2e-vmaas CI (2)#79651
omer-vishlitzky wants to merge 4 commits into
openshift:mainfrom
omer-vishlitzky:osac-e2e-vmaas-debug-diagnostics-2

Conversation

@omer-vishlitzky
Copy link
Copy Markdown
Contributor

@omer-vishlitzky omer-vishlitzky commented May 23, 2026

Summary

  • e2e-vmaas has ~50% failure rate with AAP provision jobs crashing (rc=None) but zero visibility into the crash reason
  • Adds diagnostics to the gather and test steps to capture the missing data
  • Duplicate PR for additional rehearsal coverage

Changes

Gather step (osac-project-gather-commands.sh):

  • Query AAP REST API for failed/errored jobs — saves result_traceback, job_explanation, stdout per job
  • Collect automation-job pod descriptions (exit codes, OOMKill events)
  • Collect instance group capacity data
  • Collect VirtualNetwork/Subnet/SecurityGroup YAML status

Test step (osac-project-cluster-tool-test-commands.sh):

  • Pre-test resource baseline: node resources, pod count, LVM thin pool, disk usage
  • Monkey-patch poll_until via conftest.py injection to dump resource state at exact moment of timeout
  • Post-test trap handler: node/pod resources, stuck resources, automation-job pod exit codes

Test plan

  • Rehearsal job passes (step scripts are syntactically valid)
  • On a failing run: verify AAP job artifacts appear in gathered logs
  • On a failing run: verify timeout diagnostics appear in build log

This PR enhances OpenShift CI for the OSAC project (e2e-vmaas jobs) by adding targeted diagnostics during boot, test, and gather steps to capture missing data when AAP provision jobs crash or time out. It's a duplicate PR to increase rehearsal coverage and contains only CI/infrastructure script changes (no public API or function signature changes).

Systems affected

  • OpenShift CI step-registry for the OSAC project: ci-operator/step-registry/osac-project (gather, cluster-tool/test, cluster-tool/boot).
  • Artifacts and logs produced by e2e-vmaas CI runs (gathered to the job artifacts).

Key practical changes and impact

  • Gather step (ci-operator/.../gather/osac-project-gather-commands.sh)

    • Adds an "aap-jobs" diagnostics phase: queries the AAP Automation Controller REST API (using osac-aap route + token) to save lists of failed/errored jobs and per-job details (job JSON, result_traceback/job_explanation, stdout, job events) and instance group data.
    • Captures automation-job pod JSON and describe output and collects instance/compute status.
    • Adds YAML outputs for VirtualNetwork, Subnet, and SecurityGroup to help cloud/network provisioning debugging.
    • All new collection commands use "|| true" so missing data won't abort artifact collection.
  • Test step (ci-operator/.../cluster-tool/test/osac-project-cluster-tool-test-commands.sh)

    • Records a pre-test resource baseline on the remote test host: node metrics, pod count, LVM thin pool usage, and disk usage.
    • Injects a generated Python helper into tests (appended to tests/conftest.py) that monkey-patches tests.core.runner.poll_until to catch TimeoutError and immediately dump targeted timeout-time Kubernetes diagnostics for the timed-out resource (resource YAML, related pods, warning events, node/pod metrics).
    • Adds post-test trap diagnostics that collect node/pod resource usage, list stuck compute/network resources, and enumerate automation-job pod exit codes/termination reasons, and copies junit artifacts back to the job.
  • Boot/refresh step (ci-operator/.../cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh)

    • Mounts patched prepare-aap.sh and refresh-after-snapshot.sh into the installer run:
      • prepare-aap: preserves raw AAP gateway token-creation HTTP response and reports a truncated response when not valid JSON (avoids opaque jq failures).
      • refresh-after-snapshot: waits for the AutomationController to report a Successful condition (or falls back to waiting for the osac-aap-controller-task deployment rollout up to 300s) before declaring refresh complete, reducing race conditions that can cause transient Redis-sidecar/socket issues and AAP job crashes.
    • Commit fixes ensure heredoc-created patched scripts are made executable (chmod +x) before mounting to avoid "Permission denied" (exit code 126).

Test plan / expected behavior

  • Rehearsal runs validate CI syntax; this PR was duplicated to increase rehearsal coverage.
  • On failing runs, the job artifacts should include AAP job outputs (failed-job JSON, per-job stdout/events, pod describes) and timeout-time diagnostics should be printed to the build log to enable root-cause analysis.

Risk/impact

  • Low risk: changes add logging, artifact collection, and safer error handling only. New commands tolerate missing data to avoid introducing CI failures.

Primary files changed

  • ci-operator/step-registry/osac-project/gather/osac-project-gather-commands.sh
  • ci-operator/step-registry/osac-project/cluster-tool/test/osac-project-cluster-tool-test-commands.sh
  • ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh

e2e-vmaas has a 50% failure rate with provision jobs crashing
(rc=None) but no visibility into why. Add diagnostics to capture
the crash reason from AAP API and cluster state on failure.

Gather step: query AAP REST API for failed job details
(result_traceback, job_explanation, stdout), collect automation-job
pod descriptions (exit codes, OOMKill events), and instance group
capacity. Also collect VirtualNetwork/Subnet/SecurityGroup status.

Test step: add pre-test resource baseline (node resources, storage),
monkey-patch poll_until to dump resource state at exact moment of
timeout, and collect post-test diagnostics in the trap handler.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 23, 2026

Walkthrough

Adds pre-test baselines, a Python helper that instruments test timeouts to emit targeted Kubernetes diagnostics, mounts that helper into the test container, expands remote post-test diagnostics (node/pod/resource listings and ansible-job container exit info), collects AAP failed-job details and pod diagnostics, exports virtualization networking YAMLs, and waits for the AAP controller-task rollout during boot.

Changes

Enhanced Test Diagnostics and Artifact Collection

Layer / File(s) Summary
Pre-test baseline
ci-operator/step-registry/osac-project/cluster-tool/test/osac-project-cluster-tool-test-commands.sh
Captures initial node metrics, namespace pod count, lvs output, and /home disk usage before tests.
Timeout instrumentation helper
ci-operator/step-registry/osac-project/cluster-tool/test/osac-project-cluster-tool-test-commands.sh
Generates /tmp/patch_helpers.py that monkey-patches tests.core.runner.poll_until to catch TimeoutError, infer resource type from the poll description, and emit resource-specific YAML, related pods, warning events, and metrics before re-raising.
Test execution with helper integration and remote post-test diagnostics
ci-operator/step-registry/osac-project/cluster-tool/test/osac-project-cluster-tool-test-commands.sh
Mounts the helper into the vmaas test container, appends it into tests/conftest.py, runs pytest with JUnit output, and expands collect_artifacts to SSH into the remote host and collect node/pod metrics, top pods by memory, lists of computeinstance/virtualnetwork/subnet/securitygroup, and per-container terminated exit codes/reasons for ansible-job pods.
AAP failed-job collection and pod diagnostics
ci-operator/step-registry/osac-project/gather/osac-project-gather-commands.sh
Creates aap-jobs/, retrieves AAP route host and decoded API token, and—if present—uses authenticated curl calls to download failed/error job listings, per-job details (including stdout and events) and instance group data. Exports each ansible_job-labeled pod's JSON and describe output.
Networking resource exports
ci-operator/step-registry/osac-project/gather/osac-project-gather-commands.sh
Writes YAML artifacts for virtual networks, subnets, and security groups in the target namespace.
Boot: prepare-aap patch
ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh
Creates a patched prepare-aap script that captures the raw AAP gateway token response (AAP_RESPONSE) and improves parsing/error reporting when the response is not valid JSON.
Boot: refresh wait and installer mount
ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh
Adds a post-refresh wait that checks the osac-aap-controller automationcontroller for a Successful reason or falls back to oc rollout status deployment/osac-aap-controller-task with a 300s timeout; mounts patched scripts into the installer pod run.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

lgtm, rehearsals-ack

Suggested reviewers

  • danmanor
  • eranco74
  • trewest
🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main objective of the PR: adding AAP job failure diagnostics to the e2e-vmaas CI pipeline, with specific focus on gathering AAP job data and test-time diagnostics.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR modifies only bash scripts in CI operator step registry. No Go test files with Ginkgo declarations (It(), Describe(), Context()) were modified.
Test Structure And Quality ✅ Passed Custom check for Ginkgo test structure is not applicable to this PR which contains only shell scripts for CI/test orchestration, not Go/Ginkgo test code.
Microshift Test Compatibility ✅ Passed PR modifies only bash shell scripts (CI/CD diagnostics) with no Ginkgo e2e tests added. Check applies only to new Go test definitions, not to CI infrastructure scripts.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR modifies only CI infrastructure shell scripts (not Ginkgo tests) - no new Ginkgo e2e tests found, so SNO compatibility check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed No deployment manifests, operator code, or controllers were added or modified. All changes are bash CI scripts for testing and diagnostics, which contain no scheduling constraints.
Ote Binary Stdout Contract ✅ Passed The PR modifies only shell scripts and YAML CI config files, not Go binaries. OTE Binary Stdout Contract applies only to Go binaries, making this check not applicable to the PR.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR. Changes are CI automation shell scripts that execute pytest (Python tests), not Go Ginkgo tests. Custom check is not applicable.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from danmanor and jhernand May 23, 2026 11:59
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 23, 2026
@omer-vishlitzky
Copy link
Copy Markdown
Contributor Author

/pj-rehearse pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-test-infra-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-aap-main-e2e-vmaas

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@ci-operator/step-registry/osac-project/cluster-tool/test/osac-project-cluster-tool-test-commands.sh`:
- Around line 106-111: The test loop prints only r.stdout so error output is
lost; update the cmds loop to print both stdout and stderr from subprocess.run
(e.g., include r.stderr in the print call) and handle timeouts by catching
subprocess.TimeoutExpired and printing e.stdout and e.stderr as well;
specifically modify the for cmd in cmds: block that calls subprocess.run(...) to
emit f"--- {cmd} ---\n{r.stdout}\n{r.stderr}" to sys.stderr (truncated if
desired) and in the except Exception as e: branch, detect
subprocess.TimeoutExpired (or print getattr(e, 'stdout', '') and getattr(e,
'stderr', '')) so timeout failures also dump stderr.
- Around line 103-104: The timeout-diagnostics commands use invalid invocations
("kubectl adm top nodes" and "kubectl adm top pods -n {ns} --sort-by=memory");
change them to the correct kubectl subcommand names ("kubectl top nodes" and
"kubectl top pods -n {ns} --sort-by=memory") so the functions that build the
diagnostics list (the strings containing "kubectl adm top nodes" and "kubectl
adm top pods -n {ns} --sort-by=memory") will run successfully and produce
node/pod metrics.

In
`@ci-operator/step-registry/osac-project/gather/osac-project-gather-commands.sh`:
- Around line 92-126: The new AAP gather block runs without errexit so setup
failures can be skipped; add set -e (preferably set -o errexit or set -euo
pipefail consistent with other step scripts) at the start of that remote heredoc
before the AAP block to enable errexit, and then remove the blanket "|| true"
only from critical setup commands (e.g. mkdir -p "${ARTIFACT_DIR}/aap-jobs",
AAP_ROUTE=$(oc get route ...), AAP_TOKEN=$(oc get secret ...)) so those fail the
script on error; keep "|| true" on the best-effort collectors/outputs (the curl
and oc get pod/json/describe lines that write into ${ARTIFACT_DIR}) to preserve
diagnostics gathering.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: a692b313-7c18-407b-95af-8ef3092d920a

📥 Commits

Reviewing files that changed from the base of the PR and between 1ccb439 and 0300d79.

📒 Files selected for processing (2)
  • ci-operator/step-registry/osac-project/cluster-tool/test/osac-project-cluster-tool-test-commands.sh
  • ci-operator/step-registry/osac-project/gather/osac-project-gather-commands.sh

Comment on lines +103 to +104
"kubectl adm top nodes",
f"kubectl adm top pods -n {ns} --sort-by=memory",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

Does kubectlsupport the subcommand sequencekubectl adm top, or is the correct syntax kubectl top? If ocis used instead, what is the correctoc adm top syntax?

💡 Result:

The correct syntax for the Kubernetes command-line tool is kubectl top [1][2]. The sequence kubectl adm top is not a valid command [1][3]. When using the OpenShift CLI (oc), the correct syntax is oc adm top [4][5]. This command is used to analyze resources managed by the platform and provides usage statistics for various resources, such as nodes and pods [4][6]. Common examples for the OpenShift CLI include: - oc adm top node: Displays resource (CPU/memory) usage of nodes [5][6]. - oc adm top pod: Displays resource (CPU/memory) usage of pods [5][7]. The oc adm top command internally utilizes kubectl top functionality to retrieve these metrics [8].

Citations:


Fix invalid kubectl metrics commands in timeout diagnostics
kubectl adm top ... is not a valid kubectl invocation, so the timeout diagnostics will fail and produce no node/pod metrics.

Suggested fix
-        "kubectl adm top nodes",
-        f"kubectl adm top pods -n {ns} --sort-by=memory",
+        "kubectl top nodes",
+        f"kubectl top pods -n {ns} --sort-by=memory",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"kubectl adm top nodes",
f"kubectl adm top pods -n {ns} --sort-by=memory",
"kubectl top nodes",
f"kubectl top pods -n {ns} --sort-by=memory",
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@ci-operator/step-registry/osac-project/cluster-tool/test/osac-project-cluster-tool-test-commands.sh`
around lines 103 - 104, The timeout-diagnostics commands use invalid invocations
("kubectl adm top nodes" and "kubectl adm top pods -n {ns} --sort-by=memory");
change them to the correct kubectl subcommand names ("kubectl top nodes" and
"kubectl top pods -n {ns} --sort-by=memory") so the functions that build the
diagnostics list (the strings containing "kubectl adm top nodes" and "kubectl
adm top pods -n {ns} --sort-by=memory") will run successfully and produce
node/pod metrics.

Comment on lines +106 to +111
for cmd in cmds:
try:
r = subprocess.run(cmd, shell=True, capture_output=True, text=True, timeout=15)
print(f"--- {cmd} ---\n{r.stdout[:3000]}", file=sys.stderr, flush=True)
except Exception as e:
print(f"--- {cmd} FAILED: {e} ---", file=sys.stderr, flush=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Include stderr in the timeout dump.

subprocess.run(..., capture_output=True) captures both streams, but this code only prints r.stdout. Any failed kubectl call becomes an empty section instead of showing the actual error.

Suggested fix
     for cmd in cmds:
         try:
             r = subprocess.run(cmd, shell=True, capture_output=True, text=True, timeout=15)
-            print(f"--- {cmd} ---\n{r.stdout[:3000]}", file=sys.stderr, flush=True)
+            output = "\n".join(part for part in (r.stdout.strip(), r.stderr.strip()) if part)
+            print(
+                f"--- {cmd} (rc={r.returncode}) ---\n{output[:3000]}",
+                file=sys.stderr,
+                flush=True,
+            )
         except Exception as e:
             print(f"--- {cmd} FAILED: {e} ---", file=sys.stderr, flush=True)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@ci-operator/step-registry/osac-project/cluster-tool/test/osac-project-cluster-tool-test-commands.sh`
around lines 106 - 111, The test loop prints only r.stdout so error output is
lost; update the cmds loop to print both stdout and stderr from subprocess.run
(e.g., include r.stderr in the print call) and handle timeouts by catching
subprocess.TimeoutExpired and printing e.stdout and e.stderr as well;
specifically modify the for cmd in cmds: block that calls subprocess.run(...) to
emit f"--- {cmd} ---\n{r.stdout}\n{r.stderr}" to sys.stderr (truncated if
desired) and in the except Exception as e: branch, detect
subprocess.TimeoutExpired (or print getattr(e, 'stdout', '') and getattr(e,
'stderr', '')) so timeout failures also dump stderr.

Comment on lines +92 to +126
echo "=== Collecting AAP job failure diagnostics ==="
mkdir -p "${ARTIFACT_DIR}/aap-jobs"

AAP_ROUTE=$(oc get route osac-aap -n "${E2E_NAMESPACE}" -o jsonpath='{.spec.host}' 2>/dev/null || true)
AAP_TOKEN=$(oc get secret osac-aap-api-token -n "${E2E_NAMESPACE}" -o jsonpath='{.data.token}' 2>/dev/null | base64 -d || true)

if [[ -n "${AAP_ROUTE}" && -n "${AAP_TOKEN}" ]]; then
AUTH="Authorization: Bearer ${AAP_TOKEN}"
BASE="https://${AAP_ROUTE}/api/controller/v2"

curl -sk -H "${AUTH}" "${BASE}/jobs/?status__in=error,failed&order_by=-finished&page_size=20" \
> "${ARTIFACT_DIR}/aap-jobs/failed-jobs.json" 2>&1 || true

for JOB_ID in $(jq -r '.results[].id' "${ARTIFACT_DIR}/aap-jobs/failed-jobs.json" 2>/dev/null | head -10); do
curl -sk -H "${AUTH}" "${BASE}/jobs/${JOB_ID}/" \
> "${ARTIFACT_DIR}/aap-jobs/job-${JOB_ID}-detail.json" 2>&1 || true
curl -sk -H "${AUTH}" "${BASE}/jobs/${JOB_ID}/stdout/?format=txt" \
> "${ARTIFACT_DIR}/aap-jobs/job-${JOB_ID}-stdout.txt" 2>&1 || true
curl -sk -H "${AUTH}" "${BASE}/jobs/${JOB_ID}/job_events/?order_by=-counter&page_size=30" \
> "${ARTIFACT_DIR}/aap-jobs/job-${JOB_ID}-events.json" 2>&1 || true
done

curl -sk -H "${AUTH}" "${BASE}/instance_groups/" \
> "${ARTIFACT_DIR}/aap-jobs/instance-groups.json" 2>&1 || true
fi

for POD in $(oc get pods -n "${E2E_NAMESPACE}" -l ansible_job --no-headers -o custom-columns=NAME:.metadata.name 2>/dev/null); do
oc get pod "${POD}" -n "${E2E_NAMESPACE}" -o json > "${ARTIFACT_DIR}/aap-jobs/pod-${POD}.json" 2>&1 || true
oc describe pod "${POD}" -n "${E2E_NAMESPACE}" > "${ARTIFACT_DIR}/aap-jobs/pod-${POD}-describe.txt" 2>&1 || true
done

echo "=== Collecting networking resource status ==="
oc get virtualnetwork -n "${E2E_NAMESPACE}" -o yaml > "${ARTIFACT_DIR}/virtualnetworks.yaml" 2>&1 || true
oc get subnet -n "${E2E_NAMESPACE}" -o yaml > "${ARTIFACT_DIR}/subnets.yaml" 2>&1 || true
oc get securitygroup -n "${E2E_NAMESPACE}" -o yaml > "${ARTIFACT_DIR}/securitygroups.yaml" 2>&1 || true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Run the new gather diagnostics under errexit too.

This new block still executes inside a remote heredoc that defaults to nounset + pipefail only, so setup failures here can be skipped silently unless each command is manually guarded. Please enable set -o errexit for that remote shell and keep || true only on the best-effort collectors.

Suggested fix
 timeout -s 9 10m ssh -F "${SHARED_DIR}/ssh_config" ci_machine bash -s "${E2E_NAMESPACE}" "${REMOTE_ARTIFACT_DIR}" <<'REMOTE_EOF'
+set -o errexit
 set -o nounset
 set -o pipefail

As per coding guidelines "Step registry script files must use set -euo pipefail (without -x) as default and only enable -x when actively debugging".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@ci-operator/step-registry/osac-project/gather/osac-project-gather-commands.sh`
around lines 92 - 126, The new AAP gather block runs without errexit so setup
failures can be skipped; add set -e (preferably set -o errexit or set -euo
pipefail consistent with other step scripts) at the start of that remote heredoc
before the AAP block to enable errexit, and then remove the blanket "|| true"
only from critical setup commands (e.g. mkdir -p "${ARTIFACT_DIR}/aap-jobs",
AAP_ROUTE=$(oc get route ...), AAP_TOKEN=$(oc get secret ...)) so those fail the
script on error; keep "|| true" on the best-effort collectors/outputs (the curl
and oc get pod/json/describe lines that write into ${ARTIFACT_DIR}) to preserve
diagnostics gathering.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 23, 2026

@omer-vishlitzky: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@omer-vishlitzky
Copy link
Copy Markdown
Contributor Author

/pj-rehearse pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-test-infra-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-aap-main-e2e-vmaas

The AAP operator triggers controller-task deployment rollouts
during the refresh-after-snapshot sequence. If tests start before
the rollout completes, the old pod is terminated mid-test, its
Redis sidecar socket vanishes, and running AAP jobs crash with
redis.exceptions.ConnectionError on /var/run/redis/redis.sock.

Wait for the rollout to finish after refresh, before declaring
boot complete. This matches how the refresh script already waits
for fulfillment deployment rollouts.
@omer-vishlitzky
Copy link
Copy Markdown
Contributor Author

/pj-rehearse pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-test-infra-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-aap-main-e2e-vmaas

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 23, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: omer-vishlitzky

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@omer-vishlitzky
Copy link
Copy Markdown
Contributor Author

/pj-rehearse pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-aap-main-e2e-vmaas

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

Embed verbatim copies of refresh-after-snapshot.sh and
prepare-aap.sh from osac-installer main, with two fixes:

prepare-aap.sh: capture curl response before jq so we see
what AAP returns when it responds with non-JSON (was causing
silent "parse error: Invalid numeric literal" crash).

refresh-after-snapshot.sh: after step [8/8], wait for the
AutomationController to reach Successful status before
declaring refresh complete. The AAP operator triggers multiple
async controller-task rollouts that kill the Redis sidecar
socket on the old pod, crashing in-flight provision jobs.
Waiting for Successful ensures all rollouts are done.
@omer-vishlitzky
Copy link
Copy Markdown
Contributor Author

/pj-rehearse pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-test-infra-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-aap-main-e2e-vmaas

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

The heredoc-created scripts need chmod +x before being
mounted into the container, otherwise the refresh script
fails with "Permission denied" (exit code 126).
@omer-vishlitzky
Copy link
Copy Markdown
Contributor Author

/pj-rehearse pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-test-infra-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-aap-main-e2e-vmaas

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

[REHEARSALNOTIFIER]
@omer-vishlitzky: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
pull-ci-osac-project-fulfillment-service-main-e2e-vmaas osac-project/fulfillment-service presubmit Registry content changed
pull-ci-osac-project-osac-aap-main-e2e-vmaas osac-project/osac-aap presubmit Registry content changed
pull-ci-osac-project-osac-operator-main-e2e-vmaas osac-project/osac-operator presubmit Registry content changed
pull-ci-osac-project-osac-installer-main-e2e-vmaas osac-project/osac-installer presubmit Registry content changed
pull-ci-osac-project-osac-test-infra-main-e2e-vmaas osac-project/osac-test-infra presubmit Registry content changed
periodic-ci-osac-project-osac-test-infra-main-e2e-metal-vmaas-compute-instance-cli-fields N/A periodic Registry content changed
periodic-ci-osac-project-osac-test-infra-main-e2e-metal-vmaas-compute-instance-delete-during-provision N/A periodic Registry content changed
periodic-ci-osac-project-osac-test-infra-main-e2e-metal-vmaas-compute-instance-restart N/A periodic Registry content changed
periodic-ci-osac-project-osac-test-infra-main-e2e-metal-vmaas-compute-instance-restart-negative N/A periodic Registry content changed
periodic-ci-osac-project-osac-test-infra-main-e2e-metal-vmaas-subnet-lifecycle N/A periodic Registry content changed
periodic-ci-osac-project-osac-test-infra-main-e2e-metal-vmaas-virtual-network-lifecycle N/A periodic Registry content changed
periodic-ci-osac-project-osac-test-infra-main-e2e-metal-vmaas-compute-instance-creation N/A periodic Registry content changed
periodic-ci-osac-project-osac-test-infra-main-e2e-metal-vmaas-compute-instance-api-fields N/A periodic Registry content changed
Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh (2)

213-225: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid printing the raw token endpoint response.

Lines 219 and 224 log the body returned by an authenticated token-creation call. Even truncated, that can expose the AAP route or token material in CI logs. Please redact sensitive fields before logging, or persist the body outside the log stream.

🔒 Proposed fix
 AAP_RESPONSE=$(curl -sk -X POST \
     -u "admin:${AAP_ADMIN_PASSWORD}" \
     -H "Content-Type: application/json" \
     -d '{"description": "osac-operator", "scope": "write"}' \
     "${AAP_URL}/api/gateway/v1/tokens/")
+REDACTED_AAP_RESPONSE=$(printf '%s' "${AAP_RESPONSE}" | sed -E \
+    -e 's/"token"[[:space:]]*:[[:space:]]*"[^"]*"/"token":"<redacted>"/g' \
+    -e 's#https?://[^"[:space:]]+#<redacted-url>`#g`')
 AAP_TOKEN=$(echo "${AAP_RESPONSE}" | jq -r '.token') || {
-    echo "ERROR: AAP gateway returned non-JSON response: ${AAP_RESPONSE:0:500}"
+    echo "ERROR: AAP gateway returned non-JSON response: ${REDACTED_AAP_RESPONSE:0:500}"
     exit 1
 }

 if [[ -z "${AAP_TOKEN}" || "${AAP_TOKEN}" == "null" ]]; then
-    echo "Failed to create AAP API token. Response: ${AAP_RESPONSE:0:500}"
+    echo "Failed to create AAP API token. Response: ${REDACTED_AAP_RESPONSE:0:500}"
     exit 1
 fi

As per coding guidelines, "Protect sensitive information in step registry scripts - never echo or print passwords, tokens, API keys, cluster URLs, or kubeconfig contents".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh`
around lines 213 - 225, The script currently echoes raw AAP_RESPONSE and slices
it into logs when token creation fails; update the token creation/check block
(the AAP_RESPONSE and AAP_TOKEN handling) to never print raw response contents:
instead parse and mask sensitive fields (e.g., remove or replace .token and any
URL fields via jq) before any echo, or write the full response to a secure
file/secret store and only log a non-sensitive stub like "REDACTED_RESPONSE" or
a masked summary; ensure the failure messages referencing AAP_RESPONSE use the
masked summary variable rather than the raw AAP_RESPONSE.

433-436: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the condition status as the readiness signal (not reason) (ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh:433-436). Kubernetes conditions use status as the satisfaction/truth value (True/False/Unknown); reason is diagnostic text and can change independently.

Proposed fix
-retry_until 300 10 '[[ "$(oc get automationcontroller osac-aap-controller -n '"${INSTALLER_NAMESPACE}"' -o jsonpath='"'"'{.status.conditions[?(@.type=="Successful")].reason}'"'"' 2>/dev/null)" == "Successful" ]]' || {
+retry_until 300 10 '[[ "$(oc get automationcontroller osac-aap-controller -n '"${INSTALLER_NAMESPACE}"' -o jsonpath='"'"'{.status.conditions[?(@.type=="Successful")].status}'"'"' 2>/dev/null)" == "True" ]]' || {
     echo "WARNING: AAP operator did not reach Successful state, waiting for controller-task rollout instead..."
     oc rollout status deployment/osac-aap-controller-task -n "${INSTALLER_NAMESPACE}" --timeout=300s
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh`
around lines 433 - 436, The check in the retry_until call uses the condition's
reason instead of its truth value; change the jsonpath used in the oc get
command inside retry_until to read
.status.conditions[?(@.type=="Successful")].status and compare against "True"
(i.e. update the oc get automationcontroller osac-aap-controller ... -o jsonpath
to use .status rather than .reason) so readiness is based on condition.status;
keep the existing fallback that logs the warning and calls oc rollout status
deployment/osac-aap-controller-task -n "${INSTALLER_NAMESPACE}" --timeout=300s
unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In
`@ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh`:
- Around line 213-225: The script currently echoes raw AAP_RESPONSE and slices
it into logs when token creation fails; update the token creation/check block
(the AAP_RESPONSE and AAP_TOKEN handling) to never print raw response contents:
instead parse and mask sensitive fields (e.g., remove or replace .token and any
URL fields via jq) before any echo, or write the full response to a secure
file/secret store and only log a non-sensitive stub like "REDACTED_RESPONSE" or
a masked summary; ensure the failure messages referencing AAP_RESPONSE use the
masked summary variable rather than the raw AAP_RESPONSE.
- Around line 433-436: The check in the retry_until call uses the condition's
reason instead of its truth value; change the jsonpath used in the oc get
command inside retry_until to read
.status.conditions[?(@.type=="Successful")].status and compare against "True"
(i.e. update the oc get automationcontroller osac-aap-controller ... -o jsonpath
to use .status rather than .reason) so readiness is based on condition.status;
keep the existing fallback that logs the warning and calls oc rollout status
deployment/osac-aap-controller-task -n "${INSTALLER_NAMESPACE}" --timeout=300s
unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 11628275-9d16-4378-b957-454d5080714c

📥 Commits

Reviewing files that changed from the base of the PR and between 3099928 and d331cd8.

📒 Files selected for processing (1)
  • ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant