Skip to content

fix(smoketests): retry blueprint creation on transient infra-side build failures#802

Draft
jrvb-rl wants to merge 2 commits into
mainfrom
jrvb/smoketest-blueprint-retry-on-infra-flake
Draft

fix(smoketests): retry blueprint creation on transient infra-side build failures#802
jrvb-rl wants to merge 2 commits into
mainfrom
jrvb/smoketest-blueprint-retry-on-infra-flake

Conversation

@jrvb-rl

@jrvb-rl jrvb-rl commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

Smoketest tests/smoketests/object-oriented/blueprint.test.ts > blueprint lifecycle periodically reports 6 failures in a single run despite only one underlying problem: a single transient build failure in the shared beforeAll blueprint cascades through all 6 tests in the describe block.

Investigation of the most recent failure (smoketest dev run #9756, attempt 2) and a local repro against dev confirmed the failures are infra-side flakes in the dev cluster builder, not test logic or SDK bugs.

Observed failure modes (from blueprint-operator Honeycomb + builder pod Loki logs)

Two distinct transient infra errors, both surface as Blueprint <id> is in non-complete state failed on the SDK side:

  1. BuildKit → in-cluster docker registry mirror i/o timeout (lifecycle baseline, blueprint bpt_33X84htzPGU4fQ29egPnV, pod build-019ebc1f-87df-7f43-8100-1a134ec5f1f1):

    error: failed to solve: DeadlineExceeded: ubuntu:22.04: failed to resolve source metadata
    for docker.io/library/ubuntu:22.04: failed to do request: Head "https://172.20.0.44:8080/v2/library/ubuntu/manifests/22.04?ns=docker.io":
    dial tcp 172.20.0.44:8080: i/o timeout
    
  2. stage-context → S3 transit failure (object-storage build-context test, blueprint bpt_33X875hDJsmL7EbizwCOL, pod build-019ebc20-f710-76e0-8100-f13268a2dbd5):

    ERROR 1/2 downloading object failed id="obj_33X82m5cEBsxuiB4oHF6V"
    err=error sending request for url (https://rl-code-repos-dev-us-west-2.s3.us-west-2.amazonaws.com/...)
    error: invalid local: resolve : lstat /workspace/named-contexts: no such file or directory
    

A local repro against dev hit the same shape on a fresh blueprint (bpt_33X92zJeloDjpfj6ICiJE) within minutes. Honeycomb shows ~12 build_failed spans in a 30-minute window across several distinct blueprints — sustained, not one-off.

Fix

Adds createBlueprintWithRetry in tests/smoketests/utils.ts:

  • Catches the exact Blueprint <id> is in non-complete state failed RunloopError shape.
  • Best-effort delete() of the failed blueprint to avoid leaks.
  • Retries up to 3 attempts with a 5s backoff (configurable per-call).
  • Other errors (auth, validation, etc.) re-throw immediately — only the post-poll terminal-failed shape is retried.

Wraps the six sdk.blueprint.create call sites in tests/smoketests/object-oriented/blueprint.test.ts. Other smoketest files can adopt the same helper in follow-ups.

Why this isn't masking a real bug

A genuinely-broken Dockerfile resolves to the same terminal-failed state on every attempt — all retries fail, the test still fails. Only the transient infra shape (which by definition recovers between attempts) becomes invisible.

Out of scope

  • The 2 remaining failures observed in legacy tests/smoketests/blueprints.test.ts are a different problem: the Jest test timeout (SHORT_TIMEOUT = 120s) is shorter than the inner longPoll.timeoutMs (30 min), and the suite is intentionally test-order-dependent.
  • Backend-side resiliency in the blueprint builder (S3 download attempts beyond 2; registry-mirror reachability) is outside this repo's scope.

Test plan

  • TypeScript compiles cleanly (tsc --noEmit).
  • Local re-run of tests/smoketests/object-oriented/blueprint.test.ts -t lifecycle against dev: 6/6 pass after change (the same beforeAll flake reproduced in the pre-change run in the same dev session).
  • CI smoketest dev re-run on this branch — sanity check under load.

🤖 Generated with Claude Code

…ld failures

Dev-cluster blueprint builds intermittently fail with infra-side errors —
BuildKit i/o timeouts resolving base-image manifests from the in-cluster
registry mirror, and stage-context S3 download transit errors. The SDK
surfaces those as a RunloopError ("Blueprint <id> is in non-complete state
failed"), which previously cascaded through the shared `beforeAll` fixture
in `object-oriented/blueprint.test.ts` and failed all six lifecycle tests
on a single build flake.

Adds `createBlueprintWithRetry` in `tests/smoketests/utils.ts` that
catches that exact error shape, best-effort deletes the failed blueprint,
and retries (default 3 attempts, 5s backoff). Deterministic build errors
fail every attempt and still surface as a test failure — only flakes
recover.

Wraps the six `sdk.blueprint.create` call sites in the object-oriented
blueprint smoketest.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown

❌ Object Smoke Tests Failed

Test Results

❌ Some smoke tests failed

Failed Tests:

  • �[22m�[1msmoketest: object-oriented blueprint › blueprint build context with object storage and .dockerignore › creates blueprint with build_context_dir parameter (string path)�[39m�[22m

Please fix the failing tests before checking coverage.

📋 View full test logs

…iagnostic

Per review: silently recovering from blueprint build flakes hides infra
problems that also impact customers. Change the helper to PROBE for the
underlying class of failure (transient vs persistent) but always surface
a test failure, with a message that names the classification and points
at where to investigate.

The probe distinguishes:
  - TRANSIENT infra flake: first attempt failed, a subsequent attempt
    recovered. Likely the recurring registry-mirror / S3 transient
    described in the helper docstring.
  - PERSISTENT infra failure: every attempt terminal-failed. More
    likely a broken Dockerfile, a sustained outage, or a real bug.

Either way the test fails red. The recovered blueprint (if any) is
deleted so downstream steps don't accidentally see it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant