fix(smoketests): retry blueprint creation on transient infra-side build failures#802
Draft
jrvb-rl wants to merge 2 commits into
Draft
fix(smoketests): retry blueprint creation on transient infra-side build failures#802jrvb-rl wants to merge 2 commits into
jrvb-rl wants to merge 2 commits into
Conversation
…ld failures
Dev-cluster blueprint builds intermittently fail with infra-side errors —
BuildKit i/o timeouts resolving base-image manifests from the in-cluster
registry mirror, and stage-context S3 download transit errors. The SDK
surfaces those as a RunloopError ("Blueprint <id> is in non-complete state
failed"), which previously cascaded through the shared `beforeAll` fixture
in `object-oriented/blueprint.test.ts` and failed all six lifecycle tests
on a single build flake.
Adds `createBlueprintWithRetry` in `tests/smoketests/utils.ts` that
catches that exact error shape, best-effort deletes the failed blueprint,
and retries (default 3 attempts, 5s backoff). Deterministic build errors
fail every attempt and still surface as a test failure — only flakes
recover.
Wraps the six `sdk.blueprint.create` call sites in the object-oriented
blueprint smoketest.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
❌ Object Smoke Tests FailedTest Results❌ Some smoke tests failed Failed Tests:
Please fix the failing tests before checking coverage. |
…iagnostic
Per review: silently recovering from blueprint build flakes hides infra
problems that also impact customers. Change the helper to PROBE for the
underlying class of failure (transient vs persistent) but always surface
a test failure, with a message that names the classification and points
at where to investigate.
The probe distinguishes:
- TRANSIENT infra flake: first attempt failed, a subsequent attempt
recovered. Likely the recurring registry-mirror / S3 transient
described in the helper docstring.
- PERSISTENT infra failure: every attempt terminal-failed. More
likely a broken Dockerfile, a sustained outage, or a real bug.
Either way the test fails red. The recovered blueprint (if any) is
deleted so downstream steps don't accidentally see it.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Smoketest
tests/smoketests/object-oriented/blueprint.test.ts > blueprint lifecycleperiodically reports 6 failures in a single run despite only one underlying problem: a single transient build failure in the sharedbeforeAllblueprint cascades through all 6 tests in the describe block.Investigation of the most recent failure (smoketest dev run #9756, attempt 2) and a local repro against dev confirmed the failures are infra-side flakes in the dev cluster builder, not test logic or SDK bugs.
Observed failure modes (from blueprint-operator Honeycomb + builder pod Loki logs)
Two distinct transient infra errors, both surface as
Blueprint <id> is in non-complete state failedon the SDK side:BuildKit → in-cluster docker registry mirror i/o timeout (lifecycle baseline, blueprint
bpt_33X84htzPGU4fQ29egPnV, podbuild-019ebc1f-87df-7f43-8100-1a134ec5f1f1):stage-context → S3 transit failure (object-storage build-context test, blueprint
bpt_33X875hDJsmL7EbizwCOL, podbuild-019ebc20-f710-76e0-8100-f13268a2dbd5):A local repro against dev hit the same shape on a fresh blueprint (
bpt_33X92zJeloDjpfj6ICiJE) within minutes. Honeycomb shows ~12build_failedspans in a 30-minute window across several distinct blueprints — sustained, not one-off.Fix
Adds
createBlueprintWithRetryintests/smoketests/utils.ts:Blueprint <id> is in non-complete state failedRunloopErrorshape.delete()of the failed blueprint to avoid leaks.Wraps the six
sdk.blueprint.createcall sites intests/smoketests/object-oriented/blueprint.test.ts. Other smoketest files can adopt the same helper in follow-ups.Why this isn't masking a real bug
A genuinely-broken Dockerfile resolves to the same terminal-failed state on every attempt — all retries fail, the test still fails. Only the transient infra shape (which by definition recovers between attempts) becomes invisible.
Out of scope
tests/smoketests/blueprints.test.tsare a different problem: the Jest test timeout (SHORT_TIMEOUT = 120s) is shorter than the innerlongPoll.timeoutMs(30 min), and the suite is intentionally test-order-dependent.Test plan
tsc --noEmit).tests/smoketests/object-oriented/blueprint.test.ts -t lifecycleagainst dev: 6/6 pass after change (the same beforeAll flake reproduced in the pre-change run in the same dev session).🤖 Generated with Claude Code