Skip to content

fix(ci): raise AVM check-circuit per-tx timeout to 120s#23749

Closed
AztecBot wants to merge 1 commit into
nextfrom
cb/avm-check-circuit-timeout-120s
Closed

fix(ci): raise AVM check-circuit per-tx timeout to 120s#23749
AztecBot wants to merge 1 commit into
nextfrom
cb/avm-check-circuit-timeout-120s

Conversation

@AztecBot

@AztecBot AztecBot commented May 31, 2026

Copy link
Copy Markdown
Collaborator

Problem

The avm-check-circuit job in run 26703197886 failed on next with exit code 124 (timeout).

The job runs bb-avm avm_check_circuit on every dumped e2e AVM input in parallel, each under a fixed 30s per-tx timeout (yarn-project/end-to-end/bootstrap.sh). Every input passed in 4–6s except e2e_multiple_blobs tx 0x241c8baa…, which was killed at the 30s wall (ran 35s, code: 124), failing the whole job.

Root cause

That tx produces a much larger circuit (~700,560 rows vs. tiny traces for the others). From the run log of the killed job:

04:57:43 Generating trace... (mem: 824 MiB)
04:58:06 Checking circuit... (mem: 3883 MiB)          <- trace generation alone took ~23s
04:58:06 Running check (with skippable) circuit over 700560 rows.
04:58:12 timeout: sending signal TERM to command 'bash'

Simulation + trace generation alone consumed ~23s on the 2-CPU isolation container, leaving the circuit check no room before the 30s deadline. This is exactly the situation the existing in-code WARNING comment anticipated ("transactions could need more CPU and MEM than we allocate by default … they might start timing out"). The 30s value has been unchanged since the feature was introduced (#18747), so this is a heavy tx finally crossing the threshold, not a regression.

Fix

Raise the per-tx timeout from 30s to 120s — ample headroom over the ~35s observed for the heaviest tx while small txs still finish in seconds.

Resources are deliberately left at the default. With up to 64 jobs running in parallel on a 128-CPU host, the containers already use --cpus=2 (≈128 CPUs total); raising --cpus would oversubscribe the runner. A longer timeout is resource-neutral — it only changes the kill deadline, not how much CPU/MEM each run consumes.

The outdated warning comment is updated to describe the actual behavior.

Update (2026-06-03) — still recurring, please land

The same failure hit next again in run 26863710723 (commit 64f5310). Confirmed identical root cause from the CI dashboard log: every input passed in ~4–6s except e2e_multiple_blobs tx 0x0b21460a…, killed at the 30s wall (33s, code: 124), which fail-fast (--halt now,fail=1) propagated as exit 124 to the whole job.

This PR's change is exactly the right fix and still applies cleanly, but it has been sitting in draft since 2026-05-31 — which is why the nightly keeps failing and auto-dispatching duplicate fix attempts (multiple cb/avm-check-circuit-* branches). Recommend marking it ready for review and merging; the stale sibling branches/PRs can then be closed.

Update (2026-06-05) — recurred again, still the right fix

Hit next a third time in run 26995416365 (commit 91df1ab, merge-queue). Same fingerprint from the CI dashboard log (http://ci.aztec-labs.com/1780635325767557): every input PASSED in 3–5s except

FAILED ... e2e_multiple_blobs/avm-circuit-inputs-tx-0x1311eceb….bin (35s) (code: 124)
parallel: This job failed:
run_test_cmd '…:ISOLATE=1:TIMEOUT=30s:NAME=avm_cc_e2e_multiple_blobs_0x1311eceb …'

A separate session independently reached the identical 30s→120s fix. This PR is the canonical version — please mark it ready for review and merge, then the stale cb/avm-check-circuit-* sibling branches/PRs can be closed. The nightly/merge-queue will keep failing and re-dispatching ClaudeBox sessions until it lands.


Created by claudebox · group: slackbot

The avm-check-circuit job runs bb-avm avm_check_circuit on every dumped
e2e AVM input under a fixed 30s per-tx timeout. The e2e_multiple_blobs tx
produces a ~700k-row trace whose simulation + trace generation alone takes
~23s on the 2-CPU isolation container, and the subsequent circuit check
pushed the run past 30s (observed 35s, killed with code 124), failing the
whole job while every other input passed in 4-6s.

This is the scenario the existing in-code warning anticipated. Raise the
timeout to 120s to give ample headroom for the heaviest txs. Resources are
left unchanged: with up to 64 jobs in parallel on a 128-CPU host, bumping
--cpus would oversubscribe the runner, and a longer timeout is resource
-neutral since small txs still finish in seconds.
@AztecBot AztecBot added claudebox Owned by claudebox. it can push to this PR. ci-draft Run CI on draft PRs. labels May 31, 2026
@AztecBot

Copy link
Copy Markdown
Collaborator Author

Automatically closing this stale claudebox draft PR (no updates for 5+ days). Re-open if still needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-draft Run CI on draft PRs. claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant