fix(prover-client): increment retry count on timeout re-enqueue to prevent infinite loop by spypsy · Pull Request #22355 · AztecProtocol/aztec-packages

spypsy · 2026-04-07T10:35:46Z

Summary

Timed-out proving jobs in ProvingBroker.reEnqueueExpiredJobs() were being re-enqueued without incrementing the retry count or checking maxRetries, creating an infinite retry loop for jobs that consistently time out.
Now the timeout path increments the retry counter and rejects jobs that exceed maxRetries, matching the behavior of the error retry path in #reportProvingJobError.

…event infinite loop

BEGIN_COMMIT_OVERRIDE fix(stdlib): use bigint arithmetic in GasFees.mul() for non-integer scalars (#22383) fix(node-lib): reuse existing fileStore in snapshot sync instead of recreating (#22375) fix: gate req/resp data protocols for unauthenticated peers (#22406) fix(p2p): use per-batch ops array in AztecDatastore.batch() (#22357) chore(pipeline): spartan config (#21285) chore: add claude skill to send txs (#22439) feat(pipeline): minimize deadzone w cross slot attesting (#21435) fix(p2p): avoid 32-bit overflow in attestation pool block position key (#22412) fix(prover-client): increment retry count on timeout re-enqueue to prevent infinite loop (#22355) fix: remove redundant p2pClient.start() call (#22438) chore: add kubectl binary to spartan .gitignore (#22454) END_COMMIT_OVERRIDE

…event infinite loop (#22355) ## Summary - Timed-out proving jobs in `ProvingBroker.reEnqueueExpiredJobs()` were being re-enqueued without incrementing the retry count or checking `maxRetries`, creating an infinite retry loop for jobs that consistently time out. - Now the timeout path increments the retry counter and rejects jobs that exceed `maxRetries`, matching the behavior of the error retry path in `#reportProvingJobError`. Fixes [A-715](https://linear.app/aztec-labs/issue/A-715/audit-35-timed-out-proving-jobs-re-enqueue-without-incrementing-retry)

…ke (AztecProtocol#23047) Flagging `ProvingBroker > Retries > does not retry if job is stale` as a flake in `.test_patterns.yml`. Failure surfaced on an unrelated wallet PR — `dbanks12`'s wallet refactor — at http://ci.aztec-labs.com/64a972aafaa40dd0. ## Failure ``` ● ProvingBroker › Retries › does not retry if job is stale Store is closed > 99 | throw new Error('Store is closed'); | ^ at AztecLMDBStoreV2.transactionAsync (yarn-project/kv-store/dest/lmdb-v2/store.js:99:19) at SingleEpochDatabase.transactionAsync [as batchWrite] (yarn-project/prover-client/src/proving_broker/proving_broker_database/persisted.ts:45:22) at KVBrokerDatabase.batchWrite [as commitWrites] (yarn-project/prover-client/src/proving_broker/proving_broker_database/persisted.ts:120:14) ``` The broker tries to commit the final `reportProvingJobError` write after the per-epoch LMDB store has already been closed (the test advances the epoch from 1 → 3, which causes the epoch-1 store to be torn down). The race is between the epoch advance / cleanup path and the final error write — a timing flake, not a logic bug. ## Owner Test was authored by `@alexghr` in AztecProtocol#9400 (`feat: new proving broker implementation`) and most recently edited by `@alexghr` in AztecProtocol#22508 (`fix(prover-client): don't mark in-progress epoch N jobs as stale when epoch N+1 starts`). `@spypsy` has also recently fixed retries-related races in this file (AztecProtocol#21842, AztecProtocol#22355). Pinging Alex as primary owner; tag Spyros if it's actually a retry-counter race rather than a store-lifecycle race. ## Other branches Spot-checked the most recent failed runs on `merge-train/fairies` and `merge-train/spartan` — none of them hit this same `proving_broker` / `Store is closed` failure in the data window I sampled. The flake has only been observed on the one wallet PR run linked above so far. ## Pattern entry The new entry uses both `regex` (test file path) and `error_regex` (`does not retry if job is stale|Store is closed`) so unrelated failures in `proving_broker.test.ts` still fail CI — only this specific timing race gets quarantined to a Slack ping. --- *Created by [claudebox](https://claudebox.work/v2/sessions/b4b6eb63ff789d29) · group: `aztec`*

spypsy marked this pull request as ready for review April 7, 2026 10:36

fix(prover-client): increment retry count on timeout re-enqueue to pr…

2208dee

…event infinite loop

spypsy force-pushed the spyros/a-715-audit-35-timed-out-proving-jobs-re-enqueue-without branch from 6d96e65 to 2208dee Compare April 7, 2026 11:35

ludamad approved these changes Apr 9, 2026

View reviewed changes

ludamad merged commit 56b32d5 into merge-train/spartan Apr 9, 2026
11 checks passed

ludamad deleted the spyros/a-715-audit-35-timed-out-proving-jobs-re-enqueue-without branch April 9, 2026 23:13

AztecBot mentioned this pull request Apr 9, 2026

feat: merge-train/spartan #22428

Merged

AztecBot mentioned this pull request May 7, 2026

chore(ci): flag ProvingBroker "does not retry if job is stale" as flake #23047

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(prover-client): increment retry count on timeout re-enqueue to prevent infinite loop#22355

fix(prover-client): increment retry count on timeout re-enqueue to prevent infinite loop#22355
ludamad merged 1 commit into
merge-train/spartanfrom
spyros/a-715-audit-35-timed-out-proving-jobs-re-enqueue-without

spypsy commented Apr 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

spypsy commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

spypsy commented Apr 7, 2026 •

edited

Loading