Skip to content

fix(prover-client): increment retry count on timeout re-enqueue to prevent infinite loop#22355

Merged
ludamad merged 1 commit into
merge-train/spartanfrom
spyros/a-715-audit-35-timed-out-proving-jobs-re-enqueue-without
Apr 9, 2026
Merged

fix(prover-client): increment retry count on timeout re-enqueue to prevent infinite loop#22355
ludamad merged 1 commit into
merge-train/spartanfrom
spyros/a-715-audit-35-timed-out-proving-jobs-re-enqueue-without

Conversation

@spypsy

@spypsy spypsy commented Apr 7, 2026

Copy link
Copy Markdown
Member

Summary

  • Timed-out proving jobs in ProvingBroker.reEnqueueExpiredJobs() were being re-enqueued without incrementing the retry count or checking maxRetries, creating an infinite retry loop for jobs that consistently time out.
  • Now the timeout path increments the retry counter and rejects jobs that exceed maxRetries, matching the behavior of the error retry path in #reportProvingJobError.

Fixes A-715

@spypsy spypsy marked this pull request as ready for review April 7, 2026 10:36
@spypsy spypsy force-pushed the spyros/a-715-audit-35-timed-out-proving-jobs-re-enqueue-without branch from 6d96e65 to 2208dee Compare April 7, 2026 11:35
@ludamad ludamad merged commit 56b32d5 into merge-train/spartan Apr 9, 2026
11 checks passed
@ludamad ludamad deleted the spyros/a-715-audit-35-timed-out-proving-jobs-re-enqueue-without branch April 9, 2026 23:13
github-merge-queue Bot pushed a commit that referenced this pull request Apr 10, 2026
BEGIN_COMMIT_OVERRIDE
fix(stdlib): use bigint arithmetic in GasFees.mul() for non-integer
scalars (#22383)
fix(node-lib): reuse existing fileStore in snapshot sync instead of
recreating (#22375)
fix: gate req/resp data protocols for unauthenticated peers (#22406)
fix(p2p): use per-batch ops array in AztecDatastore.batch() (#22357)
chore(pipeline): spartan config (#21285)
chore: add claude skill to send txs (#22439)
feat(pipeline): minimize deadzone w cross slot attesting (#21435)
fix(p2p): avoid 32-bit overflow in attestation pool block position key
(#22412)
fix(prover-client): increment retry count on timeout re-enqueue to
prevent infinite loop (#22355)
fix: remove redundant p2pClient.start() call (#22438)
chore: add kubectl binary to spartan .gitignore (#22454)
END_COMMIT_OVERRIDE
critesjosh pushed a commit that referenced this pull request Apr 14, 2026
…event infinite loop (#22355)

## Summary

- Timed-out proving jobs in `ProvingBroker.reEnqueueExpiredJobs()` were
being re-enqueued without incrementing the retry count or checking
`maxRetries`, creating an infinite retry loop for jobs that consistently
time out.
- Now the timeout path increments the retry counter and rejects jobs
that exceed `maxRetries`, matching the behavior of the error retry path
in `#reportProvingJobError`.

Fixes
[A-715](https://linear.app/aztec-labs/issue/A-715/audit-35-timed-out-proving-jobs-re-enqueue-without-incrementing-retry)
rangozd pushed a commit to rangozd/aztec-packages that referenced this pull request May 16, 2026
…ke (AztecProtocol#23047)

Flagging `ProvingBroker > Retries > does not retry if job is stale` as a
flake in `.test_patterns.yml`. Failure surfaced on an unrelated wallet
PR — `dbanks12`'s wallet refactor — at
http://ci.aztec-labs.com/64a972aafaa40dd0.

## Failure

```
● ProvingBroker › Retries › does not retry if job is stale

  Store is closed

  > 99 |             throw new Error('Store is closed');
        |                   ^

  at AztecLMDBStoreV2.transactionAsync (yarn-project/kv-store/dest/lmdb-v2/store.js:99:19)
  at SingleEpochDatabase.transactionAsync [as batchWrite]
    (yarn-project/prover-client/src/proving_broker/proving_broker_database/persisted.ts:45:22)
  at KVBrokerDatabase.batchWrite [as commitWrites]
    (yarn-project/prover-client/src/proving_broker/proving_broker_database/persisted.ts:120:14)
```

The broker tries to commit the final `reportProvingJobError` write after
the per-epoch LMDB store has already been closed (the test advances the
epoch from 1 → 3, which causes the epoch-1 store to be torn down). The
race is between the epoch advance / cleanup path and the final error
write — a timing flake, not a logic bug.

## Owner

Test was authored by `@alexghr` in AztecProtocol#9400 (`feat: new proving broker
implementation`) and most recently edited by `@alexghr` in AztecProtocol#22508
(`fix(prover-client): don't mark in-progress epoch N jobs as stale when
epoch N+1 starts`). `@spypsy` has also recently fixed retries-related
races in this file (AztecProtocol#21842, AztecProtocol#22355). Pinging Alex as primary owner; tag
Spyros if it's actually a retry-counter race rather than a
store-lifecycle race.

## Other branches

Spot-checked the most recent failed runs on `merge-train/fairies` and
`merge-train/spartan` — none of them hit this same `proving_broker` /
`Store is closed` failure in the data window I sampled. The flake has
only been observed on the one wallet PR run linked above so far.

## Pattern entry

The new entry uses both `regex` (test file path) and `error_regex`
(`does not retry if job is stale|Store is closed`) so unrelated failures
in `proving_broker.test.ts` still fail CI — only this specific timing
race gets quarantined to a Slack ping.

---
*Created by
[claudebox](https://claudebox.work/v2/sessions/b4b6eb63ff789d29) ·
group: `aztec`*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants