v0.3.0 — production embedded release#5
Conversation
Covers the embedded-housekeeping correctness fix (retries/schedules/ crash recovery silently broken in library mode), the run_worker helper, enhanced Axum example, operational docs, crash-survival demo, website, and v0.3.0 release. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Housekeeping: drop the Drop-guard handle (footgun); tie loop lifetime to RustQueue via ref-counted Arc, abort on last drop. start_housekeeping returns Result, uses Handle::try_current, sets flag only after spawn. - run_worker: fix move-after-use (capture job_id), observe shutdown only while idle (never around handler), truncate fail error strings, log+ continue on ack/fail errors. - Pull minimal auto-heartbeat into v0.3.0 (was deferred) — without it run_worker corrupts any job exceeding stall_timeout; also fixes the demo. - Expose embedded schedule CRUD + heartbeat + get_dlq_jobs; derive Clone for RustQueue; reject zero tick_interval. - Add tests for the new race-prone areas. Website kept in v0.3.0 per owner. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
15 tasks: Arc/Clone refactor, HousekeepingState + builder knobs, start_housekeeping, embedded schedule/heartbeat/dlq API, run_worker + auto-heartbeat, crash-recovery demo, Axum example, ops docs, website, version bump, release. TDD with full code for the engine tasks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds RustQueue::run_worker and run_worker_with_shutdown methods in src/worker.rs. Sequential pull/ack/fail managed loop with graceful shutdown support. Makes stall_timeout pub(crate) so worker.rs can access it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Code review found JoinHandle::drop only detaches (not aborts), so a panic in the user handler skipped hb.abort() and left a ghost heartbeat task that kept the job's heartbeat fresh forever — stall detection could never reclaim it. Wrap both the per-job heartbeat and the shutdown-watcher in an AbortOnDrop RAII guard (same pattern as HousekeepingState). Also abort the watcher on early `?` return, and widen the heartbeat margin to stall_timeout/2 clamped to [200ms, 30s]. Adds a panic regression test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…hutdown Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add /production guide page covering crash-only design, durability modes, run_worker ergonomics, retries/DLQ, housekeeping, graceful shutdown, and crash-recovery walkthrough. Wires /production route in dashboard/mod.rs. Adds Production nav link to all 8 pages (docs/ + dashboard/static/). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bumps all five version locations (Cargo.toml, node + python SDKs, openapi.rs) and adds CHANGELOG with the v0.3.0 entry: embedded housekeeping fix + run_worker/start_housekeeping + embedded schedule API. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e38c1e6c4a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // Heartbeat well inside the stall window: half the timeout, clamped so we | ||
| // neither hammer storage (floor) nor drift on long timeouts (ceiling). | ||
| let hb_interval = | ||
| (self.stall_timeout / 2).clamp(Duration::from_millis(200), Duration::from_secs(30)); |
There was a problem hiding this comment.
Keep heartbeat interval below configured stall timeout
This fixed floor can make heartbeats slower than stall detection: if a caller sets stall_timeout below 200ms, the worker still heartbeats every 200ms, so detect_stalls may fail and requeue a job that is still running before the first heartbeat is sent. That can trigger duplicate processing/side effects for long-running handlers under low-latency stall settings. Either validate a minimum stall_timeout in the builder or derive hb_interval so it is always strictly less than the configured stall timeout.
Useful? React with 👍 / 👎.
Summary
Makes RustQueue's embedded mode actually deliver on the durability story, and adds an ergonomic worker entry point. Built spec-first (
docs/superpowers/specs/2026-05-22-production-embedded-design.md), reviewed by Codex against the real source, executed via TDD.The bug this fixes
RustQueue::redb(...).build()never started the background scheduler, so in embedded/library mode these documented features silently didn't work:Delayedwas never promoted backkill -9job stayedActiveforeverWhat's new
run_worker/run_worker_with_shutdown— managed worker: pull → handler → ack-on-Ok/ fail-on-Err, auto-heartbeats the in-flight job, auto-starts housekeeping, drains on shutdown.start_housekeeping()— idempotent, runtime-checked, lifetime-bound (aborts when the lastRustQueueclone drops)..stall_timeout()/.tick_interval();RustQueue: Clone; embedded schedule CRUD +heartbeat+get_dlq_jobs.examples/crash_recovery.rs(survivekill -9), production-shapedexamples/axum_background_jobs.rs.docs/production.mdoperational guide + website Production page.Correctness note
A code review caught that
JoinHandle::droponly detaches, so a panicking handler would leak a heartbeat task and keep a jobActiveforever. Fixed with anAbortOnDropRAII guard (same pattern asHousekeepingState); regression test added.Verification (local)
cargo test --features sqlite→ 354 pass, 0 failcargo clippy --all-targets --features sqlite,postgres,otel -- -D warnings→ cleancargo fmt --check→ cleancargo audit --ignore RUSTSEC-2023-0071→ cleanNot a breaking change
No public API removed; the housekeeping change is framed as a bug fix.
🤖 Generated with Claude Code