Skip to content

fix(broker): make delivery handling durable and observable#1073

Merged
willwashburn merged 4 commits into
mainfrom
fix/broker-delivery-durability
Jun 11, 2026
Merged

fix(broker): make delivery handling durable and observable#1073
willwashburn merged 4 commits into
mainfrom
fix/broker-delivery-durability

Conversation

@willwashburn

Copy link
Copy Markdown
Member

Summary

Fixes four delivery-durability defects in the Rust broker (crates/broker), keeping the diff scoped to delivery semantics — no changes to PTY injection, parsing, or routing.

1. Timeout-fallback acks no longer counted as verified successes

  • pty_worker records a new DeliveryOutcome::Unverified in the injection throttle on echo-verification timeout: it breaks the consecutive-success streak (so unverified deliveries can never drive the delay down) without backing off like a failure.
  • The broker's delivery_verified handler now reads the worker frame's verification/reason fields, logs timeout fallbacks at info, and forwards verification: "timeout_fallback" (plus reason) on the emitted event. The echo-verified path explicitly sends verification: "echo".
  • BrokerEvent::DeliveryVerified gains optional verification/reason fields, mirrored in @agent-relay/harness-driver protocol types.
  • The fallback ack behavior is unchanged — re-injection stays disabled to avoid duplicate deliveries.

2. Graceful shutdown no longer drops pending deliveries

  • shutdown_runtime previously did pending_deliveries.clear() and deleted the pending file. It now calls persist_pending_on_shutdown: a non-empty map is written back via the existing atomic temp-file-rename writer (warn-level log with count) so the next start redelivers; the file is only removed when the map is actually empty. Without --persist it warns that the deliveries will be lost.

3. Pending deliveries persisted on every mutation, not per tick

  • New PendingDeliveryStore wraps the pending map with Deref/DerefMut dirty tracking — any mutable access (insert, ack/remove, retry bookkeeping) marks it dirty, and the event loop flushes the snapshot right after the mutating event. The 500ms tick-time snapshot in maintenance.rs is removed as redundant.
  • --persist default kept off deliberately: the flag also gates state/lock/PID files, MCP config injection mode, and the ephemeral owner-lease shutdown path (lease_duration is only armed when --persist is absent). Flipping the default would change ephemeral one-shot SDK sessions well beyond delivery durability, so persistence remains opt-in for the serve path.

4. Queue-cap evictions emit a real event

  • queue_inbound_for_delivery_mode now returns InboundQueueResult carrying the evicted sender when the per-worker cap (MAX_PENDING_PER_WORKER = 256) forces out the oldest message. Both call sites (HTTP API send and relaycast inbound) emit BrokerEvent::DeliveryDropped { name, count: 1, reason }, matching the existing delivery_dropped TS event shape that assertNoDroppedDeliveries checks.

Deferred

  • Dedup-cache persistence (optional in the task): skipped to keep the diff focused; restart + pending-file replay still passes through the in-memory TTL dedup only for the current process lifetime.

Verification

  • cargo fmt --check -p agent-relay-broker — clean
  • cargo clippy -p agent-relay-broker --all-targets — no new warnings (one pre-existing args.get(0) warning in snippets.rs, untouched)
  • cargo test -p agent-relay-broker — 709 passed, 0 failed (scoped to the broker crate; workspace-wide build not run)
  • New unit tests: throttle neutrality of Unverified, eviction surfacing at the cap, shutdown persistence round-trip / empty-map file removal / no-persist no-write, PendingDeliveryStore dirty tracking, delivery_verified protocol round-trips with and without verification
  • packages/harness-driver/src/protocol.ts change is additive optional fields; tsc reports no errors in protocol.ts (other pre-existing errors come from unbuilt @agent-relay/sdk workspace declarations)

🤖 Generated with Claude Code

Four delivery-durability fixes in the Rust broker:

- Timeout-fallback acks are no longer conflated with verified
  successes: pty_worker records a new DeliveryOutcome::Unverified in
  the injection throttle (breaks the success streak without backing
  off), and the broker forwards verification/reason on
  delivery_verified events so unverified deliveries are observable.
  The fallback ack itself is kept — re-injection stays disabled to
  avoid duplicate deliveries.
- Graceful shutdown no longer clears pending deliveries: a non-empty
  pending map is persisted (atomic temp-file rename) for redelivery on
  the next start, with a warn-level count; the pending file is only
  removed when the map is actually empty.
- Pending deliveries are persisted on every map mutation (enqueue,
  ack/remove, retry bookkeeping) via a dirty-tracking
  PendingDeliveryStore flushed by the event loop, instead of only on
  the 500ms maintenance tick.
- Per-worker queue-cap evictions (MAX_PENDING_PER_WORKER) now emit a
  delivery_dropped broker event instead of only a tracing warning;
  queue_inbound_for_delivery_mode surfaces the evicted sender to both
  the HTTP API and relaycast inbound call sites.

BrokerEvent::DeliveryVerified gains optional verification/reason
fields, mirrored in @agent-relay/harness-driver protocol types.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@willwashburn willwashburn requested a review from khaliqgant as a code owner June 9, 2026 20:05
@gemini-code-assist

Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@codeant-ai

codeant-ai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Your free trial PR review limit of 300 PRs has been reached. Please upgrade your plan to continue using CodeAnt AI.

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

The broker's delivery-state durability improves through a dirty-tracked persistent storage abstraction, queue-cap evictions surface as broker events, and timeout-fallback acknowledgments are distinguished from verified deliveries via a new Unverified outcome and explicit verification signaling in protocol events.

Changes

Delivery Durability & Observability

Layer / File(s) Summary
Unverified Delivery Outcome State
crates/broker/src/broker/delivery_verification.rs
DeliveryOutcome gains Unverified variant. ThrottleState::record resets consecutive_successes on unverified outcomes while keeping delay unchanged, so timeout-fallback acks don't falsely accelerate throttle recovery. Tests validate that unverified outcomes break success streaks and preserve the current delay.
Pending Delivery Persistence Architecture
crates/broker/src/runtime/delivery.rs, crates/broker/src/runtime/event_loop.rs, crates/broker/src/runtime/init.rs, crates/broker/src/runtime/maintenance.rs
PendingDeliveryStore wraps pending-delivery state with dirty-tracking on DerefMut mutations. Event loop flushes persisted snapshots after each event when dirty; maintenance tick removes its duplicate tick-time save. Graceful shutdown now persists any remaining pending deliveries for redlivery on restart, removing the file only when empty.
Inbound Queue Eviction Result Type
crates/broker/src/runtime/delivery.rs
InboundQueueResult enriches queue routing outcomes with evicted_from: Option<String>. queue_inbound_for_delivery_mode now tracks and returns which sender was evicted when the per-worker queue hits its 256-message cap. Helper delivery_dropped_event_for_eviction constructs broker event payloads for cap-triggered evictions.
Eviction Event Emission
crates/broker/src/runtime/api.rs, crates/broker/src/runtime/relaycast_events.rs
API and relaycast request handlers capture InboundQueueResult, check evicted_from, and emit delivery_dropped broker events before proceeding. Evictions are now explicitly surfaced as observable events.
Worker Verification Mode Signaling
crates/broker/src/pty_worker.rs, crates/broker/src/runtime/worker_events.rs
pty_worker emits "verification": "echo" for successful echo verification and records DeliveryOutcome::Unverified for timeout-fallback acks (with info-level logging). worker_events handler extracts verification mode and optional reason from payloads, propagating them in outgoing delivery_verified events to distinguish echo-verified from timeout-fallback outcomes.
Protocol Extensions
crates/broker/src/protocol.rs, packages/harness-driver/src/protocol.ts
BrokerEvent::DeliveryVerified gains optional verification and reason fields (omitted from wire JSON when None). Tests validate default and timeout-fallback round-trips. TypeScript client types updated to reflect extended broker event schema.
Test Suites & Documentation
crates/broker/src/runtime/tests.rs, CHANGELOG.md, .agentworkforce/trajectories/...
Queue tests refactored to assert evicted_from and match on result.outcome. New coverage for queue cap eviction behavior, pending delivery persistence lifecycle, and PendingDeliveryStore dirty-tracking. Changelog documents four fixed defects. Trajectory records design decisions and PR conflict resolution.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • khaliqgant

🐰 Pending state now persists through shutdowns sweet,
Unverified acks don't race the throttle beat,
Queue overflows surface their cry,
With "verification" fields racing by,
Durability and observability complete! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix(broker): make delivery handling durable and observable' directly captures the main PR objective—fixing four delivery-durability defects and improving observability through event emission, which is the core purpose of the changeset.
Description check ✅ Passed The description comprehensively covers all four fixes, verification steps, and deferred work. It follows the template with a detailed Summary section and includes test verification, exceeding the basic template requirements.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/broker-delivery-durability

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint install failed due to a network error.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codeant-ai

codeant-ai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Your free trial PR review limit of 300 PRs has been reached. Please upgrade your plan to continue using CodeAnt AI.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
packages/harness-driver/src/protocol.ts (1)

348-350: ⚡ Quick win

Narrow verification to known literals instead of plain string.

Using 'echo' | 'timeout_fallback' here makes downstream handling safer and self-documenting.

Suggested fix
-      /** 'echo' when confirmed in PTY output, 'timeout_fallback' when acked unverified. */
-      verification?: string;
+      /** 'echo' when confirmed in PTY output, 'timeout_fallback' when acked unverified. */
+      verification?: 'echo' | 'timeout_fallback';
       reason?: string;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/harness-driver/src/protocol.ts` around lines 348 - 350, The
verification property on the relevant type/interface should be narrowed from a
plain string to a union of the known literal values; change the type of
verification to 'echo' | 'timeout_fallback' (instead of string) in the interface
definition in protocol.ts and update any consumers that assume arbitrary strings
(e.g., code reading verification, switch/case or comparisons) to handle these
two literals explicitly or add a fallback branch to keep behavior unchanged;
ensure TypeScript compiles and adjust tests/types that relied on broader typing.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@CHANGELOG.md`:
- Around line 125-128: Rewrite the four CHANGELOG.md bullets for
agent-relay-broker to remove implementation/internal details and make them
impact-first: consolidate each into a concise user-visible change for
"agent-relay-broker" (e.g., persist pending deliveries across restarts; persist
pending state more durably to prevent loss on crashes; make timeout-fallback
acks observable via delivery_verified with verification: \"timeout_fallback\"
and include reason; emit delivery_dropped when the per-worker queue evicts
messages), omitting mentions of maintenance ticks, file removal logic, eviction
cap values (256), and wording about logging vs events — keep only the observable
behavioral changes and their user-facing effects.

In `@crates/broker/src/protocol.rs`:
- Around line 401-406: The DeliveryVerified enum variant in WorkerToBroker must
be extended to include the new verification and reason metadata so the Rust
protocol matches the struct fields; update the WorkerToBroker::DeliveryVerified
variant signature to add verification: Option<String> and reason: Option<String>
(with the same serde(default, skip_serializing_if = "Option::is_none") semantics
as the surrounding struct), then update every construction, pattern-match, and
serialization/deserialization site that creates or deconstructs
WorkerToBroker::DeliveryVerified (e.g., builders, matches, and tests) to accept
and forward these two Option<String> fields so they are preserved across Rust↔TS
round-trips.

In `@crates/broker/src/runtime/event_loop.rs`:
- Around line 127-137: The current flush_pending_deliveries clears the dirty
flag by calling self.pending_deliveries.take_dirty() before attempting
save_pending_deliveries, so if the write fails the dirty state is lost; change
the flow in flush_pending_deliveries to first check the dirty state without
clearing (use an is_dirty-like check instead of take_dirty), proceed with
save_pending_deliveries(&self.paths.pending, &self.pending_deliveries), and only
clear/take the dirty flag (call take_dirty() or an explicit clear) after a
successful save; if save_pending_deliveries returns Err, leave the dirty flag
set so pending mutations will be retried later (refer to
flush_pending_deliveries, pending_deliveries.take_dirty(),
save_pending_deliveries, and self.paths.persist).

---

Nitpick comments:
In `@packages/harness-driver/src/protocol.ts`:
- Around line 348-350: The verification property on the relevant type/interface
should be narrowed from a plain string to a union of the known literal values;
change the type of verification to 'echo' | 'timeout_fallback' (instead of
string) in the interface definition in protocol.ts and update any consumers that
assume arbitrary strings (e.g., code reading verification, switch/case or
comparisons) to handle these two literals explicitly or add a fallback branch to
keep behavior unchanged; ensure TypeScript compiles and adjust tests/types that
relied on broader typing.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: af257686-ffad-4c59-bf1b-41203c2964b8

📥 Commits

Reviewing files that changed from the base of the PR and between 7e9a44a and 447c501.

📒 Files selected for processing (17)
  • .agentworkforce/trajectories/active/traj_b1jrutolckfb/trajectory.json
  • .agentworkforce/trajectories/completed/2026-06/traj_b1jrutolckfb.trace.json
  • .agentworkforce/trajectories/completed/2026-06/traj_b1jrutolckfb/summary.md
  • .agentworkforce/trajectories/completed/2026-06/traj_b1jrutolckfb/trajectory.json
  • CHANGELOG.md
  • crates/broker/src/broker/delivery_verification.rs
  • crates/broker/src/protocol.rs
  • crates/broker/src/pty_worker.rs
  • crates/broker/src/runtime/api.rs
  • crates/broker/src/runtime/delivery.rs
  • crates/broker/src/runtime/event_loop.rs
  • crates/broker/src/runtime/init.rs
  • crates/broker/src/runtime/maintenance.rs
  • crates/broker/src/runtime/relaycast_events.rs
  • crates/broker/src/runtime/tests.rs
  • crates/broker/src/runtime/worker_events.rs
  • packages/harness-driver/src/protocol.ts
💤 Files with no reviewable changes (1)
  • .agentworkforce/trajectories/active/traj_b1jrutolckfb/trajectory.json

Comment thread CHANGELOG.md
Comment on lines +125 to +128
- `agent-relay-broker` no longer discards pending deliveries on graceful shutdown: a non-empty pending map is persisted for redelivery on the next start, and the pending file is only removed when nothing is pending.
- `agent-relay-broker` persists pending deliveries on every change (enqueue, ack, retry) instead of only on the 500ms maintenance tick, so a crash between ticks cannot lose queued messages.
- `agent-relay-broker` timeout-fallback delivery acks are now observable: `delivery_verified` events carry `verification: "timeout_fallback"` plus a reason, and unverified deliveries no longer count as successes in the injection throttle.
- `agent-relay-broker` emits a `delivery_dropped` event when the per-worker pending queue cap (256) evicts the oldest message, instead of only logging a warning.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Remove implementation details per coding guidelines.

The changelog entries include implementation details that should be omitted according to the guideline "Omit issue/PR links, internal notes, and implementation details." As per coding guidelines, changelog entries should be concise and impact-first, focusing on user-visible changes.

📝 Suggested simplifications
-agent-relay-broker no longer discards pending deliveries on graceful shutdown: a non-empty pending map is persisted for redelivery on the next start, and the pending file is only removed when nothing is pending.
+agent-relay-broker no longer discards pending deliveries on graceful shutdown: pending deliveries are persisted for redelivery on the next start.
-agent-relay-broker persists pending deliveries on every change (enqueue, ack, retry) instead of only on the 500ms maintenance tick, so a crash between ticks cannot lose queued messages.
+agent-relay-broker persists pending deliveries immediately on every change (enqueue, ack, retry) instead of only periodically, preventing message loss on crash.
-agent-relay-broker timeout-fallback delivery acks are now observable: delivery_verified events carry verification: "timeout_fallback" plus a reason, and unverified deliveries no longer count as successes in the injection throttle.
+agent-relay-broker timeout-fallback delivery acks are now observable: delivery_verified events carry verification: "timeout_fallback" plus a reason, distinguishing them from echo-verified successes.
-agent-relay-broker emits a delivery_dropped event when the per-worker pending queue cap (256) evicts the oldest message, instead of only logging a warning.
+agent-relay-broker emits a delivery_dropped event when the per-worker pending queue capacity is exceeded and the oldest message is evicted, instead of only logging a warning.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- `agent-relay-broker` no longer discards pending deliveries on graceful shutdown: a non-empty pending map is persisted for redelivery on the next start, and the pending file is only removed when nothing is pending.
- `agent-relay-broker` persists pending deliveries on every change (enqueue, ack, retry) instead of only on the 500ms maintenance tick, so a crash between ticks cannot lose queued messages.
- `agent-relay-broker` timeout-fallback delivery acks are now observable: `delivery_verified` events carry `verification: "timeout_fallback"` plus a reason, and unverified deliveries no longer count as successes in the injection throttle.
- `agent-relay-broker` emits a `delivery_dropped` event when the per-worker pending queue cap (256) evicts the oldest message, instead of only logging a warning.
- `agent-relay-broker` no longer discards pending deliveries on graceful shutdown: pending deliveries are persisted for redelivery on the next start.
- `agent-relay-broker` persists pending deliveries immediately on every change (enqueue, ack, retry) instead of only periodically, preventing message loss on crash.
- `agent-relay-broker` timeout-fallback delivery acks are now observable: `delivery_verified` events carry `verification: "timeout_fallback"` plus a reason, distinguishing them from echo-verified successes.
- `agent-relay-broker` emits a `delivery_dropped` event when the per-worker pending queue capacity is exceeded and the oldest message is evicted, instead of only logging a warning.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@CHANGELOG.md` around lines 125 - 128, Rewrite the four CHANGELOG.md bullets
for agent-relay-broker to remove implementation/internal details and make them
impact-first: consolidate each into a concise user-visible change for
"agent-relay-broker" (e.g., persist pending deliveries across restarts; persist
pending state more durably to prevent loss on crashes; make timeout-fallback
acks observable via delivery_verified with verification: \"timeout_fallback\"
and include reason; emit delivery_dropped when the per-worker queue evicts
messages), omitting mentions of maintenance ticks, file removal logic, eviction
cap values (256), and wording about logging vs events — keep only the observable
behavioral changes and their user-facing effects.

Source: Coding guidelines

Comment on lines +401 to +406
/// "echo" when confirmed in PTY output, "timeout_fallback" when the
/// delivery was acked without echo verification.
#[serde(default, skip_serializing_if = "Option::is_none")]
verification: Option<String>,
#[serde(default, skip_serializing_if = "Option::is_none")]
reason: Option<String>,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Align WorkerToBroker::DeliveryVerified with the new verification metadata.

Line 401 introduces verification semantics at the protocol layer, but WorkerToBroker::DeliveryVerified (Line 542) still omits verification/reason. This creates a Rust↔TS contract drift and can drop fields on typed round-trips.

Suggested fix
@@
     DeliveryVerified {
         delivery_id: DeliveryId,
         event_id: EventId,
+        #[serde(default, skip_serializing_if = "Option::is_none")]
+        verification: Option<String>,
+        #[serde(default, skip_serializing_if = "Option::is_none")]
+        reason: Option<String>,
     },
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/broker/src/protocol.rs` around lines 401 - 406, The DeliveryVerified
enum variant in WorkerToBroker must be extended to include the new verification
and reason metadata so the Rust protocol matches the struct fields; update the
WorkerToBroker::DeliveryVerified variant signature to add verification:
Option<String> and reason: Option<String> (with the same serde(default,
skip_serializing_if = "Option::is_none") semantics as the surrounding struct),
then update every construction, pattern-match, and serialization/deserialization
site that creates or deconstructs WorkerToBroker::DeliveryVerified (e.g.,
builders, matches, and tests) to accept and forward these two Option<String>
fields so they are preserved across Rust↔TS round-trips.

Comment on lines +127 to +137
fn flush_pending_deliveries(&mut self) {
if !self.pending_deliveries.take_dirty() || !self.paths.persist {
return;
}
if let Err(error) = save_pending_deliveries(&self.paths.pending, &self.pending_deliveries) {
tracing::warn!(
path = %self.paths.pending.display(),
error = %error,
"failed to persist pending deliveries"
);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve dirty state when snapshot persistence fails.

Line 128 clears the dirty flag before the write attempt, and Line 131 logs failures without restoring it. If that write fails, pending-delivery mutations can remain unsaved until another mutation occurs.

💡 Suggested fix
diff --git a/crates/broker/src/runtime/delivery.rs b/crates/broker/src/runtime/delivery.rs
@@
 impl PendingDeliveryStore {
@@
     pub(crate) fn take_dirty(&mut self) -> bool {
         std::mem::take(&mut self.dirty)
     }
+
+    pub(crate) fn mark_dirty(&mut self) {
+        self.dirty = true;
+    }
 }
diff --git a/crates/broker/src/runtime/event_loop.rs b/crates/broker/src/runtime/event_loop.rs
@@
     fn flush_pending_deliveries(&mut self) {
-        if !self.pending_deliveries.take_dirty() || !self.paths.persist {
+        if !self.paths.persist || !self.pending_deliveries.take_dirty() {
             return;
         }
         if let Err(error) = save_pending_deliveries(&self.paths.pending, &self.pending_deliveries) {
             tracing::warn!(
                 path = %self.paths.pending.display(),
                 error = %error,
                 "failed to persist pending deliveries"
             );
+            self.pending_deliveries.mark_dirty();
         }
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/broker/src/runtime/event_loop.rs` around lines 127 - 137, The current
flush_pending_deliveries clears the dirty flag by calling
self.pending_deliveries.take_dirty() before attempting save_pending_deliveries,
so if the write fails the dirty state is lost; change the flow in
flush_pending_deliveries to first check the dirty state without clearing (use an
is_dirty-like check instead of take_dirty), proceed with
save_pending_deliveries(&self.paths.pending, &self.pending_deliveries), and only
clear/take the dirty flag (call take_dirty() or an explicit clear) after a
successful save; if save_pending_deliveries returns Err, leave the dirty flag
set so pending mutations will be retried later (refer to
flush_pending_deliveries, pending_deliveries.take_dirty(),
save_pending_deliveries, and self.paths.persist).

willwashburn and others added 2 commits June 11, 2026 12:44
…urability

# Conflicts:
#	.agentworkforce/trajectories/active/traj_b1jrutolckfb/trajectory.json
#	CHANGELOG.md
#	crates/broker/src/runtime/tests.rs

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
crates/broker/src/runtime/relaycast_events.rs (1)

275-277: ⚡ Quick win

Consider extracting token seeding logic to reduce duplication.

The token seeding block appears identically in both the primary spawn path (lines 275-277) and the fallback spawn path (lines 505-507). While the duplication is localized and the logic is simple, extracting it into a helper would improve maintainability if the seeding behavior evolves.

♻️ Suggested refactor
+fn maybe_seed_or_register_agent(
+    workspace_http: &WorkspaceHttpClient,
+    name: &str,
+    cli: &str,
+    ws_value: &Value,
+) -> impl Future<Output = Option<String>> + '_ {
+    async move {
+        if let Some(token) = relaycast_ws_spawn_token(ws_value) {
+            seed_supplied_agent_token(workspace_http, name, &token);
+            Some(token)
+        } else {
+            const REG_TIMEOUT: Duration = Duration::from_secs(3);
+            match tokio::time::timeout(
+                REG_TIMEOUT,
+                workspace_http.register_agent_token(name, Some(cli)),
+            )
+            .await
+            {
+                Ok(Ok(token)) => {
+                    tracing::info!(worker = %name, "pre-registered agent via broker for WS spawn");
+                    Some(token)
+                }
+                Ok(Err(error)) => {
+                    tracing::warn!(worker = %name, error = %error, "WS spawn pre-registration failed");
+                    None
+                }
+                Err(_) => {
+                    tracing::warn!(worker = %name, "WS spawn pre-registration timed out (3s)");
+                    None
+                }
+            }
+        }
+    }
+}

Then replace both blocks with:

let worker_relay_key = maybe_seed_or_register_agent(&workspace_http, &name, &cli, &ws_value).await;

Also applies to: 505-507

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/broker/src/runtime/relaycast_events.rs` around lines 275 - 277,
Extract the duplicated token seeding logic into a single async helper (e.g.,
maybe_seed_or_register_agent) that accepts the same context used in both places
(references to workspace_http, name, cli, ws_value) and returns the token or
worker key; move the existing relaycast_ws_spawn_token call and
seed_supplied_agent_token invocation into that helper and await it where the
original blocks were, then replace both occurrences (the primary spawn path and
the fallback spawn path) with a call like let worker_relay_key =
maybe_seed_or_register_agent(&workspace_http, &name, &cli, &ws_value).await to
remove duplication and centralize seeding behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@crates/broker/src/runtime/relaycast_events.rs`:
- Around line 275-277: Extract the duplicated token seeding logic into a single
async helper (e.g., maybe_seed_or_register_agent) that accepts the same context
used in both places (references to workspace_http, name, cli, ws_value) and
returns the token or worker key; move the existing relaycast_ws_spawn_token call
and seed_supplied_agent_token invocation into that helper and await it where the
original blocks were, then replace both occurrences (the primary spawn path and
the fallback spawn path) with a call like let worker_relay_key =
maybe_seed_or_register_agent(&workspace_http, &name, &cli, &ws_value).await to
remove duplication and centralize seeding behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: d7ce28c0-2969-463a-a7a3-0908ddd5605f

📥 Commits

Reviewing files that changed from the base of the PR and between 447c501 and 3dfcc88.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (10)
  • .agentworkforce/trajectories/active/traj_b1jrutolckfb/trajectory.json
  • CHANGELOG.md
  • crates/broker/src/protocol.rs
  • crates/broker/src/pty_worker.rs
  • crates/broker/src/runtime/api.rs
  • crates/broker/src/runtime/delivery.rs
  • crates/broker/src/runtime/relaycast_events.rs
  • crates/broker/src/runtime/tests.rs
  • crates/broker/src/runtime/worker_events.rs
  • packages/harness-driver/src/protocol.ts
🚧 Files skipped from review as they are similar to previous changes (8)
  • crates/broker/src/runtime/worker_events.rs
  • packages/harness-driver/src/protocol.ts
  • CHANGELOG.md
  • crates/broker/src/runtime/api.rs
  • crates/broker/src/protocol.rs
  • crates/broker/src/pty_worker.rs
  • crates/broker/src/runtime/delivery.rs
  • crates/broker/src/runtime/tests.rs

@willwashburn willwashburn merged commit b74b476 into main Jun 11, 2026
2 checks passed
@willwashburn willwashburn deleted the fix/broker-delivery-durability branch June 11, 2026 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant