fix: graceful browser shutdown and network error retries by vsumner · Pull Request #463 · spacedriveapp/spacebot

vsumner · 2026-03-21T01:56:46Z

Summary

Fixes browser crashes when workers fail due to transient API errors.

Changes

Error Handling

Add "error sending request" to retriable error patterns to handle DNS failures, connection refused, and other transient network errors
Workers now retry with exponential backoff instead of failing immediately

Browser Cleanup

Remove broken keepalive code (tried to clone non-Clone Browser struct)
Implement graceful browser shutdown in Drop implementation with 5s timeout
Abort handler task after close completes or times out to prevent Chrome force-kill
Fixes "WS Connection error: ResetWithoutClosingHandshake" and core dumps

Tests

Add comprehensive test suite (15 tests total):
- Routing (7 tests): Error classification, HTTP errors, timeouts, server phrases
- Browser (8 tests): Handler state clearing, persistent session preservation, CHROME_FLAGS parsing, SingletonLock lifecycle
All 15 new tests run without Chrome (state management only)

Testing

All 695 tests pass (688 existing + 7 routing + 8 browser)
Clippy clean
Gate PR checks pass

Notes (Optional)

Regression Protection:

The persistent_profile check prevents regression of PR feat: persistent browser sessions with configurable close policy #309 (persistent browser sessions)
Handler task now conditionally clears state: full cleanup for temp sessions, preserves data for persistent sessions
Drop timeouts browser.close() after 5s to prevent hanging during worker exit

Implementation Details:

Uses chained let-else with && for clean resource acquisition in Drop
Sequential cleanup: close (5s timeout) → abort handler → remove temp dir
Best-effort async cleanup; chromiumoxide force-kill fallback if runtime unavailable
No breaking changes to public API

coderabbitai · 2026-03-21T01:57:00Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

is_retriable_error in src/llm/routing.rs was expanded to match the lowercase substring "error sending request" and gained unit tests. src/tools/browser.rs updates BrowserState Drop to attempt async close and abort the handler task, appends whitespace-split CHROME_FLAGS to launch args, and improves handler-task logging and state clearing.

Changes

Cohort / File(s)	Summary
Error classification & tests `src/llm/routing.rs`	Extended `is_retriable_error` to also match the lowercase substring `"error sending request"`; added `#[cfg(test)]` unit tests covering multiple retriable patterns, non-retriable cases, and `is_rate_limit_error`.
Browser lifecycle & handler management `src/tools/browser.rs`	`Drop for BrowserState` now takes ownership of browser/handler/user_data_dir and, when a Tokio runtime exists, spawns a single async task to attempt `browser.close()` with a timeout, abort the handler task, and remove the user-data directory; handler loop now logs errors/warnings and clears or preserves shared state depending on persistence; `BrowserContext::ensure_launched` appends whitespace-split `CHROME_FLAGS` to launch arguments; comprehensive `#[cfg(test)]` coverage added.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

fix: browser context management — dedup snapshots, consecutive loop guard, overflow resilience #359 — Similar extension to is_retriable_error patterns in src/llm/routing.rs.
feat: persistent browser sessions with configurable close policy #309 — Overlapping browser lifecycle, handler-task shutdown, and persistent-session handling changes in src/tools/browser.rs.
feat: auto-download Chrome via fetcher, unify Docker image, fix singleton lock #268 — Related Drop/cleanup refactor and CHROME_FLAGS handling in src/tools/browser.rs.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically summarizes the main changes: graceful browser shutdown and network error retries are the two primary objectives addressed in the changeset.
Description check	✅ Passed	The description is directly related to the changeset, providing detailed explanations of error handling improvements, browser cleanup fixes, and test coverage that align with the actual code changes across both files.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/browser-graceful-shutdown

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This PR aims to improve runtime stability by (1) making Chromium shutdown/cleanup more reliable and (2) retrying transient network/API failures instead of failing workers immediately, with accompanying Nix packaging updates.

Changes:

Add additional retriable error detection for transient network failures and introduce a focused unit test suite for routing retry classification.
Improve browser lifecycle handling (graceful shutdown attempt, handler error logging) and add Chrome stability flags + env-driven flags.
Update Nix build inputs/packaging (crane/nixpkgs bumps, vendoring/patch hook adjustments, expanded fileset inputs).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
`src/tools/browser.rs`	Attempts graceful browser shutdown on drop, adds stability flags and CHROME_FLAGS support, improves handler task logging.
`src/llm/routing.rs`	Expands retriable error pattern matching and adds unit tests for retriable/rate-limit detection.
`nix/default.nix`	Refactors crane build/test derivations, adds an imap-proto replacement hook, updates hashes and wrapper env.
`flake.nix`	Expands the Nix fileset inputs to include additional embedded docs/assets and vendored sources.
`flake.lock`	Updates locked revisions/hashes for flake inputs.
`.cargo/config.toml`	Changes Cargo configuration to set the crates.io registry index.

Comments suppressed due to low confidence (1)

src/tools/browser.rs:691

In Drop, the browser close is spawned but not awaited, and the code proceeds to abort the handler task and delete user_data_dir immediately afterward. This can race: the browser may still be running/closing while its profile directory is removed, undermining the goal of preventing Chromium force-kills/core dumps and potentially causing shutdown failures. Consider moving close + handler abort + temp-dir cleanup into a single spawned task (or otherwise ensuring close() completes or times out before deleting the directory).

        // Close browser gracefully before cleaning up temp directory.
        // This prevents chromiumoxide's force-kill which causes core dumps.
        if let Some(mut browser) = self.browser.take()
            && let Ok(handle) = tokio::runtime::Handle::try_current()
        {
            // Spawn async close task - best effort, don't wait for completion
            handle.spawn(async move {
                if let Err(error) = browser.close().await {
                    tracing::debug!(%error, "failed to close browser during drop");
                }
            });
        }
        // Note: if no runtime is available, we can't close async -
        // chromiumoxide will force-kill on drop, which is acceptable

        // Abort handler task to prevent it from outliving the browser
        if let Some(task) = self._handler_task.take() {
            task.abort();
        }

        if let Some(dir) = self.user_data_dir.take() {
            if let Ok(handle) = tokio::runtime::Handle::try_current() {
                handle.spawn_blocking(move || {
                    if let Err(error) = std::fs::remove_dir_all(&dir) {
                        tracing::debug!(
                            path = %dir.display(),
                            %error,
                            "failed to clean up browser user data dir"
                        );
                    }
                });
            } else if let Err(error) = std::fs::remove_dir_all(&dir) {

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

nix/default.nix (1)
176-208: Consider tightening the version pattern match.

The hook replaces any imap-proto-* directory with the imap-proto-0.10.2 patch. This is intentionally broad to catch directories in various cargo locations, but a few observations:

The wildcard imap-proto-* will match any version. If another version (e.g., imap-proto-0.11.0) is ever introduced as a transitive dependency, it would also be replaced with 0.10.2, which could cause version mismatches.

Suppressing stderr with 2>/dev/null on find is fine for missing directories, but consider whether important errors might be hidden.

If the project only ever uses imap-proto-0.10.2 (as confirmed by Cargo.lock), this is acceptable. Otherwise, consider matching the exact version:
♻️ Optional: More precise pattern matching
-    find /build -type d -name "imap-proto-*" 2>/dev/null | while read dir; do
+    find /build -type d -name "imap-proto-0.10.2" 2>/dev/null | while read dir; do
Apply similarly to the other find commands.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nix/default.nix` around lines 176 - 208, The postPatchImapProto hook
currently finds directories named "imap-proto-*" and unconditionally replaces
them with the vendor copy, which can accidentally replace other versions; update
the find patterns in postPatchImapProto to match the exact version name
"imap-proto-0.10.2" instead of the wildcard (apply to all three find
invocations) and keep the rest of the logic (rm -rf, cp -r
${cargoSrc}/vendor/imap-proto-0.10.2, chmod) the same; also remove or reduce the
global stderr suppression (the "2>/dev/null" on the find calls) so real errors
from find are not hidden while still handling missing directories gracefully,
referencing the postPatchImapProto function and variables CARGO_HOME and
cargoVendorDir in your edits.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/tools/browser.rs`:
- Around line 660-678: The close and handler-abort are racing: instead of
fire-and-forgetting browser.close() (via tokio::runtime::Handle::try_current()
spawn) and then immediately calling self._handler_task.take().abort(), ensure
the close completes (or at least is driven) before aborting the handler. Change
the logic that uses self.browser.take() and Handle::try_current() to spawn
browser.close() and wait for its JoinHandle to complete (for example by using
the runtime handle to block_on awaiting the join or awaiting the join inside a
spawned task that you then block_on) and only after that call
self._handler_task.take().abort(); refer to self.browser, browser.close(),
tokio::runtime::Handle::try_current(), self._handler_task and task.abort() when
locating the code to update.

---

Nitpick comments:
In `@nix/default.nix`:
- Around line 176-208: The postPatchImapProto hook currently finds directories
named "imap-proto-*" and unconditionally replaces them with the vendor copy,
which can accidentally replace other versions; update the find patterns in
postPatchImapProto to match the exact version name "imap-proto-0.10.2" instead
of the wildcard (apply to all three find invocations) and keep the rest of the
logic (rm -rf, cp -r ${cargoSrc}/vendor/imap-proto-0.10.2, chmod) the same; also
remove or reduce the global stderr suppression (the "2>/dev/null" on the find
calls) so real errors from find are not hidden while still handling missing
directories gracefully, referencing the postPatchImapProto function and
variables CARGO_HOME and cargoVendorDir in your edits.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 67c087f2-66f1-426f-a2da-8caf135660e4

📥 Commits

Reviewing files that changed from the base of the PR and between bcdb44a and 1b39687.

⛔ Files ignored due to path filters (2)

.cargo/config.toml is excluded by !**/*.toml
flake.lock is excluded by !**/*.lock, !**/*.lock

📒 Files selected for processing (4)

flake.nix
nix/default.nix
src/llm/routing.rs
src/tools/browser.rs

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

src/tools/browser.rs (1)
660-690: ⚠️ Potential issue | 🟠 Major

Serialize browser close, handler abort, and profile cleanup.

Handle::spawn returns immediately, so the handler abort and the remove_dir_all path below can both run before browser.close() has actually driven a graceful shutdown. That means this drop path can still tear down the handler or temp profile while Chrome is live, which is the failure mode this change is trying to avoid. Please sequence close → abort → cleanup inside the spawned task, ideally with a short timeout around close().
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/tools/browser.rs` around lines 660 - 690, The current drop spawns
browser.close() asynchronously while aborting the handler and deleting
user_data_dir immediately, risking tearing down resources before close
completes; fix by serializing close → abort → cleanup inside a single spawned
async task: take self.browser, self._handler_task, and self.user_data_dir, then
if tokio::runtime::Handle::try_current() is Ok spawn an async move that (1)
calls tokio::time::timeout with a short duration around browser.close().await
and logs on timeout/error, (2) after close completes (or times out) calls
task.abort() on the taken _handler_task, and (3) uses
tokio::task::spawn_blocking to run std::fs::remove_dir_all on the taken
user_data_dir and log errors; remove the separate handler.abort() and standalone
spawn_blocking calls so cleanup is only performed after close finishes or times
out.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/tools/browser.rs`:
- Around line 1167-1174: The handler task must clear the shared BrowserState
when the handler reaches a terminal/failure state so ensure_launched() doesn't
falsely think a dead session is live; inside the tokio::spawn closure that
currently polls handler.next() (the block with handler_task and the
tracing::error!/warn! calls), acquire the same BrowserState lock/handle and
reset its fields (set browser to None, clear pages, clear active_target or flip
any launched flag) whenever you observe Err(error) or when the stream ends,
ensuring this happens before returning from the task so subsequent
ensure_launched() calls will relaunch a fresh browser session.

---

Duplicate comments:
In `@src/tools/browser.rs`:
- Around line 660-690: The current drop spawns browser.close() asynchronously
while aborting the handler and deleting user_data_dir immediately, risking
tearing down resources before close completes; fix by serializing close → abort
→ cleanup inside a single spawned async task: take self.browser,
self._handler_task, and self.user_data_dir, then if
tokio::runtime::Handle::try_current() is Ok spawn an async move that (1) calls
tokio::time::timeout with a short duration around browser.close().await and logs
on timeout/error, (2) after close completes (or times out) calls task.abort() on
the taken _handler_task, and (3) uses tokio::task::spawn_blocking to run
std::fs::remove_dir_all on the taken user_data_dir and log errors; remove the
separate handler.abort() and standalone spawn_blocking calls so cleanup is only
performed after close finishes or times out.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ee54ea1e-e9a3-4b00-b589-fe4939d8bd6e

📥 Commits

Reviewing files that changed from the base of the PR and between 6992ba2 and 0ce85a1.

📒 Files selected for processing (2)

src/llm/routing.rs
src/tools/browser.rs

- Add "error sending request" to retriable error patterns to handle transient network failures (DNS, connection refused, etc.) - Remove broken keepalive code that tried to clone non-Clone Browser - Implement graceful browser shutdown in Drop to prevent Chrome core dumps - Add comprehensive test suite (7 tests) for error classification - Fix clippy warning about collapsible if statements Fixes WebSocket connection errors when workers fail due to transient API errors, preventing Chrome from being force-killed and dumping core.

vsumner · 2026-03-21T12:52:11Z

@coderabbitai review to trigger a single review.

coderabbitai · 2026-03-21T12:52:26Z

@vsumner Sure, I'll review the changes now!

🧠 Learnings used

Learnt from: CR
Repo: spacedriveapp/spacebot PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-08T04:12:21.175Z
Learning: For changes in async/stateful paths (worker lifecycle, cancellation, retrigger, recall cache behavior), include explicit race/terminal-state reasoning in the PR summary and run targeted tests in addition to `just gate-pr`

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

🧹 Nitpick comments (1)

src/tools/browser.rs (1)
2607-2641: Consider using std::sync::Mutex or serial_test for env var tests.

The unsafe { std::env::set_var(...) } blocks mutate shared process state. While the comment says "Tests run sequentially," Rust's test runner executes tests in parallel by default. If other tests read CHROME_FLAGS, this could cause flaky behavior.

Options:

Run this specific test with cargo test -- --test-threads=1

Use the serial_test crate with #[serial] attribute

Save/restore the original env var value in a mutex-guarded block

This is a minor concern since the env var is unlikely to be read by other tests, but worth noting for robustness.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/tools/browser.rs` around lines 2607 - 2641, The test_chrome_flags_parsing
mutates global process env with unsafe set_var calls causing potential
flakiness; make it deterministic by either adding the serial_test attribute (add
serial_test crate and annotate the function with #[serial]) or by guarding env
changes with a Mutex and saving/restoring the original CHROME_FLAGS value: read
original with std::env::var_os("CHROME_FLAGS"), set with std::env::set_var(...)
(no unsafe), run assertions, then restore the original (use std::env::remove_var
if it was None or std::env::set_var to restore); remove all unsafe blocks if you
adopt mutex/save-restore or add #[serial] above test_chrome_flags_parsing if
using serial_test.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/tools/browser.rs`:
- Around line 2607-2641: The test_chrome_flags_parsing mutates global process
env with unsafe set_var calls causing potential flakiness; make it deterministic
by either adding the serial_test attribute (add serial_test crate and annotate
the function with #[serial]) or by guarding env changes with a Mutex and
saving/restoring the original CHROME_FLAGS value: read original with
std::env::var_os("CHROME_FLAGS"), set with std::env::set_var(...) (no unsafe),
run assertions, then restore the original (use std::env::remove_var if it was
None or std::env::set_var to restore); remove all unsafe blocks if you adopt
mutex/save-restore or add #[serial] above test_chrome_flags_parsing if using
serial_test.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 92f607df-20f0-419b-9272-7b2dea4cb26b

📥 Commits

Reviewing files that changed from the base of the PR and between 92c52fe and 3b83d6d.

📒 Files selected for processing (2)

src/llm/routing.rs
src/tools/browser.rs

vsumner requested review from Copilot and jamiepine March 21, 2026 01:57

Copilot started reviewing on behalf of vsumner March 21, 2026 01:57 View session

vsumner force-pushed the fix/browser-graceful-shutdown branch from 1b39687 to 6992ba2 Compare March 21, 2026 01:59

Copilot AI reviewed Mar 21, 2026

View reviewed changes

tembo Bot reviewed Mar 21, 2026

View reviewed changes

Comment thread src/tools/browser.rs

Comment thread src/tools/browser.rs

coderabbitai Bot reviewed Mar 21, 2026

View reviewed changes

Comment thread src/tools/browser.rs Outdated

vsumner force-pushed the fix/browser-graceful-shutdown branch 3 times, most recently from 7863dac to 92c52fe Compare March 21, 2026 02:22

coderabbitai Bot reviewed Mar 21, 2026

View reviewed changes

Comment thread src/tools/browser.rs

vsumner force-pushed the fix/browser-graceful-shutdown branch 3 times, most recently from f032f20 to 26d52da Compare March 21, 2026 12:35

vsumner force-pushed the fix/browser-graceful-shutdown branch from 26d52da to 3b83d6d Compare March 21, 2026 12:45

coderabbitai Bot reviewed Mar 21, 2026

View reviewed changes

jamiepine approved these changes Mar 21, 2026

View reviewed changes

jamiepine merged commit 6b3ed72 into main Mar 21, 2026
4 checks passed

coderabbitai Bot mentioned this pull request Mar 23, 2026

fix: resolve worker tool call errors (browser_evaluate, browser_close, compaction) #475

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: graceful browser shutdown and network error retries#463

fix: graceful browser shutdown and network error retries#463
jamiepine merged 1 commit into
mainfrom
fix/browser-graceful-shutdown

vsumner commented Mar 21, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Mar 21, 2026 •

edited

Loading

Reviews paused

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

vsumner commented Mar 21, 2026

Uh oh!

coderabbitai Bot commented Mar 21, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vsumner commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Error Handling

Browser Cleanup

Tests

Testing

Notes (Optional)

Uh oh!

coderabbitai Bot commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vsumner commented Mar 21, 2026

Uh oh!

coderabbitai Bot commented Mar 21, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vsumner commented Mar 21, 2026 •

edited

Loading

coderabbitai Bot commented Mar 21, 2026 •

edited

Loading