Skip to content

fix: graceful browser shutdown and network error retries#463

Merged
jamiepine merged 1 commit into
mainfrom
fix/browser-graceful-shutdown
Mar 21, 2026
Merged

fix: graceful browser shutdown and network error retries#463
jamiepine merged 1 commit into
mainfrom
fix/browser-graceful-shutdown

Conversation

@vsumner

@vsumner vsumner commented Mar 21, 2026

Copy link
Copy Markdown
Collaborator

Summary

Fixes browser crashes when workers fail due to transient API errors.

Changes

Error Handling

  • Add "error sending request" to retriable error patterns to handle DNS failures, connection refused, and other transient network errors
  • Workers now retry with exponential backoff instead of failing immediately

Browser Cleanup

  • Remove broken keepalive code (tried to clone non-Clone Browser struct)
  • Implement graceful browser shutdown in Drop implementation with 5s timeout
  • Abort handler task after close completes or times out to prevent Chrome force-kill
  • Fixes "WS Connection error: ResetWithoutClosingHandshake" and core dumps

Tests

  • Add comprehensive test suite (15 tests total):
    • Routing (7 tests): Error classification, HTTP errors, timeouts, server phrases
    • Browser (8 tests): Handler state clearing, persistent session preservation, CHROME_FLAGS parsing, SingletonLock lifecycle
  • All 15 new tests run without Chrome (state management only)

Testing

  • All 695 tests pass (688 existing + 7 routing + 8 browser)
  • Clippy clean
  • Gate PR checks pass

Notes (Optional)

Regression Protection:

  • The persistent_profile check prevents regression of PR feat: persistent browser sessions with configurable close policy #309 (persistent browser sessions)
  • Handler task now conditionally clears state: full cleanup for temp sessions, preserves data for persistent sessions
  • Drop timeouts browser.close() after 5s to prevent hanging during worker exit

Implementation Details:

  • Uses chained let-else with && for clean resource acquisition in Drop
  • Sequential cleanup: close (5s timeout) → abort handler → remove temp dir
  • Best-effort async cleanup; chromiumoxide force-kill fallback if runtime unavailable
  • No breaking changes to public API

@coderabbitai

coderabbitai Bot commented Mar 21, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

is_retriable_error in src/llm/routing.rs was expanded to match the lowercase substring "error sending request" and gained unit tests. src/tools/browser.rs updates BrowserState Drop to attempt async close and abort the handler task, appends whitespace-split CHROME_FLAGS to launch args, and improves handler-task logging and state clearing.

Changes

Cohort / File(s) Summary
Error classification & tests
src/llm/routing.rs
Extended is_retriable_error to also match the lowercase substring "error sending request"; added #[cfg(test)] unit tests covering multiple retriable patterns, non-retriable cases, and is_rate_limit_error.
Browser lifecycle & handler management
src/tools/browser.rs
Drop for BrowserState now takes ownership of browser/handler/user_data_dir and, when a Tokio runtime exists, spawns a single async task to attempt browser.close() with a timeout, abort the handler task, and remove the user-data directory; handler loop now logs errors/warnings and clears or preserves shared state depending on persistence; BrowserContext::ensure_launched appends whitespace-split CHROME_FLAGS to launch arguments; comprehensive #[cfg(test)] coverage added.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main changes: graceful browser shutdown and network error retries are the two primary objectives addressed in the changeset.
Description check ✅ Passed The description is directly related to the changeset, providing detailed explanations of error handling improvements, browser cleanup fixes, and test coverage that align with the actual code changes across both files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/browser-graceful-shutdown

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to improve runtime stability by (1) making Chromium shutdown/cleanup more reliable and (2) retrying transient network/API failures instead of failing workers immediately, with accompanying Nix packaging updates.

Changes:

  • Add additional retriable error detection for transient network failures and introduce a focused unit test suite for routing retry classification.
  • Improve browser lifecycle handling (graceful shutdown attempt, handler error logging) and add Chrome stability flags + env-driven flags.
  • Update Nix build inputs/packaging (crane/nixpkgs bumps, vendoring/patch hook adjustments, expanded fileset inputs).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/tools/browser.rs Attempts graceful browser shutdown on drop, adds stability flags and CHROME_FLAGS support, improves handler task logging.
src/llm/routing.rs Expands retriable error pattern matching and adds unit tests for retriable/rate-limit detection.
nix/default.nix Refactors crane build/test derivations, adds an imap-proto replacement hook, updates hashes and wrapper env.
flake.nix Expands the Nix fileset inputs to include additional embedded docs/assets and vendored sources.
flake.lock Updates locked revisions/hashes for flake inputs.
.cargo/config.toml Changes Cargo configuration to set the crates.io registry index.
Comments suppressed due to low confidence (1)

src/tools/browser.rs:691

  • In Drop, the browser close is spawned but not awaited, and the code proceeds to abort the handler task and delete user_data_dir immediately afterward. This can race: the browser may still be running/closing while its profile directory is removed, undermining the goal of preventing Chromium force-kills/core dumps and potentially causing shutdown failures. Consider moving close + handler abort + temp-dir cleanup into a single spawned task (or otherwise ensuring close() completes or times out before deleting the directory).
        // Close browser gracefully before cleaning up temp directory.
        // This prevents chromiumoxide's force-kill which causes core dumps.
        if let Some(mut browser) = self.browser.take()
            && let Ok(handle) = tokio::runtime::Handle::try_current()
        {
            // Spawn async close task - best effort, don't wait for completion
            handle.spawn(async move {
                if let Err(error) = browser.close().await {
                    tracing::debug!(%error, "failed to close browser during drop");
                }
            });
        }
        // Note: if no runtime is available, we can't close async -
        // chromiumoxide will force-kill on drop, which is acceptable

        // Abort handler task to prevent it from outliving the browser
        if let Some(task) = self._handler_task.take() {
            task.abort();
        }

        if let Some(dir) = self.user_data_dir.take() {
            if let Ok(handle) = tokio::runtime::Handle::try_current() {
                handle.spawn_blocking(move || {
                    if let Err(error) = std::fs::remove_dir_all(&dir) {
                        tracing::debug!(
                            path = %dir.display(),
                            %error,
                            "failed to clean up browser user data dir"
                        );
                    }
                });
            } else if let Err(error) = std::fs::remove_dir_all(&dir) {

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/tools/browser.rs
Comment thread src/tools/browser.rs

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
nix/default.nix (1)

176-208: Consider tightening the version pattern match.

The hook replaces any imap-proto-* directory with the imap-proto-0.10.2 patch. This is intentionally broad to catch directories in various cargo locations, but a few observations:

  1. The wildcard imap-proto-* will match any version. If another version (e.g., imap-proto-0.11.0) is ever introduced as a transitive dependency, it would also be replaced with 0.10.2, which could cause version mismatches.

  2. Suppressing stderr with 2>/dev/null on find is fine for missing directories, but consider whether important errors might be hidden.

If the project only ever uses imap-proto-0.10.2 (as confirmed by Cargo.lock), this is acceptable. Otherwise, consider matching the exact version:

♻️ Optional: More precise pattern matching
-    find /build -type d -name "imap-proto-*" 2>/dev/null | while read dir; do
+    find /build -type d -name "imap-proto-0.10.2" 2>/dev/null | while read dir; do

Apply similarly to the other find commands.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nix/default.nix` around lines 176 - 208, The postPatchImapProto hook
currently finds directories named "imap-proto-*" and unconditionally replaces
them with the vendor copy, which can accidentally replace other versions; update
the find patterns in postPatchImapProto to match the exact version name
"imap-proto-0.10.2" instead of the wildcard (apply to all three find
invocations) and keep the rest of the logic (rm -rf, cp -r
${cargoSrc}/vendor/imap-proto-0.10.2, chmod) the same; also remove or reduce the
global stderr suppression (the "2>/dev/null" on the find calls) so real errors
from find are not hidden while still handling missing directories gracefully,
referencing the postPatchImapProto function and variables CARGO_HOME and
cargoVendorDir in your edits.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/tools/browser.rs`:
- Around line 660-678: The close and handler-abort are racing: instead of
fire-and-forgetting browser.close() (via tokio::runtime::Handle::try_current()
spawn) and then immediately calling self._handler_task.take().abort(), ensure
the close completes (or at least is driven) before aborting the handler. Change
the logic that uses self.browser.take() and Handle::try_current() to spawn
browser.close() and wait for its JoinHandle to complete (for example by using
the runtime handle to block_on awaiting the join or awaiting the join inside a
spawned task that you then block_on) and only after that call
self._handler_task.take().abort(); refer to self.browser, browser.close(),
tokio::runtime::Handle::try_current(), self._handler_task and task.abort() when
locating the code to update.

---

Nitpick comments:
In `@nix/default.nix`:
- Around line 176-208: The postPatchImapProto hook currently finds directories
named "imap-proto-*" and unconditionally replaces them with the vendor copy,
which can accidentally replace other versions; update the find patterns in
postPatchImapProto to match the exact version name "imap-proto-0.10.2" instead
of the wildcard (apply to all three find invocations) and keep the rest of the
logic (rm -rf, cp -r ${cargoSrc}/vendor/imap-proto-0.10.2, chmod) the same; also
remove or reduce the global stderr suppression (the "2>/dev/null" on the find
calls) so real errors from find are not hidden while still handling missing
directories gracefully, referencing the postPatchImapProto function and
variables CARGO_HOME and cargoVendorDir in your edits.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 67c087f2-66f1-426f-a2da-8caf135660e4

📥 Commits

Reviewing files that changed from the base of the PR and between bcdb44a and 1b39687.

⛔ Files ignored due to path filters (2)
  • .cargo/config.toml is excluded by !**/*.toml
  • flake.lock is excluded by !**/*.lock, !**/*.lock
📒 Files selected for processing (4)
  • flake.nix
  • nix/default.nix
  • src/llm/routing.rs
  • src/tools/browser.rs

Comment thread src/tools/browser.rs Outdated
@vsumner vsumner force-pushed the fix/browser-graceful-shutdown branch 3 times, most recently from 7863dac to 92c52fe Compare March 21, 2026 02:22

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/tools/browser.rs (1)

660-690: ⚠️ Potential issue | 🟠 Major

Serialize browser close, handler abort, and profile cleanup.

Handle::spawn returns immediately, so the handler abort and the remove_dir_all path below can both run before browser.close() has actually driven a graceful shutdown. That means this drop path can still tear down the handler or temp profile while Chrome is live, which is the failure mode this change is trying to avoid. Please sequence close → abort → cleanup inside the spawned task, ideally with a short timeout around close().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/tools/browser.rs` around lines 660 - 690, The current drop spawns
browser.close() asynchronously while aborting the handler and deleting
user_data_dir immediately, risking tearing down resources before close
completes; fix by serializing close → abort → cleanup inside a single spawned
async task: take self.browser, self._handler_task, and self.user_data_dir, then
if tokio::runtime::Handle::try_current() is Ok spawn an async move that (1)
calls tokio::time::timeout with a short duration around browser.close().await
and logs on timeout/error, (2) after close completes (or times out) calls
task.abort() on the taken _handler_task, and (3) uses
tokio::task::spawn_blocking to run std::fs::remove_dir_all on the taken
user_data_dir and log errors; remove the separate handler.abort() and standalone
spawn_blocking calls so cleanup is only performed after close finishes or times
out.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/tools/browser.rs`:
- Around line 1167-1174: The handler task must clear the shared BrowserState
when the handler reaches a terminal/failure state so ensure_launched() doesn't
falsely think a dead session is live; inside the tokio::spawn closure that
currently polls handler.next() (the block with handler_task and the
tracing::error!/warn! calls), acquire the same BrowserState lock/handle and
reset its fields (set browser to None, clear pages, clear active_target or flip
any launched flag) whenever you observe Err(error) or when the stream ends,
ensuring this happens before returning from the task so subsequent
ensure_launched() calls will relaunch a fresh browser session.

---

Duplicate comments:
In `@src/tools/browser.rs`:
- Around line 660-690: The current drop spawns browser.close() asynchronously
while aborting the handler and deleting user_data_dir immediately, risking
tearing down resources before close completes; fix by serializing close → abort
→ cleanup inside a single spawned async task: take self.browser,
self._handler_task, and self.user_data_dir, then if
tokio::runtime::Handle::try_current() is Ok spawn an async move that (1) calls
tokio::time::timeout with a short duration around browser.close().await and logs
on timeout/error, (2) after close completes (or times out) calls task.abort() on
the taken _handler_task, and (3) uses tokio::task::spawn_blocking to run
std::fs::remove_dir_all on the taken user_data_dir and log errors; remove the
separate handler.abort() and standalone spawn_blocking calls so cleanup is only
performed after close finishes or times out.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ee54ea1e-e9a3-4b00-b589-fe4939d8bd6e

📥 Commits

Reviewing files that changed from the base of the PR and between 6992ba2 and 0ce85a1.

📒 Files selected for processing (2)
  • src/llm/routing.rs
  • src/tools/browser.rs

Comment thread src/tools/browser.rs
@vsumner vsumner force-pushed the fix/browser-graceful-shutdown branch 3 times, most recently from f032f20 to 26d52da Compare March 21, 2026 12:35
- Add "error sending request" to retriable error patterns to handle
  transient network failures (DNS, connection refused, etc.)
- Remove broken keepalive code that tried to clone non-Clone Browser
- Implement graceful browser shutdown in Drop to prevent Chrome core dumps
- Add comprehensive test suite (7 tests) for error classification
- Fix clippy warning about collapsible if statements

Fixes WebSocket connection errors when workers fail due to transient
API errors, preventing Chrome from being force-killed and dumping core.
@vsumner vsumner force-pushed the fix/browser-graceful-shutdown branch from 26d52da to 3b83d6d Compare March 21, 2026 12:45
@vsumner

vsumner commented Mar 21, 2026

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review to trigger a single review.

@coderabbitai

coderabbitai Bot commented Mar 21, 2026

Copy link
Copy Markdown
Contributor

@vsumner Sure, I'll review the changes now!


🧠 Learnings used
Learnt from: CR
Repo: spacedriveapp/spacebot PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-08T04:12:21.175Z
Learning: For changes in async/stateful paths (worker lifecycle, cancellation, retrigger, recall cache behavior), include explicit race/terminal-state reasoning in the PR summary and run targeted tests in addition to `just gate-pr`
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/tools/browser.rs (1)

2607-2641: Consider using std::sync::Mutex or serial_test for env var tests.

The unsafe { std::env::set_var(...) } blocks mutate shared process state. While the comment says "Tests run sequentially," Rust's test runner executes tests in parallel by default. If other tests read CHROME_FLAGS, this could cause flaky behavior.

Options:

  1. Run this specific test with cargo test -- --test-threads=1
  2. Use the serial_test crate with #[serial] attribute
  3. Save/restore the original env var value in a mutex-guarded block

This is a minor concern since the env var is unlikely to be read by other tests, but worth noting for robustness.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/tools/browser.rs` around lines 2607 - 2641, The test_chrome_flags_parsing
mutates global process env with unsafe set_var calls causing potential
flakiness; make it deterministic by either adding the serial_test attribute (add
serial_test crate and annotate the function with #[serial]) or by guarding env
changes with a Mutex and saving/restoring the original CHROME_FLAGS value: read
original with std::env::var_os("CHROME_FLAGS"), set with std::env::set_var(...)
(no unsafe), run assertions, then restore the original (use std::env::remove_var
if it was None or std::env::set_var to restore); remove all unsafe blocks if you
adopt mutex/save-restore or add #[serial] above test_chrome_flags_parsing if
using serial_test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/tools/browser.rs`:
- Around line 2607-2641: The test_chrome_flags_parsing mutates global process
env with unsafe set_var calls causing potential flakiness; make it deterministic
by either adding the serial_test attribute (add serial_test crate and annotate
the function with #[serial]) or by guarding env changes with a Mutex and
saving/restoring the original CHROME_FLAGS value: read original with
std::env::var_os("CHROME_FLAGS"), set with std::env::set_var(...) (no unsafe),
run assertions, then restore the original (use std::env::remove_var if it was
None or std::env::set_var to restore); remove all unsafe blocks if you adopt
mutex/save-restore or add #[serial] above test_chrome_flags_parsing if using
serial_test.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 92f607df-20f0-419b-9272-7b2dea4cb26b

📥 Commits

Reviewing files that changed from the base of the PR and between 92c52fe and 3b83d6d.

📒 Files selected for processing (2)
  • src/llm/routing.rs
  • src/tools/browser.rs

@jamiepine jamiepine merged commit 6b3ed72 into main Mar 21, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants