fix: graceful browser shutdown and network error retries#463
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
Walkthrough
Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
1b39687 to
6992ba2
Compare
There was a problem hiding this comment.
Pull request overview
This PR aims to improve runtime stability by (1) making Chromium shutdown/cleanup more reliable and (2) retrying transient network/API failures instead of failing workers immediately, with accompanying Nix packaging updates.
Changes:
- Add additional retriable error detection for transient network failures and introduce a focused unit test suite for routing retry classification.
- Improve browser lifecycle handling (graceful shutdown attempt, handler error logging) and add Chrome stability flags + env-driven flags.
- Update Nix build inputs/packaging (crane/nixpkgs bumps, vendoring/patch hook adjustments, expanded fileset inputs).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
src/tools/browser.rs |
Attempts graceful browser shutdown on drop, adds stability flags and CHROME_FLAGS support, improves handler task logging. |
src/llm/routing.rs |
Expands retriable error pattern matching and adds unit tests for retriable/rate-limit detection. |
nix/default.nix |
Refactors crane build/test derivations, adds an imap-proto replacement hook, updates hashes and wrapper env. |
flake.nix |
Expands the Nix fileset inputs to include additional embedded docs/assets and vendored sources. |
flake.lock |
Updates locked revisions/hashes for flake inputs. |
.cargo/config.toml |
Changes Cargo configuration to set the crates.io registry index. |
Comments suppressed due to low confidence (1)
src/tools/browser.rs:691
- In
Drop, the browser close is spawned but not awaited, and the code proceeds to abort the handler task and deleteuser_data_dirimmediately afterward. This can race: the browser may still be running/closing while its profile directory is removed, undermining the goal of preventing Chromium force-kills/core dumps and potentially causing shutdown failures. Consider moving close + handler abort + temp-dir cleanup into a single spawned task (or otherwise ensuringclose()completes or times out before deleting the directory).
// Close browser gracefully before cleaning up temp directory.
// This prevents chromiumoxide's force-kill which causes core dumps.
if let Some(mut browser) = self.browser.take()
&& let Ok(handle) = tokio::runtime::Handle::try_current()
{
// Spawn async close task - best effort, don't wait for completion
handle.spawn(async move {
if let Err(error) = browser.close().await {
tracing::debug!(%error, "failed to close browser during drop");
}
});
}
// Note: if no runtime is available, we can't close async -
// chromiumoxide will force-kill on drop, which is acceptable
// Abort handler task to prevent it from outliving the browser
if let Some(task) = self._handler_task.take() {
task.abort();
}
if let Some(dir) = self.user_data_dir.take() {
if let Ok(handle) = tokio::runtime::Handle::try_current() {
handle.spawn_blocking(move || {
if let Err(error) = std::fs::remove_dir_all(&dir) {
tracing::debug!(
path = %dir.display(),
%error,
"failed to clean up browser user data dir"
);
}
});
} else if let Err(error) = std::fs::remove_dir_all(&dir) {
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
nix/default.nix (1)
176-208: Consider tightening the version pattern match.The hook replaces any
imap-proto-*directory with theimap-proto-0.10.2patch. This is intentionally broad to catch directories in various cargo locations, but a few observations:
The wildcard
imap-proto-*will match any version. If another version (e.g.,imap-proto-0.11.0) is ever introduced as a transitive dependency, it would also be replaced with0.10.2, which could cause version mismatches.Suppressing stderr with
2>/dev/nullonfindis fine for missing directories, but consider whether important errors might be hidden.If the project only ever uses
imap-proto-0.10.2(as confirmed by Cargo.lock), this is acceptable. Otherwise, consider matching the exact version:♻️ Optional: More precise pattern matching
- find /build -type d -name "imap-proto-*" 2>/dev/null | while read dir; do + find /build -type d -name "imap-proto-0.10.2" 2>/dev/null | while read dir; doApply similarly to the other
findcommands.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nix/default.nix` around lines 176 - 208, The postPatchImapProto hook currently finds directories named "imap-proto-*" and unconditionally replaces them with the vendor copy, which can accidentally replace other versions; update the find patterns in postPatchImapProto to match the exact version name "imap-proto-0.10.2" instead of the wildcard (apply to all three find invocations) and keep the rest of the logic (rm -rf, cp -r ${cargoSrc}/vendor/imap-proto-0.10.2, chmod) the same; also remove or reduce the global stderr suppression (the "2>/dev/null" on the find calls) so real errors from find are not hidden while still handling missing directories gracefully, referencing the postPatchImapProto function and variables CARGO_HOME and cargoVendorDir in your edits.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/tools/browser.rs`:
- Around line 660-678: The close and handler-abort are racing: instead of
fire-and-forgetting browser.close() (via tokio::runtime::Handle::try_current()
spawn) and then immediately calling self._handler_task.take().abort(), ensure
the close completes (or at least is driven) before aborting the handler. Change
the logic that uses self.browser.take() and Handle::try_current() to spawn
browser.close() and wait for its JoinHandle to complete (for example by using
the runtime handle to block_on awaiting the join or awaiting the join inside a
spawned task that you then block_on) and only after that call
self._handler_task.take().abort(); refer to self.browser, browser.close(),
tokio::runtime::Handle::try_current(), self._handler_task and task.abort() when
locating the code to update.
---
Nitpick comments:
In `@nix/default.nix`:
- Around line 176-208: The postPatchImapProto hook currently finds directories
named "imap-proto-*" and unconditionally replaces them with the vendor copy,
which can accidentally replace other versions; update the find patterns in
postPatchImapProto to match the exact version name "imap-proto-0.10.2" instead
of the wildcard (apply to all three find invocations) and keep the rest of the
logic (rm -rf, cp -r ${cargoSrc}/vendor/imap-proto-0.10.2, chmod) the same; also
remove or reduce the global stderr suppression (the "2>/dev/null" on the find
calls) so real errors from find are not hidden while still handling missing
directories gracefully, referencing the postPatchImapProto function and
variables CARGO_HOME and cargoVendorDir in your edits.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 67c087f2-66f1-426f-a2da-8caf135660e4
⛔ Files ignored due to path filters (2)
.cargo/config.tomlis excluded by!**/*.tomlflake.lockis excluded by!**/*.lock,!**/*.lock
📒 Files selected for processing (4)
flake.nixnix/default.nixsrc/llm/routing.rssrc/tools/browser.rs
7863dac to
92c52fe
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/tools/browser.rs (1)
660-690:⚠️ Potential issue | 🟠 MajorSerialize browser close, handler abort, and profile cleanup.
Handle::spawnreturns immediately, so the handler abort and theremove_dir_allpath below can both run beforebrowser.close()has actually driven a graceful shutdown. That means this drop path can still tear down the handler or temp profile while Chrome is live, which is the failure mode this change is trying to avoid. Please sequence close → abort → cleanup inside the spawned task, ideally with a short timeout aroundclose().🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/tools/browser.rs` around lines 660 - 690, The current drop spawns browser.close() asynchronously while aborting the handler and deleting user_data_dir immediately, risking tearing down resources before close completes; fix by serializing close → abort → cleanup inside a single spawned async task: take self.browser, self._handler_task, and self.user_data_dir, then if tokio::runtime::Handle::try_current() is Ok spawn an async move that (1) calls tokio::time::timeout with a short duration around browser.close().await and logs on timeout/error, (2) after close completes (or times out) calls task.abort() on the taken _handler_task, and (3) uses tokio::task::spawn_blocking to run std::fs::remove_dir_all on the taken user_data_dir and log errors; remove the separate handler.abort() and standalone spawn_blocking calls so cleanup is only performed after close finishes or times out.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/tools/browser.rs`:
- Around line 1167-1174: The handler task must clear the shared BrowserState
when the handler reaches a terminal/failure state so ensure_launched() doesn't
falsely think a dead session is live; inside the tokio::spawn closure that
currently polls handler.next() (the block with handler_task and the
tracing::error!/warn! calls), acquire the same BrowserState lock/handle and
reset its fields (set browser to None, clear pages, clear active_target or flip
any launched flag) whenever you observe Err(error) or when the stream ends,
ensuring this happens before returning from the task so subsequent
ensure_launched() calls will relaunch a fresh browser session.
---
Duplicate comments:
In `@src/tools/browser.rs`:
- Around line 660-690: The current drop spawns browser.close() asynchronously
while aborting the handler and deleting user_data_dir immediately, risking
tearing down resources before close completes; fix by serializing close → abort
→ cleanup inside a single spawned async task: take self.browser,
self._handler_task, and self.user_data_dir, then if
tokio::runtime::Handle::try_current() is Ok spawn an async move that (1) calls
tokio::time::timeout with a short duration around browser.close().await and logs
on timeout/error, (2) after close completes (or times out) calls task.abort() on
the taken _handler_task, and (3) uses tokio::task::spawn_blocking to run
std::fs::remove_dir_all on the taken user_data_dir and log errors; remove the
separate handler.abort() and standalone spawn_blocking calls so cleanup is only
performed after close finishes or times out.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ee54ea1e-e9a3-4b00-b589-fe4939d8bd6e
📒 Files selected for processing (2)
src/llm/routing.rssrc/tools/browser.rs
f032f20 to
26d52da
Compare
- Add "error sending request" to retriable error patterns to handle transient network failures (DNS, connection refused, etc.) - Remove broken keepalive code that tried to clone non-Clone Browser - Implement graceful browser shutdown in Drop to prevent Chrome core dumps - Add comprehensive test suite (7 tests) for error classification - Fix clippy warning about collapsible if statements Fixes WebSocket connection errors when workers fail due to transient API errors, preventing Chrome from being force-killed and dumping core.
26d52da to
3b83d6d
Compare
|
@coderabbitai review to trigger a single review. |
|
🧠 Learnings used✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/tools/browser.rs (1)
2607-2641: Consider usingstd::sync::Mutexorserial_testfor env var tests.The
unsafe { std::env::set_var(...) }blocks mutate shared process state. While the comment says "Tests run sequentially," Rust's test runner executes tests in parallel by default. If other tests readCHROME_FLAGS, this could cause flaky behavior.Options:
- Run this specific test with
cargo test -- --test-threads=1- Use the
serial_testcrate with#[serial]attribute- Save/restore the original env var value in a mutex-guarded block
This is a minor concern since the env var is unlikely to be read by other tests, but worth noting for robustness.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/tools/browser.rs` around lines 2607 - 2641, The test_chrome_flags_parsing mutates global process env with unsafe set_var calls causing potential flakiness; make it deterministic by either adding the serial_test attribute (add serial_test crate and annotate the function with #[serial]) or by guarding env changes with a Mutex and saving/restoring the original CHROME_FLAGS value: read original with std::env::var_os("CHROME_FLAGS"), set with std::env::set_var(...) (no unsafe), run assertions, then restore the original (use std::env::remove_var if it was None or std::env::set_var to restore); remove all unsafe blocks if you adopt mutex/save-restore or add #[serial] above test_chrome_flags_parsing if using serial_test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/tools/browser.rs`:
- Around line 2607-2641: The test_chrome_flags_parsing mutates global process
env with unsafe set_var calls causing potential flakiness; make it deterministic
by either adding the serial_test attribute (add serial_test crate and annotate
the function with #[serial]) or by guarding env changes with a Mutex and
saving/restoring the original CHROME_FLAGS value: read original with
std::env::var_os("CHROME_FLAGS"), set with std::env::set_var(...) (no unsafe),
run assertions, then restore the original (use std::env::remove_var if it was
None or std::env::set_var to restore); remove all unsafe blocks if you adopt
mutex/save-restore or add #[serial] above test_chrome_flags_parsing if using
serial_test.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 92f607df-20f0-419b-9272-7b2dea4cb26b
📒 Files selected for processing (2)
src/llm/routing.rssrc/tools/browser.rs
Summary
Fixes browser crashes when workers fail due to transient API errors.
Changes
Error Handling
Browser Cleanup
Tests
Testing
Notes (Optional)
Regression Protection:
persistent_profilecheck prevents regression of PR feat: persistent browser sessions with configurable close policy #309 (persistent browser sessions)Implementation Details:
&&for clean resource acquisition in Drop