feat(browser): proxy v2 — real fetch + lxml rewrite + cookie jar (PR 3/10)#302
Conversation
Whole-branch Opus review caught a Critical bug + 3 Important items: - cookie_jar.py (Critical): persist_response_cookies stripped no leading dot from the cookie's domain, but real Set-Cookie with Domain=github.com stores as `.github.com` per cookielib. Lookup via urlparse hostname (`github.com`, no dot) missed the row. Sites with explicit Domain= cookies (most real logins) appeared to authenticate but immediately lost the session on the next request. Strip leading dot on persist + defensively on lookup. Regression test uses real http.cookiejar.Cookie path (the existing test bypassed the bug via Cookies().set() which produces non-leading-dot domains). - store.py (Important): add_profile used bare INSERT, racing on the PRIMARY KEY constraint when ensure_default_profiles fired concurrently for a brand-new user. Switched to INSERT OR IGNORE so the bootstrap is genuinely idempotent at the storage layer. - app.py (Important): expanded the placeholder-cookie-key comment with an explicit MIGRATION HAZARD note for PR 5+. SQLCipher rejects an existing DB if the key changes, so swapping in per-user Argon2 keys without a rekey or wipe path silently corrupts existing user cookies. - proxy.py (Important): wrapped the redirect-walk in asyncio.wait_for so the documented 15s total-deadline-including-redirects is now actually enforced (was per-operation; 6 hops x 15s = 90s worst case). Per-hop timeout reduced to 5s.
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (7)
📝 WalkthroughWalkthroughThis PR adds the ChangesBrowser Proxy Feature Implementation
Sequence DiagramsequenceDiagram
participant Client
participant Proxy as Proxy Route<br/>(proxy.py)
participant Jar as Cookie Jar<br/>(cookie_jar.py)
participant Store as Browser Store<br/>(store.py)
participant Upstream as Upstream Server
participant Rewriter as HTML Rewriter<br/>(rewriter.py)
participant Injector as Head Injector<br/>(injector.py)
Client->>Proxy: GET /api/desktop/browser/proxy?url=X&profile_id=Y
Proxy->>Store: ensure_default_profiles(user_id)
Proxy->>Store: get_profile_or_404(user_id, profile_id)
Proxy->>Proxy: SSRF validation on URL
Proxy->>Jar: load_jar_for_request(user_id, profile_id, host)
Jar->>Store: list_cookies(user_id, profile_id, host)
Store-->>Jar: cookies[]
Jar-->>Proxy: httpx.Cookies jar
Proxy->>Upstream: fetch(url, cookies=jar)
Upstream-->>Proxy: response (HTML/binary)
Proxy->>Jar: persist_response_cookies(response)
Jar->>Store: set_cookie(user_id, profile_id, cookie)
Store-->>Jar: ✓
alt response is HTML
Proxy->>Rewriter: rewrite_html(html, base_url, proxy_fn)
Rewriter-->>Proxy: rewritten_html
Proxy->>Injector: inject_into_head(html, ws_url)
Injector-->>Proxy: injected_html
Proxy->>Proxy: Apply strict CSP header
end
Proxy->>Proxy: Strip set-cookie & CSP headers
Proxy-->>Client: response (HTML/binary, no cookies)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 36 minutes and 9 seconds.Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (2)
tests/routes/desktop_browser/test_proxy_fetch.py (1)
117-142: ⚡ Quick winThis doesn't verify persistence yet.
The test proves header stripping, but not that
Set-Cookiewas actually written to the server-side jar. A brokenpersist_response_cookies()path would still pass. Follow this with a second proxied request and assert the stored cookie is replayed upstream, or inspectapp.state.browser_cookie_storedirectly.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/routes/desktop_browser/test_proxy_fetch.py` around lines 117 - 142, The test only checks header stripping but not that persist_response_cookies actually stored cookies; after the first proxied GET to "/api/desktop/browser/proxy" (in test_persists_set_cookie_to_jar) either make a second proxied request to the same upstream URL and assert the upstream mock receives a "Cookie: session=abc123" header, or directly inspect app.state.browser_cookie_store (or the server-side cookie jar object used by persist_response_cookies) to assert it contains the "session=abc123" entry; update the test to call the proxy twice (or examine app.state.browser_cookie_store) and add the appropriate assertion so a broken persist_response_cookies path will fail.tests/routes/desktop_browser/test_proxy_shell.py (1)
107-122: ⚡ Quick winMake this assertion deterministic.
assert resp.status_code != 501still passes on unrelated regressions like a profile/bootstrap failure or a route error. Mock the upstream fetch and assert a concrete proxy outcome here, so this test only passes when the request actually reaches the fetch pipeline.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/routes/desktop_browser/test_proxy_shell.py` around lines 107 - 122, The test currently only patches DNS and asserts resp.status_code != 501 which is nondeterministic; instead mock the upstream fetch call used by the desktop_browser proxy handler so the route actually reaches the fetch pipeline and returns a predictable response. In test_valid_request_now_attempts_fetch patch the real fetch function the route calls (e.g., the module-level fetch helper or the HTTP client method used by tinyagentos.routes.desktop_browser — for example patch tinyagentos.routes.desktop_browser.fetch_url or httpx.AsyncClient.get if that’s used) to return a fixed successful Response (status 200 and known body), keep the existing getaddrinfo patch, then assert resp.status_code == 200 (and optionally assert the response body matches the mocked body) so the test only passes when the proxy forwards to the fetch pipeline.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tinyagentos/app.py`:
- Around line 317-323: The code uses a hardcoded SQLCipher key
(_placeholder_cookie_key) when constructing BrowserCookieStore, which makes all
installs share the same encryption key; instead, generate or retrieve a unique
per-installation key and pass that to BrowserCookieStore: e.g., on first run
create a strong random key (use secrets.token_bytes/ token_hex or a KDF) and
persist it in a restricted-permission local file or the OS keyring, on
subsequent runs read that stored key, and fall back to an environment variable
only for recoverability; update the initialization flow around
BrowserCookieStore.init() and app.state.browser_cookie_store to use the
retrieved/generated key rather than the hardcoded _placeholder_cookie_key.
In `@tinyagentos/routes/desktop_browser/cookie_jar.py`:
- Line 80: The current line using "if cookie.expires" treats expires==0 as falsy
and stores None; change the check to "cookie.expires is not None" when computing
expires_at so zero-expiry is preserved, and add logic in the cookie persistence
path (where BrowserCookieStore.set_cookie is called) to detect when expires_at
is in the past and call a new BrowserCookieStore.delete_cookie(user_id,
profile_id, host, path, name) instead of set_cookie; also modify
BrowserCookieStore.get_cookies to filter out expired rows (e.g. add "AND
(expires_at IS NULL OR expires_at > current_unix_time)" semantics) so expired
cookies are not returned.
- Around line 42-44: The SQL match in BrowserCookieStore.get_cookies currently
uses an exact equality (host = ?) so a cookie stored for "github.com" won't
match requests to "gist.github.com"; change the WHERE clause to include a
subdomain match instead of only equality—e.g. WHERE (host = :stored_host OR
:request_host LIKE '%' || host) so a stored host like "github.com" will match
request_host "gist.github.com" but not vice versa; keep the existing
host.lstrip(".") handling in load_jar_for_request and ensure parameter ordering
matches the new predicate in get_cookies (update BrowserCookieStore.get_cookies
in store.py and any SQL-binding code accordingly), and add/adjust tests to cover
domain-scoped cookies matching subdomains.
In `@tinyagentos/routes/desktop_browser/proxy.py`:
- Around line 114-126: The code currently calls http.get(current_url) and
accesses response.content (e.g., variable _resp) which buffers the whole
upstream body; change to streaming requests using http.stream("GET",
current_url, cookies=jar) and inspect headers from the response context
(Content-Type) before consuming the body; for non-HTML responses forward the
stream to the client in chunks (using response.aiter_bytes()/iter_bytes())
without storing into memory, and for HTML responses read into a bounded
accumulator up to a defined MAX_HTML_BYTES (reject or error if exceeded) before
performing rewrite/injection; apply the same streaming-and-bounded-read pattern
to the other occurrences that use response.content (lines referenced around
192-194 and 211-217), and keep unique symbols in mind: httpx.AsyncClient,
http.stream, current_url, load_jar_for_request, _resp, and any downstream code
that expects the full body so it can work with the bounded HTML payload only.
---
Nitpick comments:
In `@tests/routes/desktop_browser/test_proxy_fetch.py`:
- Around line 117-142: The test only checks header stripping but not that
persist_response_cookies actually stored cookies; after the first proxied GET to
"/api/desktop/browser/proxy" (in test_persists_set_cookie_to_jar) either make a
second proxied request to the same upstream URL and assert the upstream mock
receives a "Cookie: session=abc123" header, or directly inspect
app.state.browser_cookie_store (or the server-side cookie jar object used by
persist_response_cookies) to assert it contains the "session=abc123" entry;
update the test to call the proxy twice (or examine
app.state.browser_cookie_store) and add the appropriate assertion so a broken
persist_response_cookies path will fail.
In `@tests/routes/desktop_browser/test_proxy_shell.py`:
- Around line 107-122: The test currently only patches DNS and asserts
resp.status_code != 501 which is nondeterministic; instead mock the upstream
fetch call used by the desktop_browser proxy handler so the route actually
reaches the fetch pipeline and returns a predictable response. In
test_valid_request_now_attempts_fetch patch the real fetch function the route
calls (e.g., the module-level fetch helper or the HTTP client method used by
tinyagentos.routes.desktop_browser — for example patch
tinyagentos.routes.desktop_browser.fetch_url or httpx.AsyncClient.get if that’s
used) to return a fixed successful Response (status 200 and known body), keep
the existing getaddrinfo patch, then assert resp.status_code == 200 (and
optionally assert the response body matches the mocked body) so the test only
passes when the proxy forwards to the fetch pipeline.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 13054bc1-f2cd-4c72-b2b2-0bb8f4fb3bfe
📒 Files selected for processing (17)
pyproject.tomltests/conftest.pytests/routes/desktop_browser/test_cookie_jar.pytests/routes/desktop_browser/test_injector.pytests/routes/desktop_browser/test_profile.pytests/routes/desktop_browser/test_proxy_fetch.pytests/routes/desktop_browser/test_proxy_shell.pytests/routes/desktop_browser/test_rewriter.pytests/routes/desktop_browser/test_store_tenancy.pytinyagentos/app.pytinyagentos/routes/desktop_browser/cookie_jar.pytinyagentos/routes/desktop_browser/copilot.jstinyagentos/routes/desktop_browser/injector.pytinyagentos/routes/desktop_browser/profile.pytinyagentos/routes/desktop_browser/proxy.pytinyagentos/routes/desktop_browser/rewriter.pytinyagentos/routes/desktop_browser/store.py
| rows = await cookie_store.get_cookies( | ||
| user_id=user_id, profile_id=profile_id, host=host.lstrip("."), | ||
| ) |
There was a problem hiding this comment.
Domain-scoped cookies won't be sent to subdomains — breaks auth for many real sites.
get_cookies uses an exact SQL WHERE host = ? match. A cookie stored for github.com (after stripping the leading dot from .github.com) will never be found when the proxy fetches gist.github.com or api.github.com. Per RFC 6265, Domain=github.com must apply to all subdomains, which is standard practice for auth cookies on virtually every major site.
The fix belongs in BrowserCookieStore.get_cookies (store.py), but since load_jar_for_request is the call site establishing this pattern, tracking the gap here:
🐛 Proposed fix in `BrowserCookieStore.get_cookies`
- "WHERE user_id = ? AND profile_id = ? AND host = ?",
- (user_id, profile_id, host),
+ "WHERE user_id = ? AND profile_id = ? AND (host = ? OR ? LIKE '%.' || host)",
+ (user_id, profile_id, host, host),'gist.github.com' LIKE '%.github.com' → True, so a stored github.com cookie is correctly included in requests to any subdomain. The reverse is not matched (a gist.github.com cookie is never sent to github.com), which is correct per spec.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tinyagentos/routes/desktop_browser/cookie_jar.py` around lines 42 - 44, The
SQL match in BrowserCookieStore.get_cookies currently uses an exact equality
(host = ?) so a cookie stored for "github.com" won't match requests to
"gist.github.com"; change the WHERE clause to include a subdomain match instead
of only equality—e.g. WHERE (host = :stored_host OR :request_host LIKE '%' ||
host) so a stored host like "github.com" will match request_host
"gist.github.com" but not vice versa; keep the existing host.lstrip(".")
handling in load_jar_for_request and ensure parameter ordering matches the new
predicate in get_cookies (update BrowserCookieStore.get_cookies in store.py and
any SQL-binding code accordingly), and add/adjust tests to cover domain-scoped
cookies matching subdomains.
| async with httpx.AsyncClient( | ||
| follow_redirects=False, timeout=_HOP_TIMEOUT, | ||
| ) as http: | ||
| for hop in range(_MAX_REDIRECTS + 1): | ||
| host = urlparse(current_url).hostname or "" | ||
|
|
||
| jar = await load_jar_for_request( | ||
| cookie_store, user_id=user_id, profile_id=profile_id, host=host, | ||
| ) | ||
|
|
||
| try: | ||
| _resp = await http.get(current_url, cookies=jar) | ||
| except httpx.HTTPError as e: |
There was a problem hiding this comment.
Avoid buffering arbitrary upstream bodies in memory.
http.get() reads the full upstream response before we know whether it's HTML, and both branches then use response.content. For a user-supplied proxy URL, a large download or media asset becomes an unbounded memory spike on the server. Non-HTML responses should be streamed, and HTML responses should have a size cap before rewrite/injection.
Also applies to: 192-194, 211-217
🧰 Tools
🪛 Ruff (0.15.12)
[warning] 117-117: Loop control variable hop not used within loop body
Rename unused hop to _hop
(B007)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tinyagentos/routes/desktop_browser/proxy.py` around lines 114 - 126, The code
currently calls http.get(current_url) and accesses response.content (e.g.,
variable _resp) which buffers the whole upstream body; change to streaming
requests using http.stream("GET", current_url, cookies=jar) and inspect headers
from the response context (Content-Type) before consuming the body; for non-HTML
responses forward the stream to the client in chunks (using
response.aiter_bytes()/iter_bytes()) without storing into memory, and for HTML
responses read into a bounded accumulator up to a defined MAX_HTML_BYTES (reject
or error if exceeded) before performing rewrite/injection; apply the same
streaming-and-bounded-read pattern to the other occurrences that use
response.content (lines referenced around 192-194 and 211-217), and keep unique
symbols in mind: httpx.AsyncClient, http.stream, current_url,
load_jar_for_request, _resp, and any downstream code that expects the full body
so it can work with the bounded HTML payload only.
| jar = await load_jar_for_request( | ||
| cookie_store, user_id=user_id, profile_id=profile_id, host=host, | ||
| ) | ||
|
|
||
| try: | ||
| _resp = await http.get(current_url, cookies=jar) | ||
| except httpx.HTTPError as e: | ||
| _logger.info("browser proxy fetch error: err=%s", e) | ||
| return None | ||
|
|
||
| # Persist any cookies set by this hop | ||
| await persist_response_cookies( | ||
| cookie_store, _resp.cookies, | ||
| user_id=user_id, profile_id=profile_id, | ||
| ) |
There was a problem hiding this comment.
Host-only cookie loading loses real browser cookie semantics.
This call path only loads cookies by exact host, but the current jar helpers persist .example.com as example.com and then look up by exact host. That means cookies scoped to a parent domain will not survive redirects across subdomains, and the loader has no scheme input to enforce secure-only replay rules on later http:// hops. Common login flows like www.example.com -> app.example.com will break, and scheme downgrades can replay cookies too broadly. Please move this integration to full cookie matching semantics before merge. Cross-file evidence: tinyagentos/routes/desktop_browser/cookie_jar.py:29-54 and tinyagentos/routes/desktop_browser/cookie_jar.py:57-82.
… cap Five Major findings from CodeRabbit's PR 302 review: - app.py: replace universal "0"*64 SQLCipher key with a per-install random key persisted to data_dir/browser_cookie_key.hex (mode 0o600) generated via secrets.token_hex(32). Override via env var TAOS_BROWSER_COOKIE_KEY_HEX for recovery. Removes the at-rest protection regression where every install shared the same key. - store.py: get_cookies now matches subdomains per RFC 6265 ((host = ? OR ? LIKE '%.' || host)), so Set-Cookie: Domain=github.com reaches gist.github.com / api.github.com etc. — fixes auth on every major site. Also adds expiry filter so expired rows aren't returned. - store.py: new delete_cookie method for explicit cookie removal. - cookie_jar.py: `if cookie.expires` → `cookie.expires is not None` so server-sent expires=0 (cookie deletion) is preserved instead of being silently treated as a session cookie. When persisting a past-dated cookie, route to delete_cookie instead of set_cookie so logout actually invalidates the session. - proxy.py: 10 MB response size cap before HTML rewrite / non-HTML pass-through. Without it, a user-supplied URL pointing at a large download would OOM the proxy. +6 regression tests covering subdomain cascade (both directions), expired-cookie filtering, explicit delete_cookie, server-driven cookie deletion via past expiry, and the response size cap.
Third of ten PRs implementing BrowserApp v2 per the design doc.
First user-visible deliverable of BrowserApp v2. Replaces PR 2's 501 stub with the real proxy fetch pipeline — a logged-in user can hit a URL via
/api/desktop/browser/proxyand get back the rewritten HTML with their cookies attached and the copilot script injected.Summary
profile.py— Profile CRUD + default Personal/Work bootstrap.ensure_default_profilesis genuinely idempotent (usesINSERT OR IGNORE); safe to call on every request.cookie_jar.py— httpx adapter wrapping PR 1'sBrowserCookieStore. Per-request load/persist pattern bridges sync httpx + async SQLCipher store. Multi-user isolation enforced. Strips cookielib's leading-dot domain convention soSet-Cookie: Domain=github.com(stored as.github.com) is retrievable on the next request viaurlparse(...).hostname(no dot).rewriter.py— lxml-based DOM rewriter. Handleshref/src/actionon every tag,srcset(preserving descriptors), inlinestyle="background-image: url(...)",<style>tag contents, and<meta http-equiv="refresh" content="N;url=...">. Skipsdata:/javascript:/mailto:/tel:/blob:/about:/#schemes.injector.py— Inserts<script src="/__taos/copilot.js">and<meta name="taos-copilot-ws" content="...">into<head>. Idempotent.copilot.js— Stub asset. Full RPC bridge + DOM extraction + annotation rendering land in PR 6.proxy.py— Full orchestrator: auth → profile bootstrap → SSRF → cookie jar load → httpx fetch (manual redirect walk with per-hop SSRF re-check) → rewrite + inject + CSP for HTML / pass-through for non-HTML → cookie persist → response. 15s total deadline enforced viaasyncio.wait_for(per-hop budget 5s)./__taos/copilot.js— Static FileResponse route serving the copilot script.app.py—BrowserStore+BrowserCookieStorewired onapp.statein lifespan, with placeholder cookie key. Migration hazard for PR 5+ documented inline (SQLCipher rejects DB if key changes — PR 5 must rekey or wipe).What this does not land
Test Plan
pytest tests/routes/desktop_browser/test_profile.py -v— 6 testspytest tests/routes/desktop_browser/test_cookie_jar.py -v— 6 tests (incl. regression test for the leading-dot domain bug using realhttp.cookiejar.Cookieparsing)pytest tests/routes/desktop_browser/test_rewriter.py -v— 16 cases (parametrized; covers href/src/srcset/style/meta-refresh/scheme-skipping)pytest tests/routes/desktop_browser/test_injector.py -v— 5 testspytest tests/routes/desktop_browser/test_proxy_fetch.py -v— 7 integration tests (respx-mocked) covering HTML rewriting, copilot injection, CSP application, non-HTML pass-through, cookie persistence, redirect SSRF re-check, static asset servepytest tests/routes/desktop_browser/— 123 total, all greenpytest tests/routes/ tests/test_secrets.py— 205 tests pass, no regressions outside scopeSpec
docs/superpowers/specs/2026-05-03-browser-app-v2-design.md§3.2 (data flow), §4.3 (proxy v2), §9 (security model).Cumulative shipping arc:
Notes for reviewers
"0" * 64) — auth integration isn't there yet. Store is genuinely encrypted; only the key derivation is deferred. PR 5+ wires it to the login password. Migration hazard documented inline in app.py so PR 5+ author doesn't ship a silent data-loss bug.httpx.AsyncClient(follow_redirects=False)+ a hand loop with SSRF re-check on every Location header. Critical for security: a redirect to a private IP must be blocked even if the initial URL was public. Testtest_redirect_target_revalidated_against_ssrfcovers this.asyncio.wait_for— a malicious chain can't hold the proxy for 6 × 15s = 90s.Set-Cookie: Domain=github.compreviously stored as.github.com(leading dot, cookielib convention) but lookup usedurlparse(...).hostname(no dot) → exact-equality miss → user appeared logged in but lost the session on the next request. Strip leading dot on persist (and defensively on lookup). New regression test uses realhttp.cookiejar.Cookiepath.Summary by CodeRabbit
New Features
Tests
Dependencies