Skip to content

fix(security): reject SSRF-smuggling URL characters in spider_tools._validate_url#1578

Merged
MervinPraison merged 1 commit into
mainfrom
fix/spider-tools-ssrf-hardening
Apr 28, 2026
Merged

fix(security): reject SSRF-smuggling URL characters in spider_tools._validate_url#1578
MervinPraison merged 1 commit into
mainfrom
fix/spider-tools-ssrf-hardening

Conversation

@MervinPraison

Copy link
Copy Markdown
Owner

Summary

Hardens SpiderTools._validate_url against an SSRF bypass that exploits parser disagreement between urllib.parse.urlparse and the underlying HTTP client (requests, httpx).

Threat

A URL such as

http://127.0.0.1:6666\@1.1.1.1

parses with hostname 1.1.1.1 via urllib.parse.urlparse (so any allow/deny check that consults parsed.hostname sees a public IP) but is actually dispatched to 127.0.0.1:6666 by requests / httpx, because those clients treat the backslash differently and re-resolve the authority. The result: hostname-based SSRF guards are silently bypassed and the agent ends up issuing requests to internal services on the local host.

ASCII control characters (NUL, CR, LF, DEL, …) in the authority section can produce a similar parser disagreement and have been used in HTTP request smuggling and CRLF-injection attacks.

>>> from urllib.parse import urlparse
>>> urlparse("http://127.0.0.1:6666\\@1.1.1.1").hostname
'1.1.1.1'                       # <- what the SSRF allow-list sees
>>> import requests
>>> requests.get("http://127.0.0.1:6666\\@1.1.1.1")  # <- actual destination
# attempts to connect to 127.0.0.1:6666

Fix

Before urlparse runs, reject any URL that:

  1. is not a str,
  2. contains a backslash anywhere, or
  3. contains any ASCII control character (codepoint < 0x20 or == 0x7f).

These rejections are early-return so the existing urlparse + IP / private / metadata / internal-domain checks below them remain unchanged.

if not isinstance(url, str):
    return False
if "\\" in url or any(ord(c) < 0x20 or ord(c) == 0x7f for c in url):
    return False

Tests

src/praisonai-agents/tests/unit/tools/test_spider_url_validation.py (new — 6 tests):

Test Asserts
test_rejects_backslash_smuggle_in_authority The real-world advisory payload http://127.0.0.1:6666\@1.1.1.1 is rejected
test_rejects_backslash_anywhere_in_url Backslash in path/query also rejected
test_rejects_control_characters NUL and CR+LF in URL rejected
test_allows_normal_public_url https://example.com/path?q=1 still allowed (regression)
test_still_blocks_loopback 127.0.0.1 and localhost still blocked (regression)
test_rejects_non_string_input None and int rejected without crashing
$ PYTHONPATH=src/praisonai-agents:src/praisonai pytest \
    src/praisonai-agents/tests/unit/tools/test_spider_url_validation.py -q
6 passed in 0.37s

AGENTS.md conformance

  • Core SDK scoped: only touches praisonaiagents.tools.spider_tools. No wrapper changes, no protocol changes.
  • Backward compatible: tightens an existing security check; only rejects URLs that were never expected to be valid (and that no legitimate caller would construct).
  • No performance impact: two O(n) scans of the URL string ahead of the existing urlparse call. Negligible cost vs. an HTTP roundtrip.
  • Fail safe: defaults to denying the suspect input.
  • Test discipline: 6 new tests covering both the bypass and the regressions.

Files changed

File Δ
src/praisonai-agents/praisonaiagents/tools/spider_tools.py +11 (early-reject guard with comment)
src/praisonai-agents/tests/unit/tools/test_spider_url_validation.py new (45 lines, 6 tests)

…ate_url

A URL such as 'http://127.0.0.1:6666\\@1.1.1.1' parses with hostname
'1.1.1.1' via urllib.parse.urlparse but is dispatched to '127.0.0.1' by
requests/httpx. Hostname-based SSRF allow/deny checks that trust
urlparse alone can therefore be smuggled past with a backslash in the
authority section, exposing localhost services. ASCII control characters
in the URL (newline, NUL, DEL, etc.) can produce similar parser
disagreement and HTTP request smuggling.

Reject any URL containing a backslash anywhere or any ASCII control
character (codepoint < 0x20 or == 0x7f) before urlparse runs. Also
reject non-string input early.

Tests in tests/unit/tools/test_spider_url_validation.py cover:
  - the real-world advisory payload 'http://127.0.0.1:6666\\@1.1.1.1'
  - backslash anywhere in the URL
  - NUL and CR/LF control characters
  - non-string (None / int) input
  - regression: normal public URLs still allowed
  - regression: existing loopback/localhost block still fires

All 6 tests pass; the existing _validate_url contract for IP/private/
metadata/internal-domain blocking is preserved.
Copilot AI review requested due to automatic review settings April 28, 2026 13:05
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

@MervinPraison

Copy link
Copy Markdown
Owner Author

@claude You are the Lead Engineer. If the branch is under MervinPraison/PraisonAI (not a fork), you are able to make modifications to this branch and push directly. SCOPE: Focus ONLY on Python packages (praisonaiagents, praisonai). Do NOT modify praisonai-rust or praisonai-ts. Read ALL analysis and reviews above carefully (Gemini, CodeRabbit, Qodo, Copilot, etc).

Phase 1: Review per AGENTS.md

  1. Protocol-driven: check heavy implementations vs core SDK
  2. Backward compatible: ensure zero feature regressions
  3. Performance: no hot-path regressions

Phase 2: FIX Valid Issues
4. For any VALID bugs or architectural flaws found by Gemini, CodeRabbit, Qodo, Copilot, or any other reviewer: implement the fix
5. Push all code fixes directly to THIS branch (do NOT create a new PR)
6. Comment a summary of exact files modified and what you skipped

Phase 3: Final Verdict
7. If all issues are resolved, approve the PR / close the Issue
8. If blocking issues remain, request changes / leave clear action items

@coderabbitai

coderabbitai Bot commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

Caution

Review failed

An error occurred during the review process. Please try again later.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/spider-tools-ssrf-hardening

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@MervinPraison

Copy link
Copy Markdown
Owner Author

@copilot Do a thorough review of this PR. Read ALL existing reviewer comments above from Qodo, Coderabbit, and Gemini first — incorporate their findings.

Review areas:

  1. Bloat check: Are changes minimal and focused? Any unnecessary code or scope creep?
  2. Security: Any hardcoded secrets, unsafe eval/exec, missing input validation?
  3. Performance: Any module-level heavy imports? Hot-path regressions?
  4. Tests: Are tests included? Do they cover the changes adequately?
  5. Backward compat: Any public API changes without deprecation?
  6. Code quality: DRY violations, naming conventions, error handling?
  7. Address reviewer feedback: If Qodo, Coderabbit, or Gemini flagged valid issues, include them in your review
  8. Suggest specific improvements with code examples where possible

@greptile-apps

greptile-apps Bot commented Apr 28, 2026

Copy link
Copy Markdown

Greptile Summary

This PR hardens SpiderTools._validate_url against a real SSRF bypass where urllib.parse.urlparse and requests disagree on the effective destination host when a URL contains a backslash or ASCII control characters. The fix is minimal, correctly placed before urlparse runs, and is accompanied by a thorough regression test suite covering both the bypass payload and the existing loopback/private-IP guards.

Confidence Score: 5/5

Safe to merge — the fix is correct, targeted, and well-tested with no behavioral regressions.

No P0 or P1 issues found. The backslash and control-character guards are placed correctly before urlparse, closing the stated parser-disagreement SSRF vector. All existing hostname checks are preserved. The only finding is a P2 hardening suggestion about percent-encoded backslash (%5C), which does not block merge.

No files require special attention.

Important Files Changed

Filename Overview
src/praisonai-agents/praisonaiagents/tools/spider_tools.py Adds early-return guards in _validate_url to reject non-string inputs, backslashes, and ASCII control characters before urlparse runs — correctly closes the parser-disagreement SSRF bypass vector.
src/praisonai-agents/tests/unit/tools/test_spider_url_validation.py New test file with 6 targeted regression tests covering the bypass payload, control characters, normal URLs, loopback blocking, and non-string inputs.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["_validate_url(url)"] --> B{"isinstance(url, str)?"}
    B -- No --> REJECT1["return False"]
    B -- Yes --> C{"Contains backslash OR\nASCII control char?"}
    C -- Yes --> REJECT2["return False\n(SSRF smuggling guard — NEW)"]
    C -- No --> D["urlparse(url)"]
    D --> E{"scheme in\nhttp/https?"}
    E -- No --> REJECT3["return False"]
    E -- Yes --> F{"hostname\npresent?"}
    F -- No --> REJECT4["return False"]
    F -- Yes --> G{"localhost /\nloopback?"}
    G -- Yes --> REJECT5["return False"]
    G -- No --> H{"Private /\nreserved IP?"}
    H -- Yes --> REJECT6["return False"]
    H -- No --> I{"Internal\ndomain suffix?"}
    I -- Yes --> REJECT7["return False"]
    I -- No --> J{"Metadata\nservice IP?"}
    J -- Yes --> REJECT8["return False"]
    J -- No --> ALLOW["return True"]
Loading

Reviews (1): Last reviewed commit: "fix(security): reject SSRF-smuggling URL..." | Re-trigger Greptile

# parses as host ``1.1.1.1`` but is dispatched to ``127.0.0.1``).
if not isinstance(url, str):
return False
if "\\" in url or any(ord(c) < 0x20 or ord(c) == 0x7f for c in url):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Consider also rejecting percent-encoded backslash

The current guard catches a literal backslash (\, 0x5C) but not its percent-encoded form %5C. Depending on how requests decodes the authority section before resolving the connection, a URL using %5C instead of a literal backslash in the same smuggling pattern could still trigger a parser disagreement. Adding a decode-then-check pass would close that gap proactively:

import urllib.parse as _up

raw_check = _up.unquote(url)
if "\\" in raw_check or any(ord(c) < 0x20 or ord(c) == 0x7f for c in raw_check):
    return False

This is a hardening suggestion — the literal-backslash bypass is fully fixed as written.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@MervinPraison MervinPraison merged commit 004dcfe into main Apr 28, 2026
25 of 26 checks passed
@MervinPraison MervinPraison deleted the fix/spider-tools-ssrf-hardening branch April 28, 2026 16:44
MervinPraison added a commit that referenced this pull request Apr 29, 2026
…er/push/ag2) (#1580)

* fix(tests): clear 25 pre-existing test-infrastructure failures

Targeted, test-only cleanup. **No production code changes.**

Confirmed via triage that none of the 67 wrapper+core SDK test failures
post-v4.6.32 are functional regressions from PRs #1577/#1578/#1579 — all
are stale tests, missing skip guards, fixture bugs, or timing flakes
that pre-date the release.

This commit fixes the four highest-confidence categories:

* sandbox/test_sandlock_sandbox.py: macOS resolves /var/folders via the
  /private/var/folders symlink, so _safe_sandbox_path() returns the
  realpath form while sandbox._temp_dir holds the unresolved mkdtemp
  output. Compare via os.path.realpath() so the assertion holds on both
  macOS and Linux. Implementation is correct and unchanged.

* test_profiler_advanced.py: relax flaky timing bounds for
  test_api_call_context_manager and test_streaming_tracker. time.sleep
  precision is too coarse on busy CI runners to reliably exceed 10ms
  with a 10ms sleep; bumped to 50ms and asserted only that the recorded
  duration is positive. Matches AGENTS.md guidance that tests must not
  depend on timing.

* test_push_client.py: PushClient._send checks
  self._transport.is_connected (not the internal _connected flag), so
  the mock transport must report itself as connected. The fixture set
  c._connected=True but forgot mock_transport.connected=True, causing
  every send-path test to raise ConnectionError. Single-line fixture
  fix unblocks 8 tests.

* test_ag2_adapter.py: PR #1561 refactored framework validation to
  delegate to FrameworkAdapter.is_available(), which performs a real
  ag2 import. Existing tests still patch the legacy AG2_AVAILABLE flag
  and so fail with ImportError when ag2 is not installed. Added
  pytest.importorskip('ag2') at module scope to skip the suite when
  the SDK is missing — re-enable by updating mocks to patch the
  adapter directly.

Verified locally:
- 12 sandbox + profiler tests: PASS (was 3 failing)
- 10 push_client tests: PASS (was 8 failing)
- 14 ag2 tests: SKIP when ag2 missing (was 14 failing)

Net wrapper-suite improvement: 25 fewer failures (38 -> 13).

* fix(tests): strengthen timing assertions and sandbox path validation

Addresses reviewer feedback from CodeRabbit:
- Profiler tests: Change timing assertions from >0 to >=5ms to catch unit errors
- Sandbox tests: Tighten path prefix assertion with os.sep for directory boundaries

These changes improve test reliability and catch potential regressions while
maintaining tolerance for CI timing variations.

Co-authored-by: Mervin Praison <MervinPraison@users.noreply.github.com>

---------

Co-authored-by: Cascade <cascade@windsurf.dev>
Co-authored-by: praisonai-triage-agent[bot] <272766704+praisonai-triage-agent[bot]@users.noreply.github.com>
Co-authored-by: Mervin Praison <MervinPraison@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants