Scrape any WordPress website into structured, AI-ready local files. The output is designed for corpus preparation, style analysis, retrieval-augmented generation (RAG), fine-tuning datasets, and content audits.
scraped-site/
pages/ # One Markdown file per page, with YAML front matter
posts/ # One Markdown file per post (when --include-posts is used)
corpus.jsonl # One JSON object per document for model pipelines
manifest.json # Crawl metadata, settings, shield events
index.md # Human-readable table of all captured URLs
python3 -m venv .venv
source .venv/bin/activate
pip install -e .Requires Python 3.10+.
pip install -e ".[botasaurus]"If using a local botasaurus checkout:
pip install -e .
pip install -e /path/to/botasaurus# Scrape via WordPress REST API (fastest, cleanest output)
wp-ai-scrape https://example.com -o scraped/example
# Include blog posts alongside pages
wp-ai-scrape https://example.com --include-posts -o scraped/example
# Use Botasaurus for JS-rendered or protected sites
wp-ai-scrape https://example.com --renderer botasaurus -o scraped/exampleThe scraper uses a priority-ordered strategy:
-
WordPress REST API (default): Fetches structured content from
/wp-json/wp/v2/pagesand/wp-json/wp/v2/posts. This produces the cleanest output because WordPress returns pre-structured HTML with metadata. -
HTML Crawl (fallback): If the REST API is unavailable (disabled, restricted, or non-WordPress site), the scraper discovers URLs via sitemaps and crawls same-domain links, extracting main content from each page.
When --renderer botasaurus is selected, the HTML crawl uses a three-tier anti-detection strategy (see below).
When the site uses JavaScript rendering or has anti-bot protections, the botasaurus renderer provides a tiered escalation approach:
Uses botasaurus's @request decorator which sends browser-like HTTP requests with proper TLS fingerprinting and Google referrer headers. This is fast (no browser launch) and handles connection-level Cloudflare challenges. Most WordPress pages succeed here.
Launches a real Chromium browser, navigates via Google referrer (google_get) to appear as organic traffic. Checks is_bot_detected() after page load. Uses block_images=True for speed and reuse_driver=True to keep the browser session warm.
For pages with JS challenges or Turnstile CAPTCHAs, uses bypass_cloudflare=True. This tier is triggered automatically when Tier 2 detects bot detection, or can be forced with --bypass-cloudflare.
If any page triggers shield detection during a crawl, the scraper remembers this and automatically escalates all subsequent pages to the bypass tier. This means a site that's partially protected won't require manual intervention.
For large crawls, --batch-size N processes N URLs per browser session without relaunching, significantly reducing overhead:
wp-ai-scrape https://example.com \
--renderer botasaurus \
--batch-size 10 \
--max-pages 500wp-ai-scrape <url> [options]
| Option | Default | Description |
|---|---|---|
url |
(required) | WordPress site URL |
-o, --output |
scraped-site |
Output directory |
--max-pages |
200 |
Maximum documents to save |
--include-posts |
off | Include blog posts alongside pages |
--no-rest |
off | Skip REST API, crawl HTML directly |
--delay |
0.2 |
Seconds between requests |
--timeout |
20 |
HTTP request timeout in seconds |
--ignore-robots |
off | Ignore robots.txt (use only with permission) |
| Option | Default | Description |
|---|---|---|
--renderer |
requests |
requests for plain HTTP, botasaurus for tiered anti-detection |
--botasaurus-path |
none | Path to local botasaurus checkout (if not pip-installed) |
--render-wait |
2.0 |
Seconds to wait after page navigation |
| Option | Default | Description |
|---|---|---|
--headless |
on | Run browser without visible window |
--no-headless |
off | Show browser window (debugging) |
--proxy |
none | Proxy URL (e.g. http://user:pass@host:port) |
--block-images |
on | Skip image downloads for faster loads |
--no-block-images |
off | Allow image loading |
--no-stealth-request |
off | Skip Tier 1, go straight to browser |
--bypass-cloudflare |
off | Force Cloudflare bypass on all pages |
--batch-size |
0 |
URLs per browser session (0 = one at a time) |
Each page or post becomes a Markdown file with YAML front matter:
---
id: "page-about-a1b2c3d4"
title: "About Us"
url: "https://example.com/about/"
source_type: "page"
date: "2024-03-15T10:00:00"
modified: "2024-06-01T14:30:00"
author: "Jane Smith"
scraped_at: "2026-05-10T14:30:00+00:00"
word_count: 842
---
# About Us
Company description and content here...One JSON object per line, suitable for streaming into model pipelines:
{"id":"page-about-a1b2c3d4","url":"https://example.com/about/","title":"About Us","source_type":"page","text":"plain text...","markdown":"# About Us\n...","word_count":842}Records crawl configuration and any shield events encountered:
{
"base_url": "https://example.com",
"scraped_at": "2026-05-10T14:30:00+00:00",
"method": "wordpress_rest_api",
"renderer": "botasaurus",
"document_count": 47,
"shield_events": []
}The scraper recognizes and gracefully handles anti-bot protections without attempting to circumvent unauthorized barriers. Detected systems include:
- Cloudflare WAF and Turnstile
- Datadome
- Akamai Bot Manager
- Imperva/Incapsula
- Sucuri Website Firewall
- Wordfence Security
- Generic CAPTCHA and rate-limit pages
When a shield is detected, the URL is skipped and logged in manifest.json under shield_events. With the botasaurus renderer, the scraper will auto-escalate to the bypass tier before giving up.
WordPress sites typically publish a sitemap index at /sitemap_index.xml that contains references to child sitemaps (e.g. page-sitemap.xml, post-sitemap.xml). The scraper handles this automatically:
- Tries common sitemap paths:
sitemap_index.xml,sitemap.xml,page-sitemap.xml,post-sitemap.xml - If it finds a sitemap index (contains
<sitemap>elements), it recursively fetches each child sitemap - From each child sitemap, extracts the
<url><loc>entries as seed URLs for the crawl - Recursion is capped at 3 levels deep to prevent infinite loops
For example, given https://www.example.com/sitemap_index.xml containing:
<sitemapindex>
<sitemap><loc>https://www.example.com/page-sitemap.xml</loc></sitemap>
<sitemap><loc>https://www.example.com/post-sitemap.xml</loc></sitemap>
</sitemapindex>The scraper will fetch both child sitemaps and extract all page/post URLs from them, using those as the starting points for the crawl.
- Respects
robots.txtby default (override with--ignore-robots) - Only follows same-domain links
- Skips: wp-admin, wp-login, wp-content, wp-includes, feeds, cart/checkout, search, tag/category archives, and binary file extensions
- Strips tracking parameters (utm_*, fbclid, gclid) from URLs
- Deduplicates by URL and content hash
- Images are not downloaded; alt text and captions are preserved inline
wp-ai-scrape https://acme.com \
--include-posts \
--max-pages 100 \
-o corpus/acmewp-ai-scrape https://protected-site.com \
--renderer botasaurus \
--bypass-cloudflare \
--no-rest \
-o corpus/protectedwp-ai-scrape https://big-site.com \
--renderer botasaurus \
--batch-size 20 \
--max-pages 1000 \
--delay 0.5 \
-o corpus/bigsitewp-ai-scrape https://example.com \
--renderer botasaurus \
--no-headless \
--no-stealth-request \
-o scraped/debugA Flask-based GUI with a live dashboard is included in the webapp/ directory. It provides a two-agent parallel architecture with real-time Server-Sent Events (SSE) progress tracking.
pip install -e ".[webapp]"
cd webapp
./run.shThen open http://localhost:8080. Features include live discovery/scraper counters, per-site timestamped output directories, content viewer with search, and a retry button for failed pages. See webapp/README.md for details.
wordpress-scraper/
wp_ai_scraper/
__init__.py # Package metadata
cli.py # CLI entry point, scraper class, utilities
botasaurus_renderer.py # Tiered rendering with botasaurus
webapp/
app.py # Flask app with two-agent architecture
templates/index.html # Apple Liquid Glass UI
run.sh # One-command launcher
requirements.txt # Webapp-specific deps
pyproject.toml # Build config and dependencies
README.md # This file
ARCHITECTURE.md # Code design and data flow
CONTRIBUTING.md # Development guide
CHANGELOG.md # Release history
MIT