WordPress AI Scraper

Scrape any WordPress website into structured, AI-ready local files. The output is designed for corpus preparation, style analysis, retrieval-augmented generation (RAG), fine-tuning datasets, and content audits.

What It Produces

scraped-site/
  pages/          # One Markdown file per page, with YAML front matter
  posts/          # One Markdown file per post (when --include-posts is used)
  corpus.jsonl    # One JSON object per document for model pipelines
  manifest.json   # Crawl metadata, settings, shield events
  index.md        # Human-readable table of all captured URLs

Installation

Basic (HTTP-only scraping)

python3 -m venv .venv
source .venv/bin/activate
pip install -e .

Requires Python 3.10+.

With Botasaurus (JavaScript rendering + anti-detection)

pip install -e ".[botasaurus]"

If using a local botasaurus checkout:

pip install -e .
pip install -e /path/to/botasaurus

Quick Start

# Scrape via WordPress REST API (fastest, cleanest output)
wp-ai-scrape https://example.com -o scraped/example

# Include blog posts alongside pages
wp-ai-scrape https://example.com --include-posts -o scraped/example

# Use Botasaurus for JS-rendered or protected sites
wp-ai-scrape https://example.com --renderer botasaurus -o scraped/example

How It Works

The scraper uses a priority-ordered strategy:

WordPress REST API (default): Fetches structured content from /wp-json/wp/v2/pages and /wp-json/wp/v2/posts. This produces the cleanest output because WordPress returns pre-structured HTML with metadata.
HTML Crawl (fallback): If the REST API is unavailable (disabled, restricted, or non-WordPress site), the scraper discovers URLs via sitemaps and crawls same-domain links, extracting main content from each page.

When --renderer botasaurus is selected, the HTML crawl uses a three-tier anti-detection strategy (see below).

Botasaurus Integration

When the site uses JavaScript rendering or has anti-bot protections, the botasaurus renderer provides a tiered escalation approach:

Tier 1: Stealth HTTP Request

Uses botasaurus's @request decorator which sends browser-like HTTP requests with proper TLS fingerprinting and Google referrer headers. This is fast (no browser launch) and handles connection-level Cloudflare challenges. Most WordPress pages succeed here.

Tier 2: Browser with Google Referrer

Launches a real Chromium browser, navigates via Google referrer (google_get) to appear as organic traffic. Checks is_bot_detected() after page load. Uses block_images=True for speed and reuse_driver=True to keep the browser session warm.

Tier 3: Browser with Cloudflare Bypass

For pages with JS challenges or Turnstile CAPTCHAs, uses bypass_cloudflare=True. This tier is triggered automatically when Tier 2 detects bot detection, or can be forced with --bypass-cloudflare.

Auto-Escalation

If any page triggers shield detection during a crawl, the scraper remembers this and automatically escalates all subsequent pages to the bypass tier. This means a site that's partially protected won't require manual intervention.

Batch Mode

For large crawls, --batch-size N processes N URLs per browser session without relaunching, significantly reducing overhead:

wp-ai-scrape https://example.com \
  --renderer botasaurus \
  --batch-size 10 \
  --max-pages 500

CLI Reference

wp-ai-scrape <url> [options]

Core Options

Option	Default	Description
`url`	(required)	WordPress site URL
`-o, --output`	`scraped-site`	Output directory
`--max-pages`	`200`	Maximum documents to save
`--include-posts`	off	Include blog posts alongside pages
`--no-rest`	off	Skip REST API, crawl HTML directly
`--delay`	`0.2`	Seconds between requests
`--timeout`	`20`	HTTP request timeout in seconds
`--ignore-robots`	off	Ignore robots.txt (use only with permission)

Renderer Options

Option	Default	Description
`--renderer`	`requests`	`requests` for plain HTTP, `botasaurus` for tiered anti-detection
`--botasaurus-path`	none	Path to local botasaurus checkout (if not pip-installed)
`--render-wait`	`2.0`	Seconds to wait after page navigation

Botasaurus-Specific Options

Option	Default	Description
`--headless`	on	Run browser without visible window
`--no-headless`	off	Show browser window (debugging)
`--proxy`	none	Proxy URL (e.g. `http://user:pass@host:port`)
`--block-images`	on	Skip image downloads for faster loads
`--no-block-images`	off	Allow image loading
`--no-stealth-request`	off	Skip Tier 1, go straight to browser
`--bypass-cloudflare`	off	Force Cloudflare bypass on all pages
`--batch-size`	`0`	URLs per browser session (0 = one at a time)

Output Format

Markdown Files

Each page or post becomes a Markdown file with YAML front matter:

---
id: "page-about-a1b2c3d4"
title: "About Us"
url: "https://example.com/about/"
source_type: "page"
date: "2024-03-15T10:00:00"
modified: "2024-06-01T14:30:00"
author: "Jane Smith"
scraped_at: "2026-05-10T14:30:00+00:00"
word_count: 842
---

# About Us

Company description and content here...

corpus.jsonl

One JSON object per line, suitable for streaming into model pipelines:

{"id":"page-about-a1b2c3d4","url":"https://example.com/about/","title":"About Us","source_type":"page","text":"plain text...","markdown":"# About Us\n...","word_count":842}

manifest.json

Records crawl configuration and any shield events encountered:

{
  "base_url": "https://example.com",
  "scraped_at": "2026-05-10T14:30:00+00:00",
  "method": "wordpress_rest_api",
  "renderer": "botasaurus",
  "document_count": 47,
  "shield_events": []
}

Shield Detection

The scraper recognizes and gracefully handles anti-bot protections without attempting to circumvent unauthorized barriers. Detected systems include:

Cloudflare WAF and Turnstile
Datadome
Akamai Bot Manager
Imperva/Incapsula
Sucuri Website Firewall
Wordfence Security
Generic CAPTCHA and rate-limit pages

When a shield is detected, the URL is skipped and logged in manifest.json under shield_events. With the botasaurus renderer, the scraper will auto-escalate to the bypass tier before giving up.

Sitemap Handling

WordPress sites typically publish a sitemap index at /sitemap_index.xml that contains references to child sitemaps (e.g. page-sitemap.xml, post-sitemap.xml). The scraper handles this automatically:

Tries common sitemap paths: sitemap_index.xml, sitemap.xml, page-sitemap.xml, post-sitemap.xml
If it finds a sitemap index (contains <sitemap> elements), it recursively fetches each child sitemap
From each child sitemap, extracts the <url><loc> entries as seed URLs for the crawl
Recursion is capped at 3 levels deep to prevent infinite loops

For example, given https://www.example.com/sitemap_index.xml containing:

<sitemapindex>
  <sitemap><loc>https://www.example.com/page-sitemap.xml</loc></sitemap>
  <sitemap><loc>https://www.example.com/post-sitemap.xml</loc></sitemap>
</sitemapindex>

The scraper will fetch both child sitemaps and extract all page/post URLs from them, using those as the starting points for the crawl.

Crawl Behavior

Respects robots.txt by default (override with --ignore-robots)
Only follows same-domain links
Skips: wp-admin, wp-login, wp-content, wp-includes, feeds, cart/checkout, search, tag/category archives, and binary file extensions
Strips tracking parameters (utm_*, fbclid, gclid) from URLs
Deduplicates by URL and content hash
Images are not downloaded; alt text and captions are preserved inline

Examples

Scrape a marketing site for brand voice analysis

wp-ai-scrape https://acme.com \
  --include-posts \
  --max-pages 100 \
  -o corpus/acme

Scrape a Cloudflare-protected WordPress site

wp-ai-scrape https://protected-site.com \
  --renderer botasaurus \
  --bypass-cloudflare \
  --no-rest \
  -o corpus/protected

Large crawl with batch optimization

wp-ai-scrape https://big-site.com \
  --renderer botasaurus \
  --batch-size 20 \
  --max-pages 1000 \
  --delay 0.5 \
  -o corpus/bigsite

Debug with visible browser

wp-ai-scrape https://example.com \
  --renderer botasaurus \
  --no-headless \
  --no-stealth-request \
  -o scraped/debug

Web Interface

A Flask-based GUI with a live dashboard is included in the webapp/ directory. It provides a two-agent parallel architecture with real-time Server-Sent Events (SSE) progress tracking.

pip install -e ".[webapp]"
cd webapp
./run.sh

Then open http://localhost:8080. Features include live discovery/scraper counters, per-site timestamped output directories, content viewer with search, and a retry button for failed pages. See webapp/README.md for details.

Project Structure

wordpress-scraper/
  wp_ai_scraper/
    __init__.py               # Package metadata
    cli.py                    # CLI entry point, scraper class, utilities
    botasaurus_renderer.py    # Tiered rendering with botasaurus
  webapp/
    app.py                    # Flask app with two-agent architecture
    templates/index.html      # Apple Liquid Glass UI
    run.sh                    # One-command launcher
    requirements.txt          # Webapp-specific deps
  pyproject.toml              # Build config and dependencies
  README.md                   # This file
  ARCHITECTURE.md             # Code design and data flow
  CONTRIBUTING.md             # Development guide
  CHANGELOG.md                # Release history

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
webapp		webapp
wp_ai_scraper		wp_ai_scraper
.gitattributes		.gitattributes
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

WordPress AI Scraper

What It Produces

Installation

Basic (HTTP-only scraping)

With Botasaurus (JavaScript rendering + anti-detection)

Quick Start

How It Works

Botasaurus Integration

Tier 1: Stealth HTTP Request

Tier 2: Browser with Google Referrer

Tier 3: Browser with Cloudflare Bypass

Auto-Escalation

Batch Mode

CLI Reference

Core Options

Renderer Options

Botasaurus-Specific Options

Output Format

Markdown Files

corpus.jsonl

manifest.json

Shield Detection

Sitemap Handling

Crawl Behavior

Examples

Scrape a marketing site for brand voice analysis

Scrape a Cloudflare-protected WordPress site

Large crawl with batch optimization

Debug with visible browser

Web Interface

Project Structure

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages