Skip to content

lanec/wordpress-scraper

Repository files navigation

WordPress AI Scraper

Scrape any WordPress website into structured, AI-ready local files. The output is designed for corpus preparation, style analysis, retrieval-augmented generation (RAG), fine-tuning datasets, and content audits.

What It Produces

scraped-site/
  pages/          # One Markdown file per page, with YAML front matter
  posts/          # One Markdown file per post (when --include-posts is used)
  corpus.jsonl    # One JSON object per document for model pipelines
  manifest.json   # Crawl metadata, settings, shield events
  index.md        # Human-readable table of all captured URLs

Installation

Basic (HTTP-only scraping)

python3 -m venv .venv
source .venv/bin/activate
pip install -e .

Requires Python 3.10+.

With Botasaurus (JavaScript rendering + anti-detection)

pip install -e ".[botasaurus]"

If using a local botasaurus checkout:

pip install -e .
pip install -e /path/to/botasaurus

Quick Start

# Scrape via WordPress REST API (fastest, cleanest output)
wp-ai-scrape https://example.com -o scraped/example

# Include blog posts alongside pages
wp-ai-scrape https://example.com --include-posts -o scraped/example

# Use Botasaurus for JS-rendered or protected sites
wp-ai-scrape https://example.com --renderer botasaurus -o scraped/example

How It Works

The scraper uses a priority-ordered strategy:

  1. WordPress REST API (default): Fetches structured content from /wp-json/wp/v2/pages and /wp-json/wp/v2/posts. This produces the cleanest output because WordPress returns pre-structured HTML with metadata.

  2. HTML Crawl (fallback): If the REST API is unavailable (disabled, restricted, or non-WordPress site), the scraper discovers URLs via sitemaps and crawls same-domain links, extracting main content from each page.

When --renderer botasaurus is selected, the HTML crawl uses a three-tier anti-detection strategy (see below).

Botasaurus Integration

When the site uses JavaScript rendering or has anti-bot protections, the botasaurus renderer provides a tiered escalation approach:

Tier 1: Stealth HTTP Request

Uses botasaurus's @request decorator which sends browser-like HTTP requests with proper TLS fingerprinting and Google referrer headers. This is fast (no browser launch) and handles connection-level Cloudflare challenges. Most WordPress pages succeed here.

Tier 2: Browser with Google Referrer

Launches a real Chromium browser, navigates via Google referrer (google_get) to appear as organic traffic. Checks is_bot_detected() after page load. Uses block_images=True for speed and reuse_driver=True to keep the browser session warm.

Tier 3: Browser with Cloudflare Bypass

For pages with JS challenges or Turnstile CAPTCHAs, uses bypass_cloudflare=True. This tier is triggered automatically when Tier 2 detects bot detection, or can be forced with --bypass-cloudflare.

Auto-Escalation

If any page triggers shield detection during a crawl, the scraper remembers this and automatically escalates all subsequent pages to the bypass tier. This means a site that's partially protected won't require manual intervention.

Batch Mode

For large crawls, --batch-size N processes N URLs per browser session without relaunching, significantly reducing overhead:

wp-ai-scrape https://example.com \
  --renderer botasaurus \
  --batch-size 10 \
  --max-pages 500

CLI Reference

wp-ai-scrape <url> [options]

Core Options

Option Default Description
url (required) WordPress site URL
-o, --output scraped-site Output directory
--max-pages 200 Maximum documents to save
--include-posts off Include blog posts alongside pages
--no-rest off Skip REST API, crawl HTML directly
--delay 0.2 Seconds between requests
--timeout 20 HTTP request timeout in seconds
--ignore-robots off Ignore robots.txt (use only with permission)

Renderer Options

Option Default Description
--renderer requests requests for plain HTTP, botasaurus for tiered anti-detection
--botasaurus-path none Path to local botasaurus checkout (if not pip-installed)
--render-wait 2.0 Seconds to wait after page navigation

Botasaurus-Specific Options

Option Default Description
--headless on Run browser without visible window
--no-headless off Show browser window (debugging)
--proxy none Proxy URL (e.g. http://user:pass@host:port)
--block-images on Skip image downloads for faster loads
--no-block-images off Allow image loading
--no-stealth-request off Skip Tier 1, go straight to browser
--bypass-cloudflare off Force Cloudflare bypass on all pages
--batch-size 0 URLs per browser session (0 = one at a time)

Output Format

Markdown Files

Each page or post becomes a Markdown file with YAML front matter:

---
id: "page-about-a1b2c3d4"
title: "About Us"
url: "https://example.com/about/"
source_type: "page"
date: "2024-03-15T10:00:00"
modified: "2024-06-01T14:30:00"
author: "Jane Smith"
scraped_at: "2026-05-10T14:30:00+00:00"
word_count: 842
---

# About Us

Company description and content here...

corpus.jsonl

One JSON object per line, suitable for streaming into model pipelines:

{"id":"page-about-a1b2c3d4","url":"https://example.com/about/","title":"About Us","source_type":"page","text":"plain text...","markdown":"# About Us\n...","word_count":842}

manifest.json

Records crawl configuration and any shield events encountered:

{
  "base_url": "https://example.com",
  "scraped_at": "2026-05-10T14:30:00+00:00",
  "method": "wordpress_rest_api",
  "renderer": "botasaurus",
  "document_count": 47,
  "shield_events": []
}

Shield Detection

The scraper recognizes and gracefully handles anti-bot protections without attempting to circumvent unauthorized barriers. Detected systems include:

  • Cloudflare WAF and Turnstile
  • Datadome
  • Akamai Bot Manager
  • Imperva/Incapsula
  • Sucuri Website Firewall
  • Wordfence Security
  • Generic CAPTCHA and rate-limit pages

When a shield is detected, the URL is skipped and logged in manifest.json under shield_events. With the botasaurus renderer, the scraper will auto-escalate to the bypass tier before giving up.

Sitemap Handling

WordPress sites typically publish a sitemap index at /sitemap_index.xml that contains references to child sitemaps (e.g. page-sitemap.xml, post-sitemap.xml). The scraper handles this automatically:

  1. Tries common sitemap paths: sitemap_index.xml, sitemap.xml, page-sitemap.xml, post-sitemap.xml
  2. If it finds a sitemap index (contains <sitemap> elements), it recursively fetches each child sitemap
  3. From each child sitemap, extracts the <url><loc> entries as seed URLs for the crawl
  4. Recursion is capped at 3 levels deep to prevent infinite loops

For example, given https://www.example.com/sitemap_index.xml containing:

<sitemapindex>
  <sitemap><loc>https://www.example.com/page-sitemap.xml</loc></sitemap>
  <sitemap><loc>https://www.example.com/post-sitemap.xml</loc></sitemap>
</sitemapindex>

The scraper will fetch both child sitemaps and extract all page/post URLs from them, using those as the starting points for the crawl.

Crawl Behavior

  • Respects robots.txt by default (override with --ignore-robots)
  • Only follows same-domain links
  • Skips: wp-admin, wp-login, wp-content, wp-includes, feeds, cart/checkout, search, tag/category archives, and binary file extensions
  • Strips tracking parameters (utm_*, fbclid, gclid) from URLs
  • Deduplicates by URL and content hash
  • Images are not downloaded; alt text and captions are preserved inline

Examples

Scrape a marketing site for brand voice analysis

wp-ai-scrape https://acme.com \
  --include-posts \
  --max-pages 100 \
  -o corpus/acme

Scrape a Cloudflare-protected WordPress site

wp-ai-scrape https://protected-site.com \
  --renderer botasaurus \
  --bypass-cloudflare \
  --no-rest \
  -o corpus/protected

Large crawl with batch optimization

wp-ai-scrape https://big-site.com \
  --renderer botasaurus \
  --batch-size 20 \
  --max-pages 1000 \
  --delay 0.5 \
  -o corpus/bigsite

Debug with visible browser

wp-ai-scrape https://example.com \
  --renderer botasaurus \
  --no-headless \
  --no-stealth-request \
  -o scraped/debug

Web Interface

A Flask-based GUI with a live dashboard is included in the webapp/ directory. It provides a two-agent parallel architecture with real-time Server-Sent Events (SSE) progress tracking.

pip install -e ".[webapp]"
cd webapp
./run.sh

Then open http://localhost:8080. Features include live discovery/scraper counters, per-site timestamped output directories, content viewer with search, and a retry button for failed pages. See webapp/README.md for details.

Project Structure

wordpress-scraper/
  wp_ai_scraper/
    __init__.py               # Package metadata
    cli.py                    # CLI entry point, scraper class, utilities
    botasaurus_renderer.py    # Tiered rendering with botasaurus
  webapp/
    app.py                    # Flask app with two-agent architecture
    templates/index.html      # Apple Liquid Glass UI
    run.sh                    # One-command launcher
    requirements.txt          # Webapp-specific deps
  pyproject.toml              # Build config and dependencies
  README.md                   # This file
  ARCHITECTURE.md             # Code design and data flow
  CONTRIBUTING.md             # Development guide
  CHANGELOG.md                # Release history

License

MIT

About

Wordpress site scraper with ability to scrape sites that have anti-bot protection.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors