A simple Python web crawler that extracts text content from websites and converts it to Markdown format. The crawler stays within the same domain and doesn't follow external links.
- Extracts text content from web pages
- Converts HTML to Markdown format
- Stays within the same domain (no external links)
- Simple class-based structure
- Easy to use command-line interface
This project uses uv for dependency management. Make sure you have uv installed.
# Install dependencies
uv syncRun the crawler with a starting URL:
uv run crawler.py https://example.comThe crawler will:
- Start from the provided URL
- Find all internal links on the same domain
- Extract text content from each page
- Convert the content to Markdown format
- Save all content to
output.md
The crawler creates an output.md file containing all extracted text content, organized by page with headers showing the original URL.
# Crawl a simple website
uv run crawler.py https://httpbin.orgThe result will be saved in output.md with all pages from the httpbin.org domain.
requests: For fetching web pagesbeautifulsoup4: For parsing HTML and finding linksmarkdownify: For converting HTML to Markdown
├── crawler.py # Main crawler implementation
├── pyproject.toml # Project configuration and dependencies
├── README.md # This file
└── uv.lock # Dependency lock file