Python Web Crawler

A simple Python web crawler that extracts text content from websites and converts it to Markdown format. The crawler stays within the same domain and doesn't follow external links.

Features

Extracts text content from web pages
Converts HTML to Markdown format
Stays within the same domain (no external links)
Simple class-based structure
Easy to use command-line interface

Installation

This project uses uv for dependency management. Make sure you have uv installed.

# Install dependencies
uv sync

Usage

Run the crawler with a starting URL:

uv run crawler.py https://example.com

The crawler will:

Start from the provided URL
Find all internal links on the same domain
Extract text content from each page
Convert the content to Markdown format
Save all content to output.md

Output

The crawler creates an output.md file containing all extracted text content, organized by page with headers showing the original URL.

Example

# Crawl a simple website
uv run crawler.py https://httpbin.org

The result will be saved in output.md with all pages from the httpbin.org domain.

Dependencies

requests: For fetching web pages
beautifulsoup4: For parsing HTML and finding links
markdownify: For converting HTML to Markdown

Project Structure

├── crawler.py          # Main crawler implementation
├── pyproject.toml      # Project configuration and dependencies
├── README.md          # This file
└── uv.lock           # Dependency lock file

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.python-version		.python-version
README.md		README.md
crawler.py		crawler.py
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Python Web Crawler

Features

Installation

Usage

Output

Example

Dependencies

Project Structure

About

Uh oh!

Releases

Packages

Languages

robinnewhouse/python-web-crawler

Folders and files

Latest commit

History

Repository files navigation

Python Web Crawler

Features

Installation

Usage

Output

Example

Dependencies

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages