Skip to content

robinnewhouse/python-web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python Web Crawler

A simple Python web crawler that extracts text content from websites and converts it to Markdown format. The crawler stays within the same domain and doesn't follow external links.

Features

  • Extracts text content from web pages
  • Converts HTML to Markdown format
  • Stays within the same domain (no external links)
  • Simple class-based structure
  • Easy to use command-line interface

Installation

This project uses uv for dependency management. Make sure you have uv installed.

# Install dependencies
uv sync

Usage

Run the crawler with a starting URL:

uv run crawler.py https://example.com

The crawler will:

  1. Start from the provided URL
  2. Find all internal links on the same domain
  3. Extract text content from each page
  4. Convert the content to Markdown format
  5. Save all content to output.md

Output

The crawler creates an output.md file containing all extracted text content, organized by page with headers showing the original URL.

Example

# Crawl a simple website
uv run crawler.py https://httpbin.org

The result will be saved in output.md with all pages from the httpbin.org domain.

Dependencies

  • requests: For fetching web pages
  • beautifulsoup4: For parsing HTML and finding links
  • markdownify: For converting HTML to Markdown

Project Structure

├── crawler.py          # Main crawler implementation
├── pyproject.toml      # Project configuration and dependencies
├── README.md          # This file
└── uv.lock           # Dependency lock file

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages