Web Content Processor is an integrated tool for web scraping, link extraction, and content postprocessing. It provides a user-friendly interface for various web content management tasks.
- Web Scraping: Extract content from websites with advanced scrolling and JavaScript handling.
- Link Extraction: Retrieve internal links from web pages.
- Content Postprocessing: Convert scraped content to Markdown format.
- User-friendly Interface: Easy-to-use Gradio-based web interface.
- Python 3.7 or higher
- pip (Python package installer)
-
Clone the repository:
git clone https://github.com/JackSmack1971/web_content_processor.git cd web_content_processor -
Create a virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate` -
Install the required packages:
pip install -r requirements.txt -
Install the project in editable mode:
pip install -e .
To start the Web Content Processor, run:
python main.py
This will launch the Gradio interface in your default web browser. The interface consists of three main tabs:
- Scrape: Upload a file containing URLs to scrape content from multiple web pages.
- Extract Links: Enter a URL to extract all internal links from that web page.
- Postprocess: Convert scraped content from text format to Markdown.
You can customize the behavior of the Web Content Processor by modifying the configuration files:
configs/config.yaml: YAML format configurationconfigs/config.json: JSON format configuration
These files allow you to adjust settings such as:
- Maximum number of worker threads
- User agent string for web requests
- Scrolling behavior for web scraping
- Supported file types for postprocessing
- And more...
web_content_processor/
│
├── src/
│ ├── __init__.py
│ ├── config_manager.py
│ ├── scraper.py
│ ├── postprocessor.py
│ ├── link_extractor.py
│ └── integrated_app.py
│
├── tests/
│ ├── __init__.py
│ ├── test_config_manager.py
│ ├── test_scraper.py
│ ├── test_postprocessor.py
│ ├── test_link_extractor.py
│ └── test_integrated_app.py
│
├── configs/
│ ├── config.yaml
│ └── config.json
│
├── data/
│ ├── input/
│ └── output/
│
├── logs/
│ └── app.log
│
├── docs/
│ └── README.md
│
├── requirements.txt
├── setup.py
└── main.py
To contribute to the Web Content Processor, please follow these steps:
- Fork the repository on GitHub.
- Create a new branch for your feature or bug fix.
- Write your code and tests.
- Run the tests to ensure everything is working:
python -m unittest discover tests - Submit a pull request with your changes.
To run the test suite, execute:
python -m unittest discover tests
Logs are stored in the logs/ directory. The main application log file is app.log.
This project is licensed under the MIT License - see the LICENSE file for details.
- Gradio for the web interface framework.
- Playwright for web automation and scraping.
- Beautiful Soup for HTML parsing.
For any queries or suggestions, please open an issue on the GitHub repository or contact the maintainers directly.
Happy web content processing!