AutoForge Lab is a containerized Python automation and crawling stack built for engineers who want a clean, extensible pipeline for:
- Web crawling
- Structured extraction
- Validation + normalization
- Storage
- Scheduled jobs
- Browser automation (Playwright + Selenium)
- Testable, modular OOP pipelines
It is designed to be practical, hackable, and self-hostable — not bloated, not vendor-locked.
- FastAPI backend API
- PostgreSQL storage
- APScheduler worker jobs
- OOP crawler pipeline:
- Collector → Extractor → Validator → Store
- Robots.txt enforcement
- Per-host throttling
- Playwright collector
- Selenium collector
- Pandas-ready data handling
- Pytest test suite
- Dockerized full stack
- Simple React + Vite frontend dashboard
- Script-driven dev workflow (no Make required)
Collectors
├── RequestsCollector
├── PlaywrightCollector
└── SeleniumCollector
↓
Extractors
→ parse titles, links, fields
↓
Validators
→ normalize + clean + reject bad records
↓
Store Layer
→ PostgreSQL persistence
↓
API + UI
→ FastAPI + React dashboard
↓
Scheduler Worker
→ recurring crawl jobs
- Python 3.11
- FastAPI
- SQLAlchemy
- APScheduler
- Pytest
- Pandas
- Playwright
- Selenium
- PostgreSQL 16
- React
- Vite
- TypeScript
- Docker
- docker-compose
git clone <your-repo-url>
cd autoforge-lab./scripts/up.shServices:
- Backend API → http://localhost:8000
- Frontend → http://localhost:5173
- Database → localhost:5432
curl http://localhost:8000/healthExpected:
{ "ok": true }./scripts/up.sh./scripts/down.sh./scripts/logs.sh
./scripts/logs-api.sh
./scripts/logs-worker.sh./scripts/test.sh./scripts/lint.sh./scripts/format.sh./scripts/types.shEach crawl flows through composable stages:
Responsible only for fetching content.
Examples:
- RequestsCollector → HTTP
- PlaywrightCollector → headless browser
- SeleniumCollector → full browser automation
Parses raw content into structured fields.
Cleans, normalizes, rejects invalid records.
Writes validated records to DB.
The worker container runs scheduled jobs:
- Crawl samplers
- Record refresh jobs
- Pipeline runs
Default interval: every 15 minutes.
Manual trigger:
docker-compose exec worker python -c \
"from app.scheduler.jobs import run_crawl_sampler; run_crawl_sampler()"- robots.txt checks
- crawl blocking when disallowed
- per-host throttling
- timeout enforcement
- structured logging
- no stealth scraping patterns
Records can be exported and processed with Pandas easily:
import pandas as pd
df = pd.read_json("records.json")Designed for downstream analytics and ML pipelines.
Pytest suite included.
Runs inside container for consistency:
./scripts/test.shCI runs pytest + coverage on pull requests.
Add a new collector:
app/crawling/collectors/my_collector.py
Subclass base collector and plug into pipeline config.
Add a new extractor or validator the same way — pipeline is intentionally modular.
- Keep collectors dumb — no parsing inside them
- Put normalization in validators
- Keep routes thin — push logic to services
- Always test new extractors with pytest fixtures
- Respect robots.txt — don’t remove the guardrails
See:
CONTRIBUTING.md
CODE_QUALITY_CHECKLIST.md
SECURITY.md
Pull requests welcome if they keep the architecture clean and responsibility boundaries intact.