A web crawling system specifically designed for scraping job postings and providing a web interface to view the collected data.
- Overview
- System Requirements
- Environment Setup
- Installation
- Usage
- Project Structure
- Coding Style Guidelines
- Common Issues & Troubleshooting
- Contributing
This project consists of two main components:
- Core Crawler - Web scraper that collects job postings from their careers site
- Web Interface - Flask-based web application to view and manage scraped data
- Python: 3.8 or higher (recommended: Python 3.9+)
- Operating System: Windows 10+, macOS 10.15+, or Linux (Ubuntu 18.04+)
- Memory: At least 2GB RAM available
- Storage: 500MB+ free space for scraped data
- Internet: Stable internet connection for web scraping
-
Install Python:
- Download Python from python.org
- Important: Check "Add Python to PATH" during installation
- Verify installation: Open Command Prompt and run
python --version
-
Install Git (optional but recommended):
- Download from git-scm.com
- Use default settings during installation
-
Set up virtual environment:
# Open Command Prompt as Administrator (recommended) python -m pip install --upgrade pip python -m pip install virtualenv
-
Install Python:
- Option 1 (Recommended): Install via Homebrew
# Install Homebrew if not already installed /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" # Install Python brew install python
- Option 2: Download from python.org
- Option 1 (Recommended): Install via Homebrew
-
Verify installation:
python3 --version pip3 --version
-
Install developer tools (if needed):
xcode-select --install
# Update package list
sudo apt update
# Install Python and pip
sudo apt install python3 python3-pip python3-venv
# Install additional dependencies
sudo apt install curl git# For CentOS/RHEL
sudo yum install python3 python3-pip
# For Fedora
sudo dnf install python3 python3-pip
# Install git
sudo yum install git # CentOS/RHEL
sudo dnf install git # Fedora# using SSH
git clone git@github.com:your-username/crawl_sys.git
cd crawl_sysWindows:
python -m venv venv
venv\Scripts\activatemacOS/Linux:
python3 -m venv venv
source venv/bin/activateNote: You should see
(venv)in your terminal prompt when the virtual environment is activated.
# Upgrade pip first
pip install --upgrade pip
# Install project dependencies
pip install -r requirements.txt# Copy environment template
cp .env.sample .env
# For Windows Command Prompt
copy .env.sample .env# Create data directories
mkdir -p data/apple
# For Windows
mkdir data
mkdir data\apple# Activate virtual environment first
source venv/bin/activate # Linux/macOS
# or
venv\Scripts\activate # Windows
# Set Python path
export PYTHONPATH=. # Linux/macOS
set PYTHONPATH=. # Windows
# Run the crawler
python core/apple_crawl.py# In the same terminal with activated virtual environment
python web/main.pyThe web application will be available at http://localhost:5000
crawl_sys/
├── core/ # Core crawling functionality
│ └── apple_crawl.py # Apple job scraper implementation
├── data/ # Crawled data storage (git-ignored)
│ ├── apple/ # Raw HTML pages from Apple jobs site
│ └── apple_job_detail.json # Processed job data in JSON format
├── web/ # Web interface
│ └── main.py # Flask application entry point
├── .env.sample # Environment variables template
├── .gitignore # Git ignore patterns
├── requirements.txt # Python dependencies
├── CLAUDE.md # AI assistant project context
└── README.md # This file
- Follow PEP 8 style guide
- Use 4 spaces for indentation (not tabs)
- Line length: 79 characters maximum
- Use meaningful variable and function names
- Add docstrings for functions and classes
def fetch_job_details(job_id: str) -> dict:
"""
Fetch job details from Apple's API.
Args:
job_id (str): The unique identifier for the job
Returns:
dict: Job details or empty dict if failed
"""
headers = {
"User-Agent": "Mozilla/5.0 (compatible crawler)",
"Accept": "application/json"
}
try:
response = requests.get(f"https://jobs.apple.com/api/v1/jobDetails/{job_id}",
headers=headers)
response.raise_for_status()
return response.json()
except requests.RequestException as e:
print(f"Error fetching job {job_id}: {e}")
return {}# Standard library imports
import json
import os
from datetime import datetime
# Third-party imports
import requests
from bs4 import BeautifulSoup
from flask import Flask
# Local imports
from core.utils import helper_functionIssue: python: command not found
# Solution for macOS/Linux
which python3
# Use python3 instead of python
# Solution for Windows
# Reinstall Python with "Add to PATH" option checkedIssue: pip: command not found
# Linux/macOS
python3 -m pip --version
# Windows
python -m pip --versionIssue: Virtual environment not activating
# Make sure you're in the project directory
pwd # Should show path ending with crawl_sys
# Try absolute path (replace with your actual path)
source /full/path/to/venv/bin/activateIssue: ModuleNotFoundError despite installing requirements
# Make sure virtual environment is activated
which python # Should point to venv/bin/python
# Reinstall requirements
pip install -r requirements.txt --force-reinstallIssue: HTTP 403/429 errors
- The site may be blocking requests due to rate limiting
- Try adding delays between requests
- Check if your IP is temporarily blocked
Issue: No data being saved
# Check if data directory exists
ls -la data/
# If not, create it
mkdir -p data/apple-
Not activating virtual environment
- Always activate venv before running commands
- Look for
(venv)in terminal prompt
-
Using system Python instead of virtual environment
# Wrong /usr/bin/python core/apple_crawl.py # Correct python core/apple_crawl.py # with venv activated
-
Missing PYTHONPATH
# Add this to your shell profile for permanent solution echo 'export PYTHONPATH=.' >> ~/.bashrc # Linux echo 'export PYTHONPATH=.' >> ~/.zshrc # macOS with zsh
-
Permission errors on Windows
- Run Command Prompt as Administrator when installing packages
- Or use
--userflag:pip install --user package-name
-
Git line ending issues (Windows)
git config --global core.autocrlf true
- For large-scale crawling: Add delays between requests to avoid being blocked
- Memory usage: The crawler processes pages sequentially to manage memory
- Data storage: Raw HTML files are stored separately from processed JSON
If you encounter issues not covered here:
- Check Python version:
python --version(should be 3.8+) - Check installed packages:
pip list - Check virtual environment:
which python - Review error messages carefully - they often contain the solution
- Search for similar issues online with the specific error message
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Follow the coding style guidelines
- Test your changes thoroughly
- Submit a pull request
- Ensure all dependencies are in
requirements.txt - Test on your local environment
- Follow PEP 8 style guidelines
- Add comments for complex logic
- Update documentation if needed