Job Crawler System

A web crawling system specifically designed for scraping job postings and providing a web interface to view the collected data.

Overview

This project consists of two main components:

Core Crawler - Web scraper that collects job postings from their careers site
Web Interface - Flask-based web application to view and manage scraped data

System Requirements

Python: 3.8 or higher (recommended: Python 3.9+)
Operating System: Windows 10+, macOS 10.15+, or Linux (Ubuntu 18.04+)
Memory: At least 2GB RAM available
Storage: 500MB+ free space for scraped data
Internet: Stable internet connection for web scraping

Environment Setup

Windows Setup

Install Python:
- Download Python from python.org
- Important: Check "Add Python to PATH" during installation
- Verify installation: Open Command Prompt and run python --version
Install Git (optional but recommended):
- Download from git-scm.com
- Use default settings during installation

Set up virtual environment:

# Open Command Prompt as Administrator (recommended)
python -m pip install --upgrade pip
python -m pip install virtualenv

macOS Setup

Install Python:

Option 1 (Recommended): Install via Homebrew

# Install Homebrew if not already installed
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Install Python
brew install python

Option 2: Download from python.org

Verify installation:
```
python3 --version
pip3 --version
```
Install developer tools (if needed):
```
xcode-select --install
```

Linux Setup

Ubuntu/Debian:

# Update package list
sudo apt update

# Install Python and pip
sudo apt install python3 python3-pip python3-venv

# Install additional dependencies
sudo apt install curl git

CentOS/RHEL/Fedora:

# For CentOS/RHEL
sudo yum install python3 python3-pip

# For Fedora
sudo dnf install python3 python3-pip

# Install git
sudo yum install git  # CentOS/RHEL
sudo dnf install git  # Fedora

Installation

Step 1: Clone the Repository

# using SSH
git clone git@github.com:your-username/crawl_sys.git

cd crawl_sys

Step 2: Create Virtual Environment

Windows:

python -m venv venv
venv\Scripts\activate

macOS/Linux:

python3 -m venv venv
source venv/bin/activate

Note: You should see (venv) in your terminal prompt when the virtual environment is activated.

Step 3: Install Dependencies

# Upgrade pip first
pip install --upgrade pip

# Install project dependencies
pip install -r requirements.txt

Step 4: Environment Configuration

# Copy environment template
cp .env.sample .env

# For Windows Command Prompt
copy .env.sample .env

Step 5: Create Required Directories

# Create data directories
mkdir -p data/apple

# For Windows
mkdir data
mkdir data\apple

Usage

Running the Crawler

# Activate virtual environment first
source venv/bin/activate  # Linux/macOS
# or
venv\Scripts\activate     # Windows

# Set Python path
export PYTHONPATH=.       # Linux/macOS
set PYTHONPATH=.         # Windows

# Run the crawler
python core/apple_crawl.py

Running the Web Interface

# In the same terminal with activated virtual environment
python web/main.py

The web application will be available at http://localhost:5000

Project Structure

crawl_sys/
├── core/                     # Core crawling functionality
│   └── apple_crawl.py       # Apple job scraper implementation
├── data/                    # Crawled data storage (git-ignored)
│   ├── apple/              # Raw HTML pages from Apple jobs site
│   └── apple_job_detail.json  # Processed job data in JSON format
├── web/                     # Web interface
│   └── main.py             # Flask application entry point
├── .env.sample             # Environment variables template
├── .gitignore              # Git ignore patterns
├── requirements.txt        # Python dependencies
├── CLAUDE.md              # AI assistant project context
└── README.md              # This file

Coding Style Guidelines

Python Code Style

Follow PEP 8 style guide
Use 4 spaces for indentation (not tabs)
Line length: 79 characters maximum
Use meaningful variable and function names
Add docstrings for functions and classes

Example Code Style:

def fetch_job_details(job_id: str) -> dict:
    """
    Fetch job details from Apple's API.

    Args:
        job_id (str): The unique identifier for the job

    Returns:
        dict: Job details or empty dict if failed
    """
    headers = {
        "User-Agent": "Mozilla/5.0 (compatible crawler)",
        "Accept": "application/json"
    }

    try:
        response = requests.get(f"https://jobs.apple.com/api/v1/jobDetails/{job_id}",
                              headers=headers)
        response.raise_for_status()
        return response.json()
    except requests.RequestException as e:
        print(f"Error fetching job {job_id}: {e}")
        return {}

Import Organization:

# Standard library imports
import json
import os
from datetime import datetime

# Third-party imports
import requests
from bs4 import BeautifulSoup
from flask import Flask

# Local imports
from core.utils import helper_function

Common Issues & Troubleshooting

Python Installation Issues

Issue: python: command not found

# Solution for macOS/Linux
which python3
# Use python3 instead of python

# Solution for Windows
# Reinstall Python with "Add to PATH" option checked

Issue: pip: command not found

# Linux/macOS
python3 -m pip --version

# Windows
python -m pip --version

Virtual Environment Issues

Issue: Virtual environment not activating

# Make sure you're in the project directory
pwd  # Should show path ending with crawl_sys

# Try absolute path (replace with your actual path)
source /full/path/to/venv/bin/activate

Issue: ModuleNotFoundError despite installing requirements

# Make sure virtual environment is activated
which python  # Should point to venv/bin/python

# Reinstall requirements
pip install -r requirements.txt --force-reinstall

Crawling Issues

Issue: HTTP 403/429 errors

The site may be blocking requests due to rate limiting
Try adding delays between requests
Check if your IP is temporarily blocked

Issue: No data being saved

# Check if data directory exists
ls -la data/
# If not, create it
mkdir -p data/apple

Common Beginner Mistakes

Not activating virtual environment
- Always activate venv before running commands
- Look for (venv) in terminal prompt

Using system Python instead of virtual environment

# Wrong
/usr/bin/python core/apple_crawl.py

# Correct
python core/apple_crawl.py  # with venv activated

Missing PYTHONPATH

# Add this to your shell profile for permanent solution
echo 'export PYTHONPATH=.' >> ~/.bashrc  # Linux
echo 'export PYTHONPATH=.' >> ~/.zshrc   # macOS with zsh

Permission errors on Windows
- Run Command Prompt as Administrator when installing packages
- Or use --user flag: pip install --user package-name
Git line ending issues (Windows)
```
git config --global core.autocrlf true
```

Performance Tips

For large-scale crawling: Add delays between requests to avoid being blocked
Memory usage: The crawler processes pages sequentially to manage memory
Data storage: Raw HTML files are stored separately from processed JSON

Getting Help

If you encounter issues not covered here:

Check Python version: python --version (should be 3.8+)
Check installed packages: pip list
Check virtual environment: which python
Review error messages carefully - they often contain the solution
Search for similar issues online with the specific error message

Contributing

Fork the repository
Create a feature branch: git checkout -b feature-name
Follow the coding style guidelines
Test your changes thoroughly
Submit a pull request

Before Contributing:

Ensure all dependencies are in requirements.txt
Test on your local environment
Follow PEP 8 style guidelines
Add comments for complex logic
Update documentation if needed

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
core		core
database		database
web		web
.editorconfig		.editorconfig
.env.sample		.env.sample
.gitignore		.gitignore
DATABASE_README.md		DATABASE_README.md
README.md		README.md
README_CN.md		README_CN.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
test_workflow.py		test_workflow.py

Passerby/crawl_sys

Folders and files

Latest commit

History

Repository files navigation