Skip to content

Passerby/crawl_sys

Repository files navigation

Job Crawler System

A web crawling system specifically designed for scraping job postings and providing a web interface to view the collected data.

Table of Contents

Overview

This project consists of two main components:

  1. Core Crawler - Web scraper that collects job postings from their careers site
  2. Web Interface - Flask-based web application to view and manage scraped data

System Requirements

  • Python: 3.8 or higher (recommended: Python 3.9+)
  • Operating System: Windows 10+, macOS 10.15+, or Linux (Ubuntu 18.04+)
  • Memory: At least 2GB RAM available
  • Storage: 500MB+ free space for scraped data
  • Internet: Stable internet connection for web scraping

Environment Setup

Windows Setup

  1. Install Python:

    • Download Python from python.org
    • Important: Check "Add Python to PATH" during installation
    • Verify installation: Open Command Prompt and run python --version
  2. Install Git (optional but recommended):

    • Download from git-scm.com
    • Use default settings during installation
  3. Set up virtual environment:

    # Open Command Prompt as Administrator (recommended)
    python -m pip install --upgrade pip
    python -m pip install virtualenv

macOS Setup

  1. Install Python:

    • Option 1 (Recommended): Install via Homebrew
      # Install Homebrew if not already installed
      /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
      
      # Install Python
      brew install python
    • Option 2: Download from python.org
  2. Verify installation:

    python3 --version
    pip3 --version
  3. Install developer tools (if needed):

    xcode-select --install

Linux Setup

Ubuntu/Debian:

# Update package list
sudo apt update

# Install Python and pip
sudo apt install python3 python3-pip python3-venv

# Install additional dependencies
sudo apt install curl git

CentOS/RHEL/Fedora:

# For CentOS/RHEL
sudo yum install python3 python3-pip

# For Fedora
sudo dnf install python3 python3-pip

# Install git
sudo yum install git  # CentOS/RHEL
sudo dnf install git  # Fedora

Installation

Step 1: Clone the Repository

# using SSH
git clone git@github.com:your-username/crawl_sys.git

cd crawl_sys

Step 2: Create Virtual Environment

Windows:

python -m venv venv
venv\Scripts\activate

macOS/Linux:

python3 -m venv venv
source venv/bin/activate

Note: You should see (venv) in your terminal prompt when the virtual environment is activated.

Step 3: Install Dependencies

# Upgrade pip first
pip install --upgrade pip

# Install project dependencies
pip install -r requirements.txt

Step 4: Environment Configuration

# Copy environment template
cp .env.sample .env

# For Windows Command Prompt
copy .env.sample .env

Step 5: Create Required Directories

# Create data directories
mkdir -p data/apple

# For Windows
mkdir data
mkdir data\apple

Usage

Running the Crawler

# Activate virtual environment first
source venv/bin/activate  # Linux/macOS
# or
venv\Scripts\activate     # Windows

# Set Python path
export PYTHONPATH=.       # Linux/macOS
set PYTHONPATH=.         # Windows

# Run the crawler
python core/apple_crawl.py

Running the Web Interface

# In the same terminal with activated virtual environment
python web/main.py

The web application will be available at http://localhost:5000

Project Structure

crawl_sys/
├── core/                     # Core crawling functionality
│   └── apple_crawl.py       # Apple job scraper implementation
├── data/                    # Crawled data storage (git-ignored)
│   ├── apple/              # Raw HTML pages from Apple jobs site
│   └── apple_job_detail.json  # Processed job data in JSON format
├── web/                     # Web interface
│   └── main.py             # Flask application entry point
├── .env.sample             # Environment variables template
├── .gitignore              # Git ignore patterns
├── requirements.txt        # Python dependencies
├── CLAUDE.md              # AI assistant project context
└── README.md              # This file

Coding Style Guidelines

Python Code Style

  • Follow PEP 8 style guide
  • Use 4 spaces for indentation (not tabs)
  • Line length: 79 characters maximum
  • Use meaningful variable and function names
  • Add docstrings for functions and classes

Example Code Style:

def fetch_job_details(job_id: str) -> dict:
    """
    Fetch job details from Apple's API.

    Args:
        job_id (str): The unique identifier for the job

    Returns:
        dict: Job details or empty dict if failed
    """
    headers = {
        "User-Agent": "Mozilla/5.0 (compatible crawler)",
        "Accept": "application/json"
    }

    try:
        response = requests.get(f"https://jobs.apple.com/api/v1/jobDetails/{job_id}",
                              headers=headers)
        response.raise_for_status()
        return response.json()
    except requests.RequestException as e:
        print(f"Error fetching job {job_id}: {e}")
        return {}

Import Organization:

# Standard library imports
import json
import os
from datetime import datetime

# Third-party imports
import requests
from bs4 import BeautifulSoup
from flask import Flask

# Local imports
from core.utils import helper_function

Common Issues & Troubleshooting

Python Installation Issues

Issue: python: command not found

# Solution for macOS/Linux
which python3
# Use python3 instead of python

# Solution for Windows
# Reinstall Python with "Add to PATH" option checked

Issue: pip: command not found

# Linux/macOS
python3 -m pip --version

# Windows
python -m pip --version

Virtual Environment Issues

Issue: Virtual environment not activating

# Make sure you're in the project directory
pwd  # Should show path ending with crawl_sys

# Try absolute path (replace with your actual path)
source /full/path/to/venv/bin/activate

Issue: ModuleNotFoundError despite installing requirements

# Make sure virtual environment is activated
which python  # Should point to venv/bin/python

# Reinstall requirements
pip install -r requirements.txt --force-reinstall

Crawling Issues

Issue: HTTP 403/429 errors

  • The site may be blocking requests due to rate limiting
  • Try adding delays between requests
  • Check if your IP is temporarily blocked

Issue: No data being saved

# Check if data directory exists
ls -la data/
# If not, create it
mkdir -p data/apple

Common Beginner Mistakes

  1. Not activating virtual environment

    • Always activate venv before running commands
    • Look for (venv) in terminal prompt
  2. Using system Python instead of virtual environment

    # Wrong
    /usr/bin/python core/apple_crawl.py
    
    # Correct
    python core/apple_crawl.py  # with venv activated
  3. Missing PYTHONPATH

    # Add this to your shell profile for permanent solution
    echo 'export PYTHONPATH=.' >> ~/.bashrc  # Linux
    echo 'export PYTHONPATH=.' >> ~/.zshrc   # macOS with zsh
  4. Permission errors on Windows

    • Run Command Prompt as Administrator when installing packages
    • Or use --user flag: pip install --user package-name
  5. Git line ending issues (Windows)

    git config --global core.autocrlf true

Performance Tips

  • For large-scale crawling: Add delays between requests to avoid being blocked
  • Memory usage: The crawler processes pages sequentially to manage memory
  • Data storage: Raw HTML files are stored separately from processed JSON

Getting Help

If you encounter issues not covered here:

  1. Check Python version: python --version (should be 3.8+)
  2. Check installed packages: pip list
  3. Check virtual environment: which python
  4. Review error messages carefully - they often contain the solution
  5. Search for similar issues online with the specific error message

Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Follow the coding style guidelines
  4. Test your changes thoroughly
  5. Submit a pull request

Before Contributing:

  • Ensure all dependencies are in requirements.txt
  • Test on your local environment
  • Follow PEP 8 style guidelines
  • Add comments for complex logic
  • Update documentation if needed

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •