ArXiv Paper Collector

An automated Python tool that fetches the latest papers from arXiv related to electronic structure theory and artificial intelligence, filters them by keywords, and generates formatted PDF reports using LaTeX.

Features

Automated Daily Collection: Runs automatically at scheduled times
Keyword Filtering: Filters papers by customizable keywords
LaTeX Reports: Generates professional PDF reports with hyperlinks
Cross-Platform: Works on Windows, macOS, and Linux
Portable: Self-contained with auto-installation
Configurable: Simple YAML-based configuration

Prerequisites

Requirement	Version	Check Command
Python	3.8+	`python --version`
pip	Latest	`pip --version`
LaTeX	Any	`pdflatex --version`

Installation

Quick Install (Recommended)

Linux/macOS

# 1. Clone the repository
git clone https://github.com/YOUR_USERNAME/arxiv-paper-collector.git
cd arxiv-paper-collector

# 2. Run the installer
chmod +x install.sh
./install.sh

# 3. Run the program
./run.sh --run

Windows

REM 1. Clone the repository
git clone https://github.com/YOUR_USERNAME/arxiv-paper-collector.git
cd arxiv-paper-collector

REM 2. Run the installer
install.bat

REM 3. Run the program
run.bat --run

The installer will:

✓ Check Python version
✓ Check LaTeX installation
✓ Install Python dependencies
✓ Create output directories
✓ Update configuration

Manual Install

If the automated installer doesn't work, follow these steps:

Step 1: Install Python Dependencies

pip install -r requirements.txt

Packages installed:

arxiv - arXiv API client
PyYAML - Configuration parsing
Jinja2 - Template engine
python-dateutil - Date handling
colorlog - Colored logging
schedule - Task scheduling

Step 2: Install LaTeX

See LaTeX Installation below.

Step 3: Verify Installation

python main.py --help

LaTeX Installation

LaTeX is required for PDF generation. Choose your platform:

macOS

# Option 1: Homebrew (smaller download)
brew install mactex-no-gui

# Option 2: Full MacTeX
brew install --cask mactex

Download size: ~100MB (no-gui) or ~4GB (full)

Ubuntu/Debian

# Basic LaTeX installation
sudo apt-get update
sudo apt-get install texlive-latex-extra

# Full installation (recommended)
sudo apt-get install texlive-full

Download size: ~500MB (basic) or ~4GB (full)

Fedora/RHEL

sudo dnf install texlive-scheme-full

Windows

Option 1: MiKTeX (Recommended)

Download from miktex.org
Run the installer
Choose "Complete" installation

Option 2: TeX Live

Download from tug.org
Run install-tl-windows.bat
Follow the installation wizard

Verify LaTeX Installation

pdflatex --version    # Should show version info
xelatex --version     # Alternative engine (better Unicode support)

Troubleshooting LaTeX:

If LaTeX is not found after installation, restart your terminal/command prompt
On Windows, MiKTeX may ask to install packages on first use - click "Yes"

Usage

Basic Commands

# Run once immediately
python main.py --run

# Using launcher scripts (recommended)
./run.sh --run       # Linux/macOS
run.bat --run        # Windows

# Edit keywords in config file
python main.py --edit-keywords

# Show next scheduled run time
python main.py --status

# Start scheduled daemon (runs daily at configured time)
python main.py --daemon

Command Line Options

Option	Short	Description
`--run`	`-r`	Run the paper collector once immediately
`--daemon`	`-d`	Run as a daemon with scheduled execution
`--config`	`-c`	Path to configuration file
`--status`	`-s`	Show scheduler status
`--edit-keywords`		Open config file in default editor

First Run Example

$ ./run.sh --run

============================================================
Starting ArXiv Paper Collector
============================================================

Step 1: Fetching papers from arXiv...
Found 394 papers

Step 2: Filtering papers by keywords...
  electronic_structure: 327 papers
  artificial_intelligence: 317 papers
  uncategorized: 14 papers

Step 3: Generating LaTeX document...

Step 4: Compiling PDF...
PDF generated successfully: output/papers/arxiv_papers_2026-01-12.pdf

============================================================
Collection completed successfully!
============================================================

Configuration

Config File Locations

The program searches for configuration files in this order:

Current directory: ./config.yaml
User home: ~/.arxiv-collector/config.yaml
System config:
- Linux/macOS: ~/.config/arxiv-collector/config.yaml
- Windows: %APPDATA%\arxiv-collector\config.yaml

Creating a User Config

Method 1: Using the editor

python main.py --edit-keywords

Method 2: Manual creation

# Create config directory
mkdir -p ~/.config/arxiv-collector

# Copy default config
cp config.yaml ~/.config/arxiv-collector/

# Edit the file
nano ~/.config/arxiv-collector/config.yaml

Config File Explained

# ============================================
# KEYWORDS - Papers matching these will be collected
# ============================================
keywords:
  electronic_structure:
    - "electronic structure"
    - "density functional theory"
    - "DFT"
    - "quantum chemistry"
    - "ab initio"
    - "first-principles"
    - "Hartree-Fock"
    - "post-Hartree-Fock"
    - "coupled cluster"

  artificial_intelligence:
    - "machine learning"
    - "neural network"
    - "deep learning"
    - "artificial intelligence"
    - "AI"
    - "graph neural network"
    - "GNN"
    - "transformer"
    - "reinforcement learning"

# ============================================
# ARXIV CATEGORIES - Which categories to search
# ============================================
arxiv_categories:
  - "physics.comp-ph"      # Computational Physics
  - "physics.chem-ph"      # Chemical Physics
  - "cond-mat.str-el"      # Strongly Correlated Electrons
  - "cond-mat.mtrl-sci"    # Materials Science
  - "cs.LG"                # Machine Learning
  - "cs.AI"                # Artificial Intelligence

# ============================================
# TIME SETTINGS
# ============================================
days_back: 1                # How many days back to search (1=yesterday, 7=last week)
schedule:
  hour: 10                  # Run at 10:00 AM
  minute: 0
  timezone: "Asia/Shanghai" # Your timezone

# ============================================
# OUTPUT SETTINGS
# ============================================
output:
  pdf_dir: "output/papers"      # Where to save PDFs
  latex_dir: "output/latex"     # Where to save .tex files
  filename_format: "arxiv_papers_{date}.pdf"

# ============================================
# LATEX SETTINGS
# ============================================
latex:
  engine: "xelatex"        # Engine: pdflatex, xelatex, lualatex
                           # Use xelatex for better Unicode support
  max_compile_time: 60     # Max seconds to wait for compilation
  attempts: 2              # Number of compilation attempts

# ============================================
# PAPER LIMITS
# ============================================
max_papers: 50             # Maximum papers per group in report
abstract_max_length: 1000  # Maximum abstract length in characters

# ============================================
# LOGGING
# ============================================
logging:
  level: "INFO"            # DEBUG, INFO, WARNING, ERROR
  log_file: "output/collector.log"
  console_output: true

Common arXiv Categories

Category	Description
`physics.comp-ph`	Computational Physics
`physics.chem-ph`	Chemical Physics
`cond-mat.str-el`	Strongly Correlated Electrons
`cond-mat.mtrl-sci`	Materials Science
`cs.LG`	Machine Learning
`cs.AI`	Artificial Intelligence
`cs.CV`	Computer Vision
`cs.CL`	Computation and Language
`stat.ML`	Machine Learning (Statistics)

Troubleshooting

Problem: "ModuleNotFoundError: No module named 'arxiv'"

Solution:

pip install -r requirements.txt

Or install individually:

pip install arxiv PyYAML Jinja2 python-dateutil colorlog schedule

Problem: "pdflatex: command not found"

Solution: LaTeX is not installed or not in PATH.

macOS:

brew install mactex-no-gui
# Restart terminal after installation

Ubuntu:

sudo apt-get install texlive-latex-extra

Windows: Run the MiKTeX installer again and choose "Add/Remove" → "Add MiKTeX to PATH"

Problem: "LaTeX Error: Unicode character not supported"

Solution: Change the LaTeX engine to xelatex in config.yaml:

latex:
  engine: "xelatex"  # Changed from "pdflatex"

Problem: "No papers found"

Possible causes:

days_back is too small
arxiv_categories don't have new papers
Keywords are too specific

Solutions:

# Increase days_back
days_back: 7  # Search last 7 days instead of 1

# Add more categories
arxiv_categories:
  - "cs.AI"
  - "cs.LG"
  - "physics.comp-ph"
  - "physics.chem-ph"

# Use broader keywords
keywords:
  my_group:
    - "learning"      # Broader than "machine learning"
    - "network"       # Broader than "neural network"

Problem: PDF compilation fails with "! Emergency stop"

Causes: LaTeX syntax error in generated template

Solutions:

Check the log file: output/collector.log
Try a different LaTeX engine: xelatex or lualatex
Reduce max_papers to limit report size
Reduce abstract_max_length to avoid long abstracts

Problem: "Permission denied" when running scripts

Linux/macOS:

chmod +x install.sh run.sh

Problem: Virtual environment issues

Recreate virtual environment:

# Remove old venv
rm -rf venv

# Create new one
python3 -m venv venv

# Activate
source venv/bin/activate  # Linux/macOS
venv\Scripts\activate     # Windows

# Reinstall
pip install -r requirements.txt

Examples

Example 1: Collect papers for specific topic

Edit config.yaml:

keywords:
  my_research:
    - "quantum computing"
    - "qubit"
    - "entanglement"

arxiv_categories:
  - "quant-ph"

Run:

python main.py --run

Example 2: Scheduled daily collection

Using cron (Linux/macOS):

# Edit crontab
crontab -e

# Add this line to run daily at 10 AM
0 10 * * * cd /path/to/arxiv-paper-collector && python3 main.py --run >> output/cron.log 2>&1

Using systemd (Linux):

# Create service file
sudo nano /etc/systemd/system/arxiv-collector.service

Add:

[Unit]
Description=ArXiv Paper Collector
After=network.target

[Service]
Type=simple
User=your_username
WorkingDirectory=/path/to/arxiv-paper-collector
ExecStart=/usr/bin/python3 main.py --daemon
Restart=always

[Install]
WantedBy=multi-user.target

Enable:

sudo systemctl enable arxiv-collector
sudo systemctl start arxiv-collector

Example 3: Using with virtual environment

# Create venv
python3 -m venv venv

# Activate
source venv/bin/activate  # Linux/macOS
# venv\Scripts\activate   # Windows

# Install
pip install -r requirements.txt

# Run
python main.py --run

System Integration

Windows Task Scheduler

Open Task Scheduler (taskschd.msc)
Click "Create Basic Task"
Name: "ArXiv Paper Collector"
Trigger: "Daily" at "10:00 AM"
Action: "Start a program"
- Program: python
- Arguments: main.py --run
- Start in: C:\path\to\arxiv-paper-collector

LaunchAgents (macOS)

Create ~/Library/LaunchAgents/com.arxiv.collector.plist:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>com.arxiv.collector</string>
    <key>ProgramArguments</key>
    <array>
        <string>/usr/bin/python3</string>
        <string>/path/to/arxiv-paper-collector/main.py</string>
        <string>--run</string>
    </array>
    <key>StartCalendarInterval</key>
    <dict>
        <key>Hour</key>
        <integer>10</integer>
        <key>Minute</key>
        <integer>0</integer>
    </dict>
</dict>
</plist>

Load:

launchctl load ~/Library/LaunchAgents/com.arxiv.collector.plist

Development

# Install development dependencies
pip install -e ".[dev]"

# Format code
black .

# Run tests
pytest

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

MIT License - see LICENSE file for details.

Acknowledgments

arXiv for open access to scientific papers
arxiv.py for Python API access
Jinja2 for the template engine

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
modules		modules
templates		templates
.gitignore		.gitignore
README.md		README.md
config.yaml		config.yaml
install.bat		install.bat
install.sh		install.sh
main.py		main.py
requirements.txt		requirements.txt
run.bat		run.bat
run.sh		run.sh
setup-schedule.sh		setup-schedule.sh
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

ArXiv Paper Collector

Features

Table of Contents

Prerequisites

Installation

Quick Install (Recommended)

Linux/macOS

Windows

Manual Install

Step 1: Install Python Dependencies

Step 2: Install LaTeX

Step 3: Verify Installation

LaTeX Installation

macOS

Ubuntu/Debian

Fedora/RHEL

Windows

Verify LaTeX Installation

Usage

Basic Commands

Command Line Options

First Run Example

Configuration

Config File Locations

Creating a User Config

Config File Explained

Common arXiv Categories

Troubleshooting

Problem: "ModuleNotFoundError: No module named 'arxiv'"

Problem: "pdflatex: command not found"

Problem: "LaTeX Error: Unicode character not supported"

Problem: "No papers found"

Problem: PDF compilation fails with "! Emergency stop"

Problem: "Permission denied" when running scripts

Problem: Virtual environment issues

Examples

Example 1: Collect papers for specific topic

Example 2: Scheduled daily collection

Example 3: Using with virtual environment

System Integration

Windows Task Scheduler

LaunchAgents (macOS)

Development

Contributing

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages