An automated Python tool that fetches the latest papers from arXiv related to electronic structure theory and artificial intelligence, filters them by keywords, and generates formatted PDF reports using LaTeX.
- Automated Daily Collection: Runs automatically at scheduled times
- Keyword Filtering: Filters papers by customizable keywords
- LaTeX Reports: Generates professional PDF reports with hyperlinks
- Cross-Platform: Works on Windows, macOS, and Linux
- Portable: Self-contained with auto-installation
- Configurable: Simple YAML-based configuration
| Requirement | Version | Check Command |
|---|---|---|
| Python | 3.8+ | python --version |
| pip | Latest | pip --version |
| LaTeX | Any | pdflatex --version |
# 1. Clone the repository
git clone https://github.com/YOUR_USERNAME/arxiv-paper-collector.git
cd arxiv-paper-collector
# 2. Run the installer
chmod +x install.sh
./install.sh
# 3. Run the program
./run.sh --runREM 1. Clone the repository
git clone https://github.com/YOUR_USERNAME/arxiv-paper-collector.git
cd arxiv-paper-collector
REM 2. Run the installer
install.bat
REM 3. Run the program
run.bat --runThe installer will:
- ✓ Check Python version
- ✓ Check LaTeX installation
- ✓ Install Python dependencies
- ✓ Create output directories
- ✓ Update configuration
If the automated installer doesn't work, follow these steps:
pip install -r requirements.txtPackages installed:
arxiv- arXiv API clientPyYAML- Configuration parsingJinja2- Template enginepython-dateutil- Date handlingcolorlog- Colored loggingschedule- Task scheduling
See LaTeX Installation below.
python main.py --helpLaTeX is required for PDF generation. Choose your platform:
# Option 1: Homebrew (smaller download)
brew install mactex-no-gui
# Option 2: Full MacTeX
brew install --cask mactexDownload size: ~100MB (no-gui) or ~4GB (full)
# Basic LaTeX installation
sudo apt-get update
sudo apt-get install texlive-latex-extra
# Full installation (recommended)
sudo apt-get install texlive-fullDownload size: ~500MB (basic) or ~4GB (full)
sudo dnf install texlive-scheme-fullOption 1: MiKTeX (Recommended)
- Download from miktex.org
- Run the installer
- Choose "Complete" installation
Option 2: TeX Live
- Download from tug.org
- Run
install-tl-windows.bat - Follow the installation wizard
pdflatex --version # Should show version info
xelatex --version # Alternative engine (better Unicode support)Troubleshooting LaTeX:
- If LaTeX is not found after installation, restart your terminal/command prompt
- On Windows, MiKTeX may ask to install packages on first use - click "Yes"
# Run once immediately
python main.py --run
# Using launcher scripts (recommended)
./run.sh --run # Linux/macOS
run.bat --run # Windows
# Edit keywords in config file
python main.py --edit-keywords
# Show next scheduled run time
python main.py --status
# Start scheduled daemon (runs daily at configured time)
python main.py --daemon| Option | Short | Description |
|---|---|---|
--run |
-r |
Run the paper collector once immediately |
--daemon |
-d |
Run as a daemon with scheduled execution |
--config |
-c |
Path to configuration file |
--status |
-s |
Show scheduler status |
--edit-keywords |
Open config file in default editor |
$ ./run.sh --run
============================================================
Starting ArXiv Paper Collector
============================================================
Step 1: Fetching papers from arXiv...
Found 394 papers
Step 2: Filtering papers by keywords...
electronic_structure: 327 papers
artificial_intelligence: 317 papers
uncategorized: 14 papers
Step 3: Generating LaTeX document...
Step 4: Compiling PDF...
PDF generated successfully: output/papers/arxiv_papers_2026-01-12.pdf
============================================================
Collection completed successfully!
============================================================The program searches for configuration files in this order:
- Current directory:
./config.yaml - User home:
~/.arxiv-collector/config.yaml - System config:
- Linux/macOS:
~/.config/arxiv-collector/config.yaml - Windows:
%APPDATA%\arxiv-collector\config.yaml
- Linux/macOS:
Method 1: Using the editor
python main.py --edit-keywordsMethod 2: Manual creation
# Create config directory
mkdir -p ~/.config/arxiv-collector
# Copy default config
cp config.yaml ~/.config/arxiv-collector/
# Edit the file
nano ~/.config/arxiv-collector/config.yaml# ============================================
# KEYWORDS - Papers matching these will be collected
# ============================================
keywords:
electronic_structure:
- "electronic structure"
- "density functional theory"
- "DFT"
- "quantum chemistry"
- "ab initio"
- "first-principles"
- "Hartree-Fock"
- "post-Hartree-Fock"
- "coupled cluster"
artificial_intelligence:
- "machine learning"
- "neural network"
- "deep learning"
- "artificial intelligence"
- "AI"
- "graph neural network"
- "GNN"
- "transformer"
- "reinforcement learning"
# ============================================
# ARXIV CATEGORIES - Which categories to search
# ============================================
arxiv_categories:
- "physics.comp-ph" # Computational Physics
- "physics.chem-ph" # Chemical Physics
- "cond-mat.str-el" # Strongly Correlated Electrons
- "cond-mat.mtrl-sci" # Materials Science
- "cs.LG" # Machine Learning
- "cs.AI" # Artificial Intelligence
# ============================================
# TIME SETTINGS
# ============================================
days_back: 1 # How many days back to search (1=yesterday, 7=last week)
schedule:
hour: 10 # Run at 10:00 AM
minute: 0
timezone: "Asia/Shanghai" # Your timezone
# ============================================
# OUTPUT SETTINGS
# ============================================
output:
pdf_dir: "output/papers" # Where to save PDFs
latex_dir: "output/latex" # Where to save .tex files
filename_format: "arxiv_papers_{date}.pdf"
# ============================================
# LATEX SETTINGS
# ============================================
latex:
engine: "xelatex" # Engine: pdflatex, xelatex, lualatex
# Use xelatex for better Unicode support
max_compile_time: 60 # Max seconds to wait for compilation
attempts: 2 # Number of compilation attempts
# ============================================
# PAPER LIMITS
# ============================================
max_papers: 50 # Maximum papers per group in report
abstract_max_length: 1000 # Maximum abstract length in characters
# ============================================
# LOGGING
# ============================================
logging:
level: "INFO" # DEBUG, INFO, WARNING, ERROR
log_file: "output/collector.log"
console_output: true| Category | Description |
|---|---|
physics.comp-ph |
Computational Physics |
physics.chem-ph |
Chemical Physics |
cond-mat.str-el |
Strongly Correlated Electrons |
cond-mat.mtrl-sci |
Materials Science |
cs.LG |
Machine Learning |
cs.AI |
Artificial Intelligence |
cs.CV |
Computer Vision |
cs.CL |
Computation and Language |
stat.ML |
Machine Learning (Statistics) |
Solution:
pip install -r requirements.txtOr install individually:
pip install arxiv PyYAML Jinja2 python-dateutil colorlog scheduleSolution: LaTeX is not installed or not in PATH.
macOS:
brew install mactex-no-gui
# Restart terminal after installationUbuntu:
sudo apt-get install texlive-latex-extraWindows: Run the MiKTeX installer again and choose "Add/Remove" → "Add MiKTeX to PATH"
Solution: Change the LaTeX engine to xelatex in config.yaml:
latex:
engine: "xelatex" # Changed from "pdflatex"Possible causes:
days_backis too smallarxiv_categoriesdon't have new papers- Keywords are too specific
Solutions:
# Increase days_back
days_back: 7 # Search last 7 days instead of 1
# Add more categories
arxiv_categories:
- "cs.AI"
- "cs.LG"
- "physics.comp-ph"
- "physics.chem-ph"
# Use broader keywords
keywords:
my_group:
- "learning" # Broader than "machine learning"
- "network" # Broader than "neural network"Causes: LaTeX syntax error in generated template
Solutions:
- Check the log file:
output/collector.log - Try a different LaTeX engine:
xelatexorlualatex - Reduce
max_papersto limit report size - Reduce
abstract_max_lengthto avoid long abstracts
Linux/macOS:
chmod +x install.sh run.shRecreate virtual environment:
# Remove old venv
rm -rf venv
# Create new one
python3 -m venv venv
# Activate
source venv/bin/activate # Linux/macOS
venv\Scripts\activate # Windows
# Reinstall
pip install -r requirements.txtEdit config.yaml:
keywords:
my_research:
- "quantum computing"
- "qubit"
- "entanglement"
arxiv_categories:
- "quant-ph"Run:
python main.py --runUsing cron (Linux/macOS):
# Edit crontab
crontab -e
# Add this line to run daily at 10 AM
0 10 * * * cd /path/to/arxiv-paper-collector && python3 main.py --run >> output/cron.log 2>&1Using systemd (Linux):
# Create service file
sudo nano /etc/systemd/system/arxiv-collector.serviceAdd:
[Unit]
Description=ArXiv Paper Collector
After=network.target
[Service]
Type=simple
User=your_username
WorkingDirectory=/path/to/arxiv-paper-collector
ExecStart=/usr/bin/python3 main.py --daemon
Restart=always
[Install]
WantedBy=multi-user.targetEnable:
sudo systemctl enable arxiv-collector
sudo systemctl start arxiv-collector# Create venv
python3 -m venv venv
# Activate
source venv/bin/activate # Linux/macOS
# venv\Scripts\activate # Windows
# Install
pip install -r requirements.txt
# Run
python main.py --run- Open Task Scheduler (
taskschd.msc) - Click "Create Basic Task"
- Name: "ArXiv Paper Collector"
- Trigger: "Daily" at "10:00 AM"
- Action: "Start a program"
- Program:
python - Arguments:
main.py --run - Start in:
C:\path\to\arxiv-paper-collector
- Program:
Create ~/Library/LaunchAgents/com.arxiv.collector.plist:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.arxiv.collector</string>
<key>ProgramArguments</key>
<array>
<string>/usr/bin/python3</string>
<string>/path/to/arxiv-paper-collector/main.py</string>
<string>--run</string>
</array>
<key>StartCalendarInterval</key>
<dict>
<key>Hour</key>
<integer>10</integer>
<key>Minute</key>
<integer>0</integer>
</dict>
</dict>
</plist>Load:
launchctl load ~/Library/LaunchAgents/com.arxiv.collector.plist# Install development dependencies
pip install -e ".[dev]"
# Format code
black .
# Run tests
pytest- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
MIT License - see LICENSE file for details.