Skip to content

Commit f712524

Browse files
STOKES-DOTclaude
andcommitted
Initial commit: ArXiv Paper Collector
Features: - Automated daily arXiv paper fetching - Keyword-based filtering (electronic structure & AI) - LaTeX PDF report generation - Configurable via YAML - Scheduled execution support 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
0 parents  commit f712524

File tree

11 files changed

+1863
-0
lines changed

11 files changed

+1863
-0
lines changed

README.md

Lines changed: 290 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,290 @@
1+
# ArXiv Paper Collector
2+
3+
[![Python](https://img.shields.io/badge/Python-3.8%2B-blue)](https://www.python.org/)
4+
[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
5+
6+
An automated Python tool that fetches the latest papers from arXiv related to electronic structure theory and artificial intelligence, filters them by keywords, and generates formatted PDF reports using LaTeX.
7+
8+
## Features
9+
10+
- **Automated Daily Collection**: Runs automatically at 10:00 AM every day
11+
- **Keyword Filtering**: Filters papers by customizable keywords (electronic structure, AI/ML)
12+
- **LaTeX Reports**: Generates professional PDF reports with paper summaries
13+
- **Configurable**: Easy YAML-based configuration for keywords and settings
14+
- **Portable**: Self-contained Python package with minimal dependencies
15+
16+
## Project Structure
17+
18+
```
19+
arxiv-paper-collector/
20+
├── config.yaml # Configuration file (edit keywords here)
21+
├── main.py # Main entry point
22+
├── requirements.txt # Python dependencies
23+
├── README.md # This file
24+
├── modules/
25+
│ ├── __init__.py
26+
│ ├── arxiv_fetcher.py # arXiv API integration
27+
│ ├── paper_filter.py # Keyword filtering
28+
│ ├── latex_generator.py # LaTeX document generation
29+
│ ├── pdf_compiler.py # PDF compilation
30+
│ └── scheduler.py # Task scheduling
31+
├── templates/
32+
│ └── paper_report.tex # LaTeX template
33+
└── output/
34+
├── papers/ # Generated PDFs
35+
├── latex/ # Intermediate LaTeX files
36+
└── collector.log # Log file
37+
```
38+
39+
## Installation
40+
41+
### Prerequisites
42+
43+
- Python 3.8 or higher
44+
- LaTeX distribution (TeX Live, MiKTeX, or MacTeX)
45+
- Git (for cloning)
46+
47+
### Step 1: Clone the Repository
48+
49+
```bash
50+
git clone https://github.com/YOUR_USERNAME/arxiv-paper-collector.git
51+
cd arxiv-paper-collector
52+
```
53+
54+
### Step 2: Install Python Dependencies
55+
56+
```bash
57+
pip install -r requirements.txt
58+
```
59+
60+
### Step 3: Verify LaTeX Installation
61+
62+
```bash
63+
pdflatex --version
64+
```
65+
66+
If LaTeX is not installed, install it:
67+
- **macOS**: `brew install mactex`
68+
- **Ubuntu/Debian**: `sudo apt-get install texlive-full`
69+
- **Windows**: Download and install [MiKTeX](https://miktex.org/)
70+
71+
## Usage
72+
73+
### Quick Start
74+
75+
Run the collector once immediately:
76+
77+
```bash
78+
python main.py --run
79+
```
80+
81+
### Edit Keywords
82+
83+
To customize the keywords used for filtering papers:
84+
85+
```bash
86+
python main.py --edit-keywords
87+
```
88+
89+
Or edit `config.yaml` directly:
90+
91+
```yaml
92+
keywords:
93+
electronic_structure:
94+
- "electronic structure"
95+
- "density functional theory"
96+
- "DFT"
97+
- "quantum chemistry"
98+
# Add your keywords here...
99+
100+
artificial_intelligence:
101+
- "machine learning"
102+
- "neural network"
103+
- "deep learning"
104+
# Add your keywords here...
105+
```
106+
107+
### Scheduled Execution
108+
109+
Run as a daemon (starts scheduler):
110+
111+
```bash
112+
python main.py --daemon
113+
```
114+
115+
The daemon will run the collector daily at the time specified in `config.yaml` (default: 10:00 AM).
116+
117+
### Check Status
118+
119+
```bash
120+
python main.py --status
121+
```
122+
123+
### Command Line Options
124+
125+
| Option | Description |
126+
|--------|-------------|
127+
| `--run, -r` | Run the paper collector once immediately |
128+
| `--daemon, -d` | Run as a daemon with scheduled execution |
129+
| `--config, -c` | Path to configuration file (default: config.yaml) |
130+
| `--status, -s` | Show scheduler status |
131+
| `--edit-keywords` | Open config file in default editor |
132+
133+
## Configuration
134+
135+
The `config.yaml` file contains all settings:
136+
137+
### Keywords
138+
139+
Define keyword groups for filtering papers:
140+
141+
```yaml
142+
keywords:
143+
electronic_structure:
144+
- "electronic structure"
145+
- "DFT"
146+
artificial_intelligence:
147+
- "machine learning"
148+
- "AI"
149+
```
150+
151+
### arXiv Categories
152+
153+
Specify which arXiv categories to search:
154+
155+
```yaml
156+
arxiv_categories:
157+
- "physics.comp-ph" # Computational Physics
158+
- "physics.chem-ph" # Chemical Physics
159+
- "cs.LG" # Machine Learning
160+
```
161+
162+
### Schedule
163+
164+
Set the daily run time:
165+
166+
```yaml
167+
schedule:
168+
hour: 10
169+
minute: 0
170+
timezone: "Asia/Shanghai"
171+
```
172+
173+
### Output
174+
175+
Configure output directories:
176+
177+
```yaml
178+
output:
179+
pdf_dir: "output/papers"
180+
latex_dir: "output/latex"
181+
```
182+
183+
### LaTeX Compilation
184+
185+
Configure LaTeX engine:
186+
187+
```yaml
188+
latex:
189+
engine: "pdflatex" # Options: pdflatex, xelatex, lualatex
190+
max_compile_time: 60
191+
attempts: 2
192+
```
193+
194+
## System Integration
195+
196+
### Cron (Linux/macOS)
197+
198+
Add to crontab (`crontab -e`):
199+
200+
```bash
201+
0 10 * * * cd /path/to/arxiv-paper-collector && /usr/bin/python3 main.py --run >> output/cron.log 2>&1
202+
```
203+
204+
### systemd (Linux)
205+
206+
Create `/etc/systemd/system/arxiv-collector.service`:
207+
208+
```ini
209+
[Unit]
210+
Description=ArXiv Paper Collector
211+
After=network.target
212+
213+
[Service]
214+
Type=simple
215+
User=your_username
216+
WorkingDirectory=/path/to/arxiv-paper-collector
217+
ExecStart=/usr/bin/python3 main.py --daemon
218+
Restart=always
219+
220+
[Install]
221+
WantedBy=multi-user.target
222+
```
223+
224+
Enable and start:
225+
226+
```bash
227+
sudo systemctl enable arxiv-collector
228+
sudo systemctl start arxiv-collector
229+
```
230+
231+
## Output
232+
233+
The tool generates:
234+
235+
1. **PDF Report**: `output/papers/arxiv_papers_YYYY-MM-DD.pdf`
236+
- Grouped by keyword categories
237+
- Contains title, authors, abstract, and arXiv links
238+
239+
2. **LaTeX Source**: `output/latex/arxiv_papers_YYYY-MM-DD.tex`
240+
- Can be customized or compiled manually
241+
242+
3. **Log File**: `output/collector.log`
243+
- Detailed run information for debugging
244+
245+
## Dependencies
246+
247+
- `arxiv` - arXiv API client
248+
- `PyYAML` - Configuration file parsing
249+
- `Jinja2` - LaTeX template engine
250+
- `python-dateutil` - Date handling
251+
252+
## Troubleshooting
253+
254+
### LaTeX Compilation Fails
255+
256+
- Ensure LaTeX is installed: `pdflatex --version`
257+
- Check log file: `output/collector.log`
258+
- Try different engine in config (xelatex, lualatex)
259+
260+
### No Papers Found
261+
262+
- Check `days_back` setting in config
263+
- Verify arXiv categories are correct
264+
- Check keywords are not too specific
265+
266+
### Permission Errors
267+
268+
- Ensure output directories are writable
269+
- Check file permissions: `chmod +x main.py`
270+
271+
## Contributing
272+
273+
Contributions are welcome! Please:
274+
275+
1. Fork the repository
276+
2. Create a feature branch
277+
3. Submit a pull request
278+
279+
## License
280+
281+
MIT License - see LICENSE file for details
282+
283+
## Author
284+
285+
Jiaoyuan
286+
287+
## Acknowledgments
288+
289+
- [arXiv](https://arxiv.org/) for open access to scientific papers
290+
- [arxiv Python library](https://github.com/lukasschwab/arxiv.py) for API access

config.yaml

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
# Arxiv Paper Collector Configuration File
2+
# You can modify the keywords and other settings here
3+
4+
# Keywords for filtering papers (case-insensitive)
5+
keywords:
6+
electronic_structure:
7+
- "electronic structure"
8+
- "density functional theory"
9+
- "DFT"
10+
- "quantum chemistry"
11+
- "ab initio"
12+
- "first-principles"
13+
- "Hartree-Fock"
14+
- "post-Hartree-Fock"
15+
- "coupled cluster"
16+
- "CI"
17+
- "quantum Monte Carlo"
18+
19+
artificial_intelligence:
20+
- "machine learning"
21+
- "neural network"
22+
- "deep learning"
23+
- "artificial intelligence"
24+
- "AI"
25+
- "graph neural network"
26+
- "GNN"
27+
- "transformer"
28+
- "reinforcement learning"
29+
- "supervised learning"
30+
31+
# Arxiv categories to search
32+
arxiv_categories:
33+
- "physics.comp-ph" # Computational Physics
34+
- "physics.chem-ph" # Chemical Physics
35+
- "cond-mat.str-el" # Strongly Correlated Electrons
36+
- "cond-mat.mtrl-sci" # Materials Science
37+
- "cs.LG" # Machine Learning
38+
- "cs.AI" # Artificial Intelligence
39+
40+
# Time settings
41+
schedule:
42+
hour: 10 # Run at 10:00 AM
43+
minute: 0
44+
timezone: "Asia/Shanghai"
45+
46+
# Date range for paper search (days back from today)
47+
days_back: 1
48+
49+
# Output settings
50+
output:
51+
pdf_dir: "output/papers"
52+
latex_dir: "output/latex"
53+
filename_format: "arxiv_papers_{date}.pdf"
54+
55+
# LaTeX compilation settings
56+
latex:
57+
engine: "pdflatex" # Options: pdflatex, xelatex, lualatex
58+
max_compile_time: 60 # Maximum compilation time in seconds
59+
attempts: 2 # Number of compilation attempts
60+
61+
# Logging settings
62+
logging:
63+
level: "INFO" # DEBUG, INFO, WARNING, ERROR
64+
log_file: "output/collector.log"
65+
console_output: true
66+
67+
# Paper limits
68+
max_papers: 50 # Maximum number of papers to include in report
69+
abstract_max_length: 1000 # Maximum abstract length in report
70+

0 commit comments

Comments
 (0)