穷游论坛爬虫

一个基于Python的网页爬虫应用，用于自动化爬取穷游论坛（https://bbs.qyer.com/）的帖子内容。

功能特性

自动爬取穷游论坛帖子列表和详细内容
支持动态内容加载（"加载更多"按钮）
反爬虫机制处理（随机延迟、User-Agent轮换等）
将帖子内容保存为Markdown格式
支持断点续传和错误重试
生成详细的爬取统计报告

项目结构

qyer_scraper/
├── __init__.py          # 包初始化文件
├── models.py            # 数据模型定义
├── exceptions.py        # 异常类定义
├── config.py            # 配置管理器
├── browser.py           # 浏览器管理器
├── parser.py            # 内容解析器
├── antibot.py           # 反爬虫处理器
├── processor.py         # 数据处理器
├── storage.py           # 存储管理器
├── progress.py          # 进度状态管理器
└── scraper.py           # 主爬虫控制器

tests/                   # 测试目录
├── __init__.py
└── conftest.py          # pytest配置

config.json.template     # 配置文件模板
requirements.txt         # 依赖包列表
pytest.ini              # pytest配置
main.py                 # 主程序入口
README.md               # 项目说明

安装依赖

pip install -r requirements.txt

配置

复制配置文件模板：

cp config.json.template config.json

根据需要修改配置参数：

base_url: 穷游论坛基础URL
output_directory: 输出文件目录
delay_range: 请求延迟范围（秒）
max_retries: 最大重试次数
browser_options: 浏览器配置选项

使用方法

基本使用

# 开始新的爬取任务
python main.py

# 指定配置文件
python main.py --config my_config.json

# 显示详细输出
python main.py --verbose

断点续传功能

# 列出所有可恢复的会话
python main.py --list-sessions

# 查看特定会话的详细信息
python main.py --session-info SESSION_ID

# 恢复指定会话的爬取
python main.py --resume SESSION_ID

会话管理

爬虫支持断点续传功能，当爬取过程被中断时（如网络错误、用户中断等），可以从中断点继续执行：

自动保存进度: 爬虫会自动保存爬取进度，包括已处理的帖子列表、成功/失败统计等
会话恢复: 使用 --resume SESSION_ID 可以恢复之前中断的爬取任务
进度查看: 使用 --list-sessions 查看所有可恢复的会话及其进度
自动清理: 系统会自动清理7天前的旧会话文件

运行测试

# 运行所有测试
pytest

# 运行单元测试
pytest -m unit

# 运行属性测试
pytest -m property

# 显示测试覆盖率
pytest --cov=qyer_scraper

输出结构

爬取的内容将保存在指定的输出目录中：

output/
├── posts/               # Markdown格式的帖子内容
├── metadata/            # JSON格式的元数据
├── progress/            # 进度状态文件（用于断点续传）
└── logs/               # 日志文件

注意事项

请遵守网站的robots.txt和使用条款
合理设置延迟时间，避免对服务器造成过大压力
建议在非高峰时段运行爬虫
定期更新User-Agent和其他反检测策略

开发状态

当前项目处于开发阶段，核心功能框架已搭建完成，具体的爬取逻辑将在后续任务中实现。

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
qyer_scraper		qyer_scraper
tests		tests
.gitignore		.gitignore
FIXES_SUMMARY.md		FIXES_SUMMARY.md
README.md		README.md
config.json.template		config.json.template
debug_qyer_structure.py		debug_qyer_structure.py
fix_qyer_parser.py		fix_qyer_parser.py
main.py		main.py
pytest.ini		pytest.ini
qyer_homepage_debug.html		qyer_homepage_debug.html
requirements.txt		requirements.txt
test_final_fixes.py		test_final_fixes.py
test_fixed_scraper.py		test_fixed_scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

穷游论坛爬虫

功能特性

项目结构

安装依赖

配置

使用方法

基本使用

断点续传功能

会话管理

运行测试

输出结构

注意事项

开发状态

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

穷游论坛爬虫

功能特性

项目结构

安装依赖

配置

使用方法

基本使用

断点续传功能

会话管理

运行测试

输出结构

注意事项

开发状态

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages