Hourly collection of conflict-related news on Sudan from national, regional, and international sources. The toolset aggregates full articles, links, dates, images, and metadata into a transparent, queryable dataset for research, monitoring, and decision support.
Why this exists: Existing datasets (e.g., ACLED, UCDP) are valuable but can be delayed and opaque about sources. This scraper emphasizes timeliness (hourly jobs) and source transparency (URLs + full text where allowed).
- Multi-source coverage – National, regional, and international outlets (APIs + static/dynamic websites).
- Hourly automation – Production via cron/Task Scheduler; incremental updates.
- Transparent data – Store source URL + article text (where legally permitted).
- De-duplication – URL-based duplicate detection, with update-on-collision logic.
- Export friendly – CSV/Excel examples for quick analysis using Pandas.
- Modular design – Add or modify crawlers independently (
src/crawlers/,src/utils/).
- Python 3.11+
- Dependencies listed in
requirements.txtorenvironment.yml
Clone the repository and create the environment:
git clone https://github.com/stccenter/sudan_web_scraper
cd sudan_web_scraper
conda env create -f environment.yml
conda activate sudan-scraperOr manually install with pip:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -r requirements.txt@article{sudan_scraper_2025,
title = {Automating Data Collection to Support Conflict Analysis: Scraping the Internet for Monitoring Hourly Conflict in Sudan},
author = {Yahya Masri and Anusha Srirenganathan and Samir Ahmed et al.},
year = {2025},
note = {Revisions},
url = {https://github.com/stccenter/sudan_web_scraper}
}