Skip to content

stccenter/sudan_web_scraper

Repository files navigation

Sudan Web Scraper

Hourly collection of conflict-related news on Sudan from national, regional, and international sources. The toolset aggregates full articles, links, dates, images, and metadata into a transparent, queryable dataset for research, monitoring, and decision support.

Why this exists: Existing datasets (e.g., ACLED, UCDP) are valuable but can be delayed and opaque about sources. This scraper emphasizes timeliness (hourly jobs) and source transparency (URLs + full text where allowed).


Key Features

  • Multi-source coverage – National, regional, and international outlets (APIs + static/dynamic websites).
  • Hourly automation – Production via cron/Task Scheduler; incremental updates.
  • Transparent data – Store source URL + article text (where legally permitted).
  • De-duplication – URL-based duplicate detection, with update-on-collision logic.
  • Export friendly – CSV/Excel examples for quick analysis using Pandas.
  • Modular design – Add or modify crawlers independently (src/crawlers/, src/utils/).

Installation

Requirements

  • Python 3.11+
  • Dependencies listed in requirements.txt or environment.yml

Setup

Clone the repository and create the environment:

git clone https://github.com/stccenter/sudan_web_scraper
cd sudan_web_scraper
conda env create -f environment.yml
conda activate sudan-scraper

Or manually install with pip:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

Citation

@article{sudan_scraper_2025,
  title   = {Automating Data Collection to Support Conflict Analysis: Scraping the Internet for Monitoring Hourly Conflict in Sudan},
  author  = {Yahya Masri and Anusha Srirenganathan and Samir Ahmed et al.},
  year    = {2025},
  note    = {Revisions},
  url     = {https://github.com/stccenter/sudan_web_scraper}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •