Detection and Discovery of Misinformation Sources using Attributed Webgraphs

Interactive News Webgraph

To explore the webgraph related to these news sites, checkout our interactive webgraph exploration tool built ontop of the CommonCrawl dataset.

Introduction

These scripts can be used to train classifiers using the NewsSEO dataset and is based on the research paper "Detection and Discovery of Misinformation Sources using Attributed Webgraphs" [PDF]. If you use, extend or build upon this project, please cite the following paper (upcoming at ICWSM 2024):

@article{carragher2024detection,
  title={Detection and Discovery of Misinformation Sources using Attributed Webgraphs},
  author={Carragher, Peter and Williams, Evan M and Carley, Kathleen M},
  journal={arXiv preprint arXiv:2401.02379},
  year={2024}
}

Inputs

Follow the readme to populate the data directory with the NewsSEO dataset
Webgraph data & SEO attributes have been pulled from ahrefs.com
Labels have been scraped from mediabiasfactcheck.com using this open-source scraper

Environment Setup

pip3 install -r requirements.txt
# Generate edge weights
cd analysis && python3 weights.py && cd ../ 
# Run GNN weight scheme experiments
python3 gnns/train.py 0 
# Run GNN top N backlink experiments
python3 gnns/train.py 1

Outputs

This code is provided as is, and neither the author nor the university is responsible for maintaining it. It provides the following functionality:

A classifier that predicts the reliability of news sources
A classifier that predicts the political leaning of news sources
A discovery system that finds more unreliable news sources from an initial list of news sites

More specifically, the repository is organized as follows:

analysis:
- blogping.ipynb: blogping / user generated features analysis
- country.ipynb: country and continent breakdown of URLs based on hosting IP addresses
- link_scheme_identification.ipynb: algorithm 1 for the misinformation source discovery algorithm
- weights.py: generate and save edge weight schemes to file
- weight_distributions.ipynb: plots for distribution of edge weights as computed in weights.ipynb
data_collection:
- Ahrefs R API scripts:
  - ahref_backlinks.R: fetch backlinks for a given list of domains
  - ahref_outlinks.R: fetch outlinks for a given list of domains
  - ahref_nodes.R: fetch attributes for a given list of domains
evaluation:
- headlines.ipynb: a scraper built ontop of newspaper3k that outputs a .json file compatible with label_studio for manual evaluation of news articles
- krippendorf.ipynb: inter-annotator agreement between domain level reliability ratings
- label_studio_config.xml: configuration for label studio that can create labeling jobs from json output of headlines.ipynb
flat_models:
- bias_removal.ipynb: analysis of bias removal techniques for political bias in dataset
- discovery.ipynb: runs the misinfo and bias classifiers on the link scheme outlinks.
  - First, outlinks must be generated from output of link_scheme_identification.ipynb.
  - Also trains the news source classifier.
- train_classifiers.ipynb: training & analysis of the misinfo & bias classifiers
gnns:
- For weighted experiments, first run weights.py for on both backlinks and outlinks
- model.py: defines the GCN used for predicting reliability & bias labels on the webgraphs
- seo_import.py: sets up data imports, labels & batching for webgraphs
- train.py: trains the GCN defined in model.py using the data imported via seo_import.py for all tasks, networks variants & edge weightings
- results.ipynb: plots the results of the GNN experiments: heatmap & top N backlink plots (see results folder)
politicalnews:
- For evaluation, we also train our models on the PoliticalNews dataset as described in Castelo et. al., 2019
- discovery_seo.ipynb: implementation of the partial F1 evaluation metric as defined by Chen & Freire, 2020
- train_misinfo.ipynb: comparison of misinfo classifiers trained with SEO features on the MBFC and politicalnews datasets respectively
survival_rates:
- url_list/: folder containing url lists used to train the parked domain classifiers
- parked_domain_sample.ipynb: scrapes positive examples of parked domains from sedo.com
- *_features.csv: features generated from the Parked Domain Classifier
- parked_domain_classifier.ipynb: trains the parked domain classifier with features from Vissers et. al., 2015
- requests.ipynb: sends GET requests to domain lists & checks responses

License

BSD 3-Clause License

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Detection and Discovery of Misinformation Sources using Attributed Webgraphs

Interactive News Webgraph

Introduction

Inputs

Environment Setup

Outputs

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
analysis		analysis
data		data
data_collection		data_collection
evaluation		evaluation
flat_models		flat_models
gnns		gnns
politicalnews		politicalnews
results/fig		results/fig
survival_rates		survival_rates
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Detection and Discovery of Misinformation Sources using Attributed Webgraphs

Interactive News Webgraph

Introduction

Inputs

Environment Setup

Outputs

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages