Skip to content

Zoro-chi/Categorizing-fake-news

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fake News Detection - Practical Project

Overview

This project tackles the critical challenge of identifying fake news on social media platforms through Natural Language Processing (NLP) and Machine Learning techniques. The notebook provides a comprehensive analysis pipeline from data exploration to model deployment.

Project Scenario

Imagine you're working for a social media company concerned about the growing amount of fake news circulating on its platform. This project investigates how fake news can be recognized and creates methods to identify it through data-driven analysis.

Dataset

File: fake_news_data.csv

The dataset contains news articles labeled as either "Fake News" or "Factual News", allowing for supervised learning classification.

Project Structure

1. Data Import & Exploration

  • Load and inspect the fake news dataset
  • Visualize class distribution (fake vs. factual articles)
  • Identify potential data quality issues

2. Part-of-Speech (POS) Tagging

  • Process text using spaCy's NLP pipeline
  • Extract POS tags for linguistic analysis
  • Compare grammatical patterns between fake and factual news
  • Analyze most common nouns and their frequencies

3. Named Entity Recognition (NER)

  • Extract named entities (people, organizations, locations, dates)
  • Visualize top entities in fake vs. factual news
  • Identify patterns in entity usage across news types

4. Text Preprocessing

  • Remove location tags and standardize formatting
  • Convert text to lowercase
  • Remove punctuation marks
  • Filter out stopwords
  • Tokenize text into words
  • Apply lemmatization for word normalization

5. N-gram Analysis

  • Analyze unigrams (single words) frequency
  • Analyze bigrams (two-word phrases) patterns
  • Visualize most common word combinations

6. Sentiment Analysis

  • Apply VADER sentiment analyzer
  • Calculate compound sentiment scores
  • Categorize articles as positive, neutral, or negative
  • Compare sentiment distribution between fake and factual news

7. Topic Modeling

LDA (Latent Dirichlet Allocation)

  • Create document-term matrix
  • Test multiple topic counts using coherence scores
  • Train LDA model to discover latent topics
  • Evaluate topic quality and interpretability

LSA (Latent Semantic Analysis) with TF-IDF

  • Apply TF-IDF vectorization for word importance weighting
  • Use LSI model for better topic separation
  • Optimize number of topics based on coherence metrics
  • Generate more distinct and interpretable topics

8. Classification Models

Data Preparation

  • Create Bag-of-Words representation using CountVectorizer
  • Split data into training (70%) and testing (30%) sets

Model 1: Logistic Regression

  • Train linear classification model
  • Evaluate accuracy and performance metrics
  • Generate classification report (precision, recall, F1-score)

Model 2: Support Vector Machine (SVM)

  • Train SGDClassifier for linear SVM
  • Compare performance with Logistic Regression
  • Identify best model for production deployment

Key Technologies & Libraries

NLP Libraries

  • spaCy: Industrial-strength NLP (POS tagging, NER, tokenization)
  • NLTK: Text preprocessing (stopwords, lemmatization, tokenization)
  • VADER Sentiment: Rule-based sentiment analysis

Machine Learning

  • scikit-learn: Classification models, vectorization, evaluation
  • Gensim: Topic modeling (LDA, LSA, TF-IDF)

Data Analysis & Visualization

  • pandas: Data manipulation and analysis
  • matplotlib & seaborn: Data visualization

Installation & Requirements

Install required packages:

pip install -r requirements.txt

Download spaCy language model:

python -m spacy download en_core_web_sm

How to Use

  1. Open the notebook: Launch Jupyter Notebook or VS Code
  2. Run cells sequentially: Execute cells from top to bottom
  3. Explore results: Review visualizations and model outputs
  4. Modify parameters: Experiment with different preprocessing steps or model configurations

Key Findings & Insights

Linguistic Patterns

  • Analyze differences in POS tag distributions
  • Identify characteristic word usage in fake vs. factual news
  • Discover entity mention patterns

Sentiment Differences

  • Compare emotional tone between news types
  • Identify if fake news tends to be more polarized

Topic Themes

Most Common Bigrams (After Preprocessing):

(donald, trump)        92
(united, state)        80
(white, house)         72
(president, donald)    42
(hillary, clinton)     31
(new, york)            31
(image, via)           29
(supreme, court)       29
(official, said)       26
(food, stamp)          24

LDA Topic Modeling Results (6 Topics): The analysis discovered 6 distinct topics in fake news articles, each characterized by key terms like "trump", "clinton", "president", "state", "republican", and various political themes.

LSA Topic Modeling with TF-IDF (5 Topics): LSA produced more interpretable topics including:

  • Political figures: trump, clinton, obama, president
  • Educational themes: school, student, county
  • Government operations: flynn, russian, email, department
  • Media and press coverage

Model Performance

Logistic Regression

  • Accuracy: 83.33%
              precision    recall  f1-score   support

Factual News       0.82      0.82      0.82        28
   Fake News       0.84      0.84      0.84        32

    accuracy                           0.83        60
   macro avg       0.83      0.83      0.83        60
weighted avg       0.83      0.83      0.83        60

Support Vector Machine (SVM)

  • Accuracy: 86.67%
              precision    recall  f1-score   support

Factual News       0.81      0.93      0.87        28
   Fake News       0.93      0.81      0.87        32

    accuracy                           0.87        60
   macro avg       0.87      0.87      0.87        60
weighted avg       0.87      0.87      0.87        60

Winner: SVM achieved better overall accuracy (87%) with strong precision for fake news detection (93%)

Results Communication

The notebook includes visualizations and metrics suitable for presenting to stakeholders:

  • Clear bar charts and plots
  • Detailed classification reports
  • Coherence score plots for topic optimization
  • Entity frequency visualizations

Future Enhancements

  • Test additional classification algorithms (Random Forest, Neural Networks)
  • Incorporate deep learning models (BERT, transformers)
  • Expand feature engineering (readability scores, writing style metrics)
  • Perform cross-validation for robust evaluation
  • Deploy model as a web service or API

Author Notes

This practical project demonstrates a complete NLP pipeline for fake news detection, covering:

  • ✅ Data exploration and preprocessing
  • ✅ Linguistic analysis (POS, NER)
  • ✅ Sentiment analysis
  • ✅ Topic modeling
  • ✅ Machine learning classification
  • ✅ Model evaluation and comparison

The comprehensive approach provides valuable insights into distinguishing fake from factual news through multiple analytical lenses.


Dataset: fake_news_data.csv
Project Type: NLP Classification & Analysis

About

A Natural Language Project To Categorizing Fake News And Factual News

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors