Fake News Detection - Practical Project

Overview

This project tackles the critical challenge of identifying fake news on social media platforms through Natural Language Processing (NLP) and Machine Learning techniques. The notebook provides a comprehensive analysis pipeline from data exploration to model deployment.

Project Scenario

Imagine you're working for a social media company concerned about the growing amount of fake news circulating on its platform. This project investigates how fake news can be recognized and creates methods to identify it through data-driven analysis.

Dataset

File: fake_news_data.csv

The dataset contains news articles labeled as either "Fake News" or "Factual News", allowing for supervised learning classification.

Project Structure

1. Data Import & Exploration

Load and inspect the fake news dataset
Visualize class distribution (fake vs. factual articles)
Identify potential data quality issues

2. Part-of-Speech (POS) Tagging

Process text using spaCy's NLP pipeline
Extract POS tags for linguistic analysis
Compare grammatical patterns between fake and factual news
Analyze most common nouns and their frequencies

3. Named Entity Recognition (NER)

Extract named entities (people, organizations, locations, dates)
Visualize top entities in fake vs. factual news
Identify patterns in entity usage across news types

4. Text Preprocessing

Remove location tags and standardize formatting
Convert text to lowercase
Remove punctuation marks
Filter out stopwords
Tokenize text into words
Apply lemmatization for word normalization

5. N-gram Analysis

Analyze unigrams (single words) frequency
Analyze bigrams (two-word phrases) patterns
Visualize most common word combinations

6. Sentiment Analysis

Apply VADER sentiment analyzer
Calculate compound sentiment scores
Categorize articles as positive, neutral, or negative
Compare sentiment distribution between fake and factual news

7. Topic Modeling

LDA (Latent Dirichlet Allocation)

Create document-term matrix
Test multiple topic counts using coherence scores
Train LDA model to discover latent topics
Evaluate topic quality and interpretability

LSA (Latent Semantic Analysis) with TF-IDF

Apply TF-IDF vectorization for word importance weighting
Use LSI model for better topic separation
Optimize number of topics based on coherence metrics
Generate more distinct and interpretable topics

8. Classification Models

Data Preparation

Create Bag-of-Words representation using CountVectorizer
Split data into training (70%) and testing (30%) sets

Model 1: Logistic Regression

Train linear classification model
Evaluate accuracy and performance metrics
Generate classification report (precision, recall, F1-score)

Model 2: Support Vector Machine (SVM)

Train SGDClassifier for linear SVM
Compare performance with Logistic Regression
Identify best model for production deployment

Key Technologies & Libraries

NLP Libraries

spaCy: Industrial-strength NLP (POS tagging, NER, tokenization)
NLTK: Text preprocessing (stopwords, lemmatization, tokenization)
VADER Sentiment: Rule-based sentiment analysis

Machine Learning

scikit-learn: Classification models, vectorization, evaluation
Gensim: Topic modeling (LDA, LSA, TF-IDF)

Data Analysis & Visualization

pandas: Data manipulation and analysis
matplotlib & seaborn: Data visualization

Installation & Requirements

Install required packages:

pip install -r requirements.txt

Download spaCy language model:

python -m spacy download en_core_web_sm

How to Use

Open the notebook: Launch Jupyter Notebook or VS Code
Run cells sequentially: Execute cells from top to bottom
Explore results: Review visualizations and model outputs
Modify parameters: Experiment with different preprocessing steps or model configurations

Key Findings & Insights

Linguistic Patterns

Analyze differences in POS tag distributions
Identify characteristic word usage in fake vs. factual news
Discover entity mention patterns

Sentiment Differences

Compare emotional tone between news types
Identify if fake news tends to be more polarized

Topic Themes

Most Common Bigrams (After Preprocessing):

(donald, trump)        92
(united, state)        80
(white, house)         72
(president, donald)    42
(hillary, clinton)     31
(new, york)            31
(image, via)           29
(supreme, court)       29
(official, said)       26
(food, stamp)          24

LDA Topic Modeling Results (6 Topics): The analysis discovered 6 distinct topics in fake news articles, each characterized by key terms like "trump", "clinton", "president", "state", "republican", and various political themes.

LSA Topic Modeling with TF-IDF (5 Topics): LSA produced more interpretable topics including:

Political figures: trump, clinton, obama, president
Educational themes: school, student, county
Government operations: flynn, russian, email, department
Media and press coverage

Model Performance

Logistic Regression

Accuracy: 83.33%

              precision    recall  f1-score   support

Factual News       0.82      0.82      0.82        28
   Fake News       0.84      0.84      0.84        32

    accuracy                           0.83        60
   macro avg       0.83      0.83      0.83        60
weighted avg       0.83      0.83      0.83        60

Support Vector Machine (SVM)

Accuracy: 86.67%

              precision    recall  f1-score   support

Factual News       0.81      0.93      0.87        28
   Fake News       0.93      0.81      0.87        32

    accuracy                           0.87        60
   macro avg       0.87      0.87      0.87        60
weighted avg       0.87      0.87      0.87        60

Winner: SVM achieved better overall accuracy (87%) with strong precision for fake news detection (93%)

Results Communication

The notebook includes visualizations and metrics suitable for presenting to stakeholders:

Clear bar charts and plots
Detailed classification reports
Coherence score plots for topic optimization
Entity frequency visualizations

Future Enhancements

Test additional classification algorithms (Random Forest, Neural Networks)
Incorporate deep learning models (BERT, transformers)
Expand feature engineering (readability scores, writing style metrics)
Perform cross-validation for robust evaluation
Deploy model as a web service or API

Author Notes

This practical project demonstrates a complete NLP pipeline for fake news detection, covering:

✅ Data exploration and preprocessing
✅ Linguistic analysis (POS, NER)
✅ Sentiment analysis
✅ Topic modeling
✅ Machine learning classification
✅ Model evaluation and comparison

The comprehensive approach provides valuable insights into distinguishing fake from factual news through multiple analytical lenses.

Dataset: fake_news_data.csv
Project Type: NLP Classification & Analysis

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
fake_news_data.csv		fake_news_data.csv
project.ipynb		project.ipynb
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Fake News Detection - Practical Project

Overview

Project Scenario

Dataset

Project Structure

1. Data Import & Exploration

2. Part-of-Speech (POS) Tagging

3. Named Entity Recognition (NER)

4. Text Preprocessing

5. N-gram Analysis

6. Sentiment Analysis

7. Topic Modeling

LDA (Latent Dirichlet Allocation)

LSA (Latent Semantic Analysis) with TF-IDF

8. Classification Models

Data Preparation

Model 1: Logistic Regression

Model 2: Support Vector Machine (SVM)

Key Technologies & Libraries

NLP Libraries

Machine Learning

Data Analysis & Visualization

Installation & Requirements

How to Use

Key Findings & Insights

Linguistic Patterns

Sentiment Differences

Topic Themes

Model Performance

Logistic Regression

Support Vector Machine (SVM)

Results Communication

Future Enhancements

Author Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages