This project tackles the critical challenge of identifying fake news on social media platforms through Natural Language Processing (NLP) and Machine Learning techniques. The notebook provides a comprehensive analysis pipeline from data exploration to model deployment.
Imagine you're working for a social media company concerned about the growing amount of fake news circulating on its platform. This project investigates how fake news can be recognized and creates methods to identify it through data-driven analysis.
File: fake_news_data.csv
The dataset contains news articles labeled as either "Fake News" or "Factual News", allowing for supervised learning classification.
- Load and inspect the fake news dataset
- Visualize class distribution (fake vs. factual articles)
- Identify potential data quality issues
- Process text using spaCy's NLP pipeline
- Extract POS tags for linguistic analysis
- Compare grammatical patterns between fake and factual news
- Analyze most common nouns and their frequencies
- Extract named entities (people, organizations, locations, dates)
- Visualize top entities in fake vs. factual news
- Identify patterns in entity usage across news types
- Remove location tags and standardize formatting
- Convert text to lowercase
- Remove punctuation marks
- Filter out stopwords
- Tokenize text into words
- Apply lemmatization for word normalization
- Analyze unigrams (single words) frequency
- Analyze bigrams (two-word phrases) patterns
- Visualize most common word combinations
- Apply VADER sentiment analyzer
- Calculate compound sentiment scores
- Categorize articles as positive, neutral, or negative
- Compare sentiment distribution between fake and factual news
- Create document-term matrix
- Test multiple topic counts using coherence scores
- Train LDA model to discover latent topics
- Evaluate topic quality and interpretability
- Apply TF-IDF vectorization for word importance weighting
- Use LSI model for better topic separation
- Optimize number of topics based on coherence metrics
- Generate more distinct and interpretable topics
- Create Bag-of-Words representation using CountVectorizer
- Split data into training (70%) and testing (30%) sets
- Train linear classification model
- Evaluate accuracy and performance metrics
- Generate classification report (precision, recall, F1-score)
- Train SGDClassifier for linear SVM
- Compare performance with Logistic Regression
- Identify best model for production deployment
- spaCy: Industrial-strength NLP (POS tagging, NER, tokenization)
- NLTK: Text preprocessing (stopwords, lemmatization, tokenization)
- VADER Sentiment: Rule-based sentiment analysis
- scikit-learn: Classification models, vectorization, evaluation
- Gensim: Topic modeling (LDA, LSA, TF-IDF)
- pandas: Data manipulation and analysis
- matplotlib & seaborn: Data visualization
Install required packages:
pip install -r requirements.txtDownload spaCy language model:
python -m spacy download en_core_web_sm- Open the notebook: Launch Jupyter Notebook or VS Code
- Run cells sequentially: Execute cells from top to bottom
- Explore results: Review visualizations and model outputs
- Modify parameters: Experiment with different preprocessing steps or model configurations
- Analyze differences in POS tag distributions
- Identify characteristic word usage in fake vs. factual news
- Discover entity mention patterns
- Compare emotional tone between news types
- Identify if fake news tends to be more polarized
Most Common Bigrams (After Preprocessing):
(donald, trump) 92
(united, state) 80
(white, house) 72
(president, donald) 42
(hillary, clinton) 31
(new, york) 31
(image, via) 29
(supreme, court) 29
(official, said) 26
(food, stamp) 24
LDA Topic Modeling Results (6 Topics): The analysis discovered 6 distinct topics in fake news articles, each characterized by key terms like "trump", "clinton", "president", "state", "republican", and various political themes.
LSA Topic Modeling with TF-IDF (5 Topics): LSA produced more interpretable topics including:
- Political figures: trump, clinton, obama, president
- Educational themes: school, student, county
- Government operations: flynn, russian, email, department
- Media and press coverage
- Accuracy: 83.33%
precision recall f1-score support
Factual News 0.82 0.82 0.82 28
Fake News 0.84 0.84 0.84 32
accuracy 0.83 60
macro avg 0.83 0.83 0.83 60
weighted avg 0.83 0.83 0.83 60
- Accuracy: 86.67%
precision recall f1-score support
Factual News 0.81 0.93 0.87 28
Fake News 0.93 0.81 0.87 32
accuracy 0.87 60
macro avg 0.87 0.87 0.87 60
weighted avg 0.87 0.87 0.87 60
Winner: SVM achieved better overall accuracy (87%) with strong precision for fake news detection (93%)
The notebook includes visualizations and metrics suitable for presenting to stakeholders:
- Clear bar charts and plots
- Detailed classification reports
- Coherence score plots for topic optimization
- Entity frequency visualizations
- Test additional classification algorithms (Random Forest, Neural Networks)
- Incorporate deep learning models (BERT, transformers)
- Expand feature engineering (readability scores, writing style metrics)
- Perform cross-validation for robust evaluation
- Deploy model as a web service or API
This practical project demonstrates a complete NLP pipeline for fake news detection, covering:
- ✅ Data exploration and preprocessing
- ✅ Linguistic analysis (POS, NER)
- ✅ Sentiment analysis
- ✅ Topic modeling
- ✅ Machine learning classification
- ✅ Model evaluation and comparison
The comprehensive approach provides valuable insights into distinguishing fake from factual news through multiple analytical lenses.
Dataset: fake_news_data.csv
Project Type: NLP Classification & Analysis