Skip to content

monochandan/Text-CheckWorthy-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

112 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Check Worthiness Estimetion of Text Data

With the rapid rise of online misinformation, it's increasingly important to prioritize which claims are worth fact-checking. Check-worthiness estimation tackles this by classifying whether a statement like a tweet or debate quote-merits verification. However, challenges such as subjectivity, data imbalance, and linguistic ambiguity make this task difficult.

While recent benchmarks like CheckThat! Lab at CLEF 2024 have seen dominance from transformer-based and LLM-based models (e.g., RoBERTa, GPT-4, LLaMA2), these often demand high computational resources. Whether using transformer-based models or traditional machine learning approaches, my focus was on efficiently addressing this problem in a trustworthy manner, while keeping in mind the variety and complexity of textual data. This project explores different approachs of ensemble-based traditional ML models, supported by resampling techniques, can remain competitive. I have conduct experiment with QLoRA for memory-efficient fine-tuning of large models, offering a practical alternative to resource-heavy approaches.

This research addresses the growing need to identify claims worth fact-checking, especially in the age of widespread misinformation. Focusing on English-language datasets from U.S. presidential debate transcripts, we apply a range of resampling methods to tackle data imbalance and explore multiple machine learning approaches from traditional models to fine-tuned LLMs using memory-efficient techniques like QLoRA.

Key contributions include :

  • How to handle huge ammount of imbalance text data.

  • Prompt engineering (few shot, zero shot) for data labeling and data processing.

  • Tweet data scrapping for validate the model, prompt engineering for label automation.

  • Laveraging Data pruning techniques with LLM prompt engineering and varies NLP libraries.

  • Evaluation of linguistic, contextual, and semantic features.

  • Ensemble strategies that significantly boost performance.

  • Benchmarking on CLEF 2024's CheckThat! dataset and additional tweet data.

  • Preliminary results show ensemble-based classical models outperform current CLEF 2024 LLM-based baselines.

  • Fine-tuning state of the art LLM Model with QLoRA also give better performance compared to LoRA.

Custom Ensemble Learning

Implemented advanced blending and stacking ensembles using:

  • Manual out-of-fold training and prediction logic
  • Integration of diverse base models (e.g., XGBoost, Logistic Regression, Decision Tree, Random Forest, Ada Boost, Gradient Boosting, Light GBM, KNN)
  • Final meta-learner trained on base model predictions Achieved improved F1 scores compared to traditional ensemble methods (e.g., VotingClassifier).

Used LLM models till now:

  • gemini-1.5-flash - prompt engineering for class label automation for scrapped tweet .
  • BERT - LLM model for text classification.
  • multilingual BERT
  • xlm-RoBERTa

Classical model used till now Hyperparameter Tuned for English, Arabic and Spanish

  • Random Forest
  • XGB
  • Decision Tree
  • KNN
  • GDB
  • LGBM
  • ADB

BenchMark Dataset from CLEF2024

Image

English

Dataset Not CheckWorthy CheckWorthy
Training Data 17087 5413
UnderSampled Training Data 7189 5408
Dev Data (hyperparametr Tuning) 794 238
Dev Test Data (Model Test) 210 108
Test Data (Model Test) 253 88
Tweet Data (Model Test) 124 107

Trained on Benchmark DataSet and tested on Test data

Models Accuracy Precision Recall F1 Score
XGBoost 0.830 0.8222 0.830 0.821
Gradient Boosting 0.798 0.789 0.798 0.768
Light GBM 0.8333 0.837 0.833 0.8111
Logistic Regression 0.748 0.816 0.748 0.763
Voating Classifier (Soft: XGB + GDB + LR) 0.833 0.825 0.833 0.825
Voating Classifier (Hard: XGB + LGBM + ADB + LR) 0.836 0.850 0.836 0.811
Voating Classifier (Hard: XGB + GDB + LR) 0.878 0.881 0.877 0.878

Performance of classicial ML models on Undersampled English Dataset and tested on dev test dataset

Model Accuracy Precision Recall F1
Decision Tree 0.709 0.714 0.709 0.711
KNN 0.671 0.657 0.671 0.660
XGBoost 0.867 0.872 0.867 0.868
Random Forest 0.728 0.717 0.728 0.714
Gradient Boosting 0.861 0.860 0.861 0.857
Light GBM 0.845 0.843 0.845 0.842
Ada Boost 0.813 0.815 0.813 0.803
Logistic Regression 0.851 0.874 0.851 0.855
Voating Classifier (Soft: XGb + GDB + LR) 0.884 0.889 0.883 0.884
Voating Classifier (Hard: XGb + LGBM + ADb + LR) 0.864 0.863 0.864 0.862
Voating Classifier (Hard: XGb + GDB + LR) 0.877 0.881 0.877 0.878

Performance of different models on the original dataset and evaluated on dev test data

Model Accuracy Precision Recall F1
Decision Tree 0.718 0.720 0.718 0.719
KNN 0.687 0.685 0.687 0.615
XGBoost 0.791 0.824 0.791 0.764
Random Forest 0.753 0.746 0.753 0.746
Gradient Boosting 0.801 0.831 0.801 0.777
Light GBM 0.807 0.828 0.807 0.788
Ada Boost 0.785 0.814 0.785 0.757
Logistic Regression 0.851 0.857 0.851 0.844
Voating Classifier (Soft: XGb + GDB + LR) 0.813 0.840 0.813 0.793

Performance of different classifiers on the undersampled dataset (evaluated on test data):

Model Accuracy Precision Recall F1
KNN 0.669 0.641 0.669 0.653
XGBoost 0.718 0.785 0.718 0.735
Gradient Boosting 0.780 0.790 0.780 0.784
Light GBM 0.812 0.812 0.812 0.812
Ada Boost 0.7333 0.701 0.733 0.709
Logistic Regression 0.724 0.814 0.724 0.741
Voating Classifier (Soft: XGB + GDB + LR) 0.777 0.818 0.777 0.788
Voating Classifier (Hard: XGb + LGBM + ADB + LR) 0.806 0.808 0.806 0.807

Performance of classifier trained on undersampled dataset , evaluated on tweet data

Query for tweet data scrapping:

QUERY = "COVID OR vaccine OR pandemic OR lockdown OR #COVID19 OR " \ "#VaccineMandate elections OR Biden OR Trump OR Joe Biden OR " \ "DOGE OR FBI OR Donald Trump OR Ukraine OR Russia OR " \ "Middle East Crisis OR South Asia Crisis OR UN meeting OR" \ " US Congress OR US Republic OR geopolitics OR War OR" \ " #Politics OR #Election2025 OR #UkraineRussiaWar OR" \ " #Trump OR #DOGE OR #Trump2025 #MiddleEastCrisis OR" \ " #Geopolitics OR #USRepublic OR #USCongress OR #Bangladesh (COVID OR" \ " OR OR vaccine OR OR OR pandemic OR OR OR lockdown OR OR OR" \ " #COVID19 OR OR OR #VaccineMandate OR elections OR OR OR" \ " Biden OR OR OR Trump OR OR OR Joe OR Biden OR OR OR" \ " DOGE OR OR OR FBI OR OR OR Donald OR Trump OR OR OR" \ " Ukraine OR OR OR Russia OR OR OR Middle OR " \ "East OR Crisis OR OR OR South OR Asia OR Crisis OR OR OR" \ " UN OR meeting OR OR OR US OR Congress OR OR OR US OR" \ " Republic OR OR OR geopolitics OR OR OR War OR OR OR" \ " #Politics OR OR OR #Election2025 OR OR OR #UkraineRussiaWar OR OR OR" \ " #Trump OR OR OR #DOGE OR OR OR #Trump2025 OR #MiddleEastCrisis OR OR OR" \ " #Geopolitics OR OR OR #USRepublic OR OR OR #USCongress OR OR OR #Bangladesh)" \ " lang:en until:2022-12-31 since:2020-01-01"

Model Accuracy Precision Recall F1
Decision Tree 0.641 0.698 0.641 0.625
XGBoost 0.623 0.659 0.623 0.613
Gradient Boosting 0.654 0.653 0.654 0.653
Light GBM 0.649 0.648 0.649 0.648
Ada Boost 0.632 0.639 0.632 0.616
Logistic Regression 0.688 0.740 0.688 0.678
Voating Classifier (Soft: XGB + GDB + LR) 0.658 0.675 0.658 0.656
Voating Classifier (Hard: XGb + LGBM + ADB + LR) 0.671 0.670 0.671 0.670

Performance of LLMs for english dataset across different evaluation datasets:

Model Dataset Accuracy Precision F1
DBERTa (epoch : 8) Test Data 0.836 0.834 0.819
DBERTa (epoch : 8) Dev Test Data 0.851 0.861 0.843
DBERTa (epoch : 8) Tweet Data 0.779 0.788 0.775
BERT (QLoRA, Undersampled, Hyperparametered tuned with optuna, 40 epoch) Test Data 0.839 0.834 0.825
BERT (QLoRA, Undersampled, Hyperparametered tuned with optuna, 40 epoch) Dev Test Data 0.804 0.819 0.8786
BERT (QLoRA, Undersampled, Hyperparametered tuned with optuna, 40 epoch) Tweet Data 0.745 0.750 0.741
RoBERTa - Base (25 epochs) Test Data 0.821 0.818 0.819
RoBERTa - Base (25 epochs) Dev Test Data 0.848 0.847 0.847
RoBERTa - Base (25 epochs) Tweet Data 0.736 0.739 0.736

Dutch

Dataset CheckWorthy Not Checkworthy
Training Data 405 590
Dev Data (hyperparametr Tuning) 102 150
Dev Test Data (Model Test) 316 350
Test Data (Model Test) 397 603

Result

Models Accuracy Precision Recall F1 Score
KNN Classifier 0.518 0.521 0.518 0.518
Decision Tree Classifier 0.538 0.545 0.538 0.534
Light GBM 0.506 0.506 0.506 0.506
Gradient Boosting 0.498 0.493 0.498 0.492
Random Forest 0.494 0.500 0.494 0.489
Voting(Soft: XGB + GDB + LR) 0.505 0.507 0.505 0.504
Voating(Soft: decision tree + KNN + random forest + XGB) 0.508 0.522 0.508 0.490
Voating (Hard: Decision TRee + KNN + Random Forest + XGB) 0.520 0.525 0.520 0.517

Arabic

Dataset Not CheckWorthy Checkworthy
Training Data 5090 2243
Dev Data (hyperparametr Tuning) 682 411
Dev Test Data (Model Test) 377 123

Best performance on the Arabic dataset (Dev-Test data):

Models Accuracy Precision Recall F1 Score
Random Forest Classifier 0.754 0.569 0.754 0.648
XGB Classifier 0.572 0.698 0.572 0.602
Logistic Regression 0.538 0.700 0.538 0.568
Voating Classifier (soft: RF + XGB + DT + KNN) 0.486 0.706 0.486 0.508
Voating Classifier (hard: RF + XGB + DT + KNN) 0.478 0.703 0.478 0.499

Spanish

Dataset Not CheckWorthy Checkworthy
Training Data 16862 3182
Dev Data (hyperparametr Tuning) 4296 704
Dev Test Data (Model Test) 4491 509

Best performing model on the spanish data set (dev-test):

Models Accuracy Precision Recall F1 Score
Gradient Boosting 0.900 0.872 0.900 0.874
Light GBM 0.878 0.835 0.878 0.852
AdaBoost 0.898 0.863 0.898 0.859
XGB Classifier 0.848 0.845 0.848 0.846
Voating Classifier (Soft: GDB + LGBM + ADB) 0.894 0.854 0.894 0.861
Voating Classifier (Hard: GDB + LGBM + ADB) 0.897 0.860 0.897 0.861
Voating Classifier (Soft: GDB + XGB + LR) 0.871 0.855 0.871 0.862
KNN 0.849 0.835 0.849 0.842
Decision Tree Classifier 0.840 0.842 0.840 0.841

MultiLingual Data (Arabic, Spanish, English, Dutch)

MultiLingual BERT (PEFT) QLoRA

Performance of Multilingual BERT (trained with QLoRA) on merged dataset

Epochs Datasets Accuracy Precision F1 Score
3 Test 0.865 0.852 0.847
3 Dev Test 0.834 0.820 0.819
8 Test 0.872 0.862 0.858
8 Dev Test 0.842 0.831 0.832
20 Test 0.879 0.869 0.868
20 Dev Test 0.846 0.835 0.836

This result has already surpassed the best scores from CLEF 2024, as seen here.

@checkThat image``

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages