With the rapid rise of online misinformation, it's increasingly important to prioritize which claims are worth fact-checking. Check-worthiness estimation tackles this by classifying whether a statement like a tweet or debate quote-merits verification. However, challenges such as subjectivity, data imbalance, and linguistic ambiguity make this task difficult.
While recent benchmarks like CheckThat! Lab at CLEF 2024 have seen dominance from transformer-based and LLM-based models (e.g., RoBERTa, GPT-4, LLaMA2), these often demand high computational resources. Whether using transformer-based models or traditional machine learning approaches, my focus was on efficiently addressing this problem in a trustworthy manner, while keeping in mind the variety and complexity of textual data. This project explores different approachs of ensemble-based traditional ML models, supported by resampling techniques, can remain competitive. I have conduct experiment with QLoRA for memory-efficient fine-tuning of large models, offering a practical alternative to resource-heavy approaches.
This research addresses the growing need to identify claims worth fact-checking, especially in the age of widespread misinformation. Focusing on English-language datasets from U.S. presidential debate transcripts, we apply a range of resampling methods to tackle data imbalance and explore multiple machine learning approaches from traditional models to fine-tuned LLMs using memory-efficient techniques like QLoRA.
-
How to handle huge ammount of imbalance text data.
-
Prompt engineering (few shot, zero shot) for data labeling and data processing.
-
Tweet data scrapping for validate the model, prompt engineering for label automation.
-
Laveraging Data pruning techniques with LLM prompt engineering and varies NLP libraries.
-
Evaluation of linguistic, contextual, and semantic features.
-
Ensemble strategies that significantly boost performance.
-
Benchmarking on CLEF 2024's CheckThat! dataset and additional tweet data.
-
Preliminary results show ensemble-based classical models outperform current CLEF 2024 LLM-based baselines.
-
Fine-tuning state of the art LLM Model with QLoRA also give better performance compared to LoRA.
Custom Ensemble Learning
Implemented advanced blending and stacking ensembles using:
- Manual out-of-fold training and prediction logic
- Integration of diverse base models (e.g., XGBoost, Logistic Regression, Decision Tree, Random Forest, Ada Boost, Gradient Boosting, Light GBM, KNN)
- Final meta-learner trained on base model predictions Achieved improved F1 scores compared to traditional ensemble methods (e.g., VotingClassifier).
- gemini-1.5-flash - prompt engineering for class label automation for scrapped tweet .
- BERT - LLM model for text classification.
- multilingual BERT
- xlm-RoBERTa
- Random Forest
- XGB
- Decision Tree
- KNN
- GDB
- LGBM
- ADB
BenchMark Dataset from CLEF2024
![]() |
| Dataset | Not CheckWorthy | CheckWorthy |
|---|---|---|
| Training Data | 17087 | 5413 |
| UnderSampled Training Data | 7189 | 5408 |
| Dev Data (hyperparametr Tuning) | 794 | 238 |
| Dev Test Data (Model Test) | 210 | 108 |
| Test Data (Model Test) | 253 | 88 |
| Tweet Data (Model Test) | 124 | 107 |
| Models | Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|
| XGBoost | 0.830 | 0.8222 | 0.830 | 0.821 |
| Gradient Boosting | 0.798 | 0.789 | 0.798 | 0.768 |
| Light GBM | 0.8333 | 0.837 | 0.833 | 0.8111 |
| Logistic Regression | 0.748 | 0.816 | 0.748 | 0.763 |
| Voating Classifier (Soft: XGB + GDB + LR) | 0.833 | 0.825 | 0.833 | 0.825 |
| Voating Classifier (Hard: XGB + LGBM + ADB + LR) | 0.836 | 0.850 | 0.836 | 0.811 |
| Voating Classifier (Hard: XGB + GDB + LR) | 0.878 | 0.881 | 0.877 | 0.878 |
| Model | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|
| Decision Tree | 0.709 | 0.714 | 0.709 | 0.711 |
| KNN | 0.671 | 0.657 | 0.671 | 0.660 |
| XGBoost | 0.867 | 0.872 | 0.867 | 0.868 |
| Random Forest | 0.728 | 0.717 | 0.728 | 0.714 |
| Gradient Boosting | 0.861 | 0.860 | 0.861 | 0.857 |
| Light GBM | 0.845 | 0.843 | 0.845 | 0.842 |
| Ada Boost | 0.813 | 0.815 | 0.813 | 0.803 |
| Logistic Regression | 0.851 | 0.874 | 0.851 | 0.855 |
| Voating Classifier (Soft: XGb + GDB + LR) | 0.884 | 0.889 | 0.883 | 0.884 |
| Voating Classifier (Hard: XGb + LGBM + ADb + LR) | 0.864 | 0.863 | 0.864 | 0.862 |
| Voating Classifier (Hard: XGb + GDB + LR) | 0.877 | 0.881 | 0.877 | 0.878 |
| Model | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|
| Decision Tree | 0.718 | 0.720 | 0.718 | 0.719 |
| KNN | 0.687 | 0.685 | 0.687 | 0.615 |
| XGBoost | 0.791 | 0.824 | 0.791 | 0.764 |
| Random Forest | 0.753 | 0.746 | 0.753 | 0.746 |
| Gradient Boosting | 0.801 | 0.831 | 0.801 | 0.777 |
| Light GBM | 0.807 | 0.828 | 0.807 | 0.788 |
| Ada Boost | 0.785 | 0.814 | 0.785 | 0.757 |
| Logistic Regression | 0.851 | 0.857 | 0.851 | 0.844 |
| Voating Classifier (Soft: XGb + GDB + LR) | 0.813 | 0.840 | 0.813 | 0.793 |
| Model | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|
| KNN | 0.669 | 0.641 | 0.669 | 0.653 |
| XGBoost | 0.718 | 0.785 | 0.718 | 0.735 |
| Gradient Boosting | 0.780 | 0.790 | 0.780 | 0.784 |
| Light GBM | 0.812 | 0.812 | 0.812 | 0.812 |
| Ada Boost | 0.7333 | 0.701 | 0.733 | 0.709 |
| Logistic Regression | 0.724 | 0.814 | 0.724 | 0.741 |
| Voating Classifier (Soft: XGB + GDB + LR) | 0.777 | 0.818 | 0.777 | 0.788 |
| Voating Classifier (Hard: XGb + LGBM + ADB + LR) | 0.806 | 0.808 | 0.806 | 0.807 |
QUERY = "COVID OR vaccine OR pandemic OR lockdown OR #COVID19 OR " \ "#VaccineMandate elections OR Biden OR Trump OR Joe Biden OR " \ "DOGE OR FBI OR Donald Trump OR Ukraine OR Russia OR " \ "Middle East Crisis OR South Asia Crisis OR UN meeting OR" \ " US Congress OR US Republic OR geopolitics OR War OR" \ " #Politics OR #Election2025 OR #UkraineRussiaWar OR" \ " #Trump OR #DOGE OR #Trump2025 #MiddleEastCrisis OR" \ " #Geopolitics OR #USRepublic OR #USCongress OR #Bangladesh (COVID OR" \ " OR OR vaccine OR OR OR pandemic OR OR OR lockdown OR OR OR" \ " #COVID19 OR OR OR #VaccineMandate OR elections OR OR OR" \ " Biden OR OR OR Trump OR OR OR Joe OR Biden OR OR OR" \ " DOGE OR OR OR FBI OR OR OR Donald OR Trump OR OR OR" \ " Ukraine OR OR OR Russia OR OR OR Middle OR " \ "East OR Crisis OR OR OR South OR Asia OR Crisis OR OR OR" \ " UN OR meeting OR OR OR US OR Congress OR OR OR US OR" \ " Republic OR OR OR geopolitics OR OR OR War OR OR OR" \ " #Politics OR OR OR #Election2025 OR OR OR #UkraineRussiaWar OR OR OR" \ " #Trump OR OR OR #DOGE OR OR OR #Trump2025 OR #MiddleEastCrisis OR OR OR" \ " #Geopolitics OR OR OR #USRepublic OR OR OR #USCongress OR OR OR #Bangladesh)" \ " lang:en until:2022-12-31 since:2020-01-01"
| Model | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|
| Decision Tree | 0.641 | 0.698 | 0.641 | 0.625 |
| XGBoost | 0.623 | 0.659 | 0.623 | 0.613 |
| Gradient Boosting | 0.654 | 0.653 | 0.654 | 0.653 |
| Light GBM | 0.649 | 0.648 | 0.649 | 0.648 |
| Ada Boost | 0.632 | 0.639 | 0.632 | 0.616 |
| Logistic Regression | 0.688 | 0.740 | 0.688 | 0.678 |
| Voating Classifier (Soft: XGB + GDB + LR) | 0.658 | 0.675 | 0.658 | 0.656 |
| Voating Classifier (Hard: XGb + LGBM + ADB + LR) | 0.671 | 0.670 | 0.671 | 0.670 |
| Model | Dataset | Accuracy | Precision | F1 |
|---|---|---|---|---|
| DBERTa (epoch : 8) | Test Data | 0.836 | 0.834 | 0.819 |
| DBERTa (epoch : 8) | Dev Test Data | 0.851 | 0.861 | 0.843 |
| DBERTa (epoch : 8) | Tweet Data | 0.779 | 0.788 | 0.775 |
| BERT (QLoRA, Undersampled, Hyperparametered tuned with optuna, 40 epoch) | Test Data | 0.839 | 0.834 | 0.825 |
| BERT (QLoRA, Undersampled, Hyperparametered tuned with optuna, 40 epoch) | Dev Test Data | 0.804 | 0.819 | 0.8786 |
| BERT (QLoRA, Undersampled, Hyperparametered tuned with optuna, 40 epoch) | Tweet Data | 0.745 | 0.750 | 0.741 |
| RoBERTa - Base (25 epochs) | Test Data | 0.821 | 0.818 | 0.819 |
| RoBERTa - Base (25 epochs) | Dev Test Data | 0.848 | 0.847 | 0.847 |
| RoBERTa - Base (25 epochs) | Tweet Data | 0.736 | 0.739 | 0.736 |
| Dataset | CheckWorthy | Not Checkworthy |
|---|---|---|
| Training Data | 405 | 590 |
| Dev Data (hyperparametr Tuning) | 102 | 150 |
| Dev Test Data (Model Test) | 316 | 350 |
| Test Data (Model Test) | 397 | 603 |
| Models | Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|
| KNN Classifier | 0.518 | 0.521 | 0.518 | 0.518 |
| Decision Tree Classifier | 0.538 | 0.545 | 0.538 | 0.534 |
| Light GBM | 0.506 | 0.506 | 0.506 | 0.506 |
| Gradient Boosting | 0.498 | 0.493 | 0.498 | 0.492 |
| Random Forest | 0.494 | 0.500 | 0.494 | 0.489 |
| Voting(Soft: XGB + GDB + LR) | 0.505 | 0.507 | 0.505 | 0.504 |
| Voating(Soft: decision tree + KNN + random forest + XGB) | 0.508 | 0.522 | 0.508 | 0.490 |
| Voating (Hard: Decision TRee + KNN + Random Forest + XGB) | 0.520 | 0.525 | 0.520 | 0.517 |
| Dataset | Not CheckWorthy | Checkworthy |
|---|---|---|
| Training Data | 5090 | 2243 |
| Dev Data (hyperparametr Tuning) | 682 | 411 |
| Dev Test Data (Model Test) | 377 | 123 |
| Models | Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|
| Random Forest Classifier | 0.754 | 0.569 | 0.754 | 0.648 |
| XGB Classifier | 0.572 | 0.698 | 0.572 | 0.602 |
| Logistic Regression | 0.538 | 0.700 | 0.538 | 0.568 |
| Voating Classifier (soft: RF + XGB + DT + KNN) | 0.486 | 0.706 | 0.486 | 0.508 |
| Voating Classifier (hard: RF + XGB + DT + KNN) | 0.478 | 0.703 | 0.478 | 0.499 |
| Dataset | Not CheckWorthy | Checkworthy |
|---|---|---|
| Training Data | 16862 | 3182 |
| Dev Data (hyperparametr Tuning) | 4296 | 704 |
| Dev Test Data (Model Test) | 4491 | 509 |
| Models | Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|
| Gradient Boosting | 0.900 | 0.872 | 0.900 | 0.874 |
| Light GBM | 0.878 | 0.835 | 0.878 | 0.852 |
| AdaBoost | 0.898 | 0.863 | 0.898 | 0.859 |
| XGB Classifier | 0.848 | 0.845 | 0.848 | 0.846 |
| Voating Classifier (Soft: GDB + LGBM + ADB) | 0.894 | 0.854 | 0.894 | 0.861 |
| Voating Classifier (Hard: GDB + LGBM + ADB) | 0.897 | 0.860 | 0.897 | 0.861 |
| Voating Classifier (Soft: GDB + XGB + LR) | 0.871 | 0.855 | 0.871 | 0.862 |
| KNN | 0.849 | 0.835 | 0.849 | 0.842 |
| Decision Tree Classifier | 0.840 | 0.842 | 0.840 | 0.841 |
| Epochs | Datasets | Accuracy | Precision | F1 Score |
|---|---|---|---|---|
| 3 | Test | 0.865 | 0.852 | 0.847 |
| 3 | Dev Test | 0.834 | 0.820 | 0.819 |
| 8 | Test | 0.872 | 0.862 | 0.858 |
| 8 | Dev Test | 0.842 | 0.831 | 0.832 |
| 20 | Test | 0.879 | 0.869 | 0.868 |
| 20 | Dev Test | 0.846 | 0.835 | 0.836 |
This result has already surpassed the best scores from CLEF 2024, as seen here.
@checkThat
``
