Check Worthiness Estimetion of Text Data

With the rapid rise of online misinformation, it's increasingly important to prioritize which claims are worth fact-checking. Check-worthiness estimation tackles this by classifying whether a statement like a tweet or debate quote-merits verification. However, challenges such as subjectivity, data imbalance, and linguistic ambiguity make this task difficult.

While recent benchmarks like CheckThat! Lab at CLEF 2024 have seen dominance from transformer-based and LLM-based models (e.g., RoBERTa, GPT-4, LLaMA2), these often demand high computational resources. Whether using transformer-based models or traditional machine learning approaches, my focus was on efficiently addressing this problem in a trustworthy manner, while keeping in mind the variety and complexity of textual data. This project explores different approachs of ensemble-based traditional ML models, supported by resampling techniques, can remain competitive. I have conduct experiment with QLoRA for memory-efficient fine-tuning of large models, offering a practical alternative to resource-heavy approaches.

This research addresses the growing need to identify claims worth fact-checking, especially in the age of widespread misinformation. Focusing on English-language datasets from U.S. presidential debate transcripts, we apply a range of resampling methods to tackle data imbalance and explore multiple machine learning approaches from traditional models to fine-tuned LLMs using memory-efficient techniques like QLoRA.

Key contributions include :

How to handle huge ammount of imbalance text data.
Prompt engineering (few shot, zero shot) for data labeling and data processing.
Tweet data scrapping for validate the model, prompt engineering for label automation.
Laveraging Data pruning techniques with LLM prompt engineering and varies NLP libraries.
Evaluation of linguistic, contextual, and semantic features.
Ensemble strategies that significantly boost performance.
Benchmarking on CLEF 2024's CheckThat! dataset and additional tweet data.
Preliminary results show ensemble-based classical models outperform current CLEF 2024 LLM-based baselines.
Fine-tuning state of the art LLM Model with QLoRA also give better performance compared to LoRA.

Custom Ensemble Learning

Implemented advanced blending and stacking ensembles using:

Manual out-of-fold training and prediction logic
Integration of diverse base models (e.g., XGBoost, Logistic Regression, Decision Tree, Random Forest, Ada Boost, Gradient Boosting, Light GBM, KNN)
Final meta-learner trained on base model predictions Achieved improved F1 scores compared to traditional ensemble methods (e.g., VotingClassifier).

Used LLM models till now:

gemini-1.5-flash - prompt engineering for class label automation for scrapped tweet .

BERT - LLM model for text classification.
multilingual BERT
xlm-RoBERTa

Classical model used till now Hyperparameter Tuned for English, Arabic and Spanish

Random Forest
XGB
Decision Tree
KNN
GDB
LGBM
ADB

BenchMark Dataset from CLEF2024

English

Data Used

Dataset	Not CheckWorthy	CheckWorthy
Training Data	17087	5413
UnderSampled Training Data	7189	5408
Dev Data (hyperparametr Tuning)	794	238
Dev Test Data (Model Test)	210	108
Test Data (Model Test)	253	88
Tweet Data (Model Test)	124	107

Trained on Benchmark DataSet and tested on Test data

Models	Accuracy	Precision	Recall	F1 Score
XGBoost	0.830	0.8222	0.830	0.821
Gradient Boosting	0.798	0.789	0.798	0.768
Light GBM	0.8333	0.837	0.833	0.8111
Logistic Regression	0.748	0.816	0.748	0.763
Voating Classifier (Soft: XGB + GDB + LR)	0.833	0.825	0.833	0.825
Voating Classifier (Hard: XGB + LGBM + ADB + LR)	0.836	0.850	0.836	0.811
Voating Classifier (Hard: XGB + GDB + LR)	0.878	0.881	0.877	0.878

Performance of classicial ML models on Undersampled English Dataset and tested on dev test dataset

Model	Accuracy	Precision	Recall	F1
Decision Tree	0.709	0.714	0.709	0.711
KNN	0.671	0.657	0.671	0.660
XGBoost	0.867	0.872	0.867	0.868
Random Forest	0.728	0.717	0.728	0.714
Gradient Boosting	0.861	0.860	0.861	0.857
Light GBM	0.845	0.843	0.845	0.842
Ada Boost	0.813	0.815	0.813	0.803
Logistic Regression	0.851	0.874	0.851	0.855
Voating Classifier (Soft: XGb + GDB + LR)	0.884	0.889	0.883	0.884
Voating Classifier (Hard: XGb + LGBM + ADb + LR)	0.864	0.863	0.864	0.862
Voating Classifier (Hard: XGb + GDB + LR)	0.877	0.881	0.877	0.878

Performance of different models on the original dataset and evaluated on dev test data

Model	Accuracy	Precision	Recall	F1
Decision Tree	0.718	0.720	0.718	0.719
KNN	0.687	0.685	0.687	0.615
XGBoost	0.791	0.824	0.791	0.764
Random Forest	0.753	0.746	0.753	0.746
Gradient Boosting	0.801	0.831	0.801	0.777
Light GBM	0.807	0.828	0.807	0.788
Ada Boost	0.785	0.814	0.785	0.757
Logistic Regression	0.851	0.857	0.851	0.844
Voating Classifier (Soft: XGb + GDB + LR)	0.813	0.840	0.813	0.793

Performance of different classifiers on the undersampled dataset (evaluated on test data):

Model	Accuracy	Precision	Recall	F1
KNN	0.669	0.641	0.669	0.653
XGBoost	0.718	0.785	0.718	0.735
Gradient Boosting	0.780	0.790	0.780	0.784
Light GBM	0.812	0.812	0.812	0.812
Ada Boost	0.7333	0.701	0.733	0.709
Logistic Regression	0.724	0.814	0.724	0.741
Voating Classifier (Soft: XGB + GDB + LR)	0.777	0.818	0.777	0.788
Voating Classifier (Hard: XGb + LGBM + ADB + LR)	0.806	0.808	0.806	0.807

Performance of classifier trained on undersampled dataset , evaluated on tweet data

Query for tweet data scrapping:

QUERY = "COVID OR vaccine OR pandemic OR lockdown OR #COVID19 OR " \ "#VaccineMandate elections OR Biden OR Trump OR Joe Biden OR " \ "DOGE OR FBI OR Donald Trump OR Ukraine OR Russia OR " \ "Middle East Crisis OR South Asia Crisis OR UN meeting OR" \ " US Congress OR US Republic OR geopolitics OR War OR" \ " #Politics OR #Election2025 OR #UkraineRussiaWar OR" \ " #Trump OR #DOGE OR #Trump2025 #MiddleEastCrisis OR" \ " #Geopolitics OR #USRepublic OR #USCongress OR #Bangladesh (COVID OR" \ " OR OR vaccine OR OR OR pandemic OR OR OR lockdown OR OR OR" \ " #COVID19 OR OR OR #VaccineMandate OR elections OR OR OR" \ " Biden OR OR OR Trump OR OR OR Joe OR Biden OR OR OR" \ " DOGE OR OR OR FBI OR OR OR Donald OR Trump OR OR OR" \ " Ukraine OR OR OR Russia OR OR OR Middle OR " \ "East OR Crisis OR OR OR South OR Asia OR Crisis OR OR OR" \ " UN OR meeting OR OR OR US OR Congress OR OR OR US OR" \ " Republic OR OR OR geopolitics OR OR OR War OR OR OR" \ " #Politics OR OR OR #Election2025 OR OR OR #UkraineRussiaWar OR OR OR" \ " #Trump OR OR OR #DOGE OR OR OR #Trump2025 OR #MiddleEastCrisis OR OR OR" \ " #Geopolitics OR OR OR #USRepublic OR OR OR #USCongress OR OR OR #Bangladesh)" \ " lang:en until:2022-12-31 since:2020-01-01"

Model	Accuracy	Precision	Recall	F1
Decision Tree	0.641	0.698	0.641	0.625
XGBoost	0.623	0.659	0.623	0.613
Gradient Boosting	0.654	0.653	0.654	0.653
Light GBM	0.649	0.648	0.649	0.648
Ada Boost	0.632	0.639	0.632	0.616
Logistic Regression	0.688	0.740	0.688	0.678
Voating Classifier (Soft: XGB + GDB + LR)	0.658	0.675	0.658	0.656
Voating Classifier (Hard: XGb + LGBM + ADB + LR)	0.671	0.670	0.671	0.670

Performance of LLMs for english dataset across different evaluation datasets:

Model	Dataset	Accuracy	Precision	F1
DBERTa (epoch : 8)	Test Data	0.836	0.834	0.819
DBERTa (epoch : 8)	Dev Test Data	0.851	0.861	0.843
DBERTa (epoch : 8)	Tweet Data	0.779	0.788	0.775
BERT (QLoRA, Undersampled, Hyperparametered tuned with optuna, 40 epoch)	Test Data	0.839	0.834	0.825
BERT (QLoRA, Undersampled, Hyperparametered tuned with optuna, 40 epoch)	Dev Test Data	0.804	0.819	0.8786
BERT (QLoRA, Undersampled, Hyperparametered tuned with optuna, 40 epoch)	Tweet Data	0.745	0.750	0.741
RoBERTa - Base (25 epochs)	Test Data	0.821	0.818	0.819
RoBERTa - Base (25 epochs)	Dev Test Data	0.848	0.847	0.847
RoBERTa - Base (25 epochs)	Tweet Data	0.736	0.739	0.736

Dutch

Data Used

Dataset	CheckWorthy	Not Checkworthy
Training Data	405	590
Dev Data (hyperparametr Tuning)	102	150
Dev Test Data (Model Test)	316	350
Test Data (Model Test)	397	603

Result

Models	Accuracy	Precision	Recall	F1 Score
KNN Classifier	0.518	0.521	0.518	0.518
Decision Tree Classifier	0.538	0.545	0.538	0.534
Light GBM	0.506	0.506	0.506	0.506
Gradient Boosting	0.498	0.493	0.498	0.492
Random Forest	0.494	0.500	0.494	0.489
Voting(Soft: XGB + GDB + LR)	0.505	0.507	0.505	0.504
Voating(Soft: decision tree + KNN + random forest + XGB)	0.508	0.522	0.508	0.490
Voating (Hard: Decision TRee + KNN + Random Forest + XGB)	0.520	0.525	0.520	0.517

Arabic

Data Used

Dataset	Not CheckWorthy	Checkworthy
Training Data	5090	2243
Dev Data (hyperparametr Tuning)	682	411
Dev Test Data (Model Test)	377	123

Best performance on the Arabic dataset (Dev-Test data):

Models	Accuracy	Precision	Recall	F1 Score
Random Forest Classifier	0.754	0.569	0.754	0.648
XGB Classifier	0.572	0.698	0.572	0.602
Logistic Regression	0.538	0.700	0.538	0.568
Voating Classifier (soft: RF + XGB + DT + KNN)	0.486	0.706	0.486	0.508
Voating Classifier (hard: RF + XGB + DT + KNN)	0.478	0.703	0.478	0.499

Spanish

Data Used

Dataset	Not CheckWorthy	Checkworthy
Training Data	16862	3182
Dev Data (hyperparametr Tuning)	4296	704
Dev Test Data (Model Test)	4491	509

Best performing model on the spanish data set (dev-test):

Models	Accuracy	Precision	Recall	F1 Score
Gradient Boosting	0.900	0.872	0.900	0.874
Light GBM	0.878	0.835	0.878	0.852
AdaBoost	0.898	0.863	0.898	0.859
XGB Classifier	0.848	0.845	0.848	0.846
Voating Classifier (Soft: GDB + LGBM + ADB)	0.894	0.854	0.894	0.861
Voating Classifier (Hard: GDB + LGBM + ADB)	0.897	0.860	0.897	0.861
Voating Classifier (Soft: GDB + XGB + LR)	0.871	0.855	0.871	0.862
KNN	0.849	0.835	0.849	0.842
Decision Tree Classifier	0.840	0.842	0.840	0.841

MultiLingual Data (Arabic, Spanish, English, Dutch)

MultiLingual BERT (PEFT) QLoRA

Performance of Multilingual BERT (trained with QLoRA) on merged dataset

Epochs	Datasets	Accuracy	Precision	F1 Score
3	Test	0.865	0.852	0.847
3	Dev Test	0.834	0.820	0.819
8	Test	0.872	0.862	0.858
8	Dev Test	0.842	0.831	0.832
20	Test	0.879	0.869	0.868
20	Dev Test	0.846	0.835	0.836

This result has already surpassed the best scores from CLEF 2024, as seen here.

@checkThat ``

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
.github/workflows		.github/workflows
Ensembling_techniques		Ensembling_techniques
LLM		LLM
NLP		NLP
classical_model		classical_model
clef2024-checkthat-lab-main/clef2024-checkthat-lab-main/task1		clef2024-checkthat-lab-main/clef2024-checkthat-lab-main/task1
data_preprocessing_preliminary_testing		data_preprocessing_preliminary_testing
data_sampling_code		data_sampling_code
hp-tuning		hp-tuning
thesis_paper_LATEX/thesis-en-master		thesis_paper_LATEX/thesis-en-master
tweet_data_scraping		tweet_data_scraping
.gitignore		.gitignore
README.md		README.md
all_related_mats.txt		all_related_mats.txt
best_params_for _diff_models.txt		best_params_for _diff_models.txt
bleending.py		bleending.py
code_practice.ipynb		code_practice.ipynb
embeddong_practice.ipynb		embeddong_practice.ipynb
fun.ipynb		fun.ipynb
fun_new.ipynb		fun_new.ipynb
germeval2019.training_subtask1_2_korrigiert.txt		germeval2019.training_subtask1_2_korrigiert.txt
germeval2019GoldLabelsSubtask1_2.txt		germeval2019GoldLabelsSubtask1_2.txt
model performance.txt		model performance.txt
model_performance.ipynb		model_performance.ipynb
regex.ipynb		regex.ipynb
text_data_visualization.ipynb		text_data_visualization.ipynb
thesis-en-master (1).zip		thesis-en-master (1).zip
tokenizer.ipynb		tokenizer.ipynb

Folders and files

Latest commit

History

Repository files navigation

Check Worthiness Estimetion of Text Data

Key contributions include :

Custom Ensemble Learning

Used LLM models till now:

Classical model used till now Hyperparameter Tuned for English, Arabic and Spanish

BenchMark Dataset from CLEF2024

English

Trained on Benchmark DataSet and tested on Test data

Performance of classicial ML models on Undersampled English Dataset and tested on dev test dataset

Performance of different models on the original dataset and evaluated on dev test data

Performance of different classifiers on the undersampled dataset (evaluated on test data):

Performance of classifier trained on undersampled dataset , evaluated on tweet data

Query for tweet data scrapping:

Performance of LLMs for english dataset across different evaluation datasets:

Dutch

Result

Arabic

Best performance on the Arabic dataset (Dev-Test data):

Spanish

Best performing model on the spanish data set (dev-test):

MultiLingual Data (Arabic, Spanish, English, Dutch)

MultiLingual BERT (PEFT) QLoRA

Performance of Multilingual BERT (trained with QLoRA) on merged dataset

This result has already surpassed the best scores from CLEF 2024, as seen here.

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Uh oh!

Uh oh!

Languages