This project focuses on the task of identifying authors in Persian literature by using fine-tuned BERT language models. The dataset comprises texts from 10 authors in the same genre, and the models are evaluated through 5-fold cross-validation, yielding various performance metrics. The project also compares the results with traditional machine learning techniques.
The dataset includes texts from 10 different Persian authors writing in the same genre. Each author has 30 documents, each containing exactly 500 words. Metadata includes:
- Author Name: The name of the author.
- Text Content: The 500-word document from the author.
- Additional Information: Any relevant metadata.
Dataset construction steps:
- Web Scraping: Used
Beautiful Soupin Python to scrape texts from online sources. - Document Selection: Selected 30 texts per author, ensuring genre and author diversity.
- Text Cleaning: Processed the text to meet the 500-word limit and cleaned unnecessary characters.
The author identification task was carried out using various pre-trained BERT models from Hugging Face, specifically fine-tuned for this problem:
- Pre-trained Model: BERT-based models fine-tuned on our dataset.
- Tokenization: Used the built-in tokenizer for Persian text.
- Fine-Tuning: Modified the output layer for the classification task.
The models were trained and evaluated using 5-fold cross-validation. Key experiments include:
- Model Performance: Measured accuracy, F1 Score, precision, and recall for each model.
- Parameter Tuning: Analyzed the impact of fine-tuning parameters (e.g., learning rate, batch size) on model performance.
- Document Length: Investigated the effect of varying document lengths on accuracy.
- Stopword Removal: Evaluated the impact of excluding stopwords on model performance.
The performance metrics for each experiment are summarized as follows:
- Accuracy: [Results]
- F1 Score: [Results]
- Precision & Recall: [Results]
- Confusion Matrix: Included in the report for detailed analysis.
Additionally, the project compares BERT’s performance with traditional machine learning models (e.g., SVM, Random Forest), highlighting the advantages and limitations of each approach.
To set up the environment for this project, follow these steps:
- Clone the repository:
git clone https://github.com/yourusername/author-identification-persian cd author-identification-persian - Install the required dependencies:
pip install -r requirements.txt
- Set up the environment for Hugging Face models:
pip install transformers
-
Dataset Creation:
- Run the
dataset_creation.pyscript to create the dataset:python dataset_creation.py
- The dataset will be saved in the
/datafolder.
- Run the
-
Fine-Tuning BERT Model:
- Run the
fine_tune_bert.pyscript to fine-tune the BERT model:python fine_tune_bert.py
- Run the
-
Model Evaluation:
- After training, run the evaluation script to obtain performance metrics:
python evaluate_model.py
- After training, run the evaluation script to obtain performance metrics: