Author Identification in Persian Literature using Language Models

Project Overview

This project focuses on the task of identifying authors in Persian literature by using fine-tuned BERT language models. The dataset comprises texts from 10 authors in the same genre, and the models are evaluated through 5-fold cross-validation, yielding various performance metrics. The project also compares the results with traditional machine learning techniques.

Dataset

The dataset includes texts from 10 different Persian authors writing in the same genre. Each author has 30 documents, each containing exactly 500 words. Metadata includes:

Author Name: The name of the author.
Text Content: The 500-word document from the author.
Additional Information: Any relevant metadata.

Dataset construction steps:

Web Scraping: Used Beautiful Soup in Python to scrape texts from online sources.
Document Selection: Selected 30 texts per author, ensuring genre and author diversity.
Text Cleaning: Processed the text to meet the 500-word limit and cleaned unnecessary characters.

Model Architecture

The author identification task was carried out using various pre-trained BERT models from Hugging Face, specifically fine-tuned for this problem:

Pre-trained Model: BERT-based models fine-tuned on our dataset.
Tokenization: Used the built-in tokenizer for Persian text.
Fine-Tuning: Modified the output layer for the classification task.

Experiments

The models were trained and evaluated using 5-fold cross-validation. Key experiments include:

Model Performance: Measured accuracy, F1 Score, precision, and recall for each model.
Parameter Tuning: Analyzed the impact of fine-tuning parameters (e.g., learning rate, batch size) on model performance.
Document Length: Investigated the effect of varying document lengths on accuracy.
Stopword Removal: Evaluated the impact of excluding stopwords on model performance.

Results

The performance metrics for each experiment are summarized as follows:

Accuracy: [Results]
F1 Score: [Results]
Precision & Recall: [Results]
Confusion Matrix: Included in the report for detailed analysis.

Additionally, the project compares BERT’s performance with traditional machine learning models (e.g., SVM, Random Forest), highlighting the advantages and limitations of each approach.

Installation

To set up the environment for this project, follow these steps:

Clone the repository:

git clone https://github.com/yourusername/author-identification-persian
cd author-identification-persian

Install the required dependencies:
```
pip install -r requirements.txt
```
Set up the environment for Hugging Face models:
```
pip install transformers
```

Usage

Dataset Creation:
- Run the dataset_creation.py script to create the dataset:
```
python dataset_creation.py
```
- The dataset will be saved in the /data folder.
Fine-Tuning BERT Model:
- Run the fine_tune_bert.py script to fine-tune the BERT model:
```
python fine_tune_bert.py
```
Model Evaluation:
- After training, run the evaluation script to obtain performance metrics:
```
python evaluate_model.py
```

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
persian-authors-classifier.ipynb		persian-authors-classifier.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Author Identification in Persian Literature using Language Models

Project Overview

Dataset

Model Architecture

Experiments

Results

Installation

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Author Identification in Persian Literature using Language Models

Project Overview

Dataset

Model Architecture

Experiments

Results

Installation

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages