Skip to content

sayedshaun/wsd

Repository files navigation

alt text Copyright (c), Barba and Blevins.

Dual Architecture pipeline for Word Sense Disambiguation (WSD)

This project implements algorithms and tools for Word Sense Disambiguation (WSD), the task of determining the correct meaning of a word based on its context in a sentence. It provides datasets, evaluation scripts, and models to facilitate research and development in natural language processing applications where accurate word sense interpretation is essential.

Project Structure

📁 wsd
    ├── 📄 .gitignore
    ├── 📄 LICENSE
    ├── 📄 README.md
    ├── 📄 config.yaml
    ├── 📄 requirements.txt
    ├── 📄 download.sh
    │
    ├── 🧠 Core Modules
    │   ├── 📄 model.py
    │   ├── 📄 predict.py
    │   ├── 📄 train.py
    │
    ├── 🧰 Utilities
    │   ├── 📄 train_utils.py
    │   ├── 📄 utils.py
    │   ├── 📄 wn_utils.py
    │
    ├── 📊 Data
    │   ├── 📄 dataset.py
    │   ├── 📄 data_builder.py

Setup

This project requires python=3.10

pip install -r requirements.txt

Training

python train.py -c config.yaml

with following configations

train_data_dir: data/Training_Corpora/SemCor        # Training dataset dir
val_data_dir: data/Evaluation_Datasets/semeval2007  # Validation dataset dir
model_name: distilbert-base-uncased                 # Huggingface model name
output_dir: output                                  # output directory to save checkpoints
num_sense: 5                                        # Recommended 4/5
max_seq_len: 512                                    # Between [1, 512]
batch_size: 16                                      # Batch size for training
lr: 0.00001                                         # Learning rate
weight_decay: 0.01                                  # Weight decay for optimizer
epochs: 3                                           # Number of epochs
logging_step: 2000                                  # After how many steps to log
precision: fp16                                     # [fp16, fp32, bf16]
warmup_ratio: 0.1                                   # After how many steps to warmup
grad_clip: 1.0                                      # Gradient clipping factor
pos_tag: ALL                                        # [ALL, NOUN, VERB, ADJ, ADV]
device: cuda                                        # [cpu, cuda]
seed: 1234                                          # [int,  none]
report_to: wandb                                    # [wandb, none]
architecture: span                                  # [span, cosine]
do_predict: true                                    # predict after training

Evaluation

python predict.py \
    --data_dir "data/Evaluation_Datasets/semeval2015" \
    --model_name "distilbert-base-uncased" \
    --weight_dir "output/semeval2007" \
    --pos "ALL" \
    --seed 1234 \
    --num_sense 5 \
    --max_length 256 \
    --batch_size 32 \
    --architecture "cosine"

Evaluation Results

Both architectures are fine-tuned on SemCor dataset and pretrained distilbert-base-uncased model.

Span Extraction

Dataset Loss Start F1 End F1 Exact Match Joint F1 POS Architecture
ALL 0.512 0.8129 0.8170 0.7962 0.8087 ALL span
semeval2007 0.517 0.8088 0.8088 0.7934 0.8037 ALL span
semeval2013 0.524 0.8054 0.8096 0.7835 0.7995 ALL span
semeval2015 0.611 0.7916 0.7955 0.7769 0.7880 ALL span
senseval2 0.527 0.8094 0.8146 0.7927 0.8056 ALL span
senseval3 0.476 0.8151 0.8232 0.8043 0.8142 ALL span

Cosine Similarity

Dataset Loss F1 Precision Recall Accuracy POS Architecture
ALL 0.5684 0.8024 0.8024 0.8024 0.8024 ALL cosine
semeval2007 0.5524 0.8066 0.8066 0.8066 0.8066 ALL cosine
semeval2013 0.4821 0.8303 0.8303 0.8303 0.8303 ALL cosine
semeval2015 0.6726 0.7965 0.7965 0.7965 0.7965 ALL cosine
senseval2 0.5827 0.7993 0.7993 0.7993 0.7993 ALL cosine
senseval3 0.5064 0.8000 0.8000 0.8000 0.8000 ALL cosine

Finetuned weights are available here

Dataset Details

  • SemCor
  • SemCor+OMSTI
  • SemEval 2007
  • SemEval 2013
  • SemEval 2015
  • Senseval 2
  • Senseval 3

References

If you use this repository in your work, please cite it as:

@misc{wsd,
    author       = {Md Abu Sayed Shaun},
    title        = {Dual Architecture pipeline for Word Sense Disambiguation (WSD)},
    year         = {2025},
    howpublished = {\url{https://github.com/sayedshaun/wsd}},
    note         = {GitHub repository}
}

About

This project implements algorithms and tools for WSD, the task of determining the correct meaning of a word based on its context in a sentence. It provides datasets, evaluation scripts, and models to facilitate research and development in natural language processing applications where accurate word sense interpretation is essential.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors