Skip to content

AMOGHA1140/english_hindi_translation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

data/ - folder containing data for training / creating vocab

The data is created from 3 sources, from following sources:-

  1. https://www.kaggle.com/datasets/preetviradiya/english-hindi-dataset/data
  2. https://www.kaggle.com/datasets/parvmodi/english-to-hindi-machine-translation-dataset
  3. iitb english-hindi dataset

The data is concatenated line by line to create final 2 files, one with english and other with corresponding hindi sentences.

vocab/ - This contained the vocabulary and token mappings, created using sentencepiece BPE. The vocabulary size was different for model1 and model2.

weights/ - my model's weights, along with optimizer and scheduler's state, for various time steps in training

model1 - This was a smaller model, created for making sure all components are working correctly

model2 - This model was created by increasing sze of all compoenents of model1, alongside modified loss (Label smoothing)

model1's configuration:-

embedding_dim= 128
vocab_size = 8192
n_layers = 3
n_heads = 4
max_len = 128
ffn_hidden_dim = 4*128
dropout_prob=0.1

weights/optimizer_state - optimizer's state during model1's training
weights/model_weights1.pth - model1's weights
weights/lr_schduler_state1.pth - model1's lr scheduler's state

vocab/old_model_vocab.model - model1's vocab (size 8192)
vocab/new_model_vocab.model - new model's vocab

new_model's config:-

embedding_dim= 512
vocab_size = 12288
n_layers = 4
n_heads = 4
max_len = 128
ffn_hidden_dim = 4*embedding_dim
dropout_prob=0.1

  1. Its training on label smoothing, unlike model1
  2. it doesn't have L2 regularization

Notes:

model1 has a very bad performance. It was just for implementation. it has some major issues:- 128 dimension embedding is just not enough for a language like hindi.
N=3 is not a deep enough transformer. it can and should be increased.
8192 vocab size can be increased
context size can be maybe increased

References

Revisiting Checkpoint Averaging for Neural Machine Translation - https://aclanthology.org/2022.findings-aacl.18.pdf
I couldn't make use of first and second moments of the weights as the optimizer checkpoints were not saved

Attention is all you Need - https://arxiv.org/abs/1706.03762

Masked Label Smoothing - https://arxiv.org/pdf/2203.02889
As my vocabulary of english and hindi is combined, it doesn't make sense to assign a probability for english tokens to appear in output. So the english tokens were "masked" out, and the a small probability was assigned to remaining hindi tokens only.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors