GitHub - AMOGHA1140/english_hindi_translation

data/ - folder containing data for training / creating vocab

The data is created from 3 sources, from following sources:-

https://www.kaggle.com/datasets/preetviradiya/english-hindi-dataset/data
https://www.kaggle.com/datasets/parvmodi/english-to-hindi-machine-translation-dataset
iitb english-hindi dataset

The data is concatenated line by line to create final 2 files, one with english and other with corresponding hindi sentences.

vocab/ - This contained the vocabulary and token mappings, created using sentencepiece BPE. The vocabulary size was different for model1 and model2.

weights/ - my model's weights, along with optimizer and scheduler's state, for various time steps in training

model1 - This was a smaller model, created for making sure all components are working correctly

model2 - This model was created by increasing sze of all compoenents of model1, alongside modified loss (Label smoothing)

model1's configuration:-

embedding_dim= 128
vocab_size = 8192
n_layers = 3
n_heads = 4
max_len = 128
ffn_hidden_dim = 4*128
dropout_prob=0.1

weights/optimizer_state - optimizer's state during model1's training
weights/model_weights1.pth - model1's weights
weights/lr_schduler_state1.pth - model1's lr scheduler's state

vocab/old_model_vocab.model - model1's vocab (size 8192)
vocab/new_model_vocab.model - new model's vocab

new_model's config:-

embedding_dim= 512
vocab_size = 12288
n_layers = 4
n_heads = 4
max_len = 128
ffn_hidden_dim = 4*embedding_dim
dropout_prob=0.1

Its training on label smoothing, unlike model1
it doesn't have L2 regularization

Notes:

model1 has a very bad performance. It was just for implementation. it has some major issues:- 128 dimension embedding is just not enough for a language like hindi.
N=3 is not a deep enough transformer. it can and should be increased.
8192 vocab size can be increased
context size can be maybe increased

References

Revisiting Checkpoint Averaging for Neural Machine Translation - https://aclanthology.org/2022.findings-aacl.18.pdf
I couldn't make use of first and second moments of the weights as the optimizer checkpoints were not saved

Attention is all you Need - https://arxiv.org/abs/1706.03762

Masked Label Smoothing - https://arxiv.org/pdf/2203.02889
As my vocabulary of english and hindi is combined, it doesn't make sense to assign a probability for english tokens to appear in output. So the english tokens were "masked" out, and the a small probability was assigned to remaining hindi tokens only.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
main.py		main.py
model1.ipynb		model1.ipynb
model2.ipynb		model2.ipynb
temp.ipynb		temp.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Notes:

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Notes:

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages