data/ - folder containing data for training / creating vocab
The data is created from 3 sources, from following sources:-
- https://www.kaggle.com/datasets/preetviradiya/english-hindi-dataset/data
- https://www.kaggle.com/datasets/parvmodi/english-to-hindi-machine-translation-dataset
- iitb english-hindi dataset
The data is concatenated line by line to create final 2 files, one with english and other with corresponding hindi sentences.
vocab/ - This contained the vocabulary and token mappings, created using sentencepiece BPE. The vocabulary size was different for model1 and model2.
weights/ - my model's weights, along with optimizer and scheduler's state, for various time steps in training
model1 - This was a smaller model, created for making sure all components are working correctly
model2 - This model was created by increasing sze of all compoenents of model1, alongside modified loss (Label smoothing)
model1's configuration:-
embedding_dim= 128
vocab_size = 8192
n_layers = 3
n_heads = 4
max_len = 128
ffn_hidden_dim = 4*128
dropout_prob=0.1
weights/optimizer_state - optimizer's state during model1's training
weights/model_weights1.pth - model1's weights
weights/lr_schduler_state1.pth - model1's lr scheduler's state
vocab/old_model_vocab.model - model1's vocab (size 8192)
vocab/new_model_vocab.model - new model's vocab
new_model's config:-
embedding_dim= 512
vocab_size = 12288
n_layers = 4
n_heads = 4
max_len = 128
ffn_hidden_dim = 4*embedding_dim
dropout_prob=0.1
- Its training on label smoothing, unlike model1
- it doesn't have L2 regularization
model1 has a very bad performance. It was just for implementation. it has some major issues:-
128 dimension embedding is just not enough for a language like hindi.
N=3 is not a deep enough transformer. it can and should be increased.
8192 vocab size can be increased
context size can be maybe increased
Revisiting Checkpoint Averaging for Neural Machine Translation - https://aclanthology.org/2022.findings-aacl.18.pdf
I couldn't make use of first and second moments of the weights as the optimizer checkpoints were not saved
Attention is all you Need - https://arxiv.org/abs/1706.03762
Masked Label Smoothing - https://arxiv.org/pdf/2203.02889
As my vocabulary of english and hindi is combined, it doesn't make sense to assign a probability for english tokens to appear in output. So the english tokens were "masked" out, and the a small probability was assigned to remaining hindi tokens only.