Name	Name	Last commit message	Last commit date
Latest commit History 10 Commits
data	data
.gitignore	.gitignore
README.md	README.md
attention.py	attention.py
bpe.py	bpe.py
decoder.py	decoder.py
encoder.py	encoder.py
my_utils.py	my_utils.py
seq2seq.py	seq2seq.py
tokenizer.json	tokenizer.json
transformers.py	transformers.py

Name

Last commit message

Last commit date

data

This is an educational experimentation (I am the one being educated here) around the very famous Attention Is All You Need. I use a dataset of 130k pairs of translated sentences from English to French to train a machine translation model to translate from French to English, as is described in this Kaggle competition.

Results

Since both French and English languages are Romance-language, I use a common Byte-Pair encoding tokenizer trained on the same dataset. I settled on the following hyperparameters, after some tweaking around.

num_layers = 4
num_heads = 4
d_model = 128
# Longer sequences (after tokenization) get excluded from the dataset
max_seq_len = 64
vocab_size = 8_000

We can probably get much better result with proper hyperparameter tuning but I have absolutely no idea how to do that in a more systematic way than educated guessing and trying.

On my laptop (with a cheap dedicated GPU from 8 years ago), I get in about an hour a BLEU score of 60% on the Kaggle testing dataset. A sample of the translations:

> vous n'êtes pas assez agé pour conduire.
you're not old enough to drive.
> nous sommes encore à la maison.
we're still at home.
> je suis heureux de l'entendre.
i'm glad to hear it.
> je suis nul au golf.
i'm not at golf.
> j'ai pris la photo.
i took the picture.

Project Architecture

The file architecture is the following:

In attention.py we define a very generic multi-head attention nn.Module, that simultaneously handles all the common flavours of attention (cross attention, self attention etc.). The mathematical formulas are very close so it is reasonable to use a single API for all of them.
In transformers.py, we define a standard TransformerBlock as defined in the above paper
In decoder.py and encoder.py we define two nn.Modules following a standard seq2seq architecture, each containing multiple layers of TransformerBlock and doing the appropriate remaining work (embeddings, cross attention in the decoder etc.)
In seq2seq.py we define the general translation model and all the training heavylifting is done there: dataloading utilities, the training loop, the testing and generating the final answers.
In bpe.py there's a single BPE tokenizer training and loading function.
In my_utils.py there are a few PyTorch utilities that are not already existing in PyTorch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Results

Project Architecture

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Results

Project Architecture

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages