GPT-2 Benchmark

Hey there! Are you interested in LLMs? Do you like experimenting with neural networks, implementing different ideas and testing them out? Would you like to do that for a living? Then you're in the right place! This is an official open test for people interested in joining BottleCapAI.

This project is a fork of Modded-NanoGPT ❤️, rewritten with minimal changes to run on a single GPU (e.g. RTX 3090/4090)

📌 About BottleCapAI

At BottleCapAI, we’re making large language models radically more efficient — aiming for 100× improvements over today’s approaches. 🚀

👥 Founders

Tomas Mikolov – creator of word2vec, pioneer of neural language models.
Jaroslav Beck – co-founder of Beat Games (Beat Saber, 10M+ copies sold, acquired by Meta).
David Herel – creator of Thinking Tokens, co-founder of an AI trading startup, and Amazon Alexa Prize finalist.

🌍 Our vision

Training frontier LLMs costs tens of millions today. Our new algorithms already cut that by ~50% — and we’re just getting started. We’re building a European hub to push AI forward through algorithms, not brute force.

📧 hey(at)bottlecapai.com · 🌐 bottlecapai.com

🏆 First Competition has Finished.

Your self-paced submissions are always welcome! Competition or no-competition, let us know if you have speed up!💪

Update November 13: We have winners! See the anouncement 🏆

1st place: $3,000 USD Jan Chleboun
2nd place: $2,000 USD Andrej Nosov
3rd place: $1,000 USD Dominik Jurko

Competition deadline: 11.11.2025. Prizes were awarded based on the best validated results shared via the submission process below. Ties may be broken by total training time and clarity of write-up.

Objective

Prototype your idea on a subset of the FineWeb dataset using 1 GPU.
(Optional) goal: reach a validation loss of ≤ 3.3821 faster than the baseline.

You can achieve this by:

making your model faster (so that it sees more data in shorter time)
making your training more efficient (so that in less steps your model makes better progress).

What's the point?

We're not here to optimize learning rates and torch.compile flags.
We're here to explore algorithmic ideas that might scale, and if that means writing your own CUDA kernel, even better.

This benchmark is meant for:

People with limited hardware
People with ideas and curiosity

You're encouraged to try new techniques to speed up language modeling such as but not exclusively:

Modify the loss function
Add auxiliary losses (multi-token prediction?)
Modify the architecture (Mixture of Experts? Different attention?)
Come up with a different training algorithm
Modify the training data
New architecture!

You're not expected to:

Just bump up the learning rate
Beat everyone with hyperparameter magic
Do 50 runs to grid search Adam betas
Benchmark arcane PyTorch flags
Copy speedups from Modded-NanoGPT
Modify a specific hidden layer size to align better with the number of TensorCores on your GPU

We're interested in your own ideas, not how well you can copy other's. These ideas should be general and work on different setups and not be hardcoded to a very specific one.

You have a budget of 5B tokens available for training, but the baseline only uses 2.5B, so you've got room to train on more data if you make your model faster, or on less but better data.

The dataset is pre-tokenized so that you don't have to do that yourself (saves time) but if you want to explore the original text, you can decode it using the GPT-2 tokenizer (tiktoken.get_encoding("gpt2")).

Running the baseline

To run the baseline, run the following commands.

git clone https://github.com/BottleCapAI/modded-nanogpt && cd modded-nanogpt
pip install -r requirements.txt

# you can skip this if you don't want to use W&B, in which case you should remove the --log_wandb argument from run.sh
wandb login
wandb sync wandb/run-20250410_203158-64s1zc1w # synchronizes the baseline run to your W&B account for reference

python data/cached_fineweb10B.py
./run.sh

Benchmarks

Below is a reference leaderboard. Beating it is awesome, but sub-baseline runs are still valuable when they demonstrate a creative idea.

Train a neural network to ≤ 3.3821 validation loss on FineWeb using 1 GPU.

#	Record time	Description	Date	Log	Contributors
1	5.401 hours	baseline	11/04/25	log	contributors-of-modded-nanoGPT
2	4.86 hours	gated embedding projection	20/08/25	-	adam-osusky
3	3.88 hours	custom attention mask, increased context length, variable context length	28/06/25	-	filipmihal

Note: The baseline used one RTX 4090. It took 4768 steps/iterations and used in total 2.5B tokens.

Rules

Optional: reach validation loss ≤ 3.3821 in shorter time.
Do not introduce new datasets, but feel free to modify the current.
Document your idea in IDEA.md (motivation, method, results). Negative results are welcome—share what you learned!

If you use a different GPU than RTX 4090, benchmark the baseline and compare your speedup to that result, for example, if the baseline takes 10 hours on your setup, but your solution takes only 8 hours, then thats your speedup that you can report to us! Keep the comparison fair, if you increase the learning rate for your solution, try increasing it also for the baseline.

Submission

To submit your results, run:

git bundle create <first name>-<last name>.bundle --all

Then send us your .bundle file to hey(at)bottlecapai.com with subject in format: <first name>-<last name> <percentage speedup (dont worry if it's negative)>.

Didn’t beat the baseline? No worries – send the bundle anyway plus a short IDEA.md describing:
• what you tried & why • what worked • what didn’t.

Beat the baseline? Great! Add a RESULTS.md with timing, settings, and hardware so others can reproduce it.

At this moment, we are interested mainly in candidates willing to relocate to Prague. (If you’re an exceptional fit, we’re happy to discuss possible support options.)

Technical Notes

While this project is designed to run on 1 GPU, there are a few things to keep in mind:

Batch Size, Sequence Length and Gradient Accumulation: The current setup requires ~ 13GB of GPU memory, which might not be available to you (if you have no GPU we suggest using Google Colab), in which case, you might need to tune down some hyperparameters. We recommend starting with validation batch size - this one will not affect performance but validation will take a bit longer. Next, you might tune down batch size which you might then compensate by increasing gradient accumulation to retain the same effective batch size, be careful about changing learning rate and other hyperparameters should you change effective batch size.
torch.compile Considerations:
On some RTX cards, aggressive kernel auto-tuning via torch.compile can lead to shared memory issues. If you encounter errors or persistent warnings (e.g., about insufficient SMs for max autotune GEMM mode), you may have to disable torch.compile or adjust your model settings accordingly. Although this may lead to slightly slower performance, it typically resolves hardware compatibility issues.
Multi-GPU Runs: This code should be ready for distributed training, if you happen to have access to multiple GPUs. In that case, make sure that Gradient Accumulation Steps is divisible by number of GPUs.

Comment on the target metric

The target metric is cross-entropy loss on the FineWeb val set. The goal of the speedrun is to obtain a probability model of language which assigns a probability of at least math.exp(-3.3821 * 1048576) to the first 1,048,576 tokens of the FineWeb valset. Hence, we allow evaluation at any sequence length, so long as we still have a valid probability model of language on the entire validation set.

Name		Name	Last commit message	Last commit date
Latest commit History 1,000 Commits
data		data
img		img
wandb/run-20250410_203158-64s1zc1w		wandb/run-20250410_203158-64s1zc1w
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run.sh		run.sh
train_gpt2.py		train_gpt2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPT-2 Benchmark

📌 About BottleCapAI

👥 Founders

🌍 Our vision

🏆 First Competition has Finished.

Objective

What's the point?

Running the baseline

Benchmarks

Rules

Submission

Technical Notes

Comment on the target metric

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GPT-2 Benchmark

📌 About BottleCapAI

👥 Founders

🌍 Our vision

🏆 First Competition has Finished.

Objective

What's the point?

Running the baseline

Benchmarks

Rules

Submission

Technical Notes

Comment on the target metric

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages