From 10084f5e7fc70bde194c74353b8d2ef15c62afe1 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Tue, 22 Oct 2024 09:59:47 -0400
Subject: [PATCH 01/87] misc updates

---
 .github/PULL_REQUEST_TEMPLATE.md | 21 +++++++++++++--------
 CODE_OF_CONDUCT.md               | 10 +++++-----
 CONTRIBUTING.md                  | 12 ++++++------
 docs/README.md                   |  0
 4 files changed, 24 insertions(+), 19 deletions(-)
 create mode 100644 docs/README.md

diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
index 82f811b4c..43cca0b59 100644
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@@ -25,40 +25,45 @@ List the key changes introduced in this PR:
 1. Change A
 2. Change B
 
-# ✅ Checklist
+## ✅ Checklist
 
 Make sure the following tasks are completed before submitting the PR:
 
-### General:
-- [ ] 📜 I have read and followed the [contributing guidelines](CONTRIBUTING.md).
+### General
+
+- [ ] 📜 I have read and followed the [contributing guidelines](https://github.com/ServiceNow/Fast-LLM/blob/main/CONTRIBUTING.md).
+- [ ] 🏷️ I am using a clear and descriptive title that follows the [PR title guidelines](https://servicenow.github.io/Fast-LLM/developers/pr-title-guidelines).
 - [ ] 🎉 The functionality is complete, and I have tested the changes.
 - [ ] 📝 I have updated the documentation if needed.
 - [ ] ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
 - [ ] 🧩 I have commented my code, especially in hard-to-understand areas.
 
-### Dependencies and Configuration:
+### Dependencies and Configuration
+
 - [ ] 🐋 I have updated the Docker configuration or dependencies, if applicable.
 - [ ] 🔄 I have ensured compatibility with the existing setup after dependency changes.
 
-### Testing:
+### Testing
+
 - [ ] 🧪 I have added or updated tests to cover my changes.
 - [ ] ✔️ New and existing tests pass locally with my changes.
 - [ ] 🚦 I have tested these changes on GPUs and verified training stability.
 - [ ] 🏋️ I have tested the changes on realistic training workloads, if applicable.
 
-### Performance Impact:
+### Performance Impact
+
 - [ ] 📊 I have run benchmarks where applicable to evaluate the performance impact.
 - [ ] ✅ The benchmarks show no performance regression.
 - [ ] 🚀 The benchmarks indicate a potential performance improvement.
 - [ ] ⚠️ The benchmarks indicate a potential performance degradation.
 - [ ] 📈 I have provided benchmark results and detailed any performance impact below, if applicable.
 
-# 📊 Performance Impact Details
+## 📊 Performance Impact Details
 
 If there is any impact on performance, describe it and provide benchmark results, if applicable:
 
 ---
 
-# 📝 Additional Notes
+## 🗒️ Additional Notes
 
 Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.
diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
index b3b61bc8b..0bea0b279 100644
--- a/CODE_OF_CONDUCT.md
+++ b/CODE_OF_CONDUCT.md
@@ -11,7 +11,7 @@ Communities thrive when members support each other and provide useful feedback.
 - User Contributions must not include material that is defamatory, obscene, indecent, abusive, offensive, harassing, violent, hateful, inflammatory or otherwise objectionable.
 - Lively and collegial discussions are always encouraged in a healthy community. It is okay to argue facts but not okay to argue personalities or personal beliefs.
 - Do not use text formats such as all caps or bold that may be read as annoying, rude or send a strong message.
-- Do not publish anyone’s private personal information without their explicit consent.
+- Do not publish anyone's private personal information without their explicit consent.
 - Avoid using abbreviations or terminology that others may not understand. An abbreviation may mean something to you but in another context or country, it may have another meaning.
 - Be accountable for your actions by correcting your mistakes and indicating where you have changed a previous post of yours.
 - Mark content as correct and helpful, and provide feedback. If you read a discussion post that you find helpful, we encourage you to leave a positive vote and comment in the replies. If you find a post that is unhelpful, please provide more information in the issue comments.
@@ -27,15 +27,15 @@ ServiceNow suggests the following technical support pathways for open-source pro
 3. Search the Discussions.
 4. Search the project knowledge base or Wiki for known errors, useful solutions, and troubleshooting tips.
 5. Check the project guidelines in the [`CONTRIBUTING.md`](CONTRIBUTING.md) file if you would like details on how you can submit a change. Community contributions are valued and appreciated!
-6. Log an Issue if it hasn’t already been logged. If the issue has already been logged by another user, vote it up, and add a comment with additional or missing information. Do your best to choose the correct category when logging a new issue. This will make it easier to differentiate bugs from new feature requests or ideas. If after logging an issue you find the solution, please close your issue and provide a comment with the solution. This will help the project owners and other users.
+6. Log an Issue if it hasn't already been logged. If the issue has already been logged by another user, vote it up, and add a comment with additional or missing information. Do your best to choose the correct category when logging a new issue. This will make it easier to differentiate bugs from new feature requests or ideas. If after logging an issue you find the solution, please close your issue and provide a comment with the solution. This will help the project owners and other users.
 7. Contact the project team contributors of the project to see if they can help as a last resort only.
 
 **Repositories**
 
 - Read and follow the license instructions
-- Remember to include citations if you use someone else’s work in your own project. Use the [`CITATION.cff`](CITATION.cff) to find the correct project citation reference.
-- ‘Star’ project repos to save for future reference.
-- ‘Watch’ project repos to get notifications of changes – this can get noisy for some projects, so only watch the ones you really need to track closely.
+- Remember to include citations if you use someone else's work in your own project. Use the [`CITATION.cff`](CITATION.cff) to find the correct project citation reference.
+- ‘Star' project repos to save for future reference.
+- ‘Watch' project repos to get notifications of changes – this can get noisy for some projects, so only watch the ones you really need to track closely.
 
 **Enforcement and reporting**
 
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index e5ab7694f..e4676fc52 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -8,9 +8,9 @@ If you have questions or want to start a discussion, feel free to [open a discus
 
 To get started with contributing to Fast-LLM, follow these steps to set up your environment:
 
-1. **Set Up the Development Environment**: Fast-LLM is built on [PyTorch](https://pytorch.org/) and [Triton](https://triton-lang.org/). Check out our [setup guide](https://servicenow.github.io/Fast-LLM/development/setup) for instructions on getting everything ready, including the development environment and dependencies.
-2. **Learn Our Best Practices**: Get familiar with our [development best practices](https://servicenow.github.io/Fast-LLM/development/dev-practices/), which cover code style, pre-commit hooks, and testing strategies.
-3. **Launch Fast-LLM Locally or with Docker**: Need help getting started? Follow the instructions in the [launching section](https://servicenow.github.io/Fast-LLM/development/launching) to get Fast-LLM up and running.
+1. **Set Up the Development Environment**: Fast-LLM is built on [PyTorch](https://pytorch.org/) and [Triton](https://triton-lang.org/). Check out our [setup guide](https://servicenow.github.io/Fast-LLM/developers/setup) for instructions on getting everything ready, including the development environment and dependencies.
+2. **Learn Our Best Practices**: Get familiar with our [development best practices](https://servicenow.github.io/Fast-LLM/developers/dev-practices/), which cover code style, pre-commit hooks, and testing strategies.
+3. **Launch Fast-LLM Locally or with Docker**: Need help getting started? Follow the instructions in the [launching section](https://servicenow.github.io/Fast-LLM/developers/launching) to get Fast-LLM up and running.
 
 ## How to Report a Bug 🐞
 
@@ -31,7 +31,7 @@ Before diving into code, [open an issue](https://github.com/ServiceNow/Fast-LLM/
 2. **Clone Your Fork Locally**: Use `git clone` to bring the code to your local machine.
 3. **Create a New Branch**: Name your branch descriptively, such as `feature/awesome-feature` or `fix/nasty-bug`.
 4. **Make Your Changes**: Work your magic! Don't forget to add or update tests, benchmarks, or configurations as needed.
-5. **Create a Properly Titled Pull Request**: When you're ready to open a PR, make sure to use a clear and descriptive title that follows our [PR title guidelines](https://servicenow.github.io/Fast-LLM/development/pr-title-guidelines). This title will become the commit message for the squashed merge.
+5. **Create a Properly Titled Pull Request**: When you're ready to open a PR, make sure to use a clear and descriptive title that follows our [PR title guidelines](https://servicenow.github.io/Fast-LLM/developers/pr-title-guidelines). This title will become the commit message for the squashed merge.
 6. **Push to Your Fork**: Push the branch to your GitHub fork.
 7. **Open a Pull Request**: [Submit a pull request](https://github.com/ServiceNow/Fast-LLM/compare) to the `main` branch. Reference the original issue number and provide a brief summary of your changes.
 
@@ -39,14 +39,14 @@ Before diving into code, [open an issue](https://github.com/ServiceNow/Fast-LLM/
 
 Here are some tips to ensure your pull request gets reviewed and merged promptly:
 
-- **Follow our coding standards**: Stick to our [development best practices](https://servicenow.github.io/Fast-LLM/development/dev-practices/) to keep the code clean and consistent.
+- **Follow our coding standards**: Stick to our [development best practices](https://servicenow.github.io/Fast-LLM/developers/dev-practices/) to keep the code clean and consistent.
 - **Write tests**: Verify your changes with unit tests for new features or bug fixes.
 - **Test on GPUs and real-world workloads**: Since Fast-LLM is all about training large language models, make sure your changes work smoothly in GPU environments and on typical training setups.
 - **Run benchmarks and performance tests**: Make sure your changes don't slow things down. If there's any impact on performance, provide benchmark results to back it up.
 - **Avoid introducing new issues**: Check that there are no new runtime warnings, type checker errors, linting problems, or unhandled edge cases.
 - **Comment non-trivial code**: Make your code easy to understand for others.
 - **Keep sensitive data out**: Make sure your code or commit messages don't expose private or proprietary information.
-- **Use the [PR template](https://github.com/ServiceNow/Fast-LLM/blob/main/.github/pull_request_template.md)**: Complete the checklist to make sure everything is in order before hitting submit.
+- **Use the [PR template](https://github.com/ServiceNow/Fast-LLM/blob/main/.github/PULL_REQUEST_TEMPLATE.md)**: Complete the checklist to make sure everything is in order before hitting submit.
 
 ## Seeking Help or Clarification
 
diff --git a/docs/README.md b/docs/README.md
new file mode 100644
index 000000000..e69de29bb

From 91cd526060062d4f078dc068d5c9404298635686 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Tue, 22 Oct 2024 10:00:07 -0400
Subject: [PATCH 02/87] revamp landing page

---
 docs/index.md | 45 ++++++++++++++++++++++++++++++++++++---------
 1 file changed, 36 insertions(+), 9 deletions(-)

diff --git a/docs/index.md b/docs/index.md
index aedfb6ee3..b30e452cd 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -6,22 +6,49 @@ hide:
   - feedback
 ---
 
-# Fast-LLM
+Welcome to **Fast-LLM**, the cutting-edge open-source library built for training large language models (LLMs) with exceptional speed, scalability, and customization. Developed by ServiceNow Research's Foundation Models Lab, Fast-LLM is engineered to meet the rigorous demands of professional AI teams, research institutions, and enterprises pushing the limits of generative AI. Whether you're training models for groundbreaking research or high-stakes production, Fast-LLM empowers you to achieve unparalleled results.
 
-Welcome to Fast-LLM, an innovative library designed for training large language models with an emphasis on speed, flexibility, and convenience. Developed by ServiceNow Research's Foundation Models Lab, Fast-LLM is tailored to meet the rigorous demands of enterprise AI solutions, providing a foundation for our bespoke generative AI applications.
+## Why Fast-LLM?
 
-## Key Features
+Fast-LLM is purpose-built for serious AI practitioners who need more than off-the-shelf solutions. It is designed to handle the most demanding language model training tasks, offering a robust, flexible, and high-performance alternative to commercial frameworks like Megatron-LM or NeMo.
 
-- **Speed**: Fast-LLM delivers unparalleled training throughput, achieving speeds up to 4,000 tokens/s/GPU for Mixtral-8x7B and nearly 9,000 tokens/s/GPU for Mistral-7B, facilitating rapid model development and iteration.
-- **Flexibility**: The library supports a diverse array of model architectures including, but not limited to, GPT, StarCoder, Llama, Mistral, and Mixtral. It is designed to be adaptable, allowing for easy expansion and customization to a broad range of models and training scenarios.
-- **Convenience**: Designed with the user in mind, Fast-LLM aims to be straightforward and intuitive, enabling researchers and developers to focus more on innovation and less on the complexities of the tooling.
+### Key Features
+
+- **🚀 Speed Like No Other:** Achieve record-breaking training throughput with Fast-LLM. For instance, train Mistral-7B at nearly **9,800 tokens/s/GPU** on a 4-node cluster with 32 H100 GPUs. Our optimized kernels, advanced parallelism, and memory-efficient techniques drastically reduce training time and cost.
+- **📡 Unmatched Scalability:** Fast-LLM scales seamlessly from a single GPU to large compute clusters, supporting 3D parallelism (data, tensor, and pipeline), sequence length parallelism, and ZeRO-1, ZeRO-2, and ZeRO-3 techniques for maximum memory efficiency. Scale to the size you need without sacrificing performance.
+- **🎛️ Total Flexibility:** Fast-LLM is compatible with all major language model architectures, including GPT, Llama, Mistral, StarCoder, and Mixtral. Its modular design enables extensive customization of model architectures, optimizers, data loaders, and training loops, giving you full control over your training workflows.
+- **📦 Seamless Integration:** Fast-LLM integrates smoothly with popular libraries such as [Hugging Face Transformers](https://huggingface.co/transformers), making it easy to leverage existing models and datasets while benefiting from our optimizations.
+- **🛠️ Professional-Grade Tools:** Fast-LLM supports mixed precision training, large batch training, and gradient accumulation, all while maintaining reproducibility through deterministic behavior. Our pre-built Docker images, YAML-based configurations, and command-line interface make setup straightforward, so you can focus on what matters most—innovating with AI.
+
+## The Fast-LLM Advantage
+
+Designed for professionals who demand speed, scale, and customization, Fast-LLM is not just another library, it's a platform for powering the next generation of AI breakthroughs. Here's what sets Fast-LLM apart:
+
+- **Purpose-Built for Large-Scale AI:** Unlike generic frameworks, Fast-LLM is optimized specifically for training large language models, with features tuned for massive compute clusters and high-throughput workflows.
+- **Openness Without Compromise:** Our commitment to open-source ensures that you can customize and extend Fast-LLM to suit your specific needs, without the limitations of proprietary software.
+- **Community-Driven Development:** While our focus is on professionals and enterprise users, we believe in open innovation. Fast-LLM's development is transparent, and we actively welcome contributions that help make our platform even more powerful.
 
 ## Project Scope and Objectives
 
-Fast-LLM seeks to provide a high-quality alternative to existing frameworks such as Megatron-LM and NeMo. It is compatible with 3D parallelism and is designed to integrate seamlessly with Huggingface Transformers, promoting not only efficient model training but also straightforward model deployment and inference.
+Fast-LLM is designed to be the go-to solution for those training the most sophisticated language models. Our objectives include:
+
+- **Accelerating Training Workflows:** By leveraging optimized kernel efficiency, advanced parallelism, and custom memory management techniques, we aim to deliver the fastest LLM training experience available.
+- **Supporting a Broad Range of Architectures:** Fast-LLM offers built-in support for GPT, Llama, StarCoder, Mistral, Mixtral, and more, with an architecture-agnostic approach that allows users to easily adapt the framework to emerging models.
+- **Enabling Seamless Integration and Deployment:** From training to deployment, Fast-LLM integrates effortlessly with existing ML pipelines, including Hugging Face Transformers and Kubernetes-based clusters.
+- **Advancing LLM Research and Production-Readiness:** With support for mixed precision training, ZeRO optimizations, and reproducibility features, Fast-LLM is equipped for both cutting-edge research and mission-critical production environments.
 
 ## Collaboration and Contribution
 
-The project is set for open-sourcing in Q2 2024, inviting contributions from the community in areas such as testing, bug fixes, new features, and documentation. We are especially interested in enhancements related to custom kernels using OpenAI's Triton JIT compiler and adaptations for alternative hardware platforms like AMD and Intel.
+As we continue to expand Fast-LLM, we're looking for contributions from the community to help shape its future. We welcome:
+
+- **Testing and Bug Fixes:** Help us identify issues and improve stability.
+- **Feature Development:** Contribute new capabilities, such as custom kernels or support for alternative hardware like AMD and Intel.
+- **Documentation and Tutorials:** Make Fast-LLM more accessible by improving our [documentation](https://servicenow.github.io/Fast-LLM) and writing practical guides.
+
+Fast-LLM is more than just software—it's a community. Get involved by exploring our [contribution guidelines](https://github.com/ServiceNow/Fast-LLM/CONTRIBUTING.md) and engaging with us on [GitHub Discussions](). 
+
+## Getting Started
+
+Ready to dive in? Check out our [quickstart guide](quickstart.md) for an overview of how to set up and run Fast-LLM on different platforms, including Slurm and Kubernetes. Explore the [examples](examples/) section for pre-configured setups to help you get started quickly with your own training experiments.
 
-For more details on getting involved or using Fast-LLM, please refer to our [contribution guidelines](https://github.com/ServiceNow/Fast-LLM/CONTRIBUTING.md) and the subsequent sections of this documentation.
+For any questions or issues, don't hesitate to open an [issue](https://github.com/ServiceNow/Fast-LLM/issues) or reach out to the community. We're here to help you accelerate your LLM training to full speed.

From df5e09d4ed955313063d28fa55f41846e861ec2d Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Tue, 22 Oct 2024 10:00:18 -0400
Subject: [PATCH 03/87] add about us section

---
 docs/about-us.md | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)
 create mode 100644 docs/about-us.md

diff --git a/docs/about-us.md b/docs/about-us.md
new file mode 100644
index 000000000..a3c9878fb
--- /dev/null
+++ b/docs/about-us.md
@@ -0,0 +1,41 @@
+---
+title: About Us
+---
+
+Welcome to Fast-LLM! We are a global team of engineers, researchers, and AI professionals led by the Foundation Models Lab at [ServiceNow Research](https://www.servicenow.com/research/), dedicated to advancing large language models (LLMs) and providing the highest-performance tools for serious users. Designed with professionals, research institutions, and enterprises in mind, Fast-LLM offers the speed, scalability, and flexibility needed to train the biggest and most complex models. Our commitment to open-source ensures that you have full control over your workflows, without the limitations or compromises of commercial frameworks.
+
+## Our Mission
+
+Our mission is to deliver a best-in-class library for training large-scale language models, combining cutting-edge performance with robust, customizable features. Fast-LLM is built to meet the needs of researchers and organizations who push the boundaries of generative AI, enabling them to train state-of-the-art models more efficiently. By optimizing training workflows and scaling to massive compute clusters, we help professionals unlock the full potential of LLMs, reducing costs and time-to-deployment for ambitious AI projects.
+
+## Our Vision
+
+We envision Fast-LLM as the go-to solution for serious AI practitioners who require more than what typical frameworks can offer. Our goal is to empower research institutions, corporate AI teams, and universities to train sophisticated models that exceed the capabilities of standard tools. By creating a highly performant and customizable library, we aim to be the backbone of cutting-edge AI research and development, equipping experts with the tools they need to tackle the toughest training challenges.
+
+## Our Values
+
+At Fast-LLM, we adhere to a set of guiding principles that define our approach:
+
+- **Performance-Driven:** We are relentless in our pursuit of speed and efficiency. Fast-LLM is built to reduce training time and scale to the largest clusters, enabling our users to achieve breakthrough results faster.
+- **Professional-Grade Customization:** We understand that serious AI work demands flexibility. Fast-LLM is designed for extensive customization, allowing users to tailor every aspect of the training process to their unique needs.
+- **Open Innovation:** While we cater to advanced users, our commitment to open-source ensures that innovation remains accessible. We believe in building a community where professionals can collaborate and contribute to shaping the future of AI.
+- **Reliability at Scale:** Fast-LLM is built with rigorous standards to support production-level workloads. We prioritize stability, reproducibility, and robustness, ensuring that your models can scale from research to real-world applications seamlessly.
+
+## Meet the Team
+
+Fast-LLM is led by the Foundation Models Lab at [ServiceNow Research](https://www.servicenow.com/research/), with development driven by a dedicated group of professionals who bring extensive expertise in AI, machine learning, and distributed systems. While the project direction is guided by the Foundation Models Lab, contributions come from a growing network of researchers, developers, and industry experts worldwide. Here are some of the key members leading the project:
+
+- **Joel Lamy Poirier** - Lead Developer and maintainer, ServiceNow Research: Joel spearheads the core development, ensuring that Fast-LLM delivers on its promise of speed and scalability.
+- **Torsten Scholak** - Research Science Lead, ServiceNow Research: Torsten leads our research efforts, driving the scientific innovations that keep Fast-LLM at the forefront of AI training.
+
+Our core team includes members affiliated with ServiceNow Research, as well as other contributors who bring unique perspectives and skills to the project. We welcome new participants from the broader AI community who share our vision of creating the best tools for training large-scale language models.
+
+## Get Involved
+
+Fast-LLM is an open-source project that thrives on collaboration. If you're a professional or researcher looking to contribute, there are many ways to get involved:
+
+- **Code Contributions:** Dive into our [contribution guidelines](CONTRIBUTING.md) to learn how you can help improve Fast-LLM.
+- **Discussion and Ideas:** Join us on [GitHub Discussions](https://github.com/ServiceNow/Fast-LLM/discussions) to share your insights, ask questions, or discuss new features.
+- **Documentation and Tutorials:** Help us expand our [documentation](https://servicenow.github.io/Fast-LLM/), making it even more valuable for other professionals.
+
+If you're serious about training large language models, Fast-LLM is here to help you push the limits. We look forward to your contributions and feedback as we continue to make LLM training faster and better.

From 585bb894a75ef3bda66671a8294eaaac26adda29 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Tue, 22 Oct 2024 10:00:27 -0400
Subject: [PATCH 04/87] add developers corner

---
 docs/developers/dev-practices.md       |  0
 docs/developers/launching.md           |  0
 docs/developers/pr-title-guidelines.md | 13 +++++++++++++
 docs/developers/setup.md               |  0
 4 files changed, 13 insertions(+)
 create mode 100644 docs/developers/dev-practices.md
 create mode 100644 docs/developers/launching.md
 create mode 100644 docs/developers/pr-title-guidelines.md
 create mode 100644 docs/developers/setup.md

diff --git a/docs/developers/dev-practices.md b/docs/developers/dev-practices.md
new file mode 100644
index 000000000..e69de29bb
diff --git a/docs/developers/launching.md b/docs/developers/launching.md
new file mode 100644
index 000000000..e69de29bb
diff --git a/docs/developers/pr-title-guidelines.md b/docs/developers/pr-title-guidelines.md
new file mode 100644
index 000000000..fce109568
--- /dev/null
+++ b/docs/developers/pr-title-guidelines.md
@@ -0,0 +1,13 @@
+# PR Title Guidelines ✏️
+
+Since we squash commits when merging pull requests, the PR title will become the commit message for the squashed commit. To ensure a clear and consistent project history, follow these guidelines for naming your PR:
+
+1. **Use a concise yet descriptive title**: The title should summarize the key change or feature introduced. Avoid vague titles like "Fix bug" or "Update code."
+2. **Start with a keyword**: Use keywords to categorize the type of change. For example:
+   - **feat:** for new features (e.g., `[feat] add support for mixed-precision training`)
+   - **fix:** for bug fixes (e.g., `[fix] resolve memory leak during backpropagation`)
+   - **perf:** for performance improvements (e.g., `[perf] optimize gradient accumulation step`)
+   - **refactor:** for code refactoring (e.g., `[refactor] clean up data loader module`)
+   - **docs:** for documentation changes (e.g., `[docs] update contributing guidelines`)
+   - **build:** for changes to the build process or dependencies (e.g., `[build] bump PyTorch version`)
+3. **Reference the issue number (if applicable)**: If the PR is related to a specific issue, include the issue number in the title (e.g., `[fix] resolve #123 memory leak in training loop`).
diff --git a/docs/developers/setup.md b/docs/developers/setup.md
new file mode 100644
index 000000000..e69de29bb

From 62a3b223eb4b5604508ddf7faac646dce97516b4 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Tue, 22 Oct 2024 12:19:40 -0400
Subject: [PATCH 05/87] add docs README

---
 docs/README.md | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/docs/README.md b/docs/README.md
index e69de29bb..c70897016 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -0,0 +1,11 @@
+# Fast-LLM Documentation Sources
+
+This folder contains the source files for the Fast-LLM documentation. The contents here are used to generate the rendered documentation, which is automatically updated and published whenever changes are pushed to the `main` branch.
+
+## 📚 Access the Rendered Documentation
+
+To view the complete, rendered documentation, please visit the [Fast-LLM Documentation Site](https://servicenow.github.io/Fast-LLM).
+
+## Contributing to the Documentation
+
+If you'd like to contribute to the Fast-LLM documentation, feel free to edit these source files and submit a pull request. The changes will be reflected on the rendered documentation site after they are merged into the `main` branch.

From 257ba2dfb1a3d4243f215b9f1718c39d8ef3598a Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Tue, 22 Oct 2024 12:47:42 -0400
Subject: [PATCH 06/87] improve landing page

---
 docs/index.md | 40 +++++++++++++++++++++++++---------------
 1 file changed, 25 insertions(+), 15 deletions(-)

diff --git a/docs/index.md b/docs/index.md
index b30e452cd..3ddd12de0 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -6,27 +6,37 @@ hide:
   - feedback
 ---
 
-Welcome to **Fast-LLM**, the cutting-edge open-source library built for training large language models (LLMs) with exceptional speed, scalability, and customization. Developed by ServiceNow Research's Foundation Models Lab, Fast-LLM is engineered to meet the rigorous demands of professional AI teams, research institutions, and enterprises pushing the limits of generative AI. Whether you're training models for groundbreaking research or high-stakes production, Fast-LLM empowers you to achieve unparalleled results.
+Welcome to **Fast-LLM**, the cutting-edge open-source library built for training large language models (LLMs) with exceptional speed, scalability, and customization. Developed by [ServiceNow Research](https://www.servicenow.com/research/)'s Foundation Models Lab, Fast-LLM is engineered to meet the rigorous demands of professional AI teams, research institutions, and enterprises pushing the limits of generative AI. Whether you're training models for groundbreaking research or high-stakes production, Fast-LLM empowers you to achieve unparalleled results.
 
 ## Why Fast-LLM?
 
-Fast-LLM is purpose-built for serious AI practitioners who need more than off-the-shelf solutions. It is designed to handle the most demanding language model training tasks, offering a robust, flexible, and high-performance alternative to commercial frameworks like Megatron-LM or NeMo.
+Fast-LLM is designed for professionals who demand speed, scalability, and customization in training large language models. It goes beyond off-the-shelf solutions to meet the rigorous requirements of large-scale AI projects, offering a robust, flexible, and high-performance alternative to frameworks like NVIDIA NeMo Megatron. With Fast-LLM, you can train your most sophisticated models while optimizing for both performance and cost.
+
+### The Fast-LLM Advantage
+
+Fast-LLM isn't just another library, it's a platform for powering the next generation of AI breakthroughs. Here's what sets it apart:
+
+- **🚀 Purpose-Built for Large-Scale AI:** Optimized specifically for training large language models at scale, Fast-LLM comes with features fine-tuned for massive compute clusters and high-throughput workflows. It supports advanced parallelism techniques, ZeRO optimizations, and high-throughput kernels, making it ideal for handling the most demanding training tasks.
+- **💰 Cost Efficiency That Sets Fast-LLM Apart:** Fast-LLM's optimizations translate directly into significant cost savings:
+  - **Lower Training Costs:** Fast-LLM achieves higher throughput per GPU, reducing the number of hours needed to complete training tasks. For example, training a Mistral-7B model can be up to xx% cheaper compared to other frameworks due to faster processing (insert exact point of reference here).
+  - **More Tokens for Your Budget:** Train on significantly more data within the same budget, up to xx% more tokens per dollar—leading to better-trained models and higher-quality results (insert exact point of reference here).
+  [Learn more about Fast-LLM's cost efficiency and see detailed comparisons](cost-efficiency.md).
+- **🔓 Openness Without Compromise:** Our commitment to open-source ensures that you can customize and extend Fast-LLM to suit your specific needs without the limitations of proprietary software. Fast-LLM gives you full control over your training workflows, from experimentation to production.
+- **🌍 Community-Driven Development:** While our focus is on professionals and enterprise users, we believe in open innovation. Fast-LLM's development is transparent, and we actively welcome contributions that make our platform even more powerful and versatile.
+
+### Built for the Most Demanding Training Tasks
+
+Fast-LLM is engineered to handle complex AI projects with ease, offering a scalable solution that supports various model architectures, including Llama, Mistral, StarCoder, and Mixtral. Whether you're training on a single GPU or a multi-node cluster, Fast-LLM adapts to your setup and scales effortlessly to meet your requirements.
 
 ### Key Features
 
+Fast-LLM offers all the features you need to accelerate your LLM training to full speed:
+
 - **🚀 Speed Like No Other:** Achieve record-breaking training throughput with Fast-LLM. For instance, train Mistral-7B at nearly **9,800 tokens/s/GPU** on a 4-node cluster with 32 H100 GPUs. Our optimized kernels, advanced parallelism, and memory-efficient techniques drastically reduce training time and cost.
 - **📡 Unmatched Scalability:** Fast-LLM scales seamlessly from a single GPU to large compute clusters, supporting 3D parallelism (data, tensor, and pipeline), sequence length parallelism, and ZeRO-1, ZeRO-2, and ZeRO-3 techniques for maximum memory efficiency. Scale to the size you need without sacrificing performance.
 - **🎛️ Total Flexibility:** Fast-LLM is compatible with all major language model architectures, including GPT, Llama, Mistral, StarCoder, and Mixtral. Its modular design enables extensive customization of model architectures, optimizers, data loaders, and training loops, giving you full control over your training workflows.
 - **📦 Seamless Integration:** Fast-LLM integrates smoothly with popular libraries such as [Hugging Face Transformers](https://huggingface.co/transformers), making it easy to leverage existing models and datasets while benefiting from our optimizations.
-- **🛠️ Professional-Grade Tools:** Fast-LLM supports mixed precision training, large batch training, and gradient accumulation, all while maintaining reproducibility through deterministic behavior. Our pre-built Docker images, YAML-based configurations, and command-line interface make setup straightforward, so you can focus on what matters most—innovating with AI.
-
-## The Fast-LLM Advantage
-
-Designed for professionals who demand speed, scale, and customization, Fast-LLM is not just another library, it's a platform for powering the next generation of AI breakthroughs. Here's what sets Fast-LLM apart:
-
-- **Purpose-Built for Large-Scale AI:** Unlike generic frameworks, Fast-LLM is optimized specifically for training large language models, with features tuned for massive compute clusters and high-throughput workflows.
-- **Openness Without Compromise:** Our commitment to open-source ensures that you can customize and extend Fast-LLM to suit your specific needs, without the limitations of proprietary software.
-- **Community-Driven Development:** While our focus is on professionals and enterprise users, we believe in open innovation. Fast-LLM's development is transparent, and we actively welcome contributions that help make our platform even more powerful.
+- **🛠️ Professional-Grade Tools:** Fast-LLM supports mixed precision training, large batch training, and gradient accumulation, all while maintaining reproducibility through deterministic behavior. Our pre-built Docker images, YAML-based configurations, and command-line interface make setup straightforward, so you can focus on what matters most: innovating with AI.
 
 ## Project Scope and Objectives
 
@@ -34,8 +44,8 @@ Fast-LLM is designed to be the go-to solution for those training the most sophis
 
 - **Accelerating Training Workflows:** By leveraging optimized kernel efficiency, advanced parallelism, and custom memory management techniques, we aim to deliver the fastest LLM training experience available.
 - **Supporting a Broad Range of Architectures:** Fast-LLM offers built-in support for GPT, Llama, StarCoder, Mistral, Mixtral, and more, with an architecture-agnostic approach that allows users to easily adapt the framework to emerging models.
-- **Enabling Seamless Integration and Deployment:** From training to deployment, Fast-LLM integrates effortlessly with existing ML pipelines, including Hugging Face Transformers and Kubernetes-based clusters.
-- **Advancing LLM Research and Production-Readiness:** With support for mixed precision training, ZeRO optimizations, and reproducibility features, Fast-LLM is equipped for both cutting-edge research and mission-critical production environments.
+- **Enabling Seamless Integration and Deployment:** From training to deployment, Fast-LLM integrates effortlessly with existing ML pipelines, including [Hugging Face Transformers](https://huggingface.co/transformers) and [Kubernetes](https://kubernetes.io)-based clusters.
+- **Advancing LLM Research and Production-Readiness:** With support for mixed precision training, Zero Redundancy Optimizer (ZeRO) techniques, and reproducibility features, Fast-LLM is equipped for both cutting-edge research and mission-critical production workloads.
 
 ## Collaboration and Contribution
 
@@ -45,10 +55,10 @@ As we continue to expand Fast-LLM, we're looking for contributions from the comm
 - **Feature Development:** Contribute new capabilities, such as custom kernels or support for alternative hardware like AMD and Intel.
 - **Documentation and Tutorials:** Make Fast-LLM more accessible by improving our [documentation](https://servicenow.github.io/Fast-LLM) and writing practical guides.
 
-Fast-LLM is more than just software—it's a community. Get involved by exploring our [contribution guidelines](https://github.com/ServiceNow/Fast-LLM/CONTRIBUTING.md) and engaging with us on [GitHub Discussions](). 
+Fast-LLM is more than just software, it's a community. Get involved by exploring our [contribution guidelines](https://github.com/ServiceNow/Fast-LLM/CONTRIBUTING.md) and engaging with us on [GitHub Discussions](https://github.com/ServiceNow/Fast-LLM/discussions).
 
 ## Getting Started
 
-Ready to dive in? Check out our [quickstart guide](quickstart.md) for an overview of how to set up and run Fast-LLM on different platforms, including Slurm and Kubernetes. Explore the [examples](examples/) section for pre-configured setups to help you get started quickly with your own training experiments.
+Ready to dive in? Check out our [quickstart guide](quickstart.md) for an overview of how to set up and run Fast-LLM on different platforms, including [Slurm](https://slurm.schedmd.com) and [Kubernetes](https://kubernetes.io). Explore the [examples](https://github.com/ServiceNow/Fast-LLM/tree/main/examples) for pre-configured setups to help you get started quickly with your own training experiments.
 
 For any questions or issues, don't hesitate to open an [issue](https://github.com/ServiceNow/Fast-LLM/issues) or reach out to the community. We're here to help you accelerate your LLM training to full speed.

From a99d56dea4798cebdcc8f33fa724d680d7567936 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Tue, 22 Oct 2024 12:50:26 -0400
Subject: [PATCH 07/87] improve landing page

---
 docs/index.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/index.md b/docs/index.md
index 3ddd12de0..711a32517 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -10,7 +10,7 @@ Welcome to **Fast-LLM**, the cutting-edge open-source library built for training
 
 ## Why Fast-LLM?
 
-Fast-LLM is designed for professionals who demand speed, scalability, and customization in training large language models. It goes beyond off-the-shelf solutions to meet the rigorous requirements of large-scale AI projects, offering a robust, flexible, and high-performance alternative to frameworks like NVIDIA NeMo Megatron. With Fast-LLM, you can train your most sophisticated models while optimizing for both performance and cost.
+Fast-LLM is designed for professionals who demand speed, scalability, and customization in training large language models. It goes beyond off-the-shelf solutions to meet the rigorous requirements of large-scale AI projects, offering a robust, flexible, and high-performance open-source alternative to commercial frameworks like NVIDIA NeMo Megatron. With Fast-LLM, you can train your most sophisticated models while optimizing for both performance and cost.
 
 ### The Fast-LLM Advantage
 

From dc8e1e62393d7f89c58e4321edb8a6e9c9998f22 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Tue, 22 Oct 2024 13:32:35 -0400
Subject: [PATCH 08/87] improve landing page

---
 docs/index.md | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/docs/index.md b/docs/index.md
index 711a32517..f6f315406 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -17,11 +17,16 @@ Fast-LLM is designed for professionals who demand speed, scalability, and custom
 Fast-LLM isn't just another library, it's a platform for powering the next generation of AI breakthroughs. Here's what sets it apart:
 
 - **🚀 Purpose-Built for Large-Scale AI:** Optimized specifically for training large language models at scale, Fast-LLM comes with features fine-tuned for massive compute clusters and high-throughput workflows. It supports advanced parallelism techniques, ZeRO optimizations, and high-throughput kernels, making it ideal for handling the most demanding training tasks.
+
 - **💰 Cost Efficiency That Sets Fast-LLM Apart:** Fast-LLM's optimizations translate directly into significant cost savings:
+
   - **Lower Training Costs:** Fast-LLM achieves higher throughput per GPU, reducing the number of hours needed to complete training tasks. For example, training a Mistral-7B model can be up to xx% cheaper compared to other frameworks due to faster processing (insert exact point of reference here).
   - **More Tokens for Your Budget:** Train on significantly more data within the same budget, up to xx% more tokens per dollar—leading to better-trained models and higher-quality results (insert exact point of reference here).
+
   [Learn more about Fast-LLM's cost efficiency and see detailed comparisons](cost-efficiency.md).
+
 - **🔓 Openness Without Compromise:** Our commitment to open-source ensures that you can customize and extend Fast-LLM to suit your specific needs without the limitations of proprietary software. Fast-LLM gives you full control over your training workflows, from experimentation to production.
+
 - **🌍 Community-Driven Development:** While our focus is on professionals and enterprise users, we believe in open innovation. Fast-LLM's development is transparent, and we actively welcome contributions that make our platform even more powerful and versatile.
 
 ### Built for the Most Demanding Training Tasks
@@ -33,9 +38,13 @@ Fast-LLM is engineered to handle complex AI projects with ease, offering a scala
 Fast-LLM offers all the features you need to accelerate your LLM training to full speed:
 
 - **🚀 Speed Like No Other:** Achieve record-breaking training throughput with Fast-LLM. For instance, train Mistral-7B at nearly **9,800 tokens/s/GPU** on a 4-node cluster with 32 H100 GPUs. Our optimized kernels, advanced parallelism, and memory-efficient techniques drastically reduce training time and cost.
+
 - **📡 Unmatched Scalability:** Fast-LLM scales seamlessly from a single GPU to large compute clusters, supporting 3D parallelism (data, tensor, and pipeline), sequence length parallelism, and ZeRO-1, ZeRO-2, and ZeRO-3 techniques for maximum memory efficiency. Scale to the size you need without sacrificing performance.
+
 - **🎛️ Total Flexibility:** Fast-LLM is compatible with all major language model architectures, including GPT, Llama, Mistral, StarCoder, and Mixtral. Its modular design enables extensive customization of model architectures, optimizers, data loaders, and training loops, giving you full control over your training workflows.
+
 - **📦 Seamless Integration:** Fast-LLM integrates smoothly with popular libraries such as [Hugging Face Transformers](https://huggingface.co/transformers), making it easy to leverage existing models and datasets while benefiting from our optimizations.
+
 - **🛠️ Professional-Grade Tools:** Fast-LLM supports mixed precision training, large batch training, and gradient accumulation, all while maintaining reproducibility through deterministic behavior. Our pre-built Docker images, YAML-based configurations, and command-line interface make setup straightforward, so you can focus on what matters most: innovating with AI.
 
 ## Project Scope and Objectives

From 4b47d517db24b2df2305f92296332bbd00ce1445 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Tue, 22 Oct 2024 13:32:51 -0400
Subject: [PATCH 09/87] add cost-efficiency comparison

---
 docs/cost-efficiency.md | 54 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)
 create mode 100644 docs/cost-efficiency.md

diff --git a/docs/cost-efficiency.md b/docs/cost-efficiency.md
new file mode 100644
index 000000000..aebda6106
--- /dev/null
+++ b/docs/cost-efficiency.md
@@ -0,0 +1,54 @@
+---
+title: Cost Efficiency Comparison
+---
+
+Fast-LLM is built for speed and scalability to minimize training costs. Fast-LLM's advanced parallelism techniques, memory-efficient implementations, and kernel optimizations enable users to achieve significant cost savings compared to other training frameworks like NVIDIA NeMo Megatron and others. Let's dive into a detailed comparison of training costs across different frameworks, demonstrating how Fast-LLM delivers more value for your budget.
+
+## Comparing Training Costs Across Frameworks
+
+To demonstrate the cost-saving potential of Fast-LLM, we've compared the cost of training a language model on various frameworks under the same budget and training duration assumptions. We assume a cost of **USD 2.50 per H100 GPU per hour** for these calculations.
+
+### Scenario 1: Training on 1 Trillion Tokens
+
+| Framework      | Training Throughput (tokens/s/H100) | GPUs | Cost per Hour (USD) | Estimated Training Time (hours) | Total Cost (USD) |
+|----------------|------------------------------------:|-----:|--------------------:|--------------------------------:|-----------------:|
+| **Fast-LLM**   | 9,800                               | 32   | 80                  | 3,540                           | **$283,200**     |
+| Megatron-LM    | 7,500                               | 32   | 80                  | 4,630                           | $370,400         |
+| Megatron-Core  | 7,200                               | 32   | 80                  | 4,860                           | $388,800         |
+| NeMo           | 8,000                               | 32   | 80                  | 4,250                           | $340,000         |
+| Nanotron       | 6,500                               | 32   | 80                  | 5,000                           | $400,000         |
+| FairSeq        | 6,800                               | 32   | 80                  | 4,850                           | $388,000         |
+| ...            | ...                                 | ...  | ...                 | ...                             | ...              |
+
+> [!NOTE]
+> The above table assumes a sequence length of 8k tokens and batch size of 32 for uniformity.
+
+#### Scenario 2: Training with a Fixed Budget of $100,000
+
+| Framework      | Training Throughput (tokens/s/GPU) | GPUs | Cost per Hour (USD) | Total Training Time (hours) | Total Tokens Trained |
+|----------------|-----------------------------------:|-----:|--------------------:|----------------------------:|---------------------:|
+| **Fast-LLM**   | 9,800                              | 32   | 80                  | 1,250                        | **442 billion**      |
+| Megatron-LM    | 7,500                              | 32   | 80                  | 1,250                        | 338 billion          |
+| DeepSpeed      | 8,200                              | 32   | 80                  | 1,250                        | 370 billion          |
+| NeMo           | 8,000                              | 32   | 80                  | 1,250                        | 360 billion          |
+
+**Key Takeaways:**
+
+- With a fixed budget, Fast-LLM trains on significantly more tokens, thanks to its higher throughput.
+- This translates directly into a better-trained model within the same budget constraints.
+
+### Cost Efficiency Graphs
+
+The graphs below illustrate the cost efficiency of Fast-LLM compared to other frameworks. The first graph shows the total cost for training on 1 trillion tokens, while the second graph displays the total tokens trained within a $100,000 budget.
+
+#### Graph 1: Total Cost for Training on 1 Trillion Tokens
+
+Plot the frameworks along the x-axis, and the total training costs on the y-axis. Fast-LLM should be highlighted as the lowest cost.
+
+#### Graph 2: Total Tokens Trained Within a $100,000 Budget
+
+Plot the frameworks along the x-axis, and the total tokens trained on the y-axis, showing how Fast-LLM enables more training progress within the same budget.
+
+### Why Fast-LLM Delivers More Value
+
+Fast-LLM's advanced optimizations, including memory efficiency techniques and throughput enhancements, not only cut down training time but also translate directly into cost savings. This allows users to either reduce budget requirements or achieve better training quality within fixed budget constraints.

From fb7d3cc11a168c448ce4c81667bccf693ae4d2e1 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Tue, 22 Oct 2024 15:14:12 -0400
Subject: [PATCH 10/87] refinements

---
 SECURITY.md      |  2 +-
 docs/about-us.md |  5 ++--
 docs/index.md    | 60 ++++++++++++++++++++++++++++--------------------
 docs/license.md  |  4 +---
 4 files changed, 40 insertions(+), 31 deletions(-)

diff --git a/SECURITY.md b/SECURITY.md
index 643b23f77..e3a80c5b0 100644
--- a/SECURITY.md
+++ b/SECURITY.md
@@ -16,7 +16,7 @@ If you find a vulnerability in ServiceNow systems, products, or network infrastr
 If you find a vulnerability in this open-source project published by the ServiceNow Research team, please email [servicenow-research@servicenow.com](mailto:servicenow-research@servicenow.com) to report your findings.
 
 We will process your report as soon as possible, depending on the severity of your report. We appreciate everyone's help in disclosing vulnerabilities in a responsible manner.
- 
+
 ## Guidelines
 
 Please follow the guidelines below when [disclosing vulnerabilities](https://www.servicenow.com/company/trust/privacy/responsible-disclosure.html):
diff --git a/docs/about-us.md b/docs/about-us.md
index a3c9878fb..30337e11e 100644
--- a/docs/about-us.md
+++ b/docs/about-us.md
@@ -25,8 +25,9 @@ At Fast-LLM, we adhere to a set of guiding principles that define our approach:
 
 Fast-LLM is led by the Foundation Models Lab at [ServiceNow Research](https://www.servicenow.com/research/), with development driven by a dedicated group of professionals who bring extensive expertise in AI, machine learning, and distributed systems. While the project direction is guided by the Foundation Models Lab, contributions come from a growing network of researchers, developers, and industry experts worldwide. Here are some of the key members leading the project:
 
-- **Joel Lamy Poirier** - Lead Developer and maintainer, ServiceNow Research: Joel spearheads the core development, ensuring that Fast-LLM delivers on its promise of speed and scalability.
-- **Torsten Scholak** - Research Science Lead, ServiceNow Research: Torsten leads our research efforts, driving the scientific innovations that keep Fast-LLM at the forefront of AI training.
+- [**Joel Lamy Poirier**](https://www.servicenow.com/research/author/joel-lamy-poirier.html) - Lead Developer and maintainer, ServiceNow Research: Joel spearheads the core development, ensuring that Fast-LLM delivers on its promise of speed and scalability.
+- [**Sean Hughes**](https://www.servicenow.com/research/author/sean-hughes.html) - Ecosystem Director, ServiceNow Research: Sean focuses on building partnerships and open scientific collaborations to advance Fast-LLM's capabilities and reach.
+- [**Torsten Scholak**](https://www.servicenow.com/research/author/torsten-scholak.html) - Research Lead, ServiceNow Research: Torsten leads our research efforts, driving the scientific innovations that keep Fast-LLM at the forefront of AI training.
 
 Our core team includes members affiliated with ServiceNow Research, as well as other contributors who bring unique perspectives and skills to the project. We welcome new participants from the broader AI community who share our vision of creating the best tools for training large-scale language models.
 
diff --git a/docs/index.md b/docs/index.md
index f6f315406..bdee4c464 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -1,60 +1,70 @@
 ---
-title: Fast-LLM
+title: "Fast-LLM: Train Large Language Models Faster Than Ever Before"
 hide:
   - navigation
   - toc
   - feedback
 ---
 
-Welcome to **Fast-LLM**, the cutting-edge open-source library built for training large language models (LLMs) with exceptional speed, scalability, and customization. Developed by [ServiceNow Research](https://www.servicenow.com/research/)'s Foundation Models Lab, Fast-LLM is engineered to meet the rigorous demands of professional AI teams, research institutions, and enterprises pushing the limits of generative AI. Whether you're training models for groundbreaking research or high-stakes production, Fast-LLM empowers you to achieve unparalleled results.
+Welcome to **Fast-LLM**, the cutting-edge open-source library built for training large language models (LLMs) with **unmatched speed**, scalability, and cost-efficiency**. Developed by [ServiceNow Research](https://www.servicenow.com/research/)'s Foundation Models Lab, Fast-LLM is engineered to meet the rigorous demands of professional AI teams, research institutions, and enterprises pushing the limits of generative AI. **Achieve groundbreaking research and high-stakes production goals faster with Fast-LLM.**
+
+[Get started with Fast-LLM](quickstart.md) and experience the next generation of LLM training. [See Fast-LLM in action](in-action.md) and discover how it can transform your training workflows.
 
 ## Why Fast-LLM?
 
-Fast-LLM is designed for professionals who demand speed, scalability, and customization in training large language models. It goes beyond off-the-shelf solutions to meet the rigorous requirements of large-scale AI projects, offering a robust, flexible, and high-performance open-source alternative to commercial frameworks like NVIDIA NeMo Megatron. With Fast-LLM, you can train your most sophisticated models while optimizing for both performance and cost.
+Fast-LLM is designed for professionals who demand exceptional performance in large-scale language model training. It goes beyond off-the-shelf solutions to deliver a **robust, flexible, and high-performance open-source alternative** to commercial frameworks like NVIDIA NeMo Megatron. Whether you're optimizing for speed, cost, or scalability, Fast-LLM helps you get the most out of your training resources.
 
 ### The Fast-LLM Advantage
 
-Fast-LLM isn't just another library, it's a platform for powering the next generation of AI breakthroughs. Here's what sets it apart:
+Fast-LLM isn't just another library, **it's a platform for powering the next generation of AI breakthroughs**. Here’s what sets it apart:
 
-- **🚀 Purpose-Built for Large-Scale AI:** Optimized specifically for training large language models at scale, Fast-LLM comes with features fine-tuned for massive compute clusters and high-throughput workflows. It supports advanced parallelism techniques, ZeRO optimizations, and high-throughput kernels, making it ideal for handling the most demanding training tasks.
+- **🚀 Purpose-Built for Large-Scale AI:** Optimized specifically for training large language models at scale, Fast-LLM features advanced parallelism techniques, ZeRO optimizations, and high-throughput kernels, making it ideal for handling demanding training tasks across small and massive compute clusters.
 
-- **💰 Cost Efficiency That Sets Fast-LLM Apart:** Fast-LLM's optimizations translate directly into significant cost savings:
+- **💰 Cost Efficiency That Sets Fast-LLM Apart:**
 
-  - **Lower Training Costs:** Fast-LLM achieves higher throughput per GPU, reducing the number of hours needed to complete training tasks. For example, training a Mistral-7B model can be up to xx% cheaper compared to other frameworks due to faster processing (insert exact point of reference here).
-  - **More Tokens for Your Budget:** Train on significantly more data within the same budget, up to xx% more tokens per dollar—leading to better-trained models and higher-quality results (insert exact point of reference here).
+  - **Lower Training Costs:** With higher throughput per GPU, Fast-LLM reduces the training time required. For instance, training a Mistral-7B model can be up to **xx% cheaper** compared to other frameworks due to faster processing and memory efficiency.
+  - **More Tokens for Your Budget:** Train up to xx% more tokens for the same budget, leading to better-trained models without breaking your financial constraints.
 
   [Learn more about Fast-LLM's cost efficiency and see detailed comparisons](cost-efficiency.md).
 
-- **🔓 Openness Without Compromise:** Our commitment to open-source ensures that you can customize and extend Fast-LLM to suit your specific needs without the limitations of proprietary software. Fast-LLM gives you full control over your training workflows, from experimentation to production.
+- **🔓 Openness Without Compromise:** Fast-LLM's commitment to open-source ensures full customization and extension capabilities, allowing users to tailor the framework to specific needs without the limitations of proprietary software.
 
-- **🌍 Community-Driven Development:** While our focus is on professionals and enterprise users, we believe in open innovation. Fast-LLM's development is transparent, and we actively welcome contributions that make our platform even more powerful and versatile.
+- **🌍 Community-Driven Development:** Built by professionals for professionals, Fast-LLM's development is transparent, with an open invitation to the community to contribute. [**Join the Fast-LLM community**](community/join-us) to help shape the future of large-scale AI training.
 
-### Built for the Most Demanding Training Tasks
+### Key Features
 
-Fast-LLM is engineered to handle complex AI projects with ease, offering a scalable solution that supports various model architectures, including Llama, Mistral, StarCoder, and Mixtral. Whether you're training on a single GPU or a multi-node cluster, Fast-LLM adapts to your setup and scales effortlessly to meet your requirements.
+Fast-LLM offers all the capabilities you need to accelerate your LLM training and **push the boundaries of what's possible**:
 
-### Key Features
+- **🚀 Speed Like No Other:** Achieve record-breaking training throughput with Fast-LLM. For instance, train Mistral-7B at **9,800 tokens/s/GPU** on a 4-node cluster with 32 H100 GPUs (batch size 32, sequence length 8k). Our optimized kernels, advanced parallelism, and memory-efficient techniques drastically reduce training time and cost.
+
+- **📡 Unmatched Scalability:** Seamlessly scale from a single GPU to large compute clusters. Fast-LLM supports 3D parallelism (data, tensor, and pipeline), sequence length parallelism, and ZeRO-1,2,3 techniques for maximum memory efficiency. Scale to the size you need without sacrificing performance.
+
+- **🎛️ Total Flexibility:** Compatible with all major language model architectures, including but not limited to Llama, Mistral, StarCoder, and Mixtral. Fast-LLM's modular design gives you full control over your training workflows.
+
+- **📦 Seamless Integration:** Integrate smoothly with popular libraries such as [Hugging Face Transformers](https://huggingface.co/transformers). Benefit from Fast-LLM's optimizations without disrupting your existing pipelines.
 
-Fast-LLM offers all the features you need to accelerate your LLM training to full speed:
+- **🛠️ Professional-Grade Tools:** Enjoy mixed precision training, large batch training, and gradient accumulation. Fast-LLM ensures reproducibility through deterministic behavior and provides pre-built Docker images, YAML configurations, and a simple, intuitive command-line interface.
 
-- **🚀 Speed Like No Other:** Achieve record-breaking training throughput with Fast-LLM. For instance, train Mistral-7B at nearly **9,800 tokens/s/GPU** on a 4-node cluster with 32 H100 GPUs. Our optimized kernels, advanced parallelism, and memory-efficient techniques drastically reduce training time and cost.
+[Download Fast-LLM](https://github.com/ServiceNow/Fast-LLM/releases) and start training your large language models at full speed. [Join the Fast-LLM community](community/join-us) and collaborate with like-minded professionals to advance AI research and development.
 
-- **📡 Unmatched Scalability:** Fast-LLM scales seamlessly from a single GPU to large compute clusters, supporting 3D parallelism (data, tensor, and pipeline), sequence length parallelism, and ZeRO-1, ZeRO-2, and ZeRO-3 techniques for maximum memory efficiency. Scale to the size you need without sacrificing performance.
+## Use Cases and Success Stories
 
-- **🎛️ Total Flexibility:** Fast-LLM is compatible with all major language model architectures, including GPT, Llama, Mistral, StarCoder, and Mixtral. Its modular design enables extensive customization of model architectures, optimizers, data loaders, and training loops, giving you full control over your training workflows.
+Fast-LLM powers the world's most advanced AI projects:
 
-- **📦 Seamless Integration:** Fast-LLM integrates smoothly with popular libraries such as [Hugging Face Transformers](https://huggingface.co/transformers), making it easy to leverage existing models and datasets while benefiting from our optimizations.
+- **NLP Research and Development:** Train state-of-the-art language models for natural language understanding, summarization, and conversational AI.
+- **Enterprise AI Solutions:** Accelerate time-to-market for AI products by reducing training costs and enabling faster iteration.
+- **Academic Collaborations:** Drive AI innovation with high-performance training capabilities that support cutting-edge research in machine learning.
 
-- **🛠️ Professional-Grade Tools:** Fast-LLM supports mixed precision training, large batch training, and gradient accumulation, all while maintaining reproducibility through deterministic behavior. Our pre-built Docker images, YAML-based configurations, and command-line interface make setup straightforward, so you can focus on what matters most: innovating with AI.
+See how Fast-LLM has helped early adopters achieve up to xx% faster results. [Explore use cases and success stories](success-stories).
 
 ## Project Scope and Objectives
 
-Fast-LLM is designed to be the go-to solution for those training the most sophisticated language models. Our objectives include:
+Fast-LLM is designed to be the **go-to solution** for those training the most sophisticated language models. Our objectives include:
 
-- **Accelerating Training Workflows:** By leveraging optimized kernel efficiency, advanced parallelism, and custom memory management techniques, we aim to deliver the fastest LLM training experience available.
-- **Supporting a Broad Range of Architectures:** Fast-LLM offers built-in support for GPT, Llama, StarCoder, Mistral, Mixtral, and more, with an architecture-agnostic approach that allows users to easily adapt the framework to emerging models.
-- **Enabling Seamless Integration and Deployment:** From training to deployment, Fast-LLM integrates effortlessly with existing ML pipelines, including [Hugging Face Transformers](https://huggingface.co/transformers) and [Kubernetes](https://kubernetes.io)-based clusters.
-- **Advancing LLM Research and Production-Readiness:** With support for mixed precision training, Zero Redundancy Optimizer (ZeRO) techniques, and reproducibility features, Fast-LLM is equipped for both cutting-edge research and mission-critical production workloads.
+- **Accelerating Training Workflows:** Deliver the fastest LLM training experience with optimized kernel efficiency, parallelism, and memory management.
+- **Supporting a Broad Range of Architectures:** Offer built-in support for all major language model architectures, with an architecture-agnostic approach that allows users to easily adapt the framework to emerging models.
+- **Enabling Seamless Integration and Deployment:** Integrate effortlessly into existing ML pipelines, including [Hugging Face Transformers](https://huggingface.co/transformers) and [Kubernetes](https://kubernetes.io)-based clusters.
+- **Advancing LLM Research and Production-Readiness:** Be suitable for both cutting-edge research and mission-critical production workloads.
 
 ## Collaboration and Contribution
 
@@ -70,4 +80,4 @@ Fast-LLM is more than just software, it's a community. Get involved by exploring
 
 Ready to dive in? Check out our [quickstart guide](quickstart.md) for an overview of how to set up and run Fast-LLM on different platforms, including [Slurm](https://slurm.schedmd.com) and [Kubernetes](https://kubernetes.io). Explore the [examples](https://github.com/ServiceNow/Fast-LLM/tree/main/examples) for pre-configured setups to help you get started quickly with your own training experiments.
 
-For any questions or issues, don't hesitate to open an [issue](https://github.com/ServiceNow/Fast-LLM/issues) or reach out to the community. We're here to help you accelerate your LLM training to full speed.
+For any questions or issues, open an [issue](https://github.com/ServiceNow/Fast-LLM/issues) or join the [community discussion](https://github.com/ServiceNow/Fast-LLM/discussions).
diff --git a/docs/license.md b/docs/license.md
index 58b5946b7..eb39eedac 100644
--- a/docs/license.md
+++ b/docs/license.md
@@ -2,11 +2,9 @@
 title: License
 ---
 
-# License and citations
-
 Fast-LLM is licenced under the Apache 2.0 license:
 
-```
+```text
 Copyright 2024 ServiceNow, Inc.
 
 Licensed under the Apache License, Version 2.0 (the "License");

From 91262bb537507dfb87603a29b06ffc1f5d24631e Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Tue, 22 Oct 2024 16:39:34 -0400
Subject: [PATCH 11/87] refinements

---
 docs/community/join-us.md | 0
 docs/in-action.md         | 0
 docs/index.md             | 5 ++++-
 docs/success-stories.md   | 0
 4 files changed, 4 insertions(+), 1 deletion(-)
 create mode 100644 docs/community/join-us.md
 create mode 100644 docs/in-action.md
 create mode 100644 docs/success-stories.md

diff --git a/docs/community/join-us.md b/docs/community/join-us.md
new file mode 100644
index 000000000..e69de29bb
diff --git a/docs/in-action.md b/docs/in-action.md
new file mode 100644
index 000000000..e69de29bb
diff --git a/docs/index.md b/docs/index.md
index bdee4c464..d772b19d7 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -18,11 +18,14 @@ Fast-LLM is designed for professionals who demand exceptional performance in lar
 
 Fast-LLM isn't just another library, **it's a platform for powering the next generation of AI breakthroughs**. Here’s what sets it apart:
 
-- **🚀 Purpose-Built for Large-Scale AI:** Optimized specifically for training large language models at scale, Fast-LLM features advanced parallelism techniques, ZeRO optimizations, and high-throughput kernels, making it ideal for handling demanding training tasks across small and massive compute clusters.
+- **🚀 Purpose-Built for Small- and Large-Scale AI:** Optimized specifically for training language models of all sizes, Fast-LLM excels from **small models around 1B parameters to massive clusters** running 70B+ parameter models. Our kernels are fine-tuned for maximum throughput across this entire range, making Fast-LLM the go-to choice for diverse training needs.
+
+- **🧠 Unified Support for GPT-Like Architectures:** Unlike other frameworks that specialize in specific architectures, Fast-LLM **unifies all GPT-like model implementations** in a [single file](https://github.com/ServiceNow/Fast-LLM/blob/main/fast_llm/models/gpt/model.py). Whether you're training Llama, Mistral, Mixtral, StarCoder, or custom architectures, Fast-LLM adapts effortlessly.
 
 - **💰 Cost Efficiency That Sets Fast-LLM Apart:**
 
   - **Lower Training Costs:** With higher throughput per GPU, Fast-LLM reduces the training time required. For instance, training a Mistral-7B model can be up to **xx% cheaper** compared to other frameworks due to faster processing and memory efficiency.
+
   - **More Tokens for Your Budget:** Train up to xx% more tokens for the same budget, leading to better-trained models without breaking your financial constraints.
 
   [Learn more about Fast-LLM's cost efficiency and see detailed comparisons](cost-efficiency.md).
diff --git a/docs/success-stories.md b/docs/success-stories.md
new file mode 100644
index 000000000..e69de29bb

From 293d3b6ac36cfaf002c5cd2dda6d73f0ac4301ba Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Tue, 22 Oct 2024 20:19:06 -0400
Subject: [PATCH 12/87] refinements

---
 README.md     | 8 ++++++--
 docs/index.md | 8 ++++----
 2 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/README.md b/README.md
index c2324ad83..46518b0ba 100644
--- a/README.md
+++ b/README.md
@@ -14,7 +14,11 @@ Made with ❤️ by [ServiceNow Research][servicenow-research]
 
 ## Overview
 
-Fast-LLM is a new open-source library for training large language models, built on [PyTorch][pytorch] and [Triton][triton]. It is extremely fast, scales to large clusters, supports a wide range of model architectures, and is easy to use. Unlike commercial frameworks like Megatron-LM, which are largely closed off and fragmented across forks, Fast-LLM is fully open-source and encourages community-driven development. Researchers can freely customize and optimize as needed, making it a flexible and hackable alternative that combines the speed of specialized tools with the openness of libraries like [Hugging Face Transformers][transformers].
+Fast-LLM is a cutting-edge open-source library for training large language models with exceptional speed, scalability, and flexibility. Built on [PyTorch][pytorch] and [Triton][triton], Fast-LLM empowers AI teams to push the limits of generative AI, from research to production.
+
+Optimized for training models of all sizes—from small 1B-parameter models to massive clusters with 70B+ parameters—Fast-LLM delivers faster training, lower costs, and seamless scalability. Its fine-tuned kernels, advanced parallelism techniques, and efficient memory management make it the go-to choice for diverse training needs.
+
+As a truly open-source project, Fast-LLM allows full customization and extension without proprietary restrictions. Developed transparently by a community of professionals on GitHub, the library benefits from collaborative innovation, with every change discussed and reviewed in the open to ensure trust and quality. Fast-LLM combines professional-grade tools with unified support for GPT-like architectures, offering the cost efficiency and flexibility that serious AI practitioners demand.
 
 > [!NOTE]
 > Fast-LLM is not affiliated with Fast.AI, FastHTML, FastAPI, FastText, or other similarly named projects. Our library's name refers to its speed and efficiency in language model training.
@@ -49,7 +53,7 @@ Fast-LLM is a new open-source library for training large language models, built
 
 5. 🌐 **Fast-LLM is Truly Open Source**:
     - ⚖️ Licensed under [Apache 2.0][license] for maximum freedom to use Fast-LLM at work, in your projects, or for research.
-    - 💻 Fully developed on GitHub with a public [roadmap][roadmap] and transparent [issue tracking][issues].
+    - 💻 Transparently developed on GitHub with public [roadmap][roadmap] and [issue tracking][issues].
     - 🤝 Contributions and collaboration are always welcome!
 
 ## Usage
diff --git a/docs/index.md b/docs/index.md
index d772b19d7..8499e0f90 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -6,7 +6,7 @@ hide:
   - feedback
 ---
 
-Welcome to **Fast-LLM**, the cutting-edge open-source library built for training large language models (LLMs) with **unmatched speed**, scalability, and cost-efficiency**. Developed by [ServiceNow Research](https://www.servicenow.com/research/)'s Foundation Models Lab, Fast-LLM is engineered to meet the rigorous demands of professional AI teams, research institutions, and enterprises pushing the limits of generative AI. **Achieve groundbreaking research and high-stakes production goals faster with Fast-LLM.**
+Welcome to **Fast-LLM**, the cutting-edge open-source library built for training large language models (LLMs) with **unmatched speed, scalability, and cost-efficiency**. Developed by [ServiceNow Research](https://www.servicenow.com/research/)'s Foundation Models Lab, Fast-LLM is engineered to meet the rigorous demands of professional AI teams, research institutions, and enterprises pushing the limits of generative AI. **Achieve groundbreaking research and high-stakes production goals faster with Fast-LLM.**
 
 [Get started with Fast-LLM](quickstart.md) and experience the next generation of LLM training. [See Fast-LLM in action](in-action.md) and discover how it can transform your training workflows.
 
@@ -16,7 +16,7 @@ Fast-LLM is designed for professionals who demand exceptional performance in lar
 
 ### The Fast-LLM Advantage
 
-Fast-LLM isn't just another library, **it's a platform for powering the next generation of AI breakthroughs**. Here’s what sets it apart:
+Fast-LLM isn't just another library, **it's a platform for powering the next generation of AI breakthroughs**. Here's what sets it apart:
 
 - **🚀 Purpose-Built for Small- and Large-Scale AI:** Optimized specifically for training language models of all sizes, Fast-LLM excels from **small models around 1B parameters to massive clusters** running 70B+ parameter models. Our kernels are fine-tuned for maximum throughput across this entire range, making Fast-LLM the go-to choice for diverse training needs.
 
@@ -24,13 +24,13 @@ Fast-LLM isn't just another library, **it's a platform for powering the next gen
 
 - **💰 Cost Efficiency That Sets Fast-LLM Apart:**
 
-  - **Lower Training Costs:** With higher throughput per GPU, Fast-LLM reduces the training time required. For instance, training a Mistral-7B model can be up to **xx% cheaper** compared to other frameworks due to faster processing and memory efficiency.
+  - **Lower Training Costs:** With higher throughput per GPU, Fast-LLM reduces the training time required. For instance, training a Mistral-7B model can be up to **xx% cheaper** compared to other frameworks due to faster processing and better memory efficiency.
 
   - **More Tokens for Your Budget:** Train up to xx% more tokens for the same budget, leading to better-trained models without breaking your financial constraints.
 
   [Learn more about Fast-LLM's cost efficiency and see detailed comparisons](cost-efficiency.md).
 
-- **🔓 Openness Without Compromise:** Fast-LLM's commitment to open-source ensures full customization and extension capabilities, allowing users to tailor the framework to specific needs without the limitations of proprietary software.
+- **🔓 Openness Without Compromise:** Fast-LLM's open-source model ensures that you can **fully customize and extend the library** to fit your exact needs, without the restrictions of proprietary software. Developed transparently by a community of professionals on GitHub, every change is **openly discussed and vetted**, ensuring **trust and collaboration** as you innovate with confidence, knowing the entire development process is out in the open.
 
 - **🌍 Community-Driven Development:** Built by professionals for professionals, Fast-LLM's development is transparent, with an open invitation to the community to contribute. [**Join the Fast-LLM community**](community/join-us) to help shape the future of large-scale AI training.
 

From efde9b1381eac5210c25d038a11b17bc22dc700b Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Tue, 22 Oct 2024 21:22:25 -0400
Subject: [PATCH 13/87] refinements

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 46518b0ba..9da114bb3 100644
--- a/README.md
+++ b/README.md
@@ -29,7 +29,7 @@ As a truly open-source project, Fast-LLM allows full customization and extension
     - ⚡️ Optimized kernel efficiency and reduced overheads.
     - 🔋 Optimized memory usage for best performance.
     - ⏳ Minimizes training time and cost.
-  
+
 2. 📈 **Fast-LLM is Highly Scalable**:
     - 📡 Distributed training across multiple GPUs and nodes using 3D parallelism (Data, Tensor, and Pipeline).
     - 🔗 Supports sequence length parallelism to handle longer sequences effectively.

From 74b8ea79e571c9e9902aac249a5fa3c964141d3c Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Tue, 22 Oct 2024 21:23:49 -0400
Subject: [PATCH 14/87] linting

---
 CODE_OF_CONDUCT.md | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
index 0bea0b279..e04230bbd 100644
--- a/CODE_OF_CONDUCT.md
+++ b/CODE_OF_CONDUCT.md
@@ -1,8 +1,8 @@
-### ServiceNow Open Source Code-of-Conduct
+# ServiceNow Open Source Code-of-Conduct
 
 This code of conduct provides guidelines for participation in ServiceNow-managed open-source communities and projects.
 
-**Discussion forum guidelines**
+## Discussion forum guidelines
 
 Communities thrive when members support each other and provide useful feedback.
 
@@ -16,7 +16,7 @@ Communities thrive when members support each other and provide useful feedback.
 - Be accountable for your actions by correcting your mistakes and indicating where you have changed a previous post of yours.
 - Mark content as correct and helpful, and provide feedback. If you read a discussion post that you find helpful, we encourage you to leave a positive vote and comment in the replies. If you find a post that is unhelpful, please provide more information in the issue comments.
 
-**Issue board guidelines**
+## Issue board guidelines
 
 Many open-source projects provide an Issues board, with similar functionality to a Discussions forum. The same rules from the discussion forum guidelines apply to the Issues board.
 
@@ -30,17 +30,17 @@ ServiceNow suggests the following technical support pathways for open-source pro
 6. Log an Issue if it hasn't already been logged. If the issue has already been logged by another user, vote it up, and add a comment with additional or missing information. Do your best to choose the correct category when logging a new issue. This will make it easier to differentiate bugs from new feature requests or ideas. If after logging an issue you find the solution, please close your issue and provide a comment with the solution. This will help the project owners and other users.
 7. Contact the project team contributors of the project to see if they can help as a last resort only.
 
-**Repositories**
+## Repositories
 
 - Read and follow the license instructions
 - Remember to include citations if you use someone else's work in your own project. Use the [`CITATION.cff`](CITATION.cff) to find the correct project citation reference.
 - ‘Star' project repos to save for future reference.
 - ‘Watch' project repos to get notifications of changes – this can get noisy for some projects, so only watch the ones you really need to track closely.
 
-**Enforcement and reporting**
+## Enforcement and reporting
 
-We encourage community members and users to help each other and to resolve issues amongst themselves as much as possible. If a matter cannot be resolved in good faith within the means available, please reach out to a team member or email fast-llm-team@servicenow.com.
+We encourage community members and users to help each other and to resolve issues amongst themselves as much as possible. If a matter cannot be resolved in good faith within the means available, please reach out to a team member or email [fast-llm-team@servicenow.com](mailto:fast-llm-team@servicenow.com).
 
-**ServiceNow Disclaimer.**
+## ServiceNow Disclaimer
 
 We may, but are under no obligation to, monitor or censor comments made by users or content provided by contributors and we are not responsible for the accuracy, completeness, appropriateness or legality of anything posted, depicted or otherwise provided by third‑party users and we disclaim any and all liability relating thereto.

From 82adb543ef53f5ade30e37fce0a9f86ba5864c44 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Tue, 22 Oct 2024 21:36:18 -0400
Subject: [PATCH 15/87] rework cost efficiency comparison

---
 docs/cost-efficiency.md | 70 ++++++++++++++++++++---------------------
 1 file changed, 34 insertions(+), 36 deletions(-)

diff --git a/docs/cost-efficiency.md b/docs/cost-efficiency.md
index aebda6106..126cb7f40 100644
--- a/docs/cost-efficiency.md
+++ b/docs/cost-efficiency.md
@@ -2,53 +2,51 @@
 title: Cost Efficiency Comparison
 ---
 
-Fast-LLM is built for speed and scalability to minimize training costs. Fast-LLM's advanced parallelism techniques, memory-efficient implementations, and kernel optimizations enable users to achieve significant cost savings compared to other training frameworks like NVIDIA NeMo Megatron and others. Let's dive into a detailed comparison of training costs across different frameworks, demonstrating how Fast-LLM delivers more value for your budget.
+Fast-LLM is built for speed and scalability to minimize training costs. Its advanced parallelism techniques, memory-efficient implementations, and kernel optimizations enable significant cost savings compared to other training frameworks. Below, we present a detailed comparison of training costs for different model configurations and cluster sizes, demonstrating how Fast-LLM delivers more value for your budget.
 
 ## Comparing Training Costs Across Frameworks
 
-To demonstrate the cost-saving potential of Fast-LLM, we've compared the cost of training a language model on various frameworks under the same budget and training duration assumptions. We assume a cost of **USD 2.50 per H100 GPU per hour** for these calculations.
+To showcase the cost-saving potential of Fast-LLM, we've compared the cost of training a language model across various frameworks for different scenarios. For these calculations, we assume a cost of **USD 2.50 per H100 GPU per hour**.
 
-### Scenario 1: Training on 1 Trillion Tokens
+### Scenario Comparison: Training Costs and Token Efficiency
 
-| Framework      | Training Throughput (tokens/s/H100) | GPUs | Cost per Hour (USD) | Estimated Training Time (hours) | Total Cost (USD) |
-|----------------|------------------------------------:|-----:|--------------------:|--------------------------------:|-----------------:|
-| **Fast-LLM**   | 9,800                               | 32   | 80                  | 3,540                           | **$283,200**     |
-| Megatron-LM    | 7,500                               | 32   | 80                  | 4,630                           | $370,400         |
-| Megatron-Core  | 7,200                               | 32   | 80                  | 4,860                           | $388,800         |
-| NeMo           | 8,000                               | 32   | 80                  | 4,250                           | $340,000         |
-| Nanotron       | 6,500                               | 32   | 80                  | 5,000                           | $400,000         |
-| FairSeq        | 6,800                               | 32   | 80                  | 4,850                           | $388,000         |
-| ...            | ...                                 | ...  | ...                 | ...                             | ...              |
+The tables below provide a comparison of training costs for three different model setups, including costs for training on **1 trillion tokens** and the total tokens trained within a **$100,000 budget**.
 
-> [!NOTE]
-> The above table assumes a sequence length of 8k tokens and batch size of 32 for uniformity.
-
-#### Scenario 2: Training with a Fixed Budget of $100,000
-
-| Framework      | Training Throughput (tokens/s/GPU) | GPUs | Cost per Hour (USD) | Total Training Time (hours) | Total Tokens Trained |
-|----------------|-----------------------------------:|-----:|--------------------:|----------------------------:|---------------------:|
-| **Fast-LLM**   | 9,800                              | 32   | 80                  | 1,250                        | **442 billion**      |
-| Megatron-LM    | 7,500                              | 32   | 80                  | 1,250                        | 338 billion          |
-| DeepSpeed      | 8,200                              | 32   | 80                  | 1,250                        | 370 billion          |
-| NeMo           | 8,000                              | 32   | 80                  | 1,250                        | 360 billion          |
+#### 1B Llama 3 Model on 1 DGX Node (8 H100s)
 
-**Key Takeaways:**
+| Framework                 | Training Throughput (tokens/s/GPU) | Cost to Train 1T Tokens (USD) | Tokens Trained for $100k (Billion) |
+|---------------------------|-----------------------------------:|------------------------------:|-----------------------------------:|
+| **Fast-LLM**              | 6,500                              | **$384,600**                  | **260**                            |
+| NVIDIA Megatron           | 5,000                              | $500,000                      | 200                                |
+| MosaicML Composer         | 5,800                              | $431,000                      | 233                                |
+| Hugging Face Transformers | 4,800                              | $520,800                      | 192                                |
+| Meta Lingua               | 5,200                              | $480,800                      | 208                                |
 
-- With a fixed budget, Fast-LLM trains on significantly more tokens, thanks to its higher throughput.
-- This translates directly into a better-trained model within the same budget constraints.
+#### 8B Llama 3 Model on 4 DGX Nodes (32 H100s)
 
-### Cost Efficiency Graphs
+| Framework                 | Training Throughput (tokens/s/GPU) | Cost to Train 1T Tokens (USD) | Tokens Trained for $100k (Billion) |
+|---------------------------|-----------------------------------:|------------------------------:|-----------------------------------:|
+| **Fast-LLM**              | 9,800                              | **$283,200**                  | **442**                            |
+| NVIDIA Megatron           | 7,500                              | $370,400                      | 338                                |
+| MosaicML Composer         | 8,200                              | $338,000                      | 370                                |
+| Hugging Face Transformers | 7,000                              | $392,900                      | 320                                |
+| Meta Lingua               | 7,800                              | $352,200                      | 355                                |
 
-The graphs below illustrate the cost efficiency of Fast-LLM compared to other frameworks. The first graph shows the total cost for training on 1 trillion tokens, while the second graph displays the total tokens trained within a $100,000 budget.
+#### Mixtral-8x7B Model on 16 DGX Nodes (128 H100s)
 
-#### Graph 1: Total Cost for Training on 1 Trillion Tokens
+| Framework                 | Training Throughput (tokens/s/GPU) | Cost to Train 1T Tokens (USD) | Tokens Trained for $100k (Billion) |
+|---------------------------|-----------------------------------:|------------------------------:|-----------------------------------:|
+| **Fast-LLM**              | 4,000                              | **$233,300**                  | **515**                            |
+| NVIDIA Megatron           | 9,200                              | $304,300                      | 412                                |
+| MosaicML Composer         | 10,000                             | $280,000                      | 450                                |
+| Hugging Face Transformers | 8,500                              | $329,400                      | 382                                |
+| Meta Lingua               | not supported                      | not supported                 | not supported                      |
 
-Plot the frameworks along the x-axis, and the total training costs on the y-axis. Fast-LLM should be highlighted as the lowest cost.
-
-#### Graph 2: Total Tokens Trained Within a $100,000 Budget
-
-Plot the frameworks along the x-axis, and the total tokens trained on the y-axis, showing how Fast-LLM enables more training progress within the same budget.
+> [!NOTE]
+> All scenarios assume a sequence length of 8k tokens for consistency.
 
-### Why Fast-LLM Delivers More Value
+### Key Takeaways
 
-Fast-LLM's advanced optimizations, including memory efficiency techniques and throughput enhancements, not only cut down training time but also translate directly into cost savings. This allows users to either reduce budget requirements or achieve better training quality within fixed budget constraints.
+- **Fast-LLM consistently delivers lower training costs and higher token efficiency across various model configurations and cluster sizes.**
+- The cost savings are most significant with larger setups, where Fast-LLM's optimizations for high throughput and memory efficiency make a bigger impact.
+- In all scenarios, Fast-LLM trains on **more tokens within the same budget**, resulting in better-trained models.

From f2beb4c0fa59340e334b7af01f39d8c33371335b Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Wed, 23 Oct 2024 10:11:11 -0400
Subject: [PATCH 16/87] rework cost efficiency comparison

---
 docs/cost-efficiency.md | 106 ++++++++++++++++++++++++++++------------
 1 file changed, 74 insertions(+), 32 deletions(-)

diff --git a/docs/cost-efficiency.md b/docs/cost-efficiency.md
index 126cb7f40..17cb1b6a5 100644
--- a/docs/cost-efficiency.md
+++ b/docs/cost-efficiency.md
@@ -12,41 +12,83 @@ To showcase the cost-saving potential of Fast-LLM, we've compared the cost of tr
 
 The tables below provide a comparison of training costs for three different model setups, including costs for training on **1 trillion tokens** and the total tokens trained within a **$100,000 budget**.
 
-#### 1B Llama 3 Model on 1 DGX Node (8 H100s)
-
-| Framework                 | Training Throughput (tokens/s/GPU) | Cost to Train 1T Tokens (USD) | Tokens Trained for $100k (Billion) |
-|---------------------------|-----------------------------------:|------------------------------:|-----------------------------------:|
-| **Fast-LLM**              | 6,500                              | **$384,600**                  | **260**                            |
-| NVIDIA Megatron           | 5,000                              | $500,000                      | 200                                |
-| MosaicML Composer         | 5,800                              | $431,000                      | 233                                |
-| Hugging Face Transformers | 4,800                              | $520,800                      | 192                                |
-| Meta Lingua               | 5,200                              | $480,800                      | 208                                |
-
-#### 8B Llama 3 Model on 4 DGX Nodes (32 H100s)
-
-| Framework                 | Training Throughput (tokens/s/GPU) | Cost to Train 1T Tokens (USD) | Tokens Trained for $100k (Billion) |
-|---------------------------|-----------------------------------:|------------------------------:|-----------------------------------:|
-| **Fast-LLM**              | 9,800                              | **$283,200**                  | **442**                            |
-| NVIDIA Megatron           | 7,500                              | $370,400                      | 338                                |
-| MosaicML Composer         | 8,200                              | $338,000                      | 370                                |
-| Hugging Face Transformers | 7,000                              | $392,900                      | 320                                |
-| Meta Lingua               | 7,800                              | $352,200                      | 355                                |
+#### 1B Model on 1 DGX Node (8 H100s)
 
-#### Mixtral-8x7B Model on 16 DGX Nodes (128 H100s)
+| Framework                                  | Training Throughput (tokens/s/GPU) | Cost to Train 1T Tokens (USD)  | Tokens Trained for $100k (Billion)  |
+|:-------------------------------------------|-----------------------------------:|-------------------------------:|------------------------------------:|
+| **Fast-LLM**[^fast-llm-1b]                 | [PLACEHOLDER]                      | **[PLACEHOLDER]**              | **[PLACEHOLDER]**                   |
+| NVIDIA Megatron[^megatron-1b]              | [PLACEHOLDER]                      | [PLACEHOLDER]                  | [PLACEHOLDER]                       |
+| MosaicML Composer[^mosaic-1b]              | [PLACEHOLDER]                      | [PLACEHOLDER]                  | [PLACEHOLDER]                       |
+| Hugging Face Transformers[^huggingface-1b] | [PLACEHOLDER]                      | [PLACEHOLDER]                  | [PLACEHOLDER]                       |
+| Meta Lingua[^metaligua-1b]                 | [PLACEHOLDER]                      | [PLACEHOLDER]                  | [PLACEHOLDER]                       |
+
+#### 8B Model on 4 DGX Nodes (32 H100s)
 
-| Framework                 | Training Throughput (tokens/s/GPU) | Cost to Train 1T Tokens (USD) | Tokens Trained for $100k (Billion) |
-|---------------------------|-----------------------------------:|------------------------------:|-----------------------------------:|
-| **Fast-LLM**              | 4,000                              | **$233,300**                  | **515**                            |
-| NVIDIA Megatron           | 9,200                              | $304,300                      | 412                                |
-| MosaicML Composer         | 10,000                             | $280,000                      | 450                                |
-| Hugging Face Transformers | 8,500                              | $329,400                      | 382                                |
-| Meta Lingua               | not supported                      | not supported                 | not supported                      |
+| Framework                                  | Training Throughput (tokens/s/GPU) | Cost to Train 1T Tokens (USD)  | Tokens Trained for $100k (Billion)  |
+|:-------------------------------------------|-----------------------------------:|-------------------------------:|------------------------------------:|
+| **Fast-LLM**[^fast-llm-8b]                 | [PLACEHOLDER]                      | **[PLACEHOLDER]**              | **[PLACEHOLDER]**                   |
+| NVIDIA Megatron[^megatron-8b]              | [PLACEHOLDER]                      | [PLACEHOLDER]                  | [PLACEHOLDER]                       |
+| MosaicML Composer[^mosaic-8b]              | [PLACEHOLDER]                      | [PLACEHOLDER]                  | [PLACEHOLDER]                       |
+| Hugging Face Transformers[^huggingface-8b] | [PLACEHOLDER]                      | [PLACEHOLDER]                  | [PLACEHOLDER]                       |
+| Meta Lingua[^metaligua-8b]                 | [PLACEHOLDER]                      | [PLACEHOLDER]                  | [PLACEHOLDER]                       |
 
-> [!NOTE]
-> All scenarios assume a sequence length of 8k tokens for consistency.
+#### Mixtral-8x7B Model on 16 DGX Nodes (128 H100s)
+
+| Framework                                       | Training Throughput (tokens/s/GPU) | Cost to Train 1T Tokens (USD)  | Tokens Trained for $100k (Billion)  |
+|:------------------------------------------------|-----------------------------------:|-------------------------------:|------------------------------------:|
+| **Fast-LLM**[^fast-llm-mixtral]                 | [PLACEHOLDER]                      | **[PLACEHOLDER]**              | **[PLACEHOLDER]**                   |
+| NVIDIA Megatron[^megatron-mixtral]              | [PLACEHOLDER]                      | [PLACEHOLDER]                  | [PLACEHOLDER]                       |
+| MosaicML Composer[^mosaic-mixtral]              | [PLACEHOLDER]                      | [PLACEHOLDER]                  | [PLACEHOLDER]                       |
+| Hugging Face Transformers[^huggingface-mixtral] | [PLACEHOLDER]                      | [PLACEHOLDER]                  | [PLACEHOLDER]                       |
+| Meta Lingua[^metaligua-mixtral]                 | not supported                      | not supported                  | not supported                       |
 
 ### Key Takeaways
 
-- **Fast-LLM consistently delivers lower training costs and higher token efficiency across various model configurations and cluster sizes.**
-- The cost savings are most significant with larger setups, where Fast-LLM's optimizations for high throughput and memory efficiency make a bigger impact.
-- In all scenarios, Fast-LLM trains on **more tokens within the same budget**, resulting in better-trained models.
+- **Cost efficiency at all scales:** Fast-LLM consistently achieves lower training costs due to its advanced parallelism and memory efficiency, delivering value across various model sizes and hardware configurations.
+- **Superior token throughput:** By processing more tokens per second per GPU than other frameworks, Fast-LLM maximizes token efficiency, leading to substantial savings, particularly for longer training durations or larger GPU clusters.
+- **Optimized for large-scale training:** Fast-LLM's design allows it to scale effectively as model size and training setups expand, ensuring that the benefits of its optimizations grow with the size of the deployment.
+
+[^fast-llm-1b]:
+    Testing conducted in [Month, Year] using 8 NVIDIA H100 SXM5 80 GB GPUs in 1 DGX node connected with 3200 Gbps Infiniband. Fast-LLM version [VERSION/COMMIT HASH], CUDA version [VERSION]. Training was performed on randomly generated data. Configuration file: [Link to config file].
+
+[^megatron-1b]:
+    Testing conducted in [Month, Year] using 8 NVIDIA H100 SXM5 80 GB GPUs in 1 DGX node connected with 3200 Gbps Infiniband. NVIDIA Megatron version [VERSION], CUDA version [VERSION]. Training was performed on randomly generated data. Configuration file: [Link to config file].
+
+[^mosaic-1b]:
+    Testing conducted in [Month, Year] using 8 NVIDIA H100 SXM5 80 GB GPUs in 1 DGX node connected with 3200 Gbps Infiniband. MosaicML Composer version [VERSION], CUDA version [VERSION]. Training was performed on randomly generated data. Configuration file: [Link to config file].
+
+[^huggingface-1b]:
+    Testing conducted in [Month, Year] using 8 NVIDIA H100 SXM5 80 GB GPUs in 1 DGX node connected with 3200 Gbps Infiniband. Hugging Face Transformers version [VERSION], CUDA version [VERSION]. Training was performed on randomly generated data. Configuration file: [Link to config file].
+
+[^metaligua-1b]:
+    Testing conducted in [Month, Year] using 8 NVIDIA H100 SXM5 80 GB GPUs in 1 DGX node connected with 3200 Gbps Infiniband. Meta Lingua version [VERSION], CUDA version [VERSION]. Training was performed on randomly generated data. Configuration file: [Link to config file].
+
+[^fast-llm-8b]:
+    Testing conducted in [Month, Year] using 32 NVIDIA H100 SXM5 80 GB GPUs across 4 DGX nodes connected with 3200 Gbps Infiniband. Fast-LLM version [VERSION/COMMIT HASH], CUDA version [VERSION]. Training was performed on randomly generated data. Configuration file: [Link to config file].
+
+[^megatron-8b]:
+    Testing conducted in [Month, Year] using 32 NVIDIA H100 SXM5 80 GB GPUs across 4 DGX nodes connected with 3200 Gbps Infiniband. NVIDIA Megatron version [VERSION], CUDA version [VERSION]. Training was performed on randomly generated data. Configuration file: [Link to config file].
+
+[^mosaic-8b]:
+    Testing conducted in [Month, Year] using 32 NVIDIA H100 SXM5 80 GB GPUs across 4 DGX nodes connected with 3200 Gbps Infiniband. MosaicML Composer version [VERSION], CUDA version [VERSION]. Training was performed on randomly generated data. Configuration file: [Link to config file].
+
+[^huggingface-8b]:
+    Testing conducted in [Month, Year] using 32 NVIDIA H100 SXM5 80 GB GPUs across 4 DGX nodes connected with 3200 Gbps Infiniband. Hugging Face Transformers version [VERSION], CUDA version [VERSION]. Training was performed on randomly generated data. Configuration file: [Link to config file].
+
+[^metaligua-8b]:
+    Testing conducted in [Month, Year] using 32 NVIDIA H100 SXM5 80 GB GPUs across 4 DGX nodes connected with 3200 Gbps Infiniband. Meta Lingua version [VERSION], CUDA version [VERSION]. Training was performed on randomly generated data. Configuration file: [Link to config file].
+
+[^fast-llm-mixtral]:
+    Testing conducted in [Month, Year] using 128 NVIDIA H100 SXM5 80 GB GPUs across 16 DGX nodes connected with 3200 Gbps Infiniband. Fast-LLM version [VERSION/COMMIT HASH], CUDA version [VERSION]. Training was performed on randomly generated data. Configuration file: [Link to config file].
+
+[^megatron-mixtral]:
+    Testing conducted in [Month, Year] using 128 NVIDIA H100 SXM5 80 GB GPUs across 16 DGX nodes connected with 3200 Gbps Infiniband. NVIDIA Megatron version [VERSION], CUDA version [VERSION]. Training was performed on randomly generated data. Configuration file: [Link to config file].
+
+[^mosaic-mixtral]:
+    Testing conducted in [Month, Year] using 128 NVIDIA H100 SXM5 80 GB GPUs across 16 DGX nodes connected with 3200 Gbps Infiniband. MosaicML Composer version [VERSION], CUDA version [VERSION]. Training was performed on randomly generated data. Configuration file: [Link to config file].
+
+[^huggingface-mixtral]:
+    Testing conducted in [Month, Year] using 128 NVIDIA H100 SXM5 80 GB GPUs across 16 DGX nodes connected with 3200 Gbps Infiniband. Hugging Face Transformers version [VERSION], CUDA version [VERSION]. Training was performed on randomly generated data. Configuration file: [Link to config file].
+
+[^metaligua-mixtral]:
+    In [Month, Year], Meta Lingua did not support training this configuration.

From 9a2397dadf8ccc36a4c387f6ed342eec0c39b9a0 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Wed, 23 Oct 2024 15:07:09 -0400
Subject: [PATCH 17/87] add devenv

---
 .gitignore | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/.gitignore b/.gitignore
index 41502c68f..cbf9863f1 100644
--- a/.gitignore
+++ b/.gitignore
@@ -27,3 +27,13 @@ venv.bak/
 # Project specifics
 /.idea/
 /.vscode/
+
+# Devenv
+.devenv*
+devenv.local.nix
+devenv.lock
+devenv.nix
+devenv.yaml
+
+# direnv
+.direnv

From e4230eaaa6a5726a48f714fdc0b1a6275cba7ef4 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Wed, 23 Oct 2024 15:08:51 -0400
Subject: [PATCH 18/87] add devenv

---
 .gitignore | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/.gitignore b/.gitignore
index cbf9863f1..c8969952d 100644
--- a/.gitignore
+++ b/.gitignore
@@ -31,9 +31,7 @@ venv.bak/
 # Devenv
 .devenv*
 devenv.local.nix
-devenv.lock
-devenv.nix
-devenv.yaml
+devenv.*
 
 # direnv
 .direnv

From a1fa25123f9aa0eb66c84b7fad0a41cfa97ce9d3 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Wed, 23 Oct 2024 15:09:14 -0400
Subject: [PATCH 19/87] revamp structure

---
 docs/about-us.md                              |  2 +-
 docs/community/feedback.md                    |  3 -
 docs/community/index.md                       |  1 -
 .../continue-training-llama-8b.md}            | 52 +++++++++++
 .../data-preparation.md}                      |  0
 .../train-llama-8b.md}                        | 77 +++++++++++++++
 .../upcycle-llama-3b-to-moe.md}               |  0
 docs/in-action/kubernetes.md                  |  0
 docs/in-action/slurm.md                       |  0
 docs/join-us.md                               |  0
 docs/reference/{index.md => configuration.md} |  0
 docs/tutorial/convert_to_huggingface.md       | 50 ----------
 docs/tutorial/getting_started.md              | 75 ---------------
 docs/tutorial/index.md                        | 28 ------
 docs/tutorial/launch_training.md              | 93 -------------------
 mkdocs.yaml                                   | 42 +++++----
 16 files changed, 152 insertions(+), 271 deletions(-)
 delete mode 100644 docs/community/feedback.md
 delete mode 100644 docs/community/index.md
 rename docs/{tutorial/prepare_mistral.md => examples/continue-training-llama-8b.md} (64%)
 rename docs/{tutorial/prepare_data.md => examples/data-preparation.md} (100%)
 rename docs/{tutorial/prepare_training.md => examples/train-llama-8b.md} (69%)
 rename docs/{community/join-us.md => examples/upcycle-llama-3b-to-moe.md} (100%)
 create mode 100644 docs/in-action/kubernetes.md
 create mode 100644 docs/in-action/slurm.md
 create mode 100644 docs/join-us.md
 rename docs/reference/{index.md => configuration.md} (100%)
 delete mode 100644 docs/tutorial/convert_to_huggingface.md
 delete mode 100644 docs/tutorial/getting_started.md
 delete mode 100644 docs/tutorial/index.md
 delete mode 100644 docs/tutorial/launch_training.md

diff --git a/docs/about-us.md b/docs/about-us.md
index 30337e11e..7cad79222 100644
--- a/docs/about-us.md
+++ b/docs/about-us.md
@@ -35,7 +35,7 @@ Our core team includes members affiliated with ServiceNow Research, as well as o
 
 Fast-LLM is an open-source project that thrives on collaboration. If you're a professional or researcher looking to contribute, there are many ways to get involved:
 
-- **Code Contributions:** Dive into our [contribution guidelines](CONTRIBUTING.md) to learn how you can help improve Fast-LLM.
+- **Code Contributions:** Dive into our [contribution guidelines](https://github.com/ServiceNow/Fast-LLM/blob/main/CONTRIBUTING.md) to learn how you can help improve Fast-LLM.
 - **Discussion and Ideas:** Join us on [GitHub Discussions](https://github.com/ServiceNow/Fast-LLM/discussions) to share your insights, ask questions, or discuss new features.
 - **Documentation and Tutorials:** Help us expand our [documentation](https://servicenow.github.io/Fast-LLM/), making it even more valuable for other professionals.
 
diff --git a/docs/community/feedback.md b/docs/community/feedback.md
deleted file mode 100644
index dcd70162b..000000000
--- a/docs/community/feedback.md
+++ /dev/null
@@ -1,3 +0,0 @@
-# Feedback
-
-Coming soon...
diff --git a/docs/community/index.md b/docs/community/index.md
deleted file mode 100644
index 684e27f7d..000000000
--- a/docs/community/index.md
+++ /dev/null
@@ -1 +0,0 @@
-Coming soon...
diff --git a/docs/tutorial/prepare_mistral.md b/docs/examples/continue-training-llama-8b.md
similarity index 64%
rename from docs/tutorial/prepare_mistral.md
rename to docs/examples/continue-training-llama-8b.md
index acfa391f9..1024f7332 100644
--- a/docs/tutorial/prepare_mistral.md
+++ b/docs/examples/continue-training-llama-8b.md
@@ -100,3 +100,55 @@ $ARCHITECTURE_ARGS_MISTRAL \
 --num_experts_per_token=2 \
 "
 ```
+
+
+# Converting Fast-LLM Models to Hugging Face Format
+
+Now that we have trained a Mistral model, the natural next step is to try it for inference or benchmarks.
+Fast-LLM does not support such task (at least for the time being),
+but instead supports conversion to [Huggingface transformers](https://github.com/huggingface/transformers) models,
+which are themselves compatible with a large variety of tools.
+
+This article guides you through the conversion process for a Mistral-7B checkpoint (export)
+generated during training as described in [the previous tutorial](launch_training.md).
+This checkpoint may be found at `$EXP_BASE_DIR/export/$ITERATION/`.
+Allow some time for the first checkpoint to be generated.
+
+
+## Convert a Mistral-7B checkpoint
+
+We convert the checkpoint with Fast-LLM's
+[conversion script](https://github.com/ServiceNow/Fast-LLM/blob/main/tools/convert_model.py),
+and we specify the input and output locations and formats:
+
+```bash
+python3 -m tools.convert_model \
+    --input_type distributed \
+    --output_type huggingface \
+    --input_path $EXP_BASE_DIR/export/$ITERATION/ \
+    --output_path $CONVERTED_DIR \
+    --model_type mistral
+```
+
+<!--- TODO: What Tokenizer? --->
+
+!!! warning "Don't Forget the Tokenizer"
+
+    Make sure to add a tokenizer file and its configuration to the output directory, since `convert_model.py` does not include these files in the conversion.
+
+
+<!--- TODO: What Tokenizer? --->
+
+You can then load and use the converted model
+[as you would with any Transformers model](https://huggingface.co/docs/transformers/index).
+For example:
+```python
+import torch
+from transformers import AutoModelForCausalLM
+
+import transformers
+
+model = AutoModelForCausalLM.from_pretrained(converted_dir).to(device="cuda")
+x = torch.randint(0, 32000, (1, 1024))
+y = model(x)
+```
diff --git a/docs/tutorial/prepare_data.md b/docs/examples/data-preparation.md
similarity index 100%
rename from docs/tutorial/prepare_data.md
rename to docs/examples/data-preparation.md
diff --git a/docs/tutorial/prepare_training.md b/docs/examples/train-llama-8b.md
similarity index 69%
rename from docs/tutorial/prepare_training.md
rename to docs/examples/train-llama-8b.md
index dda8f74a9..43a32eca2 100644
--- a/docs/tutorial/prepare_training.md
+++ b/docs/examples/train-llama-8b.md
@@ -1,3 +1,80 @@
+# Getting Started
+
+<!--- TODO: Remove the ServiceNow-specific content. --->
+
+## Build the image
+
+!!! warning
+
+    This guide is not yet working.
+
+The preferred way to run [Fast-LLM](https://github.com/ServiceNow/Fast-LLM) is through a docker image built with the provided Dockerfile.
+For example, from a terminal running on a GPU node:
+
+```bash
+git clone git@github.com:ServiceNow/Fast-LLM.git
+cd Fast-LLM
+docker build -t my_fast_llm_image .
+docker run --rm -it --gpus all --net=host --ipc=host my_fast_llm_image bash
+```
+
+## First examples
+
+All training runs are launched throught the entry point [pretrain_fast_llm.py](https://github.com/ServiceNow/Fast-LLM/blob/main/pretrain_fast_llm.py).
+We can run a minimalistic training example with:
+```bash
+python3 pretrain_fast_llm.py --train_iters=100 --batch_size=32 --dataset_source=random
+```
+This will launch a short single-GPU training from scratch of a 180 M parameter model on a randomly generated dataset.
+
+To run distributed training, we run our training script through [torchrun](https://pytorch.org/docs/stable/elastic/run.html),
+the PyTorch distributed launcher. For example, on 8 GPUs:
+```bash
+torchrun --nproc-per-node=8 pretrain_fast_llm.py --train_iters=100 --batch_size=32 --dataset_source=random
+```
+Note that by default, Fast-LLM parallelizes over samples (data-parallel), so the number of GPUs should divide the batch size.
+
+Multi-node training also uses torchrun, and requires the same command to be run on each node,
+with the additional specification of a rendez-vous endpoint, i.e., the address of one of the nodes.
+For example, on four nodes:
+```bash
+torchrun --nproc-per-node=8 --nnodes=4 --rdzv-backend=c10d --rdzv-endpoint=$HOST_NODE_ADDR pretrain_fast_llm.py --train_iters=100 --batch_size=32 --dataset_source=random
+```
+
+See the [torchrun documentation](https://pytorch.org/docs/stable/elastic/run.html) for more details.
+Note that if you are using cloud or managed hardware, there Now tutorial](servicenow.md)
+may be a simpler automated method to launch multi-node jobs.
+Please refer to your provider for more details.
+The ServiceNow-specific method may be found in the [Service
+
+## More on training arguments
+
+<!--- TODO: Document arguments --->
+
+The training script supports hundreds of arguments, though most of them are optional and/or have sensible defaults.
+We already saw three arguments above, and we will see many important ones in this tutorial.
+
+At the beginning of training, Fast-LLM displays a list of arguments and their values:
+```
+------------------------ arguments ------------------------
+  activation_type ................................. gelu
+  adam_beta1 ...................................... 0.9
+  adam_beta2 ...................................... 0.999
+  adam_eps ........................................ 1e-08
+  add_linear_biases ............................... True
+  attention_dropout ............................... 0.0
+  batch_size ...................................... 1
+  [...]
+-------------------- end of arguments ---------------------
+```
+All of these arguments can be set as arguments of `pretrain_fast_llm.py`, in the form `--[name]=[value]`,
+provided the values have the expected data type, and in some case satisfy extra constraints.
+For example, we may enable attention dropout with `--attention_dropout=0.1`.
+Note that booleans are set as integers (ex. `--add_linear_biases=0` to disable biases),
+and that `None` cannot be represented.
+Please refer to each parameter's definition for more details.
+
+
 # Prepare the training configuration
 
 # Training parameters
diff --git a/docs/community/join-us.md b/docs/examples/upcycle-llama-3b-to-moe.md
similarity index 100%
rename from docs/community/join-us.md
rename to docs/examples/upcycle-llama-3b-to-moe.md
diff --git a/docs/in-action/kubernetes.md b/docs/in-action/kubernetes.md
new file mode 100644
index 000000000..e69de29bb
diff --git a/docs/in-action/slurm.md b/docs/in-action/slurm.md
new file mode 100644
index 000000000..e69de29bb
diff --git a/docs/join-us.md b/docs/join-us.md
new file mode 100644
index 000000000..e69de29bb
diff --git a/docs/reference/index.md b/docs/reference/configuration.md
similarity index 100%
rename from docs/reference/index.md
rename to docs/reference/configuration.md
diff --git a/docs/tutorial/convert_to_huggingface.md b/docs/tutorial/convert_to_huggingface.md
deleted file mode 100644
index 8d1302a03..000000000
--- a/docs/tutorial/convert_to_huggingface.md
+++ /dev/null
@@ -1,50 +0,0 @@
-# Converting Fast-LLM Models to Hugging Face Format
-
-Now that we have trained a Mistral model, the natural next step is to try it for inference or benchmarks.
-Fast-LLM does not support such task (at least for the time being),
-but instead supports conversion to [Huggingface transformers](https://github.com/huggingface/transformers) models,
-which are themselves compatible with a large variety of tools.
-
-This article guides you through the conversion process for a Mistral-7B checkpoint (export)
-generated during training as described in [the previous tutorial](launch_training.md).
-This checkpoint may be found at `$EXP_BASE_DIR/export/$ITERATION/`.
-Allow some time for the first checkpoint to be generated.
-
-
-## Convert a Mistral-7B checkpoint
-
-We convert the checkpoint with Fast-LLM's
-[conversion script](https://github.com/ServiceNow/Fast-LLM/blob/main/tools/convert_model.py),
-and we specify the input and output locations and formats:
-
-```bash
-python3 -m tools.convert_model \
-    --input_type distributed \
-    --output_type huggingface \
-    --input_path $EXP_BASE_DIR/export/$ITERATION/ \
-    --output_path $CONVERTED_DIR \
-    --model_type mistral
-```
-
-<!--- TODO: What Tokenizer? --->
-
-!!! warning "Don't Forget the Tokenizer"
-
-    Make sure to add a tokenizer file and its configuration to the output directory, since `convert_model.py` does not include these files in the conversion.
-
-
-<!--- TODO: What Tokenizer? --->
-
-You can then load and use the converted model
-[as you would with any Transformers model](https://huggingface.co/docs/transformers/index).
-For example:
-```python
-import torch
-from transformers import AutoModelForCausalLM
-
-import transformers
-
-model = AutoModelForCausalLM.from_pretrained(converted_dir).to(device="cuda")
-x = torch.randint(0, 32000, (1, 1024))
-y = model(x)
-```
diff --git a/docs/tutorial/getting_started.md b/docs/tutorial/getting_started.md
deleted file mode 100644
index c5f6bd66a..000000000
--- a/docs/tutorial/getting_started.md
+++ /dev/null
@@ -1,75 +0,0 @@
-# Getting Started
-
-<!--- TODO: Remove the ServiceNow-specific content. --->
-
-## Build the image
-
-!!! warning
-
-    This guide is not yet working.
-
-The preferred way to run [Fast-LLM](https://github.com/ServiceNow/Fast-LLM) is through a docker image built with the provided Dockerfile.
-For example, from a terminal running on a GPU node:
-
-```bash
-git clone git@github.com:ServiceNow/Fast-LLM.git
-cd Fast-LLM
-docker build -t my_fast_llm_image .
-docker run --rm -it --gpus all --net=host --ipc=host my_fast_llm_image bash
-```
-
-## First examples
-
-All training runs are launched throught the entry point [pretrain_fast_llm.py](https://github.com/ServiceNow/Fast-LLM/blob/main/pretrain_fast_llm.py).
-We can run a minimalistic training example with:
-```bash
-python3 pretrain_fast_llm.py --train_iters=100 --batch_size=32 --dataset_source=random
-```
-This will launch a short single-GPU training from scratch of a 180 M parameter model on a randomly generated dataset.
-
-To run distributed training, we run our training script through [torchrun](https://pytorch.org/docs/stable/elastic/run.html),
-the PyTorch distributed launcher. For example, on 8 GPUs:
-```bash
-torchrun --nproc-per-node=8 pretrain_fast_llm.py --train_iters=100 --batch_size=32 --dataset_source=random
-```
-Note that by default, Fast-LLM parallelizes over samples (data-parallel), so the number of GPUs should divide the batch size.
-
-Multi-node training also uses torchrun, and requires the same command to be run on each node,
-with the additional specification of a rendez-vous endpoint, i.e., the address of one of the nodes.
-For example, on four nodes:
-```bash
-torchrun --nproc-per-node=8 --nnodes=4 --rdzv-backend=c10d --rdzv-endpoint=$HOST_NODE_ADDR pretrain_fast_llm.py --train_iters=100 --batch_size=32 --dataset_source=random
-```
-
-See the [torchrun documentation](https://pytorch.org/docs/stable/elastic/run.html) for more details.
-Note that if you are using cloud or managed hardware, there Now tutorial](servicenow.md)
-may be a simpler automated method to launch multi-node jobs.
-Please refer to your provider for more details.
-The ServiceNow-specific method may be found in the [Service
-
-## More on training arguments
-
-<!--- TODO: Document arguments --->
-
-The training script supports hundreds of arguments, though most of them are optional and/or have sensible defaults.
-We already saw three arguments above, and we will see many important ones in this tutorial.
-
-At the beginning of training, Fast-LLM displays a list of arguments and their values:
-```
------------------------- arguments ------------------------
-  activation_type ................................. gelu
-  adam_beta1 ...................................... 0.9
-  adam_beta2 ...................................... 0.999
-  adam_eps ........................................ 1e-08
-  add_linear_biases ............................... True
-  attention_dropout ............................... 0.0
-  batch_size ...................................... 1
-  [...]
--------------------- end of arguments ---------------------
-```
-All of these arguments can be set as arguments of `pretrain_fast_llm.py`, in the form `--[name]=[value]`,
-provided the values have the expected data type, and in some case satisfy extra constraints.
-For example, we may enable attention dropout with `--attention_dropout=0.1`.
-Note that booleans are set as integers (ex. `--add_linear_biases=0` to disable biases),
-and that `None` cannot be represented.
-Please refer to each parameter's definition for more details.
diff --git a/docs/tutorial/index.md b/docs/tutorial/index.md
deleted file mode 100644
index 00f38a7f8..000000000
--- a/docs/tutorial/index.md
+++ /dev/null
@@ -1,28 +0,0 @@
-# Tutorial
-
-
-This guide will teach how to pretrain and/or extend pretraining of language models such as Mistral-7B with Fast-LLM on multiple GPU nodes.
-Such training requires a careful selection and optimization of:
-- The training hardware: GPU node specs, count and interconnect.
-- The model architecture: layer types, hidden sizes, activations, etc.
-- The training dataset and its sampling.
-- The training parameters: optimizer, learning rate schedule, training duration, etc.
-- The training performance optimizations: distributed layout, activation recomputation, etc.
-
-When training a model with Fast-LLM (and other training libraries),
-we generally assume the first four points to be predetermined as they are unrelated to the training framework,
-and focus on the last one, i.e., we optimize a fixed training scheme for throughput.
-(However, in practice the batch size may be adjusted together with the distributed layout,
-which in turn affects the training schedule.)
-
-In this tutorial, we follow the extended pretraining for Mistral-7B over a corpus of 500 billion tokens using 16 DGX nodes,
-each equipped with 8 A100 or H100 GPUs (totalling 128 GPUs).
-We also explore some alternative settings such as training from scratch and the Mixtral-8x7B model.
-
-
-- [Getting started](getting_started.md): Get started with Fast-LLM, set up and run a first training configuration.
-- [Load Mistral-7B](prepare_mistral.md): Define the model architecture, download a checkpoint from the Huggingface Hub and load it in Fast-LLM.
-- [Prepare and load the dataset](prepare_data.md): Prepare and configure the dataset.
-- [Prepare the training configuration](prepare_training.md): Configure the optimizer, schedule, distributed layout, etc.
-- [Launch and monitor training](launch_training.md): Launch training, configure and view experiment outputs.
-- [Convert to Hugging Face](convert_to_huggingface.md): Convert to Hugging Face format and upload it to the Hugging Face model hub.
diff --git a/docs/tutorial/launch_training.md b/docs/tutorial/launch_training.md
deleted file mode 100644
index 823d2c976..000000000
--- a/docs/tutorial/launch_training.md
+++ /dev/null
@@ -1,93 +0,0 @@
-# Launch and monitor training
-
-## Requirements
-
-At this point, you should already have:
-
-- Access to a cluster with 16 DGX nodes with 8x A100/H100-80GB GPUs ([Or at least 4 GPUs](prepare_training.md)),
-connected through an Infiniband (preferred) and/or Ethernet interconnect,
-and sharing a common fast storage.
-- A [docker image](getting_started.md) for Fast-LLM, available on all nodes.
-- A local copy of the [Mistral weights](prepare_mistral.md) on the common storage
-- A [preprocessed dataset](prepare_data.md) in json format on the common storage.
-- (Optional) A Wandb account and API key.
-
-
-## Launching the experiment
-
-To launch the experiment, we perform the following on each node,
-or use a cluster-specific tool to automate the process:
-1. Launch a docker container running our docker image,
-ensuring access to all necessary hardware (GPUs, interconnects, etc.),
-and mounting the pretrained weights, dataset and an experiment directory.
-    ```bash
-   docker run --rm -it --gpus all --net=host --ipc=host [-v ...] my_fast_llm_image bash
-    ```
-2. Note the mounted paths and host address:
-    ```bash
-    export PRETRAINED_MISTRAL_PATH=...
-    export JSON_DATA_PATH=...
-    export EXP_BASE_DIR=...
-    export HOST_NODE_ADDR=...
-    ```
-3. Set up the experiment configuration as described in the previous sections:
-    ```bash
-
-    export ARCHITECTURE_ARGS_MISTRAL_PRETRAINED="\
-   --pretrained_checkpoint_type=huggingface \
-   --pretrained_checkpoint_path=$PRETRAINED_MISTRAL_PATH \
-   "
-
-   export MODEL_ARGS_MISTRAL_PRETRAINED="\
-   $ARCHITECTURE_ARGS_MISTRAL_PRETRAINED \
-   --window_size=4096 \
-   "
-
-    export DATA_ARGS="\
-    --split=9998,2,0 \
-    --dataset_source=file \
-    --data_path=$JSON_DATA_PATH \
-    "
-
-    export TRAINING_ARGS="\
-    --batch_size=128 \
-    --sequence_length=8192 \
-    --train_iters=500000 \
-    --weight_decay=0.1 \
-    --adam_beta1=0.9 \
-    --adam_beta2=0.95 \
-    --clip_grad=1.0 \
-    --lr=0.0001 \
-    --lr_warmup_iters=1000 \
-    --lr_decay_style=cosine \
-    --lr_decay_iters=500000 \
-    --min_lr=0.000003 \
-    "
-
-    export PERFORMANCE_ARGS="\
-    --training_dtype=bf16 \
-    --num_workers=8 \
-    "
-
-    export MONITORING_ARGS="\
-    --experiment_dir=$EXP_BASE_DIR \
-    --validation_iters=25 \
-    --validation_interval=1000 \
-    --max_checkpoints=5 \
-    --export_interval=25000 \
-    --log_interval=10 \
-    --log_offset=0 \
-    --checkpoint_interval=500 \
-    "
-    ```
-4. Launch the experiment:
-    ```bash
-    torchrun --nproc-per-node=8 --nnodes=16 --rdzv-backend=c10d --rdzv-endpoint=$HOST_NODE_ADDR pretrain_fast_llm.py \
-    $MODEL_ARGS_MISTRAL_PRETRAINED $DATA_ARGS $TRAINING_ARGS $PERFORMANCE_ARGS $MONITORING_ARGS
-    ```
-
-## Monitoring the experiment
-
-After launching the experiment, you may observe the progress through either stdout,
-or the log file at `[EXP_BASE_DIR]/runs/0/logs/logs_rank_000.txt`.
-If you set up Wandb logging, progress will also be reported there.
diff --git a/mkdocs.yaml b/mkdocs.yaml
index 061d9add7..684795809 100644
--- a/mkdocs.yaml
+++ b/mkdocs.yaml
@@ -149,7 +149,7 @@ plugins:
   - section-index
   - social:
       cards_layout_options:
-        color: #173a58
+        color: "#173a58"
   - git-revision-date-localized:
       enable_creation_date: true
   - git-committers:
@@ -157,22 +157,24 @@ plugins:
       branch: main
 
 nav:
-  - 🏠 Home: index.md
-  - 🚀 Getting Started:
-    - 📜 License: license.md
-  - 🍳 Tutorial:
-    - tutorial/index.md
-    - 🚀 Getting started: tutorial/getting_started.md
-    - 💨 Load Mistral-7B: tutorial/prepare_mistral.md
-    - 📊 Prepare and load the dataset: tutorial/prepare_data.md
-    - 💨 Prepare the training configuration: tutorial/prepare_training.md
-    - 🌪 Launch and monitor training: tutorial/launch_training.md
-    - 🤗 Convert to Huggingface: tutorial/convert_to_huggingface.md
-  - 🗂️ Reference:
-    - reference/index.md
-  - 🧑‍💻 Developer Guide:
-    - developers/index.md
-    - 🛠️ How to contribute: developers/contribute.md
-  - 👥 Community:
-    - community/index.md
-    - 🫶 Feedback: community/feedback.md
+  - Fast-LLM:
+    - Welcome: index.md
+    - Cost Efficiency: cost-efficiency.md
+    - Help: help.md
+    - In Action:
+      - On Slurm: in-action/slurm.md
+      - On Kubernetes: in-action/kubernetes.md
+    - Success Stories: success-stories.md
+    - License: license.md
+  - Examples:
+    - Data Preparation: examples/data-preparation.md
+    - Train Llama 8B from scratch: examples/train-llama-8b.md
+    - Continue training Llama 8B: examples/continue-training-llama-8b.md
+    - Upcycle Llama 3B to MoE: examples/upcycle-llama-3b-to-moe.md
+  - Reference:
+    - Configuration: reference/configuration.md
+  - Developers:
+    - Contributing: developers/contributing.md
+    - Best Practices: developers/best-practices.md
+  - About Us: about-us.md
+  - Join Us: join-us.md

From 9f887c124764a0e62075b274ecdf6bd670ef0eb4 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Sat, 26 Oct 2024 15:29:55 -0400
Subject: [PATCH 20/87] add quick-start guide

---
 .gitignore                        |  1 +
 CONTRIBUTING.md                   |  2 +-
 docs/blog/index.md                |  2 +
 docs/in-action.md                 |  0
 docs/in-action/kubernetes.md      | 51 +++++++++++++++++
 docs/in-action/slurm.md           | 63 +++++++++++++++++++++
 docs/index.md                     |  8 +--
 docs/quick-start.md               | 91 +++++++++++++++++++++++++++++++
 examples/fast-llm.pytorchjob.yaml |  4 +-
 mkdocs.yaml                       |  1 +
 10 files changed, 216 insertions(+), 7 deletions(-)
 create mode 100644 docs/blog/index.md
 delete mode 100644 docs/in-action.md
 create mode 100644 docs/quick-start.md

diff --git a/.gitignore b/.gitignore
index c8969952d..4f834433a 100644
--- a/.gitignore
+++ b/.gitignore
@@ -8,6 +8,7 @@ __pycache__/
 
 # Doc build
 .cache
+site
 
 # Distribution / packaging
 *.egg-info/
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index e7842f696..9ef1c1856 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -59,4 +59,4 @@ If you're unsure about something or need help, you've got options:
 
 We're grateful for all the awesome contributors who help make Fast-LLM better. Join our contributors' list and make your first contribution!
 
-To learn more about the team and maintainers, visit our [About page](https://servicenow.github.io/Fast-LLM/about-us/).
+To learn more about the team and maintainers, visit our [About page](https://servicenow.github.io/Fast-LLM/about-us).
diff --git a/docs/blog/index.md b/docs/blog/index.md
new file mode 100644
index 000000000..c58f16c50
--- /dev/null
+++ b/docs/blog/index.md
@@ -0,0 +1,2 @@
+# Blog
+
diff --git a/docs/in-action.md b/docs/in-action.md
deleted file mode 100644
index e69de29bb..000000000
diff --git a/docs/in-action/kubernetes.md b/docs/in-action/kubernetes.md
index e69de29bb..7255c3cf0 100644
--- a/docs/in-action/kubernetes.md
+++ b/docs/in-action/kubernetes.md
@@ -0,0 +1,51 @@
+---
+title: "Kubernetes"
+---
+
+- **Purpose:** These guides cover specific environments and configurations for deploying Fast-LLM in different setups.
+- **Content Organization:**
+  - **in-action/slurm**: Provide detailed instructions on deploying Fast-LLM on a Slurm cluster, covering multi-node setups, configuring Slurm scripts, and managing jobs.
+  - **in-action/kubernetes**: Guide for deploying Fast-LLM using Kubernetes, including creating the appropriate workloads (e.g., Job, Pod, StatefulSet), handling private Docker images, and configuring multi-node training.
+  - **File Single Node Guide Here Too?** If you include a "single-node" guide in this section as well, make it more advanced, focusing on optimizing performance, using different configurations, or tuning settings for different GPU models.
+- **Why It Makes Sense:** Organizing by deployment environment ensures users can quickly find the relevant guide based on their setup. Including both multi-node cluster guides and single-node advanced setups allows users to scale their knowledge.
+
+---
+
+We'll walk you through how to use Fast-LLM to train a large language model on a cluster with multiple nodes and GPUs. We'll show an example setup using a Slurm cluster and a Kubernetes cluster.
+
+For this demo, we will train a Mistral-7B model from scratch for 100 steps on random data. The config file `examples/mistral-4-node-benchmark.yaml` is pre-configured for a multi-node setup with 4 DGX nodes, each with 8 A100-80GB or H100-80GB GPUs.
+
+> [!NOTE]
+> Fast-LLM scales from a single GPU to large clusters. You can start small and expand based on your resources.
+
+Expect to see a significant speedup in training time compared to other libraries! For training Mistral-7B, Fast-LLM is expected to achieve a throughput of **9,800 tokens/s/H100** (batch size 32, sequence length 8k) on a 4-node cluster with 32 H100s.
+
+
+### Running Fast-LLM on a Kubernetes Cluster
+
+#### Prerequisites
+
+- A [Kubernetes](https://kubernetes.io/) cluster with at least 4 DGX nodes with 8 A100-80GB or H100-80GB GPUs each.
+- [KubeFlow](https://www.kubeflow.org/) installed.
+- Locked memory limit set to unlimited at the host level on all nodes. Ask your cluster admin to do this if needed.
+
+#### Steps
+
+1. Create a Kubernetes [PersistentVolumeClaim](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) (PVC) named `fast-llm-home` that will be mounted to `/home/fast-llm` in the container using [examples/fast-llm-pvc.yaml](examples/fast-llm-pvc.yaml):
+
+    ```bash
+    kubectl apply -f examples/fast-llm-pvc.yaml
+    ```
+
+2. Create a [PyTorchJob](https://www.kubeflow.org/docs/components/training/user-guides/pytorch/) resource using the example configuration file [examples/fast-llm.pytorchjob.yaml](examples/fast-llm.pytorchjob.yaml):
+
+    ```bash
+    kubectl apply -f examples/fast-llm.pytorchjob.yaml
+    ```
+
+3. Monitor the job status:
+
+    - Use `kubectl get pytorchjobs` to see the job status.
+    - Use `kubectl logs -f fast-llm-master-0 -c pytorch` to follow the logs.
+
+That's it! You're now up and running with Fast-LLM on Kubernetes. 🚀
diff --git a/docs/in-action/slurm.md b/docs/in-action/slurm.md
index e69de29bb..6af6f6478 100644
--- a/docs/in-action/slurm.md
+++ b/docs/in-action/slurm.md
@@ -0,0 +1,63 @@
+---
+title: "Slurm"
+---
+
+- **Purpose:** These guides cover specific environments and configurations for deploying Fast-LLM in different setups.
+- **Content Organization:**
+  - **in-action/slurm**: Provide detailed instructions on deploying Fast-LLM on a Slurm cluster, covering multi-node setups, configuring Slurm scripts, and managing jobs.
+  - **in-action/kubernetes**: Guide for deploying Fast-LLM using Kubernetes, including creating the appropriate workloads (e.g., Job, Pod, StatefulSet), handling private Docker images, and configuring multi-node training.
+  - **File Single Node Guide Here Too?** If you include a "single-node" guide in this section as well, make it more advanced, focusing on optimizing performance, using different configurations, or tuning settings for different GPU models.
+- **Why It Makes Sense:** Organizing by deployment environment ensures users can quickly find the relevant guide based on their setup. Including both multi-node cluster guides and single-node advanced setups allows users to scale their knowledge.
+
+---
+
+We'll walk you through how to use Fast-LLM to train a large language model on a cluster with multiple nodes and GPUs. We'll show an example setup using a Slurm cluster and a Kubernetes cluster.
+
+For this demo, we will train a Mistral-7B model from scratch for 100 steps on random data. The config file `examples/mistral-4-node-benchmark.yaml` is pre-configured for a multi-node setup with 4 DGX nodes, each with 8 A100-80GB or H100-80GB GPUs.
+
+> [!NOTE]
+> Fast-LLM scales from a single GPU to large clusters. You can start small and expand based on your resources.
+
+Expect to see a significant speedup in training time compared to other libraries! For training Mistral-7B, Fast-LLM is expected to achieve a throughput of **9,800 tokens/s/H100** (batch size 32, sequence length 8k) on a 4-node cluster with 32 H100s.
+
+### Running Fast-LLM on a Slurm Cluster without Docker
+
+#### Prerequisites
+
+- A [Slurm](https://slurm.schedmd.com/) cluster with at least 4 DGX nodes with 8 A100-80GB or H100-80GB GPUs each.
+- CUDA 12.1 or higher.
+- Dependencies: [PyTorch][pytorch], [Triton][triton], and [Apex](https://github.com/NVIDIA/apex) installed on all nodes.
+
+#### Steps
+
+1. Deploy the [nvcr.io/nvidia/pytorch:24.07-py3](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) Docker image to all nodes (recommended), because it contains all the necessary dependencies.
+2. Install Fast-LLM on all nodes:
+
+    ```bash
+    sbatch <<EOF
+    #!/bin/bash
+    #SBATCH --nodes=$(scontrol show node | grep -c NodeName)
+    #SBATCH --ntasks-per-node=1
+    #SBATCH --ntasks=$(scontrol show node | grep -c NodeName)
+    #SBATCH --exclusive
+
+    srun bash -c 'pip install --no-cache-dir -e "git+https://github.com/ServiceNow/Fast-LLM.git#egg=llm[CORE,OPTIONAL,DEV]"'
+    EOF
+    ```
+
+3. Use the example Slurm job script [examples/fast-llm.sbat](examples/fast-llm.sbat) to submit the job to the cluster:
+
+    ```bash
+    sbatch examples/fast-llm.sbat
+    ```
+
+4. Monitor the job's progress:
+
+    - Logs: Follow `job_output.log` and `job_error.log` in your working directory for logs.
+    - Status: Use `squeue -u $USER` to see the job status.
+
+Now, you can sit back and relax while Fast-LLM trains your model at full speed! ☕
+
+
+### Running Fast-LLM on a Slurm Cluster with Docker
+
diff --git a/docs/index.md b/docs/index.md
index 8499e0f90..5d6c9e6ba 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -8,7 +8,7 @@ hide:
 
 Welcome to **Fast-LLM**, the cutting-edge open-source library built for training large language models (LLMs) with **unmatched speed, scalability, and cost-efficiency**. Developed by [ServiceNow Research](https://www.servicenow.com/research/)'s Foundation Models Lab, Fast-LLM is engineered to meet the rigorous demands of professional AI teams, research institutions, and enterprises pushing the limits of generative AI. **Achieve groundbreaking research and high-stakes production goals faster with Fast-LLM.**
 
-[Get started with Fast-LLM](quickstart.md) and experience the next generation of LLM training. [See Fast-LLM in action](in-action.md) and discover how it can transform your training workflows.
+[Start your journey with Fast-LLM](quick-start.md) and explore the future of LLM training. Dive into [real-world use cases](in-action/slurm.md) to see how Fast-LLM can elevate your training workflows.
 
 ## Why Fast-LLM?
 
@@ -28,11 +28,11 @@ Fast-LLM isn't just another library, **it's a platform for powering the next gen
 
   - **More Tokens for Your Budget:** Train up to xx% more tokens for the same budget, leading to better-trained models without breaking your financial constraints.
 
-  [Learn more about Fast-LLM's cost efficiency and see detailed comparisons](cost-efficiency.md).
+  [Learn more about Fast-LLM's cost efficiency and see detailed comparisons](cost-efficiency).
 
 - **🔓 Openness Without Compromise:** Fast-LLM's open-source model ensures that you can **fully customize and extend the library** to fit your exact needs, without the restrictions of proprietary software. Developed transparently by a community of professionals on GitHub, every change is **openly discussed and vetted**, ensuring **trust and collaboration** as you innovate with confidence, knowing the entire development process is out in the open.
 
-- **🌍 Community-Driven Development:** Built by professionals for professionals, Fast-LLM's development is transparent, with an open invitation to the community to contribute. [**Join the Fast-LLM community**](community/join-us) to help shape the future of large-scale AI training.
+- **🌍 Community-Driven Development:** Built by professionals for professionals, Fast-LLM's development is transparent, with an open invitation to the community to contribute. [**Join the Fast-LLM community**](join-us) to help shape the future of large-scale AI training.
 
 ### Key Features
 
@@ -48,7 +48,7 @@ Fast-LLM offers all the capabilities you need to accelerate your LLM training an
 
 - **🛠️ Professional-Grade Tools:** Enjoy mixed precision training, large batch training, and gradient accumulation. Fast-LLM ensures reproducibility through deterministic behavior and provides pre-built Docker images, YAML configurations, and a simple, intuitive command-line interface.
 
-[Download Fast-LLM](https://github.com/ServiceNow/Fast-LLM/releases) and start training your large language models at full speed. [Join the Fast-LLM community](community/join-us) and collaborate with like-minded professionals to advance AI research and development.
+[Download Fast-LLM](https://github.com/ServiceNow/Fast-LLM/releases) and start training your large language models at full speed. [Join the Fast-LLM community](join-us) and collaborate with like-minded professionals to advance AI research and development.
 
 ## Use Cases and Success Stories
 
diff --git a/docs/quick-start.md b/docs/quick-start.md
new file mode 100644
index 000000000..1a35aec90
--- /dev/null
+++ b/docs/quick-start.md
@@ -0,0 +1,91 @@
+---
+title: "Quick Start"
+---
+
+
+
+- **Purpose:** This section should provide an easy entry point for users who want to quickly get up and running with Fast-LLM. A single-node setup is a reasonable assumption for most users, as it doesn’t require specialized hardware or admin-level permissions on large clusters.
+- **Content Ideas:**
+  - **Single Node Setup Guide** (with GPU access, Docker, and root/privileged access assumed): Walk through the installation, setting up Docker, configuring the environment, and launching a simple training or inference task.
+  - **Running Your First Model**: Include a basic example with a small dataset to show how to use Fast-LLM on a local machine.
+  - **Troubleshooting Basics**: Common issues that users might run into when setting up on a single node.
+- **Why It Makes Sense:** Most users will likely start with a local environment to experiment with Fast-LLM, so guiding them through a single-node setup as the "Getting Started" entry point makes it more approachable.
+
+---
+
+we want to asume that the user has at least one NVIDIA GPU available in one machine, and that they have Docker installed.
+
+in that case, it's really straightforward to get started with Fast-LLM.
+
+first, let's download a pre-built Docker image with Fast-LLM:
+
+```bash
+docker pull ghcr.io/servicenow/fast-llm:latest
+```
+
+let's make two directories to store our inputs and outputs:
+
+```bash
+mkdir ~/inputs ~/results
+```
+
+then let's download a huggingface config file for training a model and save this as `~/inputs/config.json`:
+
+=== "Llama-3.2-3B-Instruct"
+
+    ```bash
+    curl -O https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct/resolve/main/config.json
+    ```
+
+=== "Qwen2.5-3B-Instruct"
+
+    ```bash
+    curl -O https://huggingface.co/Qwen/Qwen2.5-3B-Instruct/resolve/main/config.json
+    ```
+
+Now let's use this config in our Fast-LLM training configuration file:
+
+```yaml
+training:
+  train_iters: 100
+  logs:
+    interval: 10
+  validation:
+    iterations: null
+  test_iters: 0
+batch:
+  micro_batch_size: 1
+  batch_size: 32
+data:
+  format: random
+  split: [1, 0, 0]
+optimizer:
+  learning_rate:
+    base: 1.0e-05
+pretrained:
+  format: huggingface
+  path: /app/inputs
+model:
+  multi_stage:
+    zero_stage: 2
+  distributed:
+    training_dtype: bf16
+run:
+  experiment_dir: /app/results
+```
+
+save this to a file called `~/inputs/fast-llm-config.yaml`.
+this will be mounted into the Docker container when we run it.
+
+then, run the following command:
+
+```bash
+docker run --gpus all -it --rm ghcr.io/servicenow/fast-llm:latest
+        -v ~/inputs:/app/inputs
+        -v ~/results:/app/results
+        torchrun --nproc_per_node=8 --no_python fast-llm train gpt --config /app/inputs/fast-llm-config.yaml
+```
+
+[^^^ this may be incorrect, I'm not sure how to run the command]
+
+[I want the number of GPUs be auto-detected, but I'm not sure how to do that]
diff --git a/examples/fast-llm.pytorchjob.yaml b/examples/fast-llm.pytorchjob.yaml
index 9decff91f..13a7a4df8 100644
--- a/examples/fast-llm.pytorchjob.yaml
+++ b/examples/fast-llm.pytorchjob.yaml
@@ -17,7 +17,7 @@ spec:
               effect: NoSchedule
           containers:
             - name: pytorch
-              image: servicenowdocker/fast-llm:latest
+              image: ghcr.io/servicenow/fast-llm:latest
               resources:
                 limits:
                   nvidia.com/gpu: 8
@@ -77,7 +77,7 @@ spec:
               effect: NoSchedule
           containers:
             - name: pytorch
-              image: servicenowdocker/fast-llm:latest
+              image: ghcr.io/servicenow/fast-llm:latest
               resources:
                 limits:
                   nvidia.com/gpu: 8
diff --git a/mkdocs.yaml b/mkdocs.yaml
index 684795809..97a0f454e 100644
--- a/mkdocs.yaml
+++ b/mkdocs.yaml
@@ -159,6 +159,7 @@ plugins:
 nav:
   - Fast-LLM:
     - Welcome: index.md
+    - Quick Start: quick-start.md
     - Cost Efficiency: cost-efficiency.md
     - Help: help.md
     - In Action:

From 0ab5b62f924ba24cee283dbde29fb3ef51c0e7d1 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Sat, 26 Oct 2024 20:10:02 -0400
Subject: [PATCH 21/87] add prepare-dataset script

---
 tools/prepare_dataset.py | 181 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 181 insertions(+)
 create mode 100644 tools/prepare_dataset.py

diff --git a/tools/prepare_dataset.py b/tools/prepare_dataset.py
new file mode 100644
index 000000000..1c1b89b00
--- /dev/null
+++ b/tools/prepare_dataset.py
@@ -0,0 +1,181 @@
+import json
+import os
+from functools import cached_property
+from multiprocessing import Pool
+from pathlib import Path
+
+import numpy as np
+from datasets import load_dataset
+from tqdm import tqdm
+from transformers import AutoTokenizer, logging
+
+from fast_llm.config import Field, FieldHint, check_field, config_class
+from fast_llm.data.mmap import MMapIndexedDataset
+from fast_llm.engine.config_utils.data_type import DataType
+from fast_llm.engine.config_utils.runnable import RunnableConfig
+from fast_llm.utils import Assert
+
+logging.set_verbosity_error()
+
+
+@config_class
+class PrepareDatasetConfig(RunnableConfig):
+    dataset_name_or_path: str = Field(
+        desc="Name or path of the dataset.",
+        hint=FieldHint.core,
+    )
+    tokenizer_path_or_name: str = Field(
+        desc="Path or name of the tokenizer.",
+        hint=FieldHint.core,
+    )
+    output_dir: str = Field(
+        desc="Output directory for the processed dataset.",
+        hint=FieldHint.core,
+    )
+    num_processes_load: int = Field(
+        default=1,
+        desc="Number of workers in load_dataset() call.",
+        hint=FieldHint.optional,
+        valid=check_field(Assert.geq, 0),
+    )
+    num_processes_map: int = Field(
+        default=1,
+        desc="Number of workers in .map() call.",
+        hint=FieldHint.optional,
+        valid=check_field(Assert.geq, 0),
+    )
+    num_processes_save: int = Field(
+        default=1,
+        desc="Number of processes for saving the mmap'ed datasets.",
+        hint=FieldHint.optional,
+        valid=check_field(Assert.geq, 0),
+    )
+    num_tokens_per_shard: int = Field(
+        default=1000000000,
+        desc="Approximate number of tokens per shard.",
+        hint=FieldHint.optional,
+        valid=check_field(Assert.geq, 1),
+    )
+    dataset_config_name: None | str = Field(
+        default=None,
+        desc="Specific configuration name for the dataset.",
+        hint=FieldHint.optional,
+    )
+    dataset_split: str = Field(
+        default="train",
+        desc="Split of the dataset to use.",
+        hint=FieldHint.optional,
+    )
+    dataset_field: str = Field(
+        default="text",
+        desc="Field of the dataset to use.",
+        hint=FieldHint.optional,
+    )
+    dataset_dtype: DataType = Field(
+        default=None,
+        desc="Data type of the dataset field.",
+        hint=FieldHint.derived,
+    )
+
+    @cached_property
+    def tokenizer(self):
+        return AutoTokenizer.from_pretrained(self.tokenizer_path_or_name)
+
+    def _validate(self):
+        if self.dataset_dtype is None:
+            # Decide the dtype based on the tokenizer vocabulary size
+            vocab_size = len(self.tokenizer)
+
+            if vocab_size <= np.iinfo(np.int8).max:
+                self.dataset_dtype = DataType.int8
+            elif vocab_size <= np.iinfo(np.int16).max:
+                self.dataset_dtype = DataType.int16
+            elif vocab_size <= np.iinfo(np.int32).max:
+                self.dataset_dtype = DataType.int32
+            elif vocab_size <= np.iinfo(np.int64).max:
+                self.dataset_dtype = DataType.int64
+            else:
+                raise ValueError(
+                    f"Tokenizer vocabulary size {vocab_size} is too large for supported dtypes in MMapIndexedDataset."
+                )
+        super()._validate()
+
+    def _tokenize_text(self, text):
+        tokens = self.tokenizer(
+            text,
+            truncation=False,
+            padding=False,
+            add_special_tokens=True,
+        )["input_ids"]
+        return np.array(tokens, dtype=self.dataset_dtype.numpy)
+
+    def _tokenize_batch(self, batch):
+        input_ids = [self._tokenize_text(text) for text in batch[self.dataset_field]]
+        num_tokens = [len(x) for x in input_ids]
+        return {
+            "input_ids": input_ids,
+            "num_tokens": num_tokens,
+        }
+
+    def _save_shard(self, args) -> dict:
+        shard_idx, shard_dataset = args
+        prefix = f"shard_{shard_idx}"
+        shard_output_path = Path(self.output_dir) / prefix
+        documents = [
+            np.array(item["input_ids"], dtype=self.dataset_dtype.numpy)
+            for item in tqdm(shard_dataset, desc=f"Saving shard {shard_idx}", unit="docs")
+        ]
+        MMapIndexedDataset.write_dataset(prefix=shard_output_path, documents=documents)
+        dataset_dict = {
+            "prefix": prefix,
+            "num_documents": len(documents),
+            "num_tokens": sum(len(doc) for doc in documents),
+        }
+        return dataset_dict
+
+    def run(self):
+        # Load dataset
+        dataset = load_dataset(
+            path=self.dataset_name_or_path,
+            name=self.dataset_config_name,
+            split=self.dataset_split,
+            num_proc=self.num_processes_load,
+        )
+        if self.dataset_field not in dataset.column_names:
+            raise ValueError(f"Dataset does not have field '{self.dataset_field}'.")
+
+        # Tokenize the dataset
+        tokenized_dataset = dataset.map(
+            self._tokenize_batch,
+            batched=True,
+            num_proc=self.num_processes_map,
+            desc="Tokenizing batches",
+        )
+
+        # Calculate total number of tokens
+        total_tokens = sum(tqdm(tokenized_dataset["num_tokens"], desc="Counting tokens", unit="tokens"))
+
+        # Split dataset into shards
+        num_shards = int(np.ceil(total_tokens / self.num_tokens_per_shard))
+        shards = [
+            (i, tokenized_dataset.shard(num_shards=num_shards, index=i))
+            for i in tqdm(range(num_shards), desc="Creating shards")
+        ]
+
+        # Prepare output directory
+        os.makedirs(self.output_dir, exist_ok=True)
+
+        # Use multiprocessing to save each shard in parallel
+        with Pool(processes=self.num_processes_save) as pool:
+            dataset_dicts = pool.map(self._save_shard, shards)
+
+        # Create a metadata file
+        total_tokens = sum(dataset_dict["num_tokens"] for dataset_dict in dataset_dicts)
+        for dataset_dict in dataset_dicts:
+            dataset_dict["weight"] = float(dataset_dict["num_tokens"]) / float(total_tokens)
+        output_file = Path(self.output_dir) / "fast_llm_dataset.json"
+        json.dump({"datasets": dataset_dicts}, output_file.open("w"))
+
+
+if __name__ == "__main__":
+    PrepareDatasetConfig.parse_and_run()

From 79b840195c2a849a87cb543528bb454c26411792 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Sun, 27 Oct 2024 15:39:28 -0400
Subject: [PATCH 22/87] rewrite quick-start guide

---
 docs/index.md       |   6 +-
 docs/quick-start.md | 175 +++++++++++++++++++++++++++++++++++---------
 2 files changed, 143 insertions(+), 38 deletions(-)

diff --git a/docs/index.md b/docs/index.md
index 5d6c9e6ba..58f8f34c3 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -1,9 +1,5 @@
 ---
 title: "Fast-LLM: Train Large Language Models Faster Than Ever Before"
-hide:
-  - navigation
-  - toc
-  - feedback
 ---
 
 Welcome to **Fast-LLM**, the cutting-edge open-source library built for training large language models (LLMs) with **unmatched speed, scalability, and cost-efficiency**. Developed by [ServiceNow Research](https://www.servicenow.com/research/)'s Foundation Models Lab, Fast-LLM is engineered to meet the rigorous demands of professional AI teams, research institutions, and enterprises pushing the limits of generative AI. **Achieve groundbreaking research and high-stakes production goals faster with Fast-LLM.**
@@ -81,6 +77,6 @@ Fast-LLM is more than just software, it's a community. Get involved by exploring
 
 ## Getting Started
 
-Ready to dive in? Check out our [quickstart guide](quickstart.md) for an overview of how to set up and run Fast-LLM on different platforms, including [Slurm](https://slurm.schedmd.com) and [Kubernetes](https://kubernetes.io). Explore the [examples](https://github.com/ServiceNow/Fast-LLM/tree/main/examples) for pre-configured setups to help you get started quickly with your own training experiments.
+Ready to dive in? Check out our [quickstart guide](quick-start) for an overview of how to set up and run Fast-LLM on different platforms, including [Slurm](https://slurm.schedmd.com) and [Kubernetes](https://kubernetes.io). Explore the [examples](https://github.com/ServiceNow/Fast-LLM/tree/main/examples) for pre-configured setups to help you get started quickly with your own training experiments.
 
 For any questions or issues, open an [issue](https://github.com/ServiceNow/Fast-LLM/issues) or join the [community discussion](https://github.com/ServiceNow/Fast-LLM/discussions).
diff --git a/docs/quick-start.md b/docs/quick-start.md
index 1a35aec90..344874148 100644
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -1,35 +1,69 @@
 ---
-title: "Quick Start"
+title: "Quick Start 🚀"
 ---
 
+This guide will get you up and running with Fast-LLM on a single machine. Let's train a model and see some results!
 
+You'll need:
 
-- **Purpose:** This section should provide an easy entry point for users who want to quickly get up and running with Fast-LLM. A single-node setup is a reasonable assumption for most users, as it doesn’t require specialized hardware or admin-level permissions on large clusters.
-- **Content Ideas:**
-  - **Single Node Setup Guide** (with GPU access, Docker, and root/privileged access assumed): Walk through the installation, setting up Docker, configuring the environment, and launching a simple training or inference task.
-  - **Running Your First Model**: Include a basic example with a small dataset to show how to use Fast-LLM on a local machine.
-  - **Troubleshooting Basics**: Common issues that users might run into when setting up on a single node.
-- **Why It Makes Sense:** Most users will likely start with a local environment to experiment with Fast-LLM, so guiding them through a single-node setup as the "Getting Started" entry point makes it more approachable.
+- At least one NVIDIA GPU on your machine. We recommend 8 A100s or higher for this tutorial 🤑
+- Docker (installed and running)
+- Some patience for the initial setup and training 😊
 
----
-
-we want to asume that the user has at least one NVIDIA GPU available in one machine, and that they have Docker installed.
-
-in that case, it's really straightforward to get started with Fast-LLM.
+## Step 1: Pull the Fast-LLM Docker Image
 
-first, let's download a pre-built Docker image with Fast-LLM:
+To start, grab the pre-built Fast-LLM Docker image:
 
 ```bash
 docker pull ghcr.io/servicenow/fast-llm:latest
 ```
 
-let's make two directories to store our inputs and outputs:
+## Step 2: Set Up Directories for Your Inputs and Outputs
+
+Let's create folders to store our input data and output results:
 
 ```bash
 mkdir ~/inputs ~/results
 ```
 
-then let's download a huggingface config file for training a model and save this as `~/inputs/config.json`:
+## Step 3: Preparing the Training Data
+
+For this tutorial, we'll use 9B tokens of text from the [OpenWebText](https://skylion007.github.io/OpenWebTextCorpus/) dataset. This dataset is a free approximation of the WebText data OpenAI used for GPT-2, and it's perfect for our setup!
+
+We've got a script that'll download and preprocess the dataset for you. Run it like this:
+
+!!! info inline end "What's Happening Here?"
+
+    This will grab the OpenWebText data, tokenize it with the GPT-2 tokenizer, and save it in 91 shards of 100M tokens each. Expect around 2 hours for the whole thing to finish, mainly due to tokenization. If you've got more CPU cores, try upping `num_processes_*` to speed things up.
+
+```bash
+python tools/prepare_dataset.py \                          
+    tokenizer_path_or_name="gpt2" \             
+    dataset_name_or_path="openwebtext" \                                       
+    dataset_split="train" \
+    dataset_field="text" \
+    output_dir="inputs" \ 
+    num_processes_load=4 \
+    num_processes_map=4 \
+    num_processes_save=4 \
+    num_tokens_per_shard=100000000
+```
+
+## Step 4: Choose Your Model
+
+Fast-LLM supports many GPT variants, including (but not limited to) GPT-2, Llama, Mistral, and Qwen. For this tutorial, let's train the GPT-2 model from scratch with Fully Sharded Data Parallelism (FSDP). We'll grab a configuration file from Huggingface Hub and save it as `~/inputs/config.json`:
+
+=== "GPT-2 (124M)"
+
+    ```bash
+    curl -O https://huggingface.co/openai-community/gpt2/resolve/main/config.json
+    ```
+
+=== "GPT-2 XL (1558M)"
+
+    ```bash
+    curl -O https://huggingface.co/openai-community/gpt2-xl/resolve/main/config.json
+    ```
 
 === "Llama-3.2-3B-Instruct"
 
@@ -43,28 +77,56 @@ then let's download a huggingface config file for training a model and save this
     curl -O https://huggingface.co/Qwen/Qwen2.5-3B-Instruct/resolve/main/config.json
     ```
 
-Now let's use this config in our Fast-LLM training configuration file:
+!!! tip "Model Size Matters"
+
+    Smaller models like GPT-2 (124M) will train relatively quickly, especially if you've only got a few GPUs. But if you're feeling adventurous (and patient), give the larger models a shot!
+
+## Step 5: Set Up Your Training Configuration
+
+Next, we'll create a configuration file for Fast-LLM. Save the following as `~/inputs/fast-llm-config.yaml`:
 
 ```yaml
 training:
-  train_iters: 100
+  train_iters: 600_000  # (1)!
   logs:
     interval: 10
   validation:
-    iterations: null
+    iterations: 25
+    interval: 1000
+  checkpoint:
+    interval: 1000
+    keep_latest: 5
   test_iters: 0
+  export:
+    format: huggingface
+    interval: 20_000
+  wandb:
+    project_name: fast-llm
+    entity_name: servicenow  # (2)!
+    tags: quick-start
+    alert:
+      interval: 1000
 batch:
-  micro_batch_size: 1
-  batch_size: 32
+  micro_batch_size: 1  # (3)!
+  sequence_length: 1024  # (4)!
+  batch_size: 480  # (5)!
 data:
-  format: random
-  split: [1, 0, 0]
+  format: file
+  split: [998, 2, 0]  # (6)!
 optimizer:
+  weight_decay: 0.1  # (7)!
+  beta_1: 0.9  # (8)!
+  beta_2: 0.95  # (9)!
   learning_rate:
-    base: 1.0e-05
+    base: 6.0e-04  # (10)!
+    minimum: 6.0e-05  # (11)!
+    decay_style: cosine  # (12)!
+    decay_iterations: 600_000  # (13)!
+    warmup_iterations: 2000  # (14)!
 pretrained:
   format: huggingface
-  path: /app/inputs
+  path: /app/inputs  # (15)!
+  load_weights: False  # (16)!
 model:
   multi_stage:
     zero_stage: 2
@@ -74,18 +136,65 @@ run:
   experiment_dir: /app/results
 ```
 
-save this to a file called `~/inputs/fast-llm-config.yaml`.
-this will be mounted into the Docker container when we run it.
+1. Total number of tokens will be ~300B.
+2. Replace `servicenow` with your own W&B entity name.
+3. Adjust based on GPU memory. For GPT-2 and an A100-80GB, a `micro_batch_size` of 1 should work well.
+4. Should be a power of 2 and divisible by 8. For an A100-80GB, 1024 is a good starting point.
+5. Must be divisible by number of GPUs. At 1024 tokens per sequence, 480 corresponds to about ~500k tokens per batch.
+6. 99.8% train, 0.2% validation, 0% test.
+7. L2 regularization penalty.
+8. 1st Adam optimizer parameter.
+9. 2nd Adam optimizer parameter.
+10. Peak learning rate.
+11. Should be 1/10th of base per Chinchilla.
+12. Cosine decay starting at `base` after warmup and ending at `minimum` after `decay_iterations`.
+13. Usually the same as `train_iters`.
+14. Number of steps of linear warmup.
+15. Location of the `config.json` file downloaded in Step 4.
+16. Set to `False` to train from scratch.
 
-then, run the following command:
+## Step 6: Add Your Weights & Biases API Key
+
+Save your Weights & Biases API key to `~/inputs/.wandb_api_key` so Fast-LLM can track your training progress there. You can create a free W&B account if you don't already have one.
+
+## Step 7: Launch Training
+
+Alright, the big moment! If you're on an 8-GPU machine, run the following to kick off training:
 
 ```bash
-docker run --gpus all -it --rm ghcr.io/servicenow/fast-llm:latest
-        -v ~/inputs:/app/inputs
-        -v ~/results:/app/results
-        torchrun --nproc_per_node=8 --no_python fast-llm train gpt --config /app/inputs/fast-llm-config.yaml
+docker run --gpus all -it --rm ghcr.io/servicenow/fast-llm:latest \
+    -v ~/inputs:/app/inputs \
+    -v ~/results:/app/results \
+    -e PYTHONHASHSEED=0 \
+    -e WANDB_API_KEY_PATH=/app/inputs/.wandb_api_key \
+    torchrun --nproc_per_node=8 --no_python fast-llm train gpt --config /app/inputs/fast-llm-config.yaml
 ```
 
-[^^^ this may be incorrect, I'm not sure how to run the command]
+!!! note
+
+    Setting the Python hash seed to 0 ensures consistent, reproducible ordering in hash-dependent operations across processes, which is crucial for parallel computations.
+
+Expect training to run for a few days (for a full 300B tokens). Keep an eye on the validation loss. You should see it drop as the model learns.
+
+## Tracking Your Progress with W&B 📊
+
+With Weights & Biases, you'll see the loss curve, training metrics, and more. If you follow this whole training setup, you should see the validation loss approaching the ballpark of ~2.85 (similar to the original GPT-2 model finetuned on OpenWebText).
+
+### Troubleshooting Basics 🛠️
+
+Some common issues you might run into and how to address them:
+
+- **CUDA Out of Memory**: If you encounter memory issues, try **lowering the `micro_batch_size` or `sequence_length`** in your config file.
+
+- **Low Memory Usage**: If your memory utilization is low, but you aren't close to maximum GPU usage, try **increasing the `micro_batch_size` or `sequence_length`** to better utilize your GPU.
+
+- **Low GPU Utilization or Slow Training**:
+  - **Increase `micro_batch_size`**: If memory allows, increase `micro_batch_size` to 4, 8, or even 16. Larger micro batches ensure GPUs have more work per step, which reduces idle time and improves overall utilization.
+  - **Extend `sequence_length`**: If your model and GPU memory can handle it, try increasing the `sequence_length` to 2048, 3072, or 4096. Longer sequences allow each forward pass to do more work, better engaging the GPU.
+
+- **Docker Permission Issues**: If you experience Docker permission errors, ensure Docker has the necessary permissions to access your GPUs. This can be checked by ensuring `--gpus all` is specified in your Docker run command, or by confirming that your user has access to the `docker` and `nvidia-docker` groups.
+
+## Final Thoughts
 
-[I want the number of GPUs be auto-detected, but I'm not sure how to do that]
+And that's it! You've set up, prepped data, chosen a model, configured training, and launched a full training run with Fast-LLM. From here, feel free to tweak the model, try out larger datasets, or scale things up to a multi-node setup if you're on a cluster.
+We have guides for Slurm and Kubernetes setups if distributed training is your jam. Happy training! 🚀

From 8b6ef7b5049458632107b1c0012e8dd6aff240e5 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Sun, 27 Oct 2024 15:43:19 -0400
Subject: [PATCH 23/87] rewrite quick-start guide

---
 docs/quick-start.md | 14 +++++---------
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/docs/quick-start.md b/docs/quick-start.md
index 344874148..158a79474 100644
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -180,19 +180,15 @@ Expect training to run for a few days (for a full 300B tokens). Keep an eye on t
 
 With Weights & Biases, you'll see the loss curve, training metrics, and more. If you follow this whole training setup, you should see the validation loss approaching the ballpark of ~2.85 (similar to the original GPT-2 model finetuned on OpenWebText).
 
-### Troubleshooting Basics 🛠️
+## Troubleshooting Basics 🛠️
 
-Some common issues you might run into and how to address them:
+Here are some common issues you might encounter and how to address them:
 
-- **CUDA Out of Memory**: If you encounter memory issues, try **lowering the `micro_batch_size` or `sequence_length`** in your config file.
+- **CUDA Out of Memory**: Try lowering the `micro_batch_size` or `sequence_length` in your configuration to fit within available memory.
 
-- **Low Memory Usage**: If your memory utilization is low, but you aren't close to maximum GPU usage, try **increasing the `micro_batch_size` or `sequence_length`** to better utilize your GPU.
+- **Underutilized GPU or Low Memory Usage**: If memory usage is low or GPU utilization isn't maxed out, try increasing `micro_batch_size` (to 4, 8, or 16 if memory allows) or extending `sequence_length` (up to 2048, 3072, or 4096, as memory permits). Larger batches and longer sequences help keep GPUs engaged and reduce idle time.
 
-- **Low GPU Utilization or Slow Training**:
-  - **Increase `micro_batch_size`**: If memory allows, increase `micro_batch_size` to 4, 8, or even 16. Larger micro batches ensure GPUs have more work per step, which reduces idle time and improves overall utilization.
-  - **Extend `sequence_length`**: If your model and GPU memory can handle it, try increasing the `sequence_length` to 2048, 3072, or 4096. Longer sequences allow each forward pass to do more work, better engaging the GPU.
-
-- **Docker Permission Issues**: If you experience Docker permission errors, ensure Docker has the necessary permissions to access your GPUs. This can be checked by ensuring `--gpus all` is specified in your Docker run command, or by confirming that your user has access to the `docker` and `nvidia-docker` groups.
+- **Docker Permission Issues**: If you encounter Docker permission errors, confirm that Docker has permission to access your GPUs. Use the `--gpus all` flag in your Docker run command and ensure your user has access to the `docker` and `nvidia-docker` groups.
 
 ## Final Thoughts
 

From e04d5ebb3fae67fea58bece419815443db5ed763 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Sun, 27 Oct 2024 16:21:08 -0400
Subject: [PATCH 24/87] add support for distributed data preparation

---
 tools/prepare_dataset.py | 77 +++++++++++++++++++++++++++++++---------
 1 file changed, 60 insertions(+), 17 deletions(-)

diff --git a/tools/prepare_dataset.py b/tools/prepare_dataset.py
index 1c1b89b00..b38a88cbc 100644
--- a/tools/prepare_dataset.py
+++ b/tools/prepare_dataset.py
@@ -5,7 +5,8 @@
 from pathlib import Path
 
 import numpy as np
-from datasets import load_dataset
+from datasets import load_dataset, load_from_disk
+from torch import distributed as dist
 from tqdm import tqdm
 from transformers import AutoTokenizer, logging
 
@@ -76,6 +77,21 @@ class PrepareDatasetConfig(RunnableConfig):
         desc="Data type of the dataset field.",
         hint=FieldHint.derived,
     )
+    rank: int = Field(
+        default=0,
+        desc="Rank of the process for distributed processing.",
+        hint=FieldHint.optional,
+    )
+    world_size: int = Field(
+        default=1,
+        desc="Total number of processes in distributed processing.",
+        hint=FieldHint.optional,
+    )
+    distributed_backend: str = Field(
+        default="gloo",
+        desc="Distributed backend for distributed processing.",
+        hint=FieldHint.optional,
+    )
 
     @cached_property
     def tokenizer(self):
@@ -119,7 +135,7 @@ def _tokenize_batch(self, batch):
 
     def _save_shard(self, args) -> dict:
         shard_idx, shard_dataset = args
-        prefix = f"shard_{shard_idx}"
+        prefix = f"shard_{self.rank}_{shard_idx}"
         shard_output_path = Path(self.output_dir) / prefix
         documents = [
             np.array(item["input_ids"], dtype=self.dataset_dtype.numpy)
@@ -134,13 +150,30 @@ def _save_shard(self, args) -> dict:
         return dataset_dict
 
     def run(self):
-        # Load dataset
-        dataset = load_dataset(
-            path=self.dataset_name_or_path,
-            name=self.dataset_config_name,
-            split=self.dataset_split,
-            num_proc=self.num_processes_load,
-        )
+        # Initialize distributed processing
+        if self.world_size > 1:
+            dist.init_process_group(backend=self.distributed_backend, rank=self.rank, world_size=self.world_size)
+
+        # Prepare output directory
+        os.makedirs(self.output_dir, exist_ok=True)
+
+        # Download dataset
+        download_dir = Path(self.output_dir) / "downloaded_dataset"
+        if self.rank == 0:
+            load_dataset(
+                path=self.dataset_name_or_path,
+                name=self.dataset_config_name,
+                split=self.dataset_split,
+                num_proc=self.num_processes_load,
+                trust_remote_code=True,
+            ).save_to_disk(download_dir, num_proc=self.num_processes_save)
+
+        # Synchronize processes to wait for the download
+        if self.world_size > 1:
+            dist.barrier()
+
+        # Load and shard the dataset
+        dataset = load_from_disk(download_dir).shard(num_shards=self.world_size, index=self.rank)
         if self.dataset_field not in dataset.column_names:
             raise ValueError(f"Dataset does not have field '{self.dataset_field}'.")
 
@@ -162,19 +195,29 @@ def run(self):
             for i in tqdm(range(num_shards), desc="Creating shards")
         ]
 
-        # Prepare output directory
-        os.makedirs(self.output_dir, exist_ok=True)
-
         # Use multiprocessing to save each shard in parallel
         with Pool(processes=self.num_processes_save) as pool:
             dataset_dicts = pool.map(self._save_shard, shards)
 
+        # Gather dataset_dicts from all ranks to rank 0
+        if self.world_size > 1:
+            all_dataset_dicts = [None] * self.world_size
+            dist.gather_object(dataset_dicts, all_dataset_dicts, dst=0)
+            if self.rank == 0:
+                dataset_dicts = [item for sublist in all_dataset_dicts for item in sublist]
+
         # Create a metadata file
-        total_tokens = sum(dataset_dict["num_tokens"] for dataset_dict in dataset_dicts)
-        for dataset_dict in dataset_dicts:
-            dataset_dict["weight"] = float(dataset_dict["num_tokens"]) / float(total_tokens)
-        output_file = Path(self.output_dir) / "fast_llm_dataset.json"
-        json.dump({"datasets": dataset_dicts}, output_file.open("w"))
+        if self.rank == 0:
+            total_tokens = sum(dataset_dict["num_tokens"] for dataset_dict in dataset_dicts)
+            for dataset_dict in dataset_dicts:
+                dataset_dict["weight"] = float(dataset_dict["num_tokens"]) / float(total_tokens)
+            output_file = Path(self.output_dir) / "fast_llm_dataset.json"
+            json.dump({"datasets": dataset_dicts}, output_file.open("w"))
+
+        # Finalize distributed processing
+        if self.world_size > 1:
+            dist.barrier()
+            dist.destroy_process_group()
 
 
 if __name__ == "__main__":

From 4f6837806f71570d738c2b20cf5620baf9b2a584 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Sun, 27 Oct 2024 16:26:27 -0400
Subject: [PATCH 25/87] rewrite quick-start guide

---
 docs/quick-start.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/quick-start.md b/docs/quick-start.md
index 158a79474..8bc23c388 100644
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -136,7 +136,7 @@ run:
   experiment_dir: /app/results
 ```
 
-1. Total number of tokens will be ~300B.
+1. Total number of training tokens will be ~300B.
 2. Replace `servicenow` with your own W&B entity name.
 3. Adjust based on GPU memory. For GPT-2 and an A100-80GB, a `micro_batch_size` of 1 should work well.
 4. Should be a power of 2 and divisible by 8. For an A100-80GB, 1024 is a good starting point.

From 5cb0754e636a60100047f5546007f490ed4a4063 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Sun, 27 Oct 2024 16:35:03 -0400
Subject: [PATCH 26/87] add help page

---
 docs/help.md | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 53 insertions(+)
 create mode 100644 docs/help.md

diff --git a/docs/help.md b/docs/help.md
new file mode 100644
index 000000000..6a01edb96
--- /dev/null
+++ b/docs/help.md
@@ -0,0 +1,53 @@
+---
+title: "Help"
+---
+
+Welcome to the Fast-LLM Help Center! Here you'll find solutions for common hiccups, references to dive deeper, tutorials, and a few pointers for when you need some extra support. Remember, everyone runs into a snag now and then. Let's sort them out together.
+
+---
+
+## Common Issues & Gotchas 🚧
+
+Let's get ahead of those pesky gotchas! Here's a list of common issues you might run into, with some quick fixes:
+
+- **Docker Permission Denied**: If Docker isn't playing nice, make sure your user has the right permissions. You may need to add yourself to the Docker group, or even (temporarily) use `sudo`.
+  
+- **CUDA Out of Memory**: When the GPU throws a fit, it's usually a batch size problem. Try reducing `batch_size` or freeing up GPU memory from other apps.
+
+- **`torchrun` Errors**: If you see something cryptic from `torchrun`, double-check that `torch` and `torchvision` are up-to-date. Compatibility issues can sometimes cause trouble.
+
+- **NCCL Errors**: NCCL errors can be a pain. Make sure your NCCL version matches the one Fast-LLM expects. If you're using a different version, you might need to tweak the environment variables.
+
+For a deeper dive, keep an eye on our GitHub Issues page, where other users might already have tackled similar issues.
+
+---
+
+## Reference 📚
+
+For the config nerds (you know who you are), we've got a detailed **Reference Guide** covering every configuration option under the sun. Need to tweak your optimizer settings, batch sizes, or distributed training parameters? It's all in the guide.
+
+---
+
+## Tutorials 👨‍🏫
+
+We've got a couple of excellent tutorials lined up:
+
+- **Quick-Start Guide**: Perfect for getting up and running on a single GPU machine. We cover setting up Docker, running your first training job, and basic troubleshooting.
+
+- **In-Action Guides**: Ready to go big? Check out the guides for setting up Fast-LLM with Slurm and Kubernetes to tackle multi-node training. This is where Fast-LLM truly shines.
+
+---
+
+## Still Stuck? Where to Find Help 🙋
+
+Sometimes, you've tried everything, and Fast-LLM still isn't cooperating. Here's what you can do:
+
+1. **GitHub Issues & Discussions**: This is your best friend. Use the search function to see if anyone has faced a similar issue. The community (and our team) is super active, so there's a good chance you'll find an answer or get help quickly.
+
+2. **Email (last resort)**: If all else fails, drop us a line at `fast-llm-team@servicenow.com`. Seriously, only use this in rare cases. We prefere to answer on GitHub where others can benefit from the conversation too.
+
+Fast-LLM is a growing community, and you're part of it now! Your questions help us improve, and who knows, you might just help the next person who runs into the same roadblock.
+
+---
+
+And that's it! We're excited to see what you build with Fast-LLM. Happy training!

From 7b93dce8de8554de65e7fc0aaacb675c874ee05a Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Sun, 27 Oct 2024 21:43:06 -0400
Subject: [PATCH 27/87] add help page

---
 docs/help.md | 44 ++++++++++++++++++++++++++++----------------
 1 file changed, 28 insertions(+), 16 deletions(-)

diff --git a/docs/help.md b/docs/help.md
index 6a01edb96..f27895f40 100644
--- a/docs/help.md
+++ b/docs/help.md
@@ -2,52 +2,64 @@
 title: "Help"
 ---
 
-Welcome to the Fast-LLM Help Center! Here you'll find solutions for common hiccups, references to dive deeper, tutorials, and a few pointers for when you need some extra support. Remember, everyone runs into a snag now and then. Let's sort them out together.
+Welcome to the Fast-LLM Help Center! Here, you'll find fixes for common hiccups, links to dig deeper, tutorials, and pointers for when you need some extra support. Remember, everyone hits a snag now and then. Let's sort them out together and get you back to training.
 
 ---
 
 ## Common Issues & Gotchas 🚧
 
-Let's get ahead of those pesky gotchas! Here's a list of common issues you might run into, with some quick fixes:
+Let's stay one step ahead of those pesky gotchas. Here's a list of common issues and quick fixes:
 
-- **Docker Permission Denied**: If Docker isn't playing nice, make sure your user has the right permissions. You may need to add yourself to the Docker group, or even (temporarily) use `sudo`.
+- **CUDA Out of Memory**: When the GPU throws a fit, a few tweaks can help. First, try lowering `micro_batch_size` or `sequence_length` in the configuration to fit within the available memory. Still stuck? Try setting the `mlp_recompute_level` option to `activation` to save memory in the backward pass, or experiment with higher ZeRO stages for reduced memory usage. And if that's not enough, tensor or model parallelism may be your friend. We've got a guide for this, so you're covered.
+
+- **Python Hash Seed Sync Error**: Encountering an error like
+
+    ```bash
+    RuntimeError: Desync detected for barrier train begin (66830148464 != 133042721120)
+    ```
   
-- **CUDA Out of Memory**: When the GPU throws a fit, it's usually a batch size problem. Try reducing `batch_size` or freeing up GPU memory from other apps.
+    points to a hashing inconsistency. To fix it, set `PYTHONHASHSEED=0` in your environment variables. This ensures consistent hashing across processes, keeping them in sync.
 
-- **`torchrun` Errors**: If you see something cryptic from `torchrun`, double-check that `torch` and `torchvision` are up-to-date. Compatibility issues can sometimes cause trouble.
+- **`torchrun` Timeout Errors**: If you see timeout errors related to `torchrun` during rendezvous, it could be DNS resolution or a networking issue. Check that all worker nodes are communicating properly with the master node.
 
-- **NCCL Errors**: NCCL errors can be a pain. Make sure your NCCL version matches the one Fast-LLM expects. If you're using a different version, you might need to tweak the environment variables.
+- **NCCL Errors with Timeout Messages**: Oh, the joys of NCCL errors! If you see something like
+
+    ```bash
+    Watchdog caught collective operation timeout: WorkNCCL(SeqNum=408951, OpType=_ALLGATHER_BASE, … , Timeout(ms)=600000) ran for 600351 milliseconds before timing out
+    ```
+  
+    appearing across all GPU workers, it usually means one or more hosts failed to complete a NCCL operation, causing others to block. NCCL errors can be frustrating to diagnose since they rarely specify which node or GPU caused the issue. We're working on improving this by surfacing which messages and operations are in progress during these crashes to better identify any problematic hosts or GPUs. Stay tuned!
 
-For a deeper dive, keep an eye on our GitHub Issues page, where other users might already have tackled similar issues.
+For more detailed solutions, check out our GitHub Issues page. Odds are someone's already tackled a similar problem, and you might find the exact fix you need.
 
 ---
 
 ## Reference 📚
 
-For the config nerds (you know who you are), we've got a detailed **Reference Guide** covering every configuration option under the sun. Need to tweak your optimizer settings, batch sizes, or distributed training parameters? It's all in the guide.
+If you're the type who loves configurations and tweaking every detail, the [**Configuration Reference**](reference/configuration) is for you. It covers every config option you could imagine. From optimizer settings to batch sizes to distributed training parameters. It's all in there.
 
 ---
 
 ## Tutorials 👨‍🏫
 
-We've got a couple of excellent tutorials lined up:
+We've got some excellent tutorials to help you get the most out of Fast-LLM:
 
-- **Quick-Start Guide**: Perfect for getting up and running on a single GPU machine. We cover setting up Docker, running your first training job, and basic troubleshooting.
+- [**Quick-Start Guide**](quick-start): Perfect for launching Fast-LLM on a single GPU machine. We walk you through setting up Docker, running your first training job, and handling common issues.
 
-- **In-Action Guides**: Ready to go big? Check out the guides for setting up Fast-LLM with Slurm and Kubernetes to tackle multi-node training. This is where Fast-LLM truly shines.
+- [**In-Action Guides**](in-action/slurm): Ready to go big? These guides cover setting up Fast-LLM with Slurm and Kubernetes for multi-node training. This is where Fast-LLM really shows its power.
 
 ---
 
 ## Still Stuck? Where to Find Help 🙋
 
-Sometimes, you've tried everything, and Fast-LLM still isn't cooperating. Here's what you can do:
+If Fast-LLM still isn't cooperating, here's where to look next:
 
-1. **GitHub Issues & Discussions**: This is your best friend. Use the search function to see if anyone has faced a similar issue. The community (and our team) is super active, so there's a good chance you'll find an answer or get help quickly.
+1. **GitHub [Issues](https://github.com/ServiceNow/Fast-LLM/issues) & [Discussions](https://github.com/ServiceNow/Fast-LLM/discussions)**: This is your best resource. Use the search function to see if anyone has run into the same issue. The community and our team are pretty active, so you'll likely find a solution or get help quickly.
 
-2. **Email (last resort)**: If all else fails, drop us a line at `fast-llm-team@servicenow.com`. Seriously, only use this in rare cases. We prefere to answer on GitHub where others can benefit from the conversation too.
+2. **Email (last resort)**: As a final option, you can email us at [fast-llm-team@servicenow.com](mailto:fast-llm-team@servicenow.com). This is only for rare cases, though. GitHub is our go-to for answering questions, as it lets others benefit from the conversation too.
 
-Fast-LLM is a growing community, and you're part of it now! Your questions help us improve, and who knows, you might just help the next person who runs into the same roadblock.
+Fast-LLM is a growing community, and your questions and contributions help make it better for everyone. Who knows, you might just solve the next person's roadblock!
 
 ---
 
-And that's it! We're excited to see what you build with Fast-LLM. Happy training!
+That's it! We're excited to see what you build with Fast-LLM. Happy training!

From b79129c0987957f1c1f77cd87400d1c73f64caf8 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Sun, 27 Oct 2024 22:03:59 -0400
Subject: [PATCH 28/87] add starcoder2 success story

---
 docs/index.md                       |  2 +-
 docs/success-stories/starcoder-2.md | 29 +++++++++++++++++++++++++++++
 mkdocs.yaml                         |  3 ++-
 3 files changed, 32 insertions(+), 2 deletions(-)
 create mode 100644 docs/success-stories/starcoder-2.md

diff --git a/docs/index.md b/docs/index.md
index 58f8f34c3..31e62d53f 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -54,7 +54,7 @@ Fast-LLM powers the world's most advanced AI projects:
 - **Enterprise AI Solutions:** Accelerate time-to-market for AI products by reducing training costs and enabling faster iteration.
 - **Academic Collaborations:** Drive AI innovation with high-performance training capabilities that support cutting-edge research in machine learning.
 
-See how Fast-LLM has helped early adopters achieve up to xx% faster results. [Explore use cases and success stories](success-stories).
+See how Fast-LLM has helped early adopters achieve up to xx% faster results. [Explore use cases and success stories](success-stories/starcoder-2).
 
 ## Project Scope and Objectives
 
diff --git a/docs/success-stories/starcoder-2.md b/docs/success-stories/starcoder-2.md
new file mode 100644
index 000000000..13c7f3f14
--- /dev/null
+++ b/docs/success-stories/starcoder-2.md
@@ -0,0 +1,29 @@
+---
+title: "StarCoder2"
+---
+
+2023 marked a big year for our team at ServiceNow Research as we embarked on training **StarCoder2**, an open-source LLM optimized for coding tasks. This project, an evolution of the StarCoder model, aimed to create a family of models capable of handling a wide array of programming languages, achieving performance comparable to (or even surpassing) larger models in some benchmarks.
+
+Through the year, we put Fast-LLM to the test on **NVIDIA's DGX SuperCloud**, using multiple **DGX A100-80GB nodes**. The Fast-LLM framework was developed specifically to optimize the training workflow for LLMs like StarCoder 2, combining **data parallelism** with **tensor parallelism** to maximize GPU utilization, minimize idle time, and maintain high throughput across all nodes. The framework's adaptable design allowed us to scale the model on a large compute cluster seamlessly, handling everything from distributed data loading to real-time monitoring and load balancing between compute nodes.
+
+Our goal was ambitious: to train the 3-billion-parameter StarCoder2 model on **The Stack V2** dataset, a large and diverse code corpus containing repositories across more than 600 programming languages, courtesy of the Software Heritage archive. This dataset provided real-world code examples and broad coverage of programming paradigms, ensuring that StarCoder2-3B could understand context-rich coding tasks with high precision and accuracy.
+
+## The Role of Fast-LLM
+
+Fast-LLM enabled us to achieve a training throughput of **10,000 tokens per second per A100-80GB GPU**, which allowed us to reduce the expected training time by **20%** compared to the Megatron framework. This boost in scalable efficiency was made possible by Fast-LLM's optimized data pipelines and balanced load distribution, ensuring minimal latency and consistent GPU saturation across all nodes. This performance demonstrates Fast-LLM's capacity to handle a model of this scale with impressive efficiency and stability, setting a new benchmark for training large language models.
+
+Fast-LLM's adaptability shone as we trained StarCoder2-3B with a **Fill-in-the-Middle (FIM) objective**, a novel approach for the model to generate and complete code snippets in a contextually relevant way. FIM training requires dynamically structured data inputs and, therefore, efficient shuffling and sample handling—all handled seamlessly by Fast-LLM.
+
+## Technical Highlights
+
+- **16K Context Window**: StarCoder2-3B boasts a **16,384 token context window**. That's four times the length of the original model. With Fast-LLM, we integrated Grouped Query Attention (GQA) to achieve this, allowing the model to retain context over extensive code snippets, conversations, and documentation.
+  
+- **Dynamic Dataset Handling**: Training with The Stack V2 posed challenges; the dataset's sheer size and variety required Fast-LLM's efficient, adaptive sharding and fast sample batching. These features allowed us to effectively leverage our compute resources, creating a streamlined experience when dealing with billions of code tokens.
+
+- **High Throughput on DGX Nodes**: Although we're awaiting precise throughput metrics, preliminary tests showed that Fast-LLM allowed each node to perform at peak efficiency, even when processing over **4 trillion tokens** in total.
+
+## The Road Ahead
+
+The results of StarCoder2-3B have set the stage for ongoing innovation. With Fast-LLM's framework as a robust foundation, we are now exploring fine-tuning for even more targeted code generation applications, building models that offer immediate utility across development, deployment, and debugging tasks. StarCoder2-3B's performance and versatility stand as a testament to the power of Fast-LLM and to the incredible potential of open-source models in advancing the AI landscape.
+
+For more insights and technical details, please refer to our publications on StarCoder2 and Fast-LLM [Cite relevant papers].
diff --git a/mkdocs.yaml b/mkdocs.yaml
index 97a0f454e..1b1daab65 100644
--- a/mkdocs.yaml
+++ b/mkdocs.yaml
@@ -165,7 +165,8 @@ nav:
     - In Action:
       - On Slurm: in-action/slurm.md
       - On Kubernetes: in-action/kubernetes.md
-    - Success Stories: success-stories.md
+    - Success Stories:
+      - StarCoder 2: success-stories/starcoder-2.md
     - License: license.md
   - Examples:
     - Data Preparation: examples/data-preparation.md

From 4369066de0a84c9a0390c0b42d9d956719308713 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Mon, 28 Oct 2024 03:13:14 -0400
Subject: [PATCH 29/87] add starcoder2 success story

---
 docs/about-us.md                    | 12 +-----
 docs/help.md                        |  4 +-
 docs/index.md                       |  4 +-
 docs/join-us.md                     | 65 +++++++++++++++++++++++++++++
 docs/quick-start.md                 | 58 +++++++++++++------------
 docs/refs.bib                       | 13 ++++++
 docs/success-stories/starcoder-2.md | 26 +++++-------
 mkdocs.yaml                         |  6 ++-
 setup.cfg                           |  2 +
 9 files changed, 134 insertions(+), 56 deletions(-)
 create mode 100644 docs/refs.bib

diff --git a/docs/about-us.md b/docs/about-us.md
index 7cad79222..8ae43542e 100644
--- a/docs/about-us.md
+++ b/docs/about-us.md
@@ -1,5 +1,7 @@
 ---
 title: About Us
+hide:
+  - navigation
 ---
 
 Welcome to Fast-LLM! We are a global team of engineers, researchers, and AI professionals led by the Foundation Models Lab at [ServiceNow Research](https://www.servicenow.com/research/), dedicated to advancing large language models (LLMs) and providing the highest-performance tools for serious users. Designed with professionals, research institutions, and enterprises in mind, Fast-LLM offers the speed, scalability, and flexibility needed to train the biggest and most complex models. Our commitment to open-source ensures that you have full control over your workflows, without the limitations or compromises of commercial frameworks.
@@ -30,13 +32,3 @@ Fast-LLM is led by the Foundation Models Lab at [ServiceNow Research](https://ww
 - [**Torsten Scholak**](https://www.servicenow.com/research/author/torsten-scholak.html) - Research Lead, ServiceNow Research: Torsten leads our research efforts, driving the scientific innovations that keep Fast-LLM at the forefront of AI training.
 
 Our core team includes members affiliated with ServiceNow Research, as well as other contributors who bring unique perspectives and skills to the project. We welcome new participants from the broader AI community who share our vision of creating the best tools for training large-scale language models.
-
-## Get Involved
-
-Fast-LLM is an open-source project that thrives on collaboration. If you're a professional or researcher looking to contribute, there are many ways to get involved:
-
-- **Code Contributions:** Dive into our [contribution guidelines](https://github.com/ServiceNow/Fast-LLM/blob/main/CONTRIBUTING.md) to learn how you can help improve Fast-LLM.
-- **Discussion and Ideas:** Join us on [GitHub Discussions](https://github.com/ServiceNow/Fast-LLM/discussions) to share your insights, ask questions, or discuss new features.
-- **Documentation and Tutorials:** Help us expand our [documentation](https://servicenow.github.io/Fast-LLM/), making it even more valuable for other professionals.
-
-If you're serious about training large language models, Fast-LLM is here to help you push the limits. We look forward to your contributions and feedback as we continue to make LLM training faster and better.
diff --git a/docs/help.md b/docs/help.md
index f27895f40..6bb586eba 100644
--- a/docs/help.md
+++ b/docs/help.md
@@ -44,9 +44,9 @@ If you're the type who loves configurations and tweaking every detail, the [**Co
 
 We've got some excellent tutorials to help you get the most out of Fast-LLM:
 
-- [**Quick-Start Guide**](quick-start): Perfect for launching Fast-LLM on a single GPU machine. We walk you through setting up Docker, running your first training job, and handling common issues.
+- [**Quick-Start Guide**](/quick-start): Perfect for launching Fast-LLM on a single GPU machine. We walk you through setting up Docker, running your first training job, and handling common issues.
 
-- [**In-Action Guides**](in-action/slurm): Ready to go big? These guides cover setting up Fast-LLM with Slurm and Kubernetes for multi-node training. This is where Fast-LLM really shows its power.
+- [**In-Action Guides**](/in-action/slurm): Ready to go big? These guides cover setting up Fast-LLM with Slurm and Kubernetes for multi-node training. This is where Fast-LLM really shows its power.
 
 ---
 
diff --git a/docs/index.md b/docs/index.md
index 31e62d53f..556c119a4 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -1,5 +1,7 @@
 ---
 title: "Fast-LLM: Train Large Language Models Faster Than Ever Before"
+hide:
+  - navigation
 ---
 
 Welcome to **Fast-LLM**, the cutting-edge open-source library built for training large language models (LLMs) with **unmatched speed, scalability, and cost-efficiency**. Developed by [ServiceNow Research](https://www.servicenow.com/research/)'s Foundation Models Lab, Fast-LLM is engineered to meet the rigorous demands of professional AI teams, research institutions, and enterprises pushing the limits of generative AI. **Achieve groundbreaking research and high-stakes production goals faster with Fast-LLM.**
@@ -77,6 +79,6 @@ Fast-LLM is more than just software, it's a community. Get involved by exploring
 
 ## Getting Started
 
-Ready to dive in? Check out our [quickstart guide](quick-start) for an overview of how to set up and run Fast-LLM on different platforms, including [Slurm](https://slurm.schedmd.com) and [Kubernetes](https://kubernetes.io). Explore the [examples](https://github.com/ServiceNow/Fast-LLM/tree/main/examples) for pre-configured setups to help you get started quickly with your own training experiments.
+Ready to dive in? Check out our [quick-start guide](quick-start) for an overview of how to set up and run Fast-LLM on different platforms, including [Slurm](https://slurm.schedmd.com) and [Kubernetes](https://kubernetes.io). Explore the [examples](https://github.com/ServiceNow/Fast-LLM/tree/main/examples) for pre-configured setups to help you get started quickly with your own training experiments.
 
 For any questions or issues, open an [issue](https://github.com/ServiceNow/Fast-LLM/issues) or join the [community discussion](https://github.com/ServiceNow/Fast-LLM/discussions).
diff --git a/docs/join-us.md b/docs/join-us.md
index e69de29bb..510216329 100644
--- a/docs/join-us.md
+++ b/docs/join-us.md
@@ -0,0 +1,65 @@
+---
+title: Join Us
+hide:
+  - navigation
+---
+
+Fast-LLM is an open-source project driven by a community of passionate contributors. Whether you're a researcher, developer, or AI enthusiast, there's a place for you to make a real impact on the future of large-scale AI training. Join us, dive in, and help shape the tools that push the boundaries of language model training. Here's how you can get involved:
+
+---
+
+## Stay in the Loop 📬
+
+Want to keep up with the latest Fast-LLM updates and new opportunities to get involved? **Star** the Fast-LLM repository on GitHub and **watch** the project for notifications on new releases, discussions, and updates. This way, you'll always know what's happening, from new features to community initiatives.
+
+[Star](https://github.com/ServiceNow/Fast-LLM/stargazers) ⭐ and [Watch](https://github.com/ServiceNow/Fast-LLM/subscription) 👀 the Fast-LLM repo on GitHub to stay updated on new releases, discussions, and upcoming features.
+
+---
+
+## Code Contributions 🛠
+
+Fast-LLM thrives on collaboration, and we're excited to welcome new contributors! From fixing bugs to adding new features, every code contribution makes a difference. If you're just getting started, our [**Good First Issues**](https://github.com/ServiceNow/Fast-LLM/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22) on GitHub are labeled to help newcomers find approachable tasks. To set up your development environment and get oriented with Fast-LLM, check out our **Developer's Corner** for everything you need:
+
+- [**Contributing**](developers/contributing) – for setup instructions and contributing guidelines
+- [**Best Practices**](developers/best-practices) – for tips on writing clean, maintainable code
+
+Here's a quick overview of the process:
+
+1. **Fork & Clone**: Start by forking the repo and cloning it to your machine.
+2. **Set Up Your Dev Environment**: The Developer's Corner guides you through configuring your environment for maximum productivity.
+3. **Write Awesome Code**: Make your changes, document them, and follow our best practices.
+4. **Open a Pull Request**: Submit a PR to showcase your work and get feedback from our team and the community.
+
+[Explore the Developer's Corner for everything you need to get started!](developers)
+
+---
+
+## Feature Requests & Ideas 💡
+
+Got a great idea? We want to hear it! Whether it's a new feature, an enhancement, or even a moonshot idea, head over to **GitHub Discussions** to share your thoughts. Community feedback drives Fast-LLM's evolution, and your ideas can help shape the future of the project.
+
+[Share your thoughts on GitHub Discussions](https://github.com/ServiceNow/Fast-LLM/discussions)
+
+---
+
+## Testing & Feedback 🔍
+
+Your experience with Fast-LLM is invaluable, whether you're running it in production or experimenting at home. We rely on user feedback to find bugs, optimize performance, and improve documentation. Please share any bugs, performance quirks, or gaps you spot with us on GitHub Issues. This kind of feedback strengthens the entire project.
+
+[Report issues and share feedback on GitHub](https://github.com/ServiceNow/Fast-LLM/issues)
+
+---
+
+## Help & Support 🤝
+
+Love helping others? Join our **GitHub Discussions** to answer questions, help troubleshoot, or share tips. Fast-LLM is a community, and the more we support each other, the stronger we become. Helping out is a great way to get involved and learn from others too.
+
+---
+
+## Spread the Word 📣
+
+If you're excited about Fast-LLM, let the world know! Share on social media, write a blog post, or give a talk at your next tech meetup. Spreading the word helps grow our community and brings new talent into the project.
+
+---
+
+Let's push the boundaries of large-scale AI training together. We're thrilled to have you here. Welcome to the Fast-LLM community!
diff --git a/docs/quick-start.md b/docs/quick-start.md
index 8bc23c388..abc90d41d 100644
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -26,30 +26,7 @@ Let's create folders to store our input data and output results:
 mkdir ~/inputs ~/results
 ```
 
-## Step 3: Preparing the Training Data
-
-For this tutorial, we'll use 9B tokens of text from the [OpenWebText](https://skylion007.github.io/OpenWebTextCorpus/) dataset. This dataset is a free approximation of the WebText data OpenAI used for GPT-2, and it's perfect for our setup!
-
-We've got a script that'll download and preprocess the dataset for you. Run it like this:
-
-!!! info inline end "What's Happening Here?"
-
-    This will grab the OpenWebText data, tokenize it with the GPT-2 tokenizer, and save it in 91 shards of 100M tokens each. Expect around 2 hours for the whole thing to finish, mainly due to tokenization. If you've got more CPU cores, try upping `num_processes_*` to speed things up.
-
-```bash
-python tools/prepare_dataset.py \                          
-    tokenizer_path_or_name="gpt2" \             
-    dataset_name_or_path="openwebtext" \                                       
-    dataset_split="train" \
-    dataset_field="text" \
-    output_dir="inputs" \ 
-    num_processes_load=4 \
-    num_processes_map=4 \
-    num_processes_save=4 \
-    num_tokens_per_shard=100000000
-```
-
-## Step 4: Choose Your Model
+## Step 3: Choose Your Model
 
 Fast-LLM supports many GPT variants, including (but not limited to) GPT-2, Llama, Mistral, and Qwen. For this tutorial, let's train the GPT-2 model from scratch with Fully Sharded Data Parallelism (FSDP). We'll grab a configuration file from Huggingface Hub and save it as `~/inputs/config.json`:
 
@@ -81,6 +58,34 @@ Fast-LLM supports many GPT variants, including (but not limited to) GPT-2, Llama
 
     Smaller models like GPT-2 (124M) will train relatively quickly, especially if you've only got a few GPUs. But if you're feeling adventurous (and patient), give the larger models a shot!
 
+## Step 4: Preparing the Training Data
+
+For this tutorial, we'll use 9B tokens of text from the [OpenWebText](https://skylion007.github.io/OpenWebTextCorpus/) dataset. This dataset is a free approximation of the WebText data OpenAI used for GPT-2, and it's perfect for our setup!
+
+We've got a script that'll download and preprocess the dataset for you. Run it like this:
+
+```bash
+docker run -it --rm ghcr.io/servicenow/fast-llm:latest \
+    -v ~/inputs:/app/inputs \
+    python tools/prepare_dataset.py \
+    tokenizer_path_or_name="gpt2" \
+    dataset_name_or_path="openwebtext" \
+    dataset_split="train" \
+    output_dir="inputs" \
+    num_processes_load=4 \
+    num_processes_map=4 \
+    num_processes_save=4 \
+    num_tokens_per_shard=100000000
+```
+
+!!! info "What's Happening Here?"
+
+    This will grab the OpenWebText data, tokenize it with the GPT-2 tokenizer, and save it in 91 shards of 100M tokens each. Expect around 2 hours for the whole thing to finish, mainly due to tokenization. If you've got more CPU cores, try upping `num_processes_*` to speed things up.
+
+!!! warning "Tokenizer Mismatch"
+
+    If you chose a different model in Step 3, make sure to adjust the `tokenizer_path_or_name` parameter to match the model's tokenizer.
+
 ## Step 5: Set Up Your Training Configuration
 
 Next, we'll create a configuration file for Fast-LLM. Save the following as `~/inputs/fast-llm-config.yaml`:
@@ -167,10 +172,11 @@ docker run --gpus all -it --rm ghcr.io/servicenow/fast-llm:latest \
     -v ~/results:/app/results \
     -e PYTHONHASHSEED=0 \
     -e WANDB_API_KEY_PATH=/app/inputs/.wandb_api_key \
-    torchrun --nproc_per_node=8 --no_python fast-llm train gpt --config /app/inputs/fast-llm-config.yaml
+    torchrun --nproc_per_node=8 --no_python \
+    fast-llm train gpt --config /app/inputs/fast-llm-config.yaml
 ```
 
-!!! note
+!!! note "Python Hash Seed"
 
     Setting the Python hash seed to 0 ensures consistent, reproducible ordering in hash-dependent operations across processes, which is crucial for parallel computations.
 
diff --git a/docs/refs.bib b/docs/refs.bib
new file mode 100644
index 000000000..49b80c424
--- /dev/null
+++ b/docs/refs.bib
@@ -0,0 +1,13 @@
+@article{li2023starcoder,
+  title={Starcoder: may the source be with you!},
+  author={Li, Raymond and Allal, Loubna Ben and Zi, Yangtian and Muennighoff, Niklas and Kocetkov, Denis and Mou, Chenghao and Marone, Marc and Akiki, Christopher and Li, Jia and Chim, Jenny and others},
+  journal={arXiv preprint arXiv:2305.06161},
+  year={2023}
+}
+
+@article{lozhkov2024starcoder,
+  title={Starcoder 2 and the stack v2: The next generation},
+  author={Lozhkov, Anton and Li, Raymond and Allal, Loubna Ben and Cassano, Federico and Lamy-Poirier, Joel and Tazi, Nouamane and Tang, Ao and Pykhtar, Dmytro and Liu, Jiawei and Wei, Yuxiang and others},
+  journal={arXiv preprint arXiv:2402.19173},
+  year={2024}
+}
\ No newline at end of file
diff --git a/docs/success-stories/starcoder-2.md b/docs/success-stories/starcoder-2.md
index 13c7f3f14..22f84a822 100644
--- a/docs/success-stories/starcoder-2.md
+++ b/docs/success-stories/starcoder-2.md
@@ -2,28 +2,24 @@
 title: "StarCoder2"
 ---
 
-2023 marked a big year for our team at ServiceNow Research as we embarked on training **StarCoder2**, an open-source LLM optimized for coding tasks. This project, an evolution of the StarCoder model, aimed to create a family of models capable of handling a wide array of programming languages, achieving performance comparable to (or even surpassing) larger models in some benchmarks.
+2023 was a transformative year for ServiceNow Research's Foundation Model Lab. Partnering with [BigCode](https://www.bigcode-project.org), we set out to build **StarCoder2** [@lozhkov2024starcoder], an open-source language model designed specifically for coding tasks. This iteration of StarCoder [@li2023starcoder] has been built to handle a wide range of programming languages with performance on par with some larger models.
 
-Through the year, we put Fast-LLM to the test on **NVIDIA's DGX SuperCloud**, using multiple **DGX A100-80GB nodes**. The Fast-LLM framework was developed specifically to optimize the training workflow for LLMs like StarCoder 2, combining **data parallelism** with **tensor parallelism** to maximize GPU utilization, minimize idle time, and maintain high throughput across all nodes. The framework's adaptable design allowed us to scale the model on a large compute cluster seamlessly, handling everything from distributed data loading to real-time monitoring and load balancing between compute nodes.
+Our goal was ambitious: to train the [3-billion-parameter StarCoder2 model](https://huggingface.co/bigcode/starcoder2-3b) on over **3 trillion tokens** from **The Stack V2**—a rich, diverse dataset compiled by BigCode from the Software Heritage archive. This data provided StarCoder2 with the breadth of real-world code examples and programming paradigms it needed to tackle complex coding tasks with high accuracy and deep contextual understanding.
 
-Our goal was ambitious: to train the 3-billion-parameter StarCoder2 model on **The Stack V2** dataset, a large and diverse code corpus containing repositories across more than 600 programming languages, courtesy of the Software Heritage archive. This dataset provided real-world code examples and broad coverage of programming paradigms, ensuring that StarCoder2-3B could understand context-rich coding tasks with high precision and accuracy.
+To bring StarCoder2 to life, we ran Fast-LLM on **NVIDIA's DGX SuperCloud**, utilizing **DGX A100-80GB nodes**. Fast-LLM allowed us to maximize GPU throughput and streamline our entire training pipeline. The complexity of scaling StarCoder2's training across nodes became a seamless experience.
 
-## The Role of Fast-LLM
+## How Fast-LLM Made StarCoder2 Possible
 
-Fast-LLM enabled us to achieve a training throughput of **10,000 tokens per second per A100-80GB GPU**, which allowed us to reduce the expected training time by **20%** compared to the Megatron framework. This boost in scalable efficiency was made possible by Fast-LLM's optimized data pipelines and balanced load distribution, ensuring minimal latency and consistent GPU saturation across all nodes. This performance demonstrates Fast-LLM's capacity to handle a model of this scale with impressive efficiency and stability, setting a new benchmark for training large language models.
+Fast-LLM was designed to maximize efficiency in large-scale language model training—especially for tasks like StarCoder2. Here's how Fast-LLM's capabilities helped us achieve our goals:
 
-Fast-LLM's adaptability shone as we trained StarCoder2-3B with a **Fill-in-the-Middle (FIM) objective**, a novel approach for the model to generate and complete code snippets in a contextually relevant way. FIM training requires dynamically structured data inputs and, therefore, efficient shuffling and sample handling—all handled seamlessly by Fast-LLM.
+- **Optimized Throughput and GPU Utilization**: Fast-LLM's data parallelism allowed each A100-80GB GPU to operate at its peak, sustaining **10,000 tokens per second** throughput. This boosted GPU utilization and brought down training time by **20%** compared to other frameworks. Fast-LLM made sure every GPU cycle was used efficiently, cutting down on idle time across the board.
 
-## Technical Highlights
+- **Support for Long Contexts**: With Fast-LLM's built-in Grouped Query Attention (GQA), StarCoder2-3B was able to leverage a **16,384 token context window**. This is essential for code comprehension, where context often spans hundreds of lines or more. GQA enabled the model to hold extensive context across sequences, which translates into better understanding of long code snippets, in-depth documentation, and detailed coding conversations.
 
-- **16K Context Window**: StarCoder2-3B boasts a **16,384 token context window**. That's four times the length of the original model. With Fast-LLM, we integrated Grouped Query Attention (GQA) to achieve this, allowing the model to retain context over extensive code snippets, conversations, and documentation.
-  
-- **Dynamic Dataset Handling**: Training with The Stack V2 posed challenges; the dataset's sheer size and variety required Fast-LLM's efficient, adaptive sharding and fast sample batching. These features allowed us to effectively leverage our compute resources, creating a streamlined experience when dealing with billions of code tokens.
+- **Fill-in-the-Middle (FIM) Training**: Fast-LLM supported FIM training objectives natively, allowing StarCoder2-3B to complete and understand code by predicting missing snippets in various contexts. This structure-focused training enhanced the model's performance, making it adept at understanding code structure, flow, and syntax.
 
-- **High Throughput on DGX Nodes**: Although we're awaiting precise throughput metrics, preliminary tests showed that Fast-LLM allowed each node to perform at peak efficiency, even when processing over **4 trillion tokens** in total.
+## The Takeaway
 
-## The Road Ahead
+StarCoder2-3B is the first large-scale, real-world demonstration of Fast-LLM's capabilities in specialized language model training. This project exemplifies how Fast-LLM not only powers large models but does so with adaptability and efficiency. It's not just about achieving results—it's about doing so in a way that's replicable and accessible to labs of all sizes.
 
-The results of StarCoder2-3B have set the stage for ongoing innovation. With Fast-LLM's framework as a robust foundation, we are now exploring fine-tuning for even more targeted code generation applications, building models that offer immediate utility across development, deployment, and debugging tasks. StarCoder2-3B's performance and versatility stand as a testament to the power of Fast-LLM and to the incredible potential of open-source models in advancing the AI landscape.
-
-For more insights and technical details, please refer to our publications on StarCoder2 and Fast-LLM [Cite relevant papers].
+With Fast-LLM, we've made a leap in efficiency and performance, setting the stage for future innovation in LLM training. This is just the beginning, and we're excited to see how Fast-LLM will continue to push the boundaries of language model development for coding and beyond.
diff --git a/mkdocs.yaml b/mkdocs.yaml
index 1b1daab65..211c25eca 100644
--- a/mkdocs.yaml
+++ b/mkdocs.yaml
@@ -155,10 +155,12 @@ plugins:
   - git-committers:
       repository: ServiceNow/Fast-LLM
       branch: main
+  - bibtex:
+      bib_file: "docs/refs.bib"
 
 nav:
-  - Fast-LLM:
-    - Welcome: index.md
+  - Welcome: index.md
+  - Get Started:
     - Quick Start: quick-start.md
     - Cost Efficiency: cost-efficiency.md
     - Help: help.md
diff --git a/setup.cfg b/setup.cfg
index 6c4f70fff..9c92cca93 100644
--- a/setup.cfg
+++ b/setup.cfg
@@ -48,6 +48,8 @@ DOCS =
     mkdocstrings[python]
     mkdocs-git-committers-plugin-2
     mkdocs-git-revision-date-localized-plugin
+    pypandoc_binary
+    mkdocs-bibtex
 
 [options.entry_points]
 console_scripts =

From 2262c650f8274e12825072308a3e8eebb14e9e56 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Mon, 28 Oct 2024 03:25:05 -0400
Subject: [PATCH 30/87] rewrite quick-start guide

---
 docs/quick-start.md | 46 +++++++++++++++++++++++----------------------
 1 file changed, 24 insertions(+), 22 deletions(-)

diff --git a/docs/quick-start.md b/docs/quick-start.md
index abc90d41d..1ec4473a6 100644
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -117,21 +117,22 @@ batch:
   batch_size: 480  # (5)!
 data:
   format: file
-  split: [998, 2, 0]  # (6)!
+  path: /app/inputs/fast_llm_dataset.json  # (6)!
+  split: [998, 2, 0]  # (7)!
 optimizer:
-  weight_decay: 0.1  # (7)!
-  beta_1: 0.9  # (8)!
-  beta_2: 0.95  # (9)!
+  weight_decay: 0.1  # (8)!
+  beta_1: 0.9  # (9)!
+  beta_2: 0.95  # (10)!
   learning_rate:
-    base: 6.0e-04  # (10)!
-    minimum: 6.0e-05  # (11)!
-    decay_style: cosine  # (12)!
-    decay_iterations: 600_000  # (13)!
-    warmup_iterations: 2000  # (14)!
+    base: 6.0e-04  # (11)!
+    minimum: 6.0e-05  # (12)!
+    decay_style: cosine  # (13)!
+    decay_iterations: 600_000  # (14)!
+    warmup_iterations: 2000  # (15)!
 pretrained:
   format: huggingface
-  path: /app/inputs  # (15)!
-  load_weights: False  # (16)!
+  path: /app/inputs  # (16)!
+  load_weights: False  # (176)!
 model:
   multi_stage:
     zero_stage: 2
@@ -146,17 +147,18 @@ run:
 3. Adjust based on GPU memory. For GPT-2 and an A100-80GB, a `micro_batch_size` of 1 should work well.
 4. Should be a power of 2 and divisible by 8. For an A100-80GB, 1024 is a good starting point.
 5. Must be divisible by number of GPUs. At 1024 tokens per sequence, 480 corresponds to about ~500k tokens per batch.
-6. 99.8% train, 0.2% validation, 0% test.
-7. L2 regularization penalty.
-8. 1st Adam optimizer parameter.
-9. 2nd Adam optimizer parameter.
-10. Peak learning rate.
-11. Should be 1/10th of base per Chinchilla.
-12. Cosine decay starting at `base` after warmup and ending at `minimum` after `decay_iterations`.
-13. Usually the same as `train_iters`.
-14. Number of steps of linear warmup.
-15. Location of the `config.json` file downloaded in Step 4.
-16. Set to `False` to train from scratch.
+6. Location of the dataset metadata file generated in Step 4.
+7. 99.8% train, 0.2% validation, 0% test.
+8. L2 regularization penalty.
+9. 1st Adam optimizer parameter.
+10. 2nd Adam optimizer parameter.
+11. Peak learning rate.
+12. Should be 1/10th of base per Chinchilla.
+13. Cosine decay starting at `base` after warmup and ending at `minimum` after `decay_iterations`.
+14. Usually the same as `train_iters`.
+15. Number of steps of linear warmup.
+16. Location of the `config.json` file downloaded in Step 4.
+17. Set to `False` to train from scratch.
 
 ## Step 6: Add Your Weights & Biases API Key
 

From 60cc57a3e3e9da88a54639d336d1480a0c6285ee Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Tue, 29 Oct 2024 14:23:29 -0400
Subject: [PATCH 31/87] add disclaimer

---
 docs/cost-efficiency.md | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/docs/cost-efficiency.md b/docs/cost-efficiency.md
index 17cb1b6a5..7e8d77712 100644
--- a/docs/cost-efficiency.md
+++ b/docs/cost-efficiency.md
@@ -8,6 +8,10 @@ Fast-LLM is built for speed and scalability to minimize training costs. Its adva
 
 To showcase the cost-saving potential of Fast-LLM, we've compared the cost of training a language model across various frameworks for different scenarios. For these calculations, we assume a cost of **USD 2.50 per H100 GPU per hour**.
 
+!!! note "Disclaimer"
+
+    All comparisons were conducted with identical model configurations and training setups across frameworks to maintain fairness. We optimized training parameters within each framework to achieve the best possible performance. Detailed configuration files are available in the footnotes for reference. If you have questions about our methods, assumptions, or suggestions for enhancing performance on any framework, please contact us at [fast-llm-team@servicenow.com](mailto:fast-llm-team@servicenow.com).
+
 ### Scenario Comparison: Training Costs and Token Efficiency
 
 The tables below provide a comparison of training costs for three different model setups, including costs for training on **1 trillion tokens** and the total tokens trained within a **$100,000 budget**.

From 6a76020768a385c73285f5f980e4e35a5ea5f673 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Tue, 29 Oct 2024 14:26:21 -0400
Subject: [PATCH 32/87] add build instructions

---
 docs/README.md | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/docs/README.md b/docs/README.md
index c70897016..9f7f6d68e 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -6,6 +6,32 @@ This folder contains the source files for the Fast-LLM documentation. The conten
 
 To view the complete, rendered documentation, please visit the [Fast-LLM Documentation Site](https://servicenow.github.io/Fast-LLM).
 
+## Building and Serving the Documentation
+
+To build and preview the documentation locally, follow these simple steps:
+
+1. **Install the necessary dependencies:**
+
+   ```bash
+   pip install -e ".[DOCS]"
+   ```
+
+2. **Build the documentation:**
+
+   ```bash
+   mkdocs build
+   ```
+
+   This will generate the static documentation files in a `site/` folder.
+
+3. **Serve the documentation locally (with auto-reload):**
+
+   ```bash
+   mkdocs serve
+   ```
+
+   The documentation site will be served locally at [http://127.0.0.1:8000](http://127.0.0.1:8000), and any changes made to the source files will automatically trigger a rebuild.
+
 ## Contributing to the Documentation
 
 If you'd like to contribute to the Fast-LLM documentation, feel free to edit these source files and submit a pull request. The changes will be reflected on the rendered documentation site after they are merged into the `main` branch.

From 13e29bc10a50c69bfb4c81da7be0ae8315c577bf Mon Sep 17 00:00:00 2001
From: Sean Hughes <hughesthe1st@users.noreply.github.com>
Date: Wed, 30 Oct 2024 14:11:51 -0700
Subject: [PATCH 33/87] Update README.md

Added examples of contributions we welcome from the Fast-LLM community.
---
 docs/README.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/docs/README.md b/docs/README.md
index 9f7f6d68e..b0d94e310 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -35,3 +35,5 @@ To build and preview the documentation locally, follow these simple steps:
 ## Contributing to the Documentation
 
 If you'd like to contribute to the Fast-LLM documentation, feel free to edit these source files and submit a pull request. The changes will be reflected on the rendered documentation site after they are merged into the `main` branch.
+
+Your contributions could be as simple as helping to correct typos and spelling errors, improving existing content to provide more details on how to approach a tricky step for novice users, or even to add new content that describes functionality with limited or no detailed coverage anywhere else. No matter how small, we value all contributions from the Fast-LLM community.

From f72ac0ea45779f5f5d86fe65f03f96878067ecc0 Mon Sep 17 00:00:00 2001
From: Sean Hughes <hughesthe1st@users.noreply.github.com>
Date: Wed, 30 Oct 2024 17:27:38 -0700
Subject: [PATCH 34/87] Update index.md

---
 docs/index.md | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/docs/index.md b/docs/index.md
index 556c119a4..ba0618c7a 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -4,21 +4,21 @@ hide:
   - navigation
 ---
 
-Welcome to **Fast-LLM**, the cutting-edge open-source library built for training large language models (LLMs) with **unmatched speed, scalability, and cost-efficiency**. Developed by [ServiceNow Research](https://www.servicenow.com/research/)'s Foundation Models Lab, Fast-LLM is engineered to meet the rigorous demands of professional AI teams, research institutions, and enterprises pushing the limits of generative AI. **Achieve groundbreaking research and high-stakes production goals faster with Fast-LLM.**
+Introducing **Fast-LLM**, the cutting-edge open-source library built for training large language models (LLMs) with **unmatched speed, scalability, and cost-efficiency**. Developed by [ServiceNow Research](https://www.servicenow.com/research/)'s Foundation Models Lab, Fast-LLM is engineered to meet the rigorous demands of professional AI researchers, AI/ML engineers, academic and industrial research institutions, and enterprise product development teams pushing the limits of generative AI. **Achieve groundbreaking research and high-stakes production goals faster with Fast-LLM.**
 
 [Start your journey with Fast-LLM](quick-start.md) and explore the future of LLM training. Dive into [real-world use cases](in-action/slurm.md) to see how Fast-LLM can elevate your training workflows.
 
 ## Why Fast-LLM?
 
-Fast-LLM is designed for professionals who demand exceptional performance in large-scale language model training. It goes beyond off-the-shelf solutions to deliver a **robust, flexible, and high-performance open-source alternative** to commercial frameworks like NVIDIA NeMo Megatron. Whether you're optimizing for speed, cost, or scalability, Fast-LLM helps you get the most out of your training resources.
+Fast-LLM is designed for professionals who demand exceptional performance for efficient large-scale (FLOPS) language model training on GPUs. Fast-LLM integrates effortlessly into existing ML pipelines and goes beyond off-the-shelf commercial frameworks, like NVIDIA NeMo Megatron, to deliver a **robust, flexible, and high-performance open-source alternative**. Whether you're optimizing for speed, cost, or scalability, Fast-LLM helps you get the most out of your training infrastructure.
 
 ### The Fast-LLM Advantage
 
 Fast-LLM isn't just another library, **it's a platform for powering the next generation of AI breakthroughs**. Here's what sets it apart:
 
-- **🚀 Purpose-Built for Small- and Large-Scale AI:** Optimized specifically for training language models of all sizes, Fast-LLM excels from **small models around 1B parameters to massive clusters** running 70B+ parameter models. Our kernels are fine-tuned for maximum throughput across this entire range, making Fast-LLM the go-to choice for diverse training needs.
+- **🚀 Purpose-Built for Small- and Large-Scale AI:** Optimized specifically for training language models of all sizes, Fast-LLM excels from **small models around 1B parameters to massive clusters running 70B+ parameter models**, with kernels that are fine-tuned for maximum throughput across this entire range. At 10B-parameter scale, Fast-LLM avoids costly 3D-paralelism through memory optimization techniques such as ZeRO and activation recomputation, whereas at 100B-parameter scale, Fast-LLM optimally supports 3D-parallelism; making Fast-LLM the go-to choice for diverse training needs.
 
-- **🧠 Unified Support for GPT-Like Architectures:** Unlike other frameworks that specialize in specific architectures, Fast-LLM **unifies all GPT-like model implementations** in a [single file](https://github.com/ServiceNow/Fast-LLM/blob/main/fast_llm/models/gpt/model.py). Whether you're training Llama, Mistral, Mixtral, StarCoder, or custom architectures, Fast-LLM adapts effortlessly.
+- **🧠 Unified Support for GPT-Like Architectures:** Unlike other frameworks that specialize in specific architectures, Fast-LLM **unifies all GPT-like model implementations** in a [single configuration file](https://github.com/ServiceNow/Fast-LLM/blob/main/fast_llm/models/gpt/model.py). Unlike HuggingFace transformers where every model has it's own, mostly independent implementation, Fast-LLM reduces coding and adapts effortlessly, even with custom architectures.
 
 - **💰 Cost Efficiency That Sets Fast-LLM Apart:**
 
@@ -28,7 +28,7 @@ Fast-LLM isn't just another library, **it's a platform for powering the next gen
 
   [Learn more about Fast-LLM's cost efficiency and see detailed comparisons](cost-efficiency).
 
-- **🔓 Openness Without Compromise:** Fast-LLM's open-source model ensures that you can **fully customize and extend the library** to fit your exact needs, without the restrictions of proprietary software. Developed transparently by a community of professionals on GitHub, every change is **openly discussed and vetted**, ensuring **trust and collaboration** as you innovate with confidence, knowing the entire development process is out in the open.
+- **🔓 Openness Without Compromise:** Fast-LLM's open-source approach ensures that you can **fully customize and extend the library** to fit your exact needs, without the restrictions of proprietary software. Developed transparently by a community of experts on GitHub, every change is **publicly discussed and vetted**, fostering **trust and collaboration** so you can innovate with confidence, knowing the entire development process and decision making is out in the open.
 
 - **🌍 Community-Driven Development:** Built by professionals for professionals, Fast-LLM's development is transparent, with an open invitation to the community to contribute. [**Join the Fast-LLM community**](join-us) to help shape the future of large-scale AI training.
 
@@ -46,7 +46,7 @@ Fast-LLM offers all the capabilities you need to accelerate your LLM training an
 
 - **🛠️ Professional-Grade Tools:** Enjoy mixed precision training, large batch training, and gradient accumulation. Fast-LLM ensures reproducibility through deterministic behavior and provides pre-built Docker images, YAML configurations, and a simple, intuitive command-line interface.
 
-[Download Fast-LLM](https://github.com/ServiceNow/Fast-LLM/releases) and start training your large language models at full speed. [Join the Fast-LLM community](join-us) and collaborate with like-minded professionals to advance AI research and development.
+[Download Fast-LLM](https://github.com/ServiceNow/Fast-LLM/releases) and start training your large language models in record time. [Join the Fast-LLM community](join-us) and collaborate with like-minded professionals to advance the state-of-the-art in AI research and development.
 
 ## Use Cases and Success Stories
 

From 5a43ed5eaf719e7dd178e5630a7cc647d0384839 Mon Sep 17 00:00:00 2001
From: Sean Hughes <hughesthe1st@users.noreply.github.com>
Date: Wed, 30 Oct 2024 17:30:36 -0700
Subject: [PATCH 35/87] Update index.md

---
 docs/index.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/index.md b/docs/index.md
index ba0618c7a..64cb4e6a4 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -18,7 +18,7 @@ Fast-LLM isn't just another library, **it's a platform for powering the next gen
 
 - **🚀 Purpose-Built for Small- and Large-Scale AI:** Optimized specifically for training language models of all sizes, Fast-LLM excels from **small models around 1B parameters to massive clusters running 70B+ parameter models**, with kernels that are fine-tuned for maximum throughput across this entire range. At 10B-parameter scale, Fast-LLM avoids costly 3D-paralelism through memory optimization techniques such as ZeRO and activation recomputation, whereas at 100B-parameter scale, Fast-LLM optimally supports 3D-parallelism; making Fast-LLM the go-to choice for diverse training needs.
 
-- **🧠 Unified Support for GPT-Like Architectures:** Unlike other frameworks that specialize in specific architectures, Fast-LLM **unifies all GPT-like model implementations** in a [single configuration file](https://github.com/ServiceNow/Fast-LLM/blob/main/fast_llm/models/gpt/model.py). Unlike HuggingFace transformers where every model has it's own, mostly independent implementation, Fast-LLM reduces coding and adapts effortlessly, even with custom architectures.
+- **🧠 Unified Support for GPT-Like Architectures:** Fast-LLM **unifies all GPT-like model implementations** in a [single configuration file](https://github.com/ServiceNow/Fast-LLM/blob/main/fast_llm/models/gpt/model.py), and unlike HuggingFace transformers, where every model has it's own, mostly independent implementation, Fast-LLM reduces coding and adapts effortlessly, even with custom architectures.
 
 - **💰 Cost Efficiency That Sets Fast-LLM Apart:**
 

From bff65065de455d2fd6899b11537e638238345bc7 Mon Sep 17 00:00:00 2001
From: Sean Hughes <hughesthe1st@users.noreply.github.com>
Date: Wed, 30 Oct 2024 17:31:16 -0700
Subject: [PATCH 36/87] Update index.md

---
 docs/index.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/index.md b/docs/index.md
index 64cb4e6a4..f0adbc2c4 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -18,7 +18,7 @@ Fast-LLM isn't just another library, **it's a platform for powering the next gen
 
 - **🚀 Purpose-Built for Small- and Large-Scale AI:** Optimized specifically for training language models of all sizes, Fast-LLM excels from **small models around 1B parameters to massive clusters running 70B+ parameter models**, with kernels that are fine-tuned for maximum throughput across this entire range. At 10B-parameter scale, Fast-LLM avoids costly 3D-paralelism through memory optimization techniques such as ZeRO and activation recomputation, whereas at 100B-parameter scale, Fast-LLM optimally supports 3D-parallelism; making Fast-LLM the go-to choice for diverse training needs.
 
-- **🧠 Unified Support for GPT-Like Architectures:** Fast-LLM **unifies all GPT-like model implementations** in a [single configuration file](https://github.com/ServiceNow/Fast-LLM/blob/main/fast_llm/models/gpt/model.py), and unlike HuggingFace transformers, where every model has it's own, mostly independent implementation, Fast-LLM reduces coding and adapts effortlessly, even with custom architectures.
+- **🧠 Unified Support for GPT-Like Architectures:** Fast-LLM **unifies all GPT-like model implementations** in a [single configuration file](https://github.com/ServiceNow/Fast-LLM/blob/main/fast_llm/models/gpt/model.py), and unlike HuggingFace transformers where every model has it's own, mostly independent implementation, Fast-LLM reduces coding and adapts effortlessly, even with custom architectures.
 
 - **💰 Cost Efficiency That Sets Fast-LLM Apart:**
 

From 271d9d19945babaae8559cbd98b08f439955eece Mon Sep 17 00:00:00 2001
From: Sean Hughes <hughesthe1st@users.noreply.github.com>
Date: Wed, 30 Oct 2024 17:31:42 -0700
Subject: [PATCH 37/87] Update index.md

---
 docs/index.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/index.md b/docs/index.md
index f0adbc2c4..637d022c1 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -18,7 +18,7 @@ Fast-LLM isn't just another library, **it's a platform for powering the next gen
 
 - **🚀 Purpose-Built for Small- and Large-Scale AI:** Optimized specifically for training language models of all sizes, Fast-LLM excels from **small models around 1B parameters to massive clusters running 70B+ parameter models**, with kernels that are fine-tuned for maximum throughput across this entire range. At 10B-parameter scale, Fast-LLM avoids costly 3D-paralelism through memory optimization techniques such as ZeRO and activation recomputation, whereas at 100B-parameter scale, Fast-LLM optimally supports 3D-parallelism; making Fast-LLM the go-to choice for diverse training needs.
 
-- **🧠 Unified Support for GPT-Like Architectures:** Fast-LLM **unifies all GPT-like model implementations** in a [single configuration file](https://github.com/ServiceNow/Fast-LLM/blob/main/fast_llm/models/gpt/model.py), and unlike HuggingFace transformers where every model has it's own, mostly independent implementation, Fast-LLM reduces coding and adapts effortlessly, even with custom architectures.
+- **🧠 Unified Support for GPT-Like Architectures:** Fast-LLM **unifies all GPT-like model implementations** in a [single configuration file](https://github.com/ServiceNow/Fast-LLM/blob/main/fast_llm/models/gpt/model.py), and unlike HuggingFace transformers where every model has it's own, mostly independent, implementation, Fast-LLM reduces coding and adapts effortlessly, even with custom architectures.
 
 - **💰 Cost Efficiency That Sets Fast-LLM Apart:**
 

From c0b8959ac2f554daf54d6cffa65505642ffcde17 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Thu, 31 Oct 2024 10:06:58 -0400
Subject: [PATCH 38/87] remove unused blog

---
 docs/blog/index.md | 2 --
 mkdocs.yaml        | 2 +-
 2 files changed, 1 insertion(+), 3 deletions(-)
 delete mode 100644 docs/blog/index.md

diff --git a/docs/blog/index.md b/docs/blog/index.md
deleted file mode 100644
index c58f16c50..000000000
--- a/docs/blog/index.md
+++ /dev/null
@@ -1,2 +0,0 @@
-# Blog
-
diff --git a/mkdocs.yaml b/mkdocs.yaml
index 211c25eca..c56d5782b 100644
--- a/mkdocs.yaml
+++ b/mkdocs.yaml
@@ -138,7 +138,7 @@ markdown_extensions:
   - pymdownx.tilde
 
 plugins:
-  - blog
+  # - blog
   - mkdocstrings:
       default_handler: python
       handlers:

From 431eefac3b01faaccff17a990063f99d3592af34 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Thu, 31 Oct 2024 10:27:28 -0400
Subject: [PATCH 39/87] add markdownlint

---
 .markdownlint.yaml      | 25 +++++++++++++++++++++++++
 .pre-commit-config.yaml |  4 ++++
 2 files changed, 29 insertions(+)
 create mode 100644 .markdownlint.yaml

diff --git a/.markdownlint.yaml b/.markdownlint.yaml
new file mode 100644
index 000000000..f8a3ff7f7
--- /dev/null
+++ b/.markdownlint.yaml
@@ -0,0 +1,25 @@
+# See https://github.com/DavidAnson/markdownlint/blob/v0.32.1/schema/.markdownlint.yaml for schema documentation
+
+# Default state for all rules
+default: true
+
+# MD007/ul-indent : Unordered list indentation : https://github.com/DavidAnson/markdownlint/blob/v0.32.1/doc/md007.md
+MD007:
+  # Spaces for indent
+  indent: 4
+  # Whether to indent the first level of the list
+  start_indented: true
+  # Spaces for first level indent (when start_indented is set)
+  start_indent: 4
+
+# MD010/no-hard-tabs : Hard tabs : https://github.com/DavidAnson/markdownlint/blob/v0.32.1/doc/md010.md
+MD010:
+  # Include code blocks
+  code_blocks: false
+  # Fenced code languages to ignore
+  ignore_code_languages: []
+  # Number of spaces for each hard tab
+  spaces_per_tab: 4
+
+# MD013/line-length : Line length : https://github.com/DavidAnson/markdownlint/blob/v0.32.1/doc/md013.md
+MD013: false
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index f8465c521..480b669b4 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -48,3 +48,7 @@ repos:
         args:
             - "--config"
             - "./pyproject.toml"
+-   repo: https://github.com/markdownlint/markdownlint
+    rev: v0.11.0
+    hooks:
+    -   id: markdownlint

From f11bbdfe4f92c794902134c76e6c8c3280aff4e1 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Thu, 31 Oct 2024 10:34:13 -0400
Subject: [PATCH 40/87] add markdownlint

---
 .github/PULL_REQUEST_TEMPLATE.md    | 54 ++++++++---------
 .markdownlint.yaml                  | 15 +++--
 CODE_OF_CONDUCT.md                  | 40 ++++++------
 CONTRIBUTING.md                     | 48 +++++++--------
 README.md                           | 94 ++++++++++++++---------------
 SECURITY.md                         | 16 ++---
 docs/README.md                      |  6 +-
 docs/about-us.md                    | 14 ++---
 docs/cost-efficiency.md             |  6 +-
 docs/help.md                        | 16 ++---
 docs/index.md                       | 44 +++++++-------
 docs/join-us.md                     | 12 ++--
 docs/quick-start.md                 | 46 +++++++-------
 docs/success-stories.md             |  0
 docs/success-stories/starcoder-2.md |  6 +-
 15 files changed, 212 insertions(+), 205 deletions(-)
 delete mode 100644 docs/success-stories.md

diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
index 43cca0b59..77cb0630a 100644
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@@ -9,21 +9,21 @@ Closes # <!-- Insert issue number here, if applicable -->
 
 Select all that apply:
 
-- [ ] 🐛 **Bug fix** (non-breaking change that addresses a specific issue)
-- [ ] 🚀 **New feature** (non-breaking change that adds functionality)
-- [ ] ⚠️ **Breaking change** (a change that could affect existing functionality)
-- [ ] 📈 **Performance improvement/optimization** (improves speed, memory usage, or efficiency)
-- [ ] 🛠️ **Code refactor** (non-functional changes that improve code readability, structure, etc.)
-- [ ] 📦 **Dependency bump** (updates dependencies, including Dockerfile or package changes)
-- [ ] 📝 **Documentation change** (updates documentation, including new content or typo fixes)
-- [ ] 🔧 **Infrastructure/Build change** (affects build process, CI/CD, or dependencies)
+-   [ ] 🐛 **Bug fix** (non-breaking change that addresses a specific issue)
+-   [ ] 🚀 **New feature** (non-breaking change that adds functionality)
+-   [ ] ⚠️ **Breaking change** (a change that could affect existing functionality)
+-   [ ] 📈 **Performance improvement/optimization** (improves speed, memory usage, or efficiency)
+-   [ ] 🛠️ **Code refactor** (non-functional changes that improve code readability, structure, etc.)
+-   [ ] 📦 **Dependency bump** (updates dependencies, including Dockerfile or package changes)
+-   [ ] 📝 **Documentation change** (updates documentation, including new content or typo fixes)
+-   [ ] 🔧 **Infrastructure/Build change** (affects build process, CI/CD, or dependencies)
 
 ## 📝 Changes
 
 List the key changes introduced in this PR:
 
-1. Change A
-2. Change B
+1.   Change A
+2.   Change B
 
 ## ✅ Checklist
 
@@ -31,32 +31,32 @@ Make sure the following tasks are completed before submitting the PR:
 
 ### General
 
-- [ ] 📜 I have read and followed the [contributing guidelines](https://github.com/ServiceNow/Fast-LLM/blob/main/CONTRIBUTING.md).
-- [ ] 🏷️ I am using a clear and descriptive title that follows the [PR title guidelines](https://servicenow.github.io/Fast-LLM/developers/pr-title-guidelines).
-- [ ] 🎉 The functionality is complete, and I have tested the changes.
-- [ ] 📝 I have updated the documentation if needed.
-- [ ] ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
-- [ ] 🧩 I have commented my code, especially in hard-to-understand areas.
+-   [ ] 📜 I have read and followed the [contributing guidelines](https://github.com/ServiceNow/Fast-LLM/blob/main/CONTRIBUTING.md).
+-   [ ] 🏷️ I am using a clear and descriptive title that follows the [PR title guidelines](https://servicenow.github.io/Fast-LLM/developers/pr-title-guidelines).
+-   [ ] 🎉 The functionality is complete, and I have tested the changes.
+-   [ ] 📝 I have updated the documentation if needed.
+-   [ ] ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
+-   [ ] 🧩 I have commented my code, especially in hard-to-understand areas.
 
 ### Dependencies and Configuration
 
-- [ ] 🐋 I have updated the Docker configuration or dependencies, if applicable.
-- [ ] 🔄 I have ensured compatibility with the existing setup after dependency changes.
+-   [ ] 🐋 I have updated the Docker configuration or dependencies, if applicable.
+-   [ ] 🔄 I have ensured compatibility with the existing setup after dependency changes.
 
 ### Testing
 
-- [ ] 🧪 I have added or updated tests to cover my changes.
-- [ ] ✔️ New and existing tests pass locally with my changes.
-- [ ] 🚦 I have tested these changes on GPUs and verified training stability.
-- [ ] 🏋️ I have tested the changes on realistic training workloads, if applicable.
+-   [ ] 🧪 I have added or updated tests to cover my changes.
+-   [ ] ✔️ New and existing tests pass locally with my changes.
+-   [ ] 🚦 I have tested these changes on GPUs and verified training stability.
+-   [ ] 🏋️ I have tested the changes on realistic training workloads, if applicable.
 
 ### Performance Impact
 
-- [ ] 📊 I have run benchmarks where applicable to evaluate the performance impact.
-- [ ] ✅ The benchmarks show no performance regression.
-- [ ] 🚀 The benchmarks indicate a potential performance improvement.
-- [ ] ⚠️ The benchmarks indicate a potential performance degradation.
-- [ ] 📈 I have provided benchmark results and detailed any performance impact below, if applicable.
+-   [ ] 📊 I have run benchmarks where applicable to evaluate the performance impact.
+-   [ ] ✅ The benchmarks show no performance regression.
+-   [ ] 🚀 The benchmarks indicate a potential performance improvement.
+-   [ ] ⚠️ The benchmarks indicate a potential performance degradation.
+-   [ ] 📈 I have provided benchmark results and detailed any performance impact below, if applicable.
 
 ## 📊 Performance Impact Details
 
diff --git a/.markdownlint.yaml b/.markdownlint.yaml
index f8a3ff7f7..547681254 100644
--- a/.markdownlint.yaml
+++ b/.markdownlint.yaml
@@ -7,10 +7,6 @@ default: true
 MD007:
   # Spaces for indent
   indent: 4
-  # Whether to indent the first level of the list
-  start_indented: true
-  # Spaces for first level indent (when start_indented is set)
-  start_indent: 4
 
 # MD010/no-hard-tabs : Hard tabs : https://github.com/DavidAnson/markdownlint/blob/v0.32.1/doc/md010.md
 MD010:
@@ -23,3 +19,14 @@ MD010:
 
 # MD013/line-length : Line length : https://github.com/DavidAnson/markdownlint/blob/v0.32.1/doc/md013.md
 MD013: false
+
+# MD030/list-marker-space : Spaces after list markers : https://github.com/DavidAnson/markdownlint/blob/v0.32.1/doc/md030.md
+MD030:
+  # Spaces for single-line unordered list items
+  ul_single: 3
+  # Spaces for single-line ordered list items
+  ol_single: 3
+  # Spaces for multi-line unordered list items
+  ul_multi: 3
+  # Spaces for multi-line ordered list items
+  ol_multi: 3
diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
index e04230bbd..09229fce8 100644
--- a/CODE_OF_CONDUCT.md
+++ b/CODE_OF_CONDUCT.md
@@ -6,15 +6,15 @@ This code of conduct provides guidelines for participation in ServiceNow-managed
 
 Communities thrive when members support each other and provide useful feedback.
 
-- Be polite and courteous. Respect and treat others as you would expect to be treated yourself.
-- Respect your audience. Posts should not upset, annoy, threaten, harass, abuse or embarrass other members.
-- User Contributions must not include material that is defamatory, obscene, indecent, abusive, offensive, harassing, violent, hateful, inflammatory or otherwise objectionable.
-- Lively and collegial discussions are always encouraged in a healthy community. It is okay to argue facts but not okay to argue personalities or personal beliefs.
-- Do not use text formats such as all caps or bold that may be read as annoying, rude or send a strong message.
-- Do not publish anyone's private personal information without their explicit consent.
-- Avoid using abbreviations or terminology that others may not understand. An abbreviation may mean something to you but in another context or country, it may have another meaning.
-- Be accountable for your actions by correcting your mistakes and indicating where you have changed a previous post of yours.
-- Mark content as correct and helpful, and provide feedback. If you read a discussion post that you find helpful, we encourage you to leave a positive vote and comment in the replies. If you find a post that is unhelpful, please provide more information in the issue comments.
+-   Be polite and courteous. Respect and treat others as you would expect to be treated yourself.
+-   Respect your audience. Posts should not upset, annoy, threaten, harass, abuse or embarrass other members.
+-   User Contributions must not include material that is defamatory, obscene, indecent, abusive, offensive, harassing, violent, hateful, inflammatory or otherwise objectionable.
+-   Lively and collegial discussions are always encouraged in a healthy community. It is okay to argue facts but not okay to argue personalities or personal beliefs.
+-   Do not use text formats such as all caps or bold that may be read as annoying, rude or send a strong message.
+-   Do not publish anyone's private personal information without their explicit consent.
+-   Avoid using abbreviations or terminology that others may not understand. An abbreviation may mean something to you but in another context or country, it may have another meaning.
+-   Be accountable for your actions by correcting your mistakes and indicating where you have changed a previous post of yours.
+-   Mark content as correct and helpful, and provide feedback. If you read a discussion post that you find helpful, we encourage you to leave a positive vote and comment in the replies. If you find a post that is unhelpful, please provide more information in the issue comments.
 
 ## Issue board guidelines
 
@@ -22,20 +22,20 @@ Many open-source projects provide an Issues board, with similar functionality to
 
 ServiceNow suggests the following technical support pathways for open-source projects:
 
-1. Clearly identify and document the issue or question you have.
-2. View the Documentation.
-3. Search the Discussions.
-4. Search the project knowledge base or Wiki for known errors, useful solutions, and troubleshooting tips.
-5. Check the project guidelines in the [`CONTRIBUTING.md`](CONTRIBUTING.md) file if you would like details on how you can submit a change. Community contributions are valued and appreciated!
-6. Log an Issue if it hasn't already been logged. If the issue has already been logged by another user, vote it up, and add a comment with additional or missing information. Do your best to choose the correct category when logging a new issue. This will make it easier to differentiate bugs from new feature requests or ideas. If after logging an issue you find the solution, please close your issue and provide a comment with the solution. This will help the project owners and other users.
-7. Contact the project team contributors of the project to see if they can help as a last resort only.
+1.   Clearly identify and document the issue or question you have.
+2.   View the Documentation.
+3.   Search the Discussions.
+4.   Search the project knowledge base or Wiki for known errors, useful solutions, and troubleshooting tips.
+5.   Check the project guidelines in the [`CONTRIBUTING.md`](CONTRIBUTING.md) file if you would like details on how you can submit a change. Community contributions are valued and appreciated!
+6.   Log an Issue if it hasn't already been logged. If the issue has already been logged by another user, vote it up, and add a comment with additional or missing information. Do your best to choose the correct category when logging a new issue. This will make it easier to differentiate bugs from new feature requests or ideas. If after logging an issue you find the solution, please close your issue and provide a comment with the solution. This will help the project owners and other users.
+7.   Contact the project team contributors of the project to see if they can help as a last resort only.
 
 ## Repositories
 
-- Read and follow the license instructions
-- Remember to include citations if you use someone else's work in your own project. Use the [`CITATION.cff`](CITATION.cff) to find the correct project citation reference.
-- ‘Star' project repos to save for future reference.
-- ‘Watch' project repos to get notifications of changes – this can get noisy for some projects, so only watch the ones you really need to track closely.
+-   Read and follow the license instructions
+-   Remember to include citations if you use someone else's work in your own project. Use the [`CITATION.cff`](CITATION.cff) to find the correct project citation reference.
+-   ‘Star' project repos to save for future reference.
+-   ‘Watch' project repos to get notifications of changes – this can get noisy for some projects, so only watch the ones you really need to track closely.
 
 ## Enforcement and reporting
 
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 9ef1c1856..95a11edd4 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -8,18 +8,18 @@ If you have questions or want to start a discussion, feel free to [open a discus
 
 To get started with contributing to Fast-LLM, follow these steps to set up your environment:
 
-1. **Set Up the Development Environment**: Fast-LLM is built on [PyTorch](https://pytorch.org/) and [Triton](https://triton-lang.org/). Check out our [setup guide](https://servicenow.github.io/Fast-LLM/developers/setup) for instructions on getting everything ready, including the development environment and dependencies.
-2. **Learn Our Best Practices**: Get familiar with our [development best practices](https://servicenow.github.io/Fast-LLM/developers/dev-practices/), which cover code style, pre-commit hooks, and testing strategies.
-3. **Launch Fast-LLM Locally or with Docker**: Need help getting started? Follow the instructions in the [launching section](https://servicenow.github.io/Fast-LLM/developers/launching) to get Fast-LLM up and running.
+1.   **Set Up the Development Environment**: Fast-LLM is built on [PyTorch](https://pytorch.org/) and [Triton](https://triton-lang.org/). Check out our [setup guide](https://servicenow.github.io/Fast-LLM/developers/setup) for instructions on getting everything ready, including the development environment and dependencies.
+2.   **Learn Our Best Practices**: Get familiar with our [development best practices](https://servicenow.github.io/Fast-LLM/developers/dev-practices/), which cover code style, pre-commit hooks, and testing strategies.
+3.   **Launch Fast-LLM Locally or with Docker**: Need help getting started? Follow the instructions in the [launching section](https://servicenow.github.io/Fast-LLM/developers/launching) to get Fast-LLM up and running.
 
 ## How to Report a Bug 🐞
 
 Found a bug? Let's squash it together! [Open an issue](https://github.com/ServiceNow/Fast-LLM/issues/new/choose) and select "Bug report." Please include as much information as possible:
 
-- Steps to reproduce the issue.
-- What you expected to happen versus what actually happened.
-- Logs, Fast-LLM configuration, and error messages.
-- Details about your environment setup (e.g., CUDA hardware, PyTorch version, CUDA version).
+-   Steps to reproduce the issue.
+-   What you expected to happen versus what actually happened.
+-   Logs, Fast-LLM configuration, and error messages.
+-   Details about your environment setup (e.g., CUDA hardware, PyTorch version, CUDA version).
 
 If you're familiar with the codebase, consider adding a failing unit test to demonstrate the problem (optional, but helpful!).
 
@@ -27,33 +27,33 @@ If you're familiar with the codebase, consider adding a failing unit test to dem
 
 Before diving into code, [open an issue](https://github.com/ServiceNow/Fast-LLM/issues) to discuss your proposal. This is especially important if you're planning significant changes or adding new dependencies. Once your idea is approved, follow these steps:
 
-1. **Fork the Repository**: [Fork Fast-LLM](https://github.com/ServiceNow/Fast-LLM/fork) to your own GitHub account.
-2. **Clone Your Fork Locally**: Use `git clone` to bring the code to your local machine.
-3. **Create a New Branch**: Name your branch descriptively, such as `feature/awesome-feature` or `fix/nasty-bug`.
-4. **Make Your Changes**: Work your magic! Don't forget to add or update tests, benchmarks, or configurations as needed.
-5. **Create a Properly Titled Pull Request**: When you're ready to open a PR, make sure to use a clear and descriptive title that follows our [PR title guidelines](https://servicenow.github.io/Fast-LLM/developers/pr-title-guidelines). This title will become the commit message for the squashed merge.
-6. **Push to Your Fork**: Push the branch to your GitHub fork.
-7. **Open a Pull Request**: [Submit a pull request](https://github.com/ServiceNow/Fast-LLM/compare) to the `main` branch. Reference the original issue number and provide a brief summary of your changes.
+1.   **Fork the Repository**: [Fork Fast-LLM](https://github.com/ServiceNow/Fast-LLM/fork) to your own GitHub account.
+2.   **Clone Your Fork Locally**: Use `git clone` to bring the code to your local machine.
+3.   **Create a New Branch**: Name your branch descriptively, such as `feature/awesome-feature` or `fix/nasty-bug`.
+4.   **Make Your Changes**: Work your magic! Don't forget to add or update tests, benchmarks, or configurations as needed.
+5.   **Create a Properly Titled Pull Request**: When you're ready to open a PR, make sure to use a clear and descriptive title that follows our [PR title guidelines](https://servicenow.github.io/Fast-LLM/developers/pr-title-guidelines). This title will become the commit message for the squashed merge.
+6.   **Push to Your Fork**: Push the branch to your GitHub fork.
+7.   **Open a Pull Request**: [Submit a pull request](https://github.com/ServiceNow/Fast-LLM/compare) to the `main` branch. Reference the original issue number and provide a brief summary of your changes.
 
 ### Guidelines for a Successful Pull Request
 
 Here are some tips to ensure your pull request gets reviewed and merged promptly:
 
-- **Follow our coding standards**: Stick to our [development best practices](https://servicenow.github.io/Fast-LLM/developers/dev-practices/) to keep the code clean and consistent.
-- **Write tests**: Verify your changes with unit tests for new features or bug fixes.
-- **Test on GPUs and real-world workloads**: Since Fast-LLM is all about training large language models, make sure your changes work smoothly in GPU environments and on typical training setups.
-- **Run benchmarks and performance tests**: Make sure your changes don't slow things down. If there's any impact on performance, provide benchmark results to back it up.
-- **Avoid introducing new issues**: Check that there are no new runtime warnings, type checker errors, linting problems, or unhandled edge cases.
-- **Comment non-trivial code**: Make your code easy to understand for others.
-- **Keep sensitive data out**: Make sure your code or commit messages don't expose private or proprietary information.
-- **Use the [PR template](https://github.com/ServiceNow/Fast-LLM/blob/main/.github/PULL_REQUEST_TEMPLATE.md)**: Complete the checklist to make sure everything is in order before hitting submit.
+-   **Follow our coding standards**: Stick to our [development best practices](https://servicenow.github.io/Fast-LLM/developers/dev-practices/) to keep the code clean and consistent.
+-   **Write tests**: Verify your changes with unit tests for new features or bug fixes.
+-   **Test on GPUs and real-world workloads**: Since Fast-LLM is all about training large language models, make sure your changes work smoothly in GPU environments and on typical training setups.
+-   **Run benchmarks and performance tests**: Make sure your changes don't slow things down. If there's any impact on performance, provide benchmark results to back it up.
+-   **Avoid introducing new issues**: Check that there are no new runtime warnings, type checker errors, linting problems, or unhandled edge cases.
+-   **Comment non-trivial code**: Make your code easy to understand for others.
+-   **Keep sensitive data out**: Make sure your code or commit messages don't expose private or proprietary information.
+-   **Use the [PR template](https://github.com/ServiceNow/Fast-LLM/blob/main/.github/PULL_REQUEST_TEMPLATE.md)**: Complete the checklist to make sure everything is in order before hitting submit.
 
 ## Seeking Help or Clarification
 
 If you're unsure about something or need help, you've got options:
 
-- **GitHub Discussions**: [Start a discussion](https://github.com/ServiceNow/Fast-LLM/discussions) if you need advice or just want to chat.
-- **Project Maintainers**: Mention a maintainer in an issue or pull request if you need a review or guidance.
+-   **GitHub Discussions**: [Start a discussion](https://github.com/ServiceNow/Fast-LLM/discussions) if you need advice or just want to chat.
+-   **Project Maintainers**: Mention a maintainer in an issue or pull request if you need a review or guidance.
 
 ## Contributors
 
diff --git a/README.md b/README.md
index 9da114bb3..963f22118 100644
--- a/README.md
+++ b/README.md
@@ -25,36 +25,36 @@ As a truly open-source project, Fast-LLM allows full customization and extension
 
 ## Why Fast-LLM?
 
-1. 🚀 **Fast-LLM is Blazingly Fast**:
-    - ⚡️ Optimized kernel efficiency and reduced overheads.
-    - 🔋 Optimized memory usage for best performance.
-    - ⏳ Minimizes training time and cost.
-
-2. 📈 **Fast-LLM is Highly Scalable**:
-    - 📡 Distributed training across multiple GPUs and nodes using 3D parallelism (Data, Tensor, and Pipeline).
-    - 🔗 Supports sequence length parallelism to handle longer sequences effectively.
-    - 🧠 ZeRO-1, ZeRO-2, and ZeRO-3 implementations for improved memory efficiency.
-    - 🎛️ Mixed precision training support for better performance.
-    - 🏋️‍♂️ Large batch training and gradient accumulation support.
-    - 🔄 Reproducible training with deterministic behavior.
-
-3. 🎨 **Fast-LLM is Incredibly Flexible**:
-    - 🤖 Compatible with all common language model architectures in a unified class.
-    - ⚡ Efficient dropless Mixture-of-Experts (MoE) implementation with SoTA performance.
-    - 🧩 Customizable language model architectures, data loaders, loss functions, and optimizers (in progress).
-    - 🤗 Seamless integration with [Hugging Face Transformers][transformers].
-
-4. 🎯 **Fast-LLM is Super Easy to Use**:
-    - 📦 [Pre-built Docker images](https://github.com/ServiceNow/Fast-LLM/pkgs/container/fast-llm) for quick deployment.
-    - 📝 Simple YAML configuration for hassle-free setup.
-    - 💻 Command-line interface for easy launches.
-    - 📊 Detailed logging and real-time monitoring features.
-    - 📚 Extensive [documentation][docs] and practical tutorials (in progress).
-
-5. 🌐 **Fast-LLM is Truly Open Source**:
-    - ⚖️ Licensed under [Apache 2.0][license] for maximum freedom to use Fast-LLM at work, in your projects, or for research.
-    - 💻 Transparently developed on GitHub with public [roadmap][roadmap] and [issue tracking][issues].
-    - 🤝 Contributions and collaboration are always welcome!
+1.   🚀 **Fast-LLM is Blazingly Fast**:
+    -   ⚡️ Optimized kernel efficiency and reduced overheads.
+    -   🔋 Optimized memory usage for best performance.
+    -   ⏳ Minimizes training time and cost.
+
+2.   📈 **Fast-LLM is Highly Scalable**:
+    -   📡 Distributed training across multiple GPUs and nodes using 3D parallelism (Data, Tensor, and Pipeline).
+    -   🔗 Supports sequence length parallelism to handle longer sequences effectively.
+    -   🧠 ZeRO-1, ZeRO-2, and ZeRO-3 implementations for improved memory efficiency.
+    -   🎛️ Mixed precision training support for better performance.
+    -   🏋️‍♂️ Large batch training and gradient accumulation support.
+    -   🔄 Reproducible training with deterministic behavior.
+
+3.   🎨 **Fast-LLM is Incredibly Flexible**:
+    -   🤖 Compatible with all common language model architectures in a unified class.
+    -   ⚡ Efficient dropless Mixture-of-Experts (MoE) implementation with SoTA performance.
+    -   🧩 Customizable language model architectures, data loaders, loss functions, and optimizers (in progress).
+    -   🤗 Seamless integration with [Hugging Face Transformers][transformers].
+
+4.   🎯 **Fast-LLM is Super Easy to Use**:
+    -   📦 [Pre-built Docker images](https://github.com/ServiceNow/Fast-LLM/pkgs/container/fast-llm) for quick deployment.
+    -   📝 Simple YAML configuration for hassle-free setup.
+    -   💻 Command-line interface for easy launches.
+    -   📊 Detailed logging and real-time monitoring features.
+    -   📚 Extensive [documentation][docs] and practical tutorials (in progress).
+
+5.   🌐 **Fast-LLM is Truly Open Source**:
+    -   ⚖️ Licensed under [Apache 2.0][license] for maximum freedom to use Fast-LLM at work, in your projects, or for research.
+    -   💻 Transparently developed on GitHub with public [roadmap][roadmap] and [issue tracking][issues].
+    -   🤝 Contributions and collaboration are always welcome!
 
 ## Usage
 
@@ -71,14 +71,14 @@ Expect to see a significant speedup in training time compared to other libraries
 
 #### Prerequisites
 
-- A [Slurm](https://slurm.schedmd.com/) cluster with at least 4 DGX nodes with 8 A100-80GB or H100-80GB GPUs each.
-- CUDA 12.1 or higher.
-- Dependencies: [PyTorch][pytorch], [Triton][triton], and [Apex](https://github.com/NVIDIA/apex) installed on all nodes.
+-   A [Slurm](https://slurm.schedmd.com/) cluster with at least 4 DGX nodes with 8 A100-80GB or H100-80GB GPUs each.
+-   CUDA 12.1 or higher.
+-   Dependencies: [PyTorch][pytorch], [Triton][triton], and [Apex](https://github.com/NVIDIA/apex) installed on all nodes.
 
 #### Steps
 
-1. Deploy the [nvcr.io/nvidia/pytorch:24.07-py3](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) Docker image to all nodes (recommended), because it contains all the necessary dependencies.
-2. Install Fast-LLM on all nodes:
+1.   Deploy the [nvcr.io/nvidia/pytorch:24.07-py3](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) Docker image to all nodes (recommended), because it contains all the necessary dependencies.
+2.   Install Fast-LLM on all nodes:
 
     ```bash
     sbatch <<EOF
@@ -92,16 +92,16 @@ Expect to see a significant speedup in training time compared to other libraries
     EOF
     ```
 
-3. Use the example Slurm job script [examples/fast-llm.sbat](examples/fast-llm.sbat) to submit the job to the cluster:
+3.   Use the example Slurm job script [examples/fast-llm.sbat](examples/fast-llm.sbat) to submit the job to the cluster:
 
     ```bash
     sbatch examples/fast-llm.sbat
     ```
 
-4. Monitor the job's progress:
+4.   Monitor the job's progress:
 
-    - Logs: Follow `job_output.log` and `job_error.log` in your working directory for logs.
-    - Status: Use `squeue -u $USER` to see the job status.
+    -   Logs: Follow `job_output.log` and `job_error.log` in your working directory for logs.
+    -   Status: Use `squeue -u $USER` to see the job status.
 
 Now, you can sit back and relax while Fast-LLM trains your model at full speed! ☕
 
@@ -109,28 +109,28 @@ Now, you can sit back and relax while Fast-LLM trains your model at full speed!
 
 #### Prerequisites
 
-- A [Kubernetes](https://kubernetes.io/) cluster with at least 4 DGX nodes with 8 A100-80GB or H100-80GB GPUs each.
-- [KubeFlow](https://www.kubeflow.org/) installed.
-- Locked memory limit set to unlimited at the host level on all nodes. Ask your cluster admin to do this if needed.
+-   A [Kubernetes](https://kubernetes.io/) cluster with at least 4 DGX nodes with 8 A100-80GB or H100-80GB GPUs each.
+-   [KubeFlow](https://www.kubeflow.org/) installed.
+-   Locked memory limit set to unlimited at the host level on all nodes. Ask your cluster admin to do this if needed.
 
 #### Steps
 
-1. Create a Kubernetes [PersistentVolumeClaim](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) (PVC) named `fast-llm-home` that will be mounted to `/home/fast-llm` in the container using [examples/fast-llm-pvc.yaml](examples/fast-llm-pvc.yaml):
+1.   Create a Kubernetes [PersistentVolumeClaim](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) (PVC) named `fast-llm-home` that will be mounted to `/home/fast-llm` in the container using [examples/fast-llm-pvc.yaml](examples/fast-llm-pvc.yaml):
 
     ```bash
     kubectl apply -f examples/fast-llm-pvc.yaml
     ```
 
-2. Create a [PyTorchJob](https://www.kubeflow.org/docs/components/training/user-guides/pytorch/) resource using the example configuration file [examples/fast-llm.pytorchjob.yaml](examples/fast-llm.pytorchjob.yaml):
+2.   Create a [PyTorchJob](https://www.kubeflow.org/docs/components/training/user-guides/pytorch/) resource using the example configuration file [examples/fast-llm.pytorchjob.yaml](examples/fast-llm.pytorchjob.yaml):
 
     ```bash
     kubectl apply -f examples/fast-llm.pytorchjob.yaml
     ```
 
-3. Monitor the job status:
+3.   Monitor the job status:
 
-    - Use `kubectl get pytorchjobs` to see the job status.
-    - Use `kubectl logs -f fast-llm-master-0 -c pytorch` to follow the logs.
+    -   Use `kubectl get pytorchjobs` to see the job status.
+    -   Use `kubectl logs -f fast-llm-master-0 -c pytorch` to follow the logs.
 
 That's it! You're now up and running with Fast-LLM on Kubernetes. 🚀
 
diff --git a/SECURITY.md b/SECURITY.md
index e3a80c5b0..2a6258dc6 100644
--- a/SECURITY.md
+++ b/SECURITY.md
@@ -21,11 +21,11 @@ We will process your report as soon as possible, depending on the severity of yo
 
 Please follow the guidelines below when [disclosing vulnerabilities](https://www.servicenow.com/company/trust/privacy/responsible-disclosure.html):
 
-- Report any potential security issue as soon as possible. ServiceNow will make every effort to quickly resolve the issue.
-- Provide sufficient detail to reproduce the vulnerability, including proof of concept. The use of ReproNow to demonstrate reproducibility is encouraged but not required.
-- Please do not disclose an issue to the public or any third party until ServiceNow has resolved it.
-- Make a good faith effort to avoid privacy violations, data destruction, and interruption or degradation of our services. Only interact with accounts you own or have explicit permission from the account holder to access.
-- Redact any language or images that may identify the program or ServiceNow customers from information about a resolved vulnerability.
-- Do not engage in disruptive testing (such as Denial of Service attacks) or any action that could impact the confidentiality, integrity, or availability of information and systems.
-- Do not engage in social engineering or phishing against customers or employees.
-- Please do not request compensation for time, materials, or discovered vulnerabilities through the Responsible Disclosure Program.
+-   Report any potential security issue as soon as possible. ServiceNow will make every effort to quickly resolve the issue.
+-   Provide sufficient detail to reproduce the vulnerability, including proof of concept. The use of ReproNow to demonstrate reproducibility is encouraged but not required.
+-   Please do not disclose an issue to the public or any third party until ServiceNow has resolved it.
+-   Make a good faith effort to avoid privacy violations, data destruction, and interruption or degradation of our services. Only interact with accounts you own or have explicit permission from the account holder to access.
+-   Redact any language or images that may identify the program or ServiceNow customers from information about a resolved vulnerability.
+-   Do not engage in disruptive testing (such as Denial of Service attacks) or any action that could impact the confidentiality, integrity, or availability of information and systems.
+-   Do not engage in social engineering or phishing against customers or employees.
+-   Please do not request compensation for time, materials, or discovered vulnerabilities through the Responsible Disclosure Program.
diff --git a/docs/README.md b/docs/README.md
index b0d94e310..e4f8e2aaf 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -10,13 +10,13 @@ To view the complete, rendered documentation, please visit the [Fast-LLM Documen
 
 To build and preview the documentation locally, follow these simple steps:
 
-1. **Install the necessary dependencies:**
+1.   **Install the necessary dependencies:**
 
    ```bash
    pip install -e ".[DOCS]"
    ```
 
-2. **Build the documentation:**
+2.   **Build the documentation:**
 
    ```bash
    mkdocs build
@@ -24,7 +24,7 @@ To build and preview the documentation locally, follow these simple steps:
 
    This will generate the static documentation files in a `site/` folder.
 
-3. **Serve the documentation locally (with auto-reload):**
+3.   **Serve the documentation locally (with auto-reload):**
 
    ```bash
    mkdocs serve
diff --git a/docs/about-us.md b/docs/about-us.md
index 8ae43542e..9b1bc2be0 100644
--- a/docs/about-us.md
+++ b/docs/about-us.md
@@ -18,17 +18,17 @@ We envision Fast-LLM as the go-to solution for serious AI practitioners who requ
 
 At Fast-LLM, we adhere to a set of guiding principles that define our approach:
 
-- **Performance-Driven:** We are relentless in our pursuit of speed and efficiency. Fast-LLM is built to reduce training time and scale to the largest clusters, enabling our users to achieve breakthrough results faster.
-- **Professional-Grade Customization:** We understand that serious AI work demands flexibility. Fast-LLM is designed for extensive customization, allowing users to tailor every aspect of the training process to their unique needs.
-- **Open Innovation:** While we cater to advanced users, our commitment to open-source ensures that innovation remains accessible. We believe in building a community where professionals can collaborate and contribute to shaping the future of AI.
-- **Reliability at Scale:** Fast-LLM is built with rigorous standards to support production-level workloads. We prioritize stability, reproducibility, and robustness, ensuring that your models can scale from research to real-world applications seamlessly.
+-   **Performance-Driven:** We are relentless in our pursuit of speed and efficiency. Fast-LLM is built to reduce training time and scale to the largest clusters, enabling our users to achieve breakthrough results faster.
+-   **Professional-Grade Customization:** We understand that serious AI work demands flexibility. Fast-LLM is designed for extensive customization, allowing users to tailor every aspect of the training process to their unique needs.
+-   **Open Innovation:** While we cater to advanced users, our commitment to open-source ensures that innovation remains accessible. We believe in building a community where professionals can collaborate and contribute to shaping the future of AI.
+-   **Reliability at Scale:** Fast-LLM is built with rigorous standards to support production-level workloads. We prioritize stability, reproducibility, and robustness, ensuring that your models can scale from research to real-world applications seamlessly.
 
 ## Meet the Team
 
 Fast-LLM is led by the Foundation Models Lab at [ServiceNow Research](https://www.servicenow.com/research/), with development driven by a dedicated group of professionals who bring extensive expertise in AI, machine learning, and distributed systems. While the project direction is guided by the Foundation Models Lab, contributions come from a growing network of researchers, developers, and industry experts worldwide. Here are some of the key members leading the project:
 
-- [**Joel Lamy Poirier**](https://www.servicenow.com/research/author/joel-lamy-poirier.html) - Lead Developer and maintainer, ServiceNow Research: Joel spearheads the core development, ensuring that Fast-LLM delivers on its promise of speed and scalability.
-- [**Sean Hughes**](https://www.servicenow.com/research/author/sean-hughes.html) - Ecosystem Director, ServiceNow Research: Sean focuses on building partnerships and open scientific collaborations to advance Fast-LLM's capabilities and reach.
-- [**Torsten Scholak**](https://www.servicenow.com/research/author/torsten-scholak.html) - Research Lead, ServiceNow Research: Torsten leads our research efforts, driving the scientific innovations that keep Fast-LLM at the forefront of AI training.
+-   [**Joel Lamy Poirier**](https://www.servicenow.com/research/author/joel-lamy-poirier.html) - Lead Developer and maintainer, ServiceNow Research: Joel spearheads the core development, ensuring that Fast-LLM delivers on its promise of speed and scalability.
+-   [**Sean Hughes**](https://www.servicenow.com/research/author/sean-hughes.html) - Ecosystem Director, ServiceNow Research: Sean focuses on building partnerships and open scientific collaborations to advance Fast-LLM's capabilities and reach.
+-   [**Torsten Scholak**](https://www.servicenow.com/research/author/torsten-scholak.html) - Research Lead, ServiceNow Research: Torsten leads our research efforts, driving the scientific innovations that keep Fast-LLM at the forefront of AI training.
 
 Our core team includes members affiliated with ServiceNow Research, as well as other contributors who bring unique perspectives and skills to the project. We welcome new participants from the broader AI community who share our vision of creating the best tools for training large-scale language models.
diff --git a/docs/cost-efficiency.md b/docs/cost-efficiency.md
index 7e8d77712..169f7da45 100644
--- a/docs/cost-efficiency.md
+++ b/docs/cost-efficiency.md
@@ -48,9 +48,9 @@ The tables below provide a comparison of training costs for three different mode
 
 ### Key Takeaways
 
-- **Cost efficiency at all scales:** Fast-LLM consistently achieves lower training costs due to its advanced parallelism and memory efficiency, delivering value across various model sizes and hardware configurations.
-- **Superior token throughput:** By processing more tokens per second per GPU than other frameworks, Fast-LLM maximizes token efficiency, leading to substantial savings, particularly for longer training durations or larger GPU clusters.
-- **Optimized for large-scale training:** Fast-LLM's design allows it to scale effectively as model size and training setups expand, ensuring that the benefits of its optimizations grow with the size of the deployment.
+-   **Cost efficiency at all scales:** Fast-LLM consistently achieves lower training costs due to its advanced parallelism and memory efficiency, delivering value across various model sizes and hardware configurations.
+-   **Superior token throughput:** By processing more tokens per second per GPU than other frameworks, Fast-LLM maximizes token efficiency, leading to substantial savings, particularly for longer training durations or larger GPU clusters.
+-   **Optimized for large-scale training:** Fast-LLM's design allows it to scale effectively as model size and training setups expand, ensuring that the benefits of its optimizations grow with the size of the deployment.
 
 [^fast-llm-1b]:
     Testing conducted in [Month, Year] using 8 NVIDIA H100 SXM5 80 GB GPUs in 1 DGX node connected with 3200 Gbps Infiniband. Fast-LLM version [VERSION/COMMIT HASH], CUDA version [VERSION]. Training was performed on randomly generated data. Configuration file: [Link to config file].
diff --git a/docs/help.md b/docs/help.md
index 6bb586eba..5b3fee3ec 100644
--- a/docs/help.md
+++ b/docs/help.md
@@ -10,9 +10,9 @@ Welcome to the Fast-LLM Help Center! Here, you'll find fixes for common hiccups,
 
 Let's stay one step ahead of those pesky gotchas. Here's a list of common issues and quick fixes:
 
-- **CUDA Out of Memory**: When the GPU throws a fit, a few tweaks can help. First, try lowering `micro_batch_size` or `sequence_length` in the configuration to fit within the available memory. Still stuck? Try setting the `mlp_recompute_level` option to `activation` to save memory in the backward pass, or experiment with higher ZeRO stages for reduced memory usage. And if that's not enough, tensor or model parallelism may be your friend. We've got a guide for this, so you're covered.
+-   **CUDA Out of Memory**: When the GPU throws a fit, a few tweaks can help. First, try lowering `micro_batch_size` or `sequence_length` in the configuration to fit within the available memory. Still stuck? Try setting the `mlp_recompute_level` option to `activation` to save memory in the backward pass, or experiment with higher ZeRO stages for reduced memory usage. And if that's not enough, tensor or model parallelism may be your friend. We've got a guide for this, so you're covered.
 
-- **Python Hash Seed Sync Error**: Encountering an error like
+-   **Python Hash Seed Sync Error**: Encountering an error like
 
     ```bash
     RuntimeError: Desync detected for barrier train begin (66830148464 != 133042721120)
@@ -20,9 +20,9 @@ Let's stay one step ahead of those pesky gotchas. Here's a list of common issues
   
     points to a hashing inconsistency. To fix it, set `PYTHONHASHSEED=0` in your environment variables. This ensures consistent hashing across processes, keeping them in sync.
 
-- **`torchrun` Timeout Errors**: If you see timeout errors related to `torchrun` during rendezvous, it could be DNS resolution or a networking issue. Check that all worker nodes are communicating properly with the master node.
+-   **`torchrun` Timeout Errors**: If you see timeout errors related to `torchrun` during rendezvous, it could be DNS resolution or a networking issue. Check that all worker nodes are communicating properly with the master node.
 
-- **NCCL Errors with Timeout Messages**: Oh, the joys of NCCL errors! If you see something like
+-   **NCCL Errors with Timeout Messages**: Oh, the joys of NCCL errors! If you see something like
 
     ```bash
     Watchdog caught collective operation timeout: WorkNCCL(SeqNum=408951, OpType=_ALLGATHER_BASE, … , Timeout(ms)=600000) ran for 600351 milliseconds before timing out
@@ -44,9 +44,9 @@ If you're the type who loves configurations and tweaking every detail, the [**Co
 
 We've got some excellent tutorials to help you get the most out of Fast-LLM:
 
-- [**Quick-Start Guide**](/quick-start): Perfect for launching Fast-LLM on a single GPU machine. We walk you through setting up Docker, running your first training job, and handling common issues.
+-   [**Quick-Start Guide**](/quick-start): Perfect for launching Fast-LLM on a single GPU machine. We walk you through setting up Docker, running your first training job, and handling common issues.
 
-- [**In-Action Guides**](/in-action/slurm): Ready to go big? These guides cover setting up Fast-LLM with Slurm and Kubernetes for multi-node training. This is where Fast-LLM really shows its power.
+-   [**In-Action Guides**](/in-action/slurm): Ready to go big? These guides cover setting up Fast-LLM with Slurm and Kubernetes for multi-node training. This is where Fast-LLM really shows its power.
 
 ---
 
@@ -54,9 +54,9 @@ We've got some excellent tutorials to help you get the most out of Fast-LLM:
 
 If Fast-LLM still isn't cooperating, here's where to look next:
 
-1. **GitHub [Issues](https://github.com/ServiceNow/Fast-LLM/issues) & [Discussions](https://github.com/ServiceNow/Fast-LLM/discussions)**: This is your best resource. Use the search function to see if anyone has run into the same issue. The community and our team are pretty active, so you'll likely find a solution or get help quickly.
+1.   **GitHub [Issues](https://github.com/ServiceNow/Fast-LLM/issues) & [Discussions](https://github.com/ServiceNow/Fast-LLM/discussions)**: This is your best resource. Use the search function to see if anyone has run into the same issue. The community and our team are pretty active, so you'll likely find a solution or get help quickly.
 
-2. **Email (last resort)**: As a final option, you can email us at [fast-llm-team@servicenow.com](mailto:fast-llm-team@servicenow.com). This is only for rare cases, though. GitHub is our go-to for answering questions, as it lets others benefit from the conversation too.
+2.   **Email (last resort)**: As a final option, you can email us at [fast-llm-team@servicenow.com](mailto:fast-llm-team@servicenow.com). This is only for rare cases, though. GitHub is our go-to for answering questions, as it lets others benefit from the conversation too.
 
 Fast-LLM is a growing community, and your questions and contributions help make it better for everyone. Who knows, you might just solve the next person's roadblock!
 
diff --git a/docs/index.md b/docs/index.md
index 637d022c1..e4e0ca2e4 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -16,35 +16,35 @@ Fast-LLM is designed for professionals who demand exceptional performance for ef
 
 Fast-LLM isn't just another library, **it's a platform for powering the next generation of AI breakthroughs**. Here's what sets it apart:
 
-- **🚀 Purpose-Built for Small- and Large-Scale AI:** Optimized specifically for training language models of all sizes, Fast-LLM excels from **small models around 1B parameters to massive clusters running 70B+ parameter models**, with kernels that are fine-tuned for maximum throughput across this entire range. At 10B-parameter scale, Fast-LLM avoids costly 3D-paralelism through memory optimization techniques such as ZeRO and activation recomputation, whereas at 100B-parameter scale, Fast-LLM optimally supports 3D-parallelism; making Fast-LLM the go-to choice for diverse training needs.
+-   **🚀 Purpose-Built for Small- and Large-Scale AI:** Optimized specifically for training language models of all sizes, Fast-LLM excels from **small models around 1B parameters to massive clusters running 70B+ parameter models**, with kernels that are fine-tuned for maximum throughput across this entire range. At 10B-parameter scale, Fast-LLM avoids costly 3D-paralelism through memory optimization techniques such as ZeRO and activation recomputation, whereas at 100B-parameter scale, Fast-LLM optimally supports 3D-parallelism; making Fast-LLM the go-to choice for diverse training needs.
 
-- **🧠 Unified Support for GPT-Like Architectures:** Fast-LLM **unifies all GPT-like model implementations** in a [single configuration file](https://github.com/ServiceNow/Fast-LLM/blob/main/fast_llm/models/gpt/model.py), and unlike HuggingFace transformers where every model has it's own, mostly independent, implementation, Fast-LLM reduces coding and adapts effortlessly, even with custom architectures.
+-   **🧠 Unified Support for GPT-Like Architectures:** Fast-LLM **unifies all GPT-like model implementations** in a [single configuration file](https://github.com/ServiceNow/Fast-LLM/blob/main/fast_llm/models/gpt/model.py), and unlike HuggingFace transformers where every model has it's own, mostly independent, implementation, Fast-LLM reduces coding and adapts effortlessly, even with custom architectures.
 
-- **💰 Cost Efficiency That Sets Fast-LLM Apart:**
+-   **💰 Cost Efficiency That Sets Fast-LLM Apart:**
 
-  - **Lower Training Costs:** With higher throughput per GPU, Fast-LLM reduces the training time required. For instance, training a Mistral-7B model can be up to **xx% cheaper** compared to other frameworks due to faster processing and better memory efficiency.
+    -   **Lower Training Costs:** With higher throughput per GPU, Fast-LLM reduces the training time required. For instance, training a Mistral-7B model can be up to **xx% cheaper** compared to other frameworks due to faster processing and better memory efficiency.
 
-  - **More Tokens for Your Budget:** Train up to xx% more tokens for the same budget, leading to better-trained models without breaking your financial constraints.
+    -   **More Tokens for Your Budget:** Train up to xx% more tokens for the same budget, leading to better-trained models without breaking your financial constraints.
 
   [Learn more about Fast-LLM's cost efficiency and see detailed comparisons](cost-efficiency).
 
-- **🔓 Openness Without Compromise:** Fast-LLM's open-source approach ensures that you can **fully customize and extend the library** to fit your exact needs, without the restrictions of proprietary software. Developed transparently by a community of experts on GitHub, every change is **publicly discussed and vetted**, fostering **trust and collaboration** so you can innovate with confidence, knowing the entire development process and decision making is out in the open.
+-   **🔓 Openness Without Compromise:** Fast-LLM's open-source approach ensures that you can **fully customize and extend the library** to fit your exact needs, without the restrictions of proprietary software. Developed transparently by a community of experts on GitHub, every change is **publicly discussed and vetted**, fostering **trust and collaboration** so you can innovate with confidence, knowing the entire development process and decision making is out in the open.
 
-- **🌍 Community-Driven Development:** Built by professionals for professionals, Fast-LLM's development is transparent, with an open invitation to the community to contribute. [**Join the Fast-LLM community**](join-us) to help shape the future of large-scale AI training.
+-   **🌍 Community-Driven Development:** Built by professionals for professionals, Fast-LLM's development is transparent, with an open invitation to the community to contribute. [**Join the Fast-LLM community**](join-us) to help shape the future of large-scale AI training.
 
 ### Key Features
 
 Fast-LLM offers all the capabilities you need to accelerate your LLM training and **push the boundaries of what's possible**:
 
-- **🚀 Speed Like No Other:** Achieve record-breaking training throughput with Fast-LLM. For instance, train Mistral-7B at **9,800 tokens/s/GPU** on a 4-node cluster with 32 H100 GPUs (batch size 32, sequence length 8k). Our optimized kernels, advanced parallelism, and memory-efficient techniques drastically reduce training time and cost.
+-   **🚀 Speed Like No Other:** Achieve record-breaking training throughput with Fast-LLM. For instance, train Mistral-7B at **9,800 tokens/s/GPU** on a 4-node cluster with 32 H100 GPUs (batch size 32, sequence length 8k). Our optimized kernels, advanced parallelism, and memory-efficient techniques drastically reduce training time and cost.
 
-- **📡 Unmatched Scalability:** Seamlessly scale from a single GPU to large compute clusters. Fast-LLM supports 3D parallelism (data, tensor, and pipeline), sequence length parallelism, and ZeRO-1,2,3 techniques for maximum memory efficiency. Scale to the size you need without sacrificing performance.
+-   **📡 Unmatched Scalability:** Seamlessly scale from a single GPU to large compute clusters. Fast-LLM supports 3D parallelism (data, tensor, and pipeline), sequence length parallelism, and ZeRO-1,2,3 techniques for maximum memory efficiency. Scale to the size you need without sacrificing performance.
 
-- **🎛️ Total Flexibility:** Compatible with all major language model architectures, including but not limited to Llama, Mistral, StarCoder, and Mixtral. Fast-LLM's modular design gives you full control over your training workflows.
+-   **🎛️ Total Flexibility:** Compatible with all major language model architectures, including but not limited to Llama, Mistral, StarCoder, and Mixtral. Fast-LLM's modular design gives you full control over your training workflows.
 
-- **📦 Seamless Integration:** Integrate smoothly with popular libraries such as [Hugging Face Transformers](https://huggingface.co/transformers). Benefit from Fast-LLM's optimizations without disrupting your existing pipelines.
+-   **📦 Seamless Integration:** Integrate smoothly with popular libraries such as [Hugging Face Transformers](https://huggingface.co/transformers). Benefit from Fast-LLM's optimizations without disrupting your existing pipelines.
 
-- **🛠️ Professional-Grade Tools:** Enjoy mixed precision training, large batch training, and gradient accumulation. Fast-LLM ensures reproducibility through deterministic behavior and provides pre-built Docker images, YAML configurations, and a simple, intuitive command-line interface.
+-   **🛠️ Professional-Grade Tools:** Enjoy mixed precision training, large batch training, and gradient accumulation. Fast-LLM ensures reproducibility through deterministic behavior and provides pre-built Docker images, YAML configurations, and a simple, intuitive command-line interface.
 
 [Download Fast-LLM](https://github.com/ServiceNow/Fast-LLM/releases) and start training your large language models in record time. [Join the Fast-LLM community](join-us) and collaborate with like-minded professionals to advance the state-of-the-art in AI research and development.
 
@@ -52,9 +52,9 @@ Fast-LLM offers all the capabilities you need to accelerate your LLM training an
 
 Fast-LLM powers the world's most advanced AI projects:
 
-- **NLP Research and Development:** Train state-of-the-art language models for natural language understanding, summarization, and conversational AI.
-- **Enterprise AI Solutions:** Accelerate time-to-market for AI products by reducing training costs and enabling faster iteration.
-- **Academic Collaborations:** Drive AI innovation with high-performance training capabilities that support cutting-edge research in machine learning.
+-   **NLP Research and Development:** Train state-of-the-art language models for natural language understanding, summarization, and conversational AI.
+-   **Enterprise AI Solutions:** Accelerate time-to-market for AI products by reducing training costs and enabling faster iteration.
+-   **Academic Collaborations:** Drive AI innovation with high-performance training capabilities that support cutting-edge research in machine learning.
 
 See how Fast-LLM has helped early adopters achieve up to xx% faster results. [Explore use cases and success stories](success-stories/starcoder-2).
 
@@ -62,18 +62,18 @@ See how Fast-LLM has helped early adopters achieve up to xx% faster results. [Ex
 
 Fast-LLM is designed to be the **go-to solution** for those training the most sophisticated language models. Our objectives include:
 
-- **Accelerating Training Workflows:** Deliver the fastest LLM training experience with optimized kernel efficiency, parallelism, and memory management.
-- **Supporting a Broad Range of Architectures:** Offer built-in support for all major language model architectures, with an architecture-agnostic approach that allows users to easily adapt the framework to emerging models.
-- **Enabling Seamless Integration and Deployment:** Integrate effortlessly into existing ML pipelines, including [Hugging Face Transformers](https://huggingface.co/transformers) and [Kubernetes](https://kubernetes.io)-based clusters.
-- **Advancing LLM Research and Production-Readiness:** Be suitable for both cutting-edge research and mission-critical production workloads.
+-   **Accelerating Training Workflows:** Deliver the fastest LLM training experience with optimized kernel efficiency, parallelism, and memory management.
+-   **Supporting a Broad Range of Architectures:** Offer built-in support for all major language model architectures, with an architecture-agnostic approach that allows users to easily adapt the framework to emerging models.
+-   **Enabling Seamless Integration and Deployment:** Integrate effortlessly into existing ML pipelines, including [Hugging Face Transformers](https://huggingface.co/transformers) and [Kubernetes](https://kubernetes.io)-based clusters.
+-   **Advancing LLM Research and Production-Readiness:** Be suitable for both cutting-edge research and mission-critical production workloads.
 
 ## Collaboration and Contribution
 
 As we continue to expand Fast-LLM, we're looking for contributions from the community to help shape its future. We welcome:
 
-- **Testing and Bug Fixes:** Help us identify issues and improve stability.
-- **Feature Development:** Contribute new capabilities, such as custom kernels or support for alternative hardware like AMD and Intel.
-- **Documentation and Tutorials:** Make Fast-LLM more accessible by improving our [documentation](https://servicenow.github.io/Fast-LLM) and writing practical guides.
+-   **Testing and Bug Fixes:** Help us identify issues and improve stability.
+-   **Feature Development:** Contribute new capabilities, such as custom kernels or support for alternative hardware like AMD and Intel.
+-   **Documentation and Tutorials:** Make Fast-LLM more accessible by improving our [documentation](https://servicenow.github.io/Fast-LLM) and writing practical guides.
 
 Fast-LLM is more than just software, it's a community. Get involved by exploring our [contribution guidelines](https://github.com/ServiceNow/Fast-LLM/CONTRIBUTING.md) and engaging with us on [GitHub Discussions](https://github.com/ServiceNow/Fast-LLM/discussions).
 
diff --git a/docs/join-us.md b/docs/join-us.md
index 510216329..a5992649d 100644
--- a/docs/join-us.md
+++ b/docs/join-us.md
@@ -20,15 +20,15 @@ Want to keep up with the latest Fast-LLM updates and new opportunities to get in
 
 Fast-LLM thrives on collaboration, and we're excited to welcome new contributors! From fixing bugs to adding new features, every code contribution makes a difference. If you're just getting started, our [**Good First Issues**](https://github.com/ServiceNow/Fast-LLM/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22) on GitHub are labeled to help newcomers find approachable tasks. To set up your development environment and get oriented with Fast-LLM, check out our **Developer's Corner** for everything you need:
 
-- [**Contributing**](developers/contributing) – for setup instructions and contributing guidelines
-- [**Best Practices**](developers/best-practices) – for tips on writing clean, maintainable code
+-   [**Contributing**](developers/contributing) – for setup instructions and contributing guidelines
+-   [**Best Practices**](developers/best-practices) – for tips on writing clean, maintainable code
 
 Here's a quick overview of the process:
 
-1. **Fork & Clone**: Start by forking the repo and cloning it to your machine.
-2. **Set Up Your Dev Environment**: The Developer's Corner guides you through configuring your environment for maximum productivity.
-3. **Write Awesome Code**: Make your changes, document them, and follow our best practices.
-4. **Open a Pull Request**: Submit a PR to showcase your work and get feedback from our team and the community.
+1.   **Fork & Clone**: Start by forking the repo and cloning it to your machine.
+2.   **Set Up Your Dev Environment**: The Developer's Corner guides you through configuring your environment for maximum productivity.
+3.   **Write Awesome Code**: Make your changes, document them, and follow our best practices.
+4.   **Open a Pull Request**: Submit a PR to showcase your work and get feedback from our team and the community.
 
 [Explore the Developer's Corner for everything you need to get started!](developers)
 
diff --git a/docs/quick-start.md b/docs/quick-start.md
index 1ec4473a6..213f810a1 100644
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -6,9 +6,9 @@ This guide will get you up and running with Fast-LLM on a single machine. Let's
 
 You'll need:
 
-- At least one NVIDIA GPU on your machine. We recommend 8 A100s or higher for this tutorial 🤑
-- Docker (installed and running)
-- Some patience for the initial setup and training 😊
+-   At least one NVIDIA GPU on your machine. We recommend 8 A100s or higher for this tutorial 🤑
+-   Docker (installed and running)
+-   Some patience for the initial setup and training 😊
 
 ## Step 1: Pull the Fast-LLM Docker Image
 
@@ -142,23 +142,23 @@ run:
   experiment_dir: /app/results
 ```
 
-1. Total number of training tokens will be ~300B.
-2. Replace `servicenow` with your own W&B entity name.
-3. Adjust based on GPU memory. For GPT-2 and an A100-80GB, a `micro_batch_size` of 1 should work well.
-4. Should be a power of 2 and divisible by 8. For an A100-80GB, 1024 is a good starting point.
-5. Must be divisible by number of GPUs. At 1024 tokens per sequence, 480 corresponds to about ~500k tokens per batch.
-6. Location of the dataset metadata file generated in Step 4.
-7. 99.8% train, 0.2% validation, 0% test.
-8. L2 regularization penalty.
-9. 1st Adam optimizer parameter.
-10. 2nd Adam optimizer parameter.
-11. Peak learning rate.
-12. Should be 1/10th of base per Chinchilla.
-13. Cosine decay starting at `base` after warmup and ending at `minimum` after `decay_iterations`.
-14. Usually the same as `train_iters`.
-15. Number of steps of linear warmup.
-16. Location of the `config.json` file downloaded in Step 4.
-17. Set to `False` to train from scratch.
+1.   Total number of training tokens will be ~300B.
+2.   Replace `servicenow` with your own W&B entity name.
+3.   Adjust based on GPU memory. For GPT-2 and an A100-80GB, a `micro_batch_size` of 1 should work well.
+4.   Should be a power of 2 and divisible by 8. For an A100-80GB, 1024 is a good starting point.
+5.   Must be divisible by number of GPUs. At 1024 tokens per sequence, 480 corresponds to about ~500k tokens per batch.
+6.   Location of the dataset metadata file generated in Step 4.
+7.   99.8% train, 0.2% validation, 0% test.
+8.   L2 regularization penalty.
+9.   1st Adam optimizer parameter.
+10.   2nd Adam optimizer parameter.
+11.   Peak learning rate.
+12.   Should be 1/10th of base per Chinchilla.
+13.   Cosine decay starting at `base` after warmup and ending at `minimum` after `decay_iterations`.
+14.   Usually the same as `train_iters`.
+15.   Number of steps of linear warmup.
+16.   Location of the `config.json` file downloaded in Step 4.
+17.   Set to `False` to train from scratch.
 
 ## Step 6: Add Your Weights & Biases API Key
 
@@ -192,11 +192,11 @@ With Weights & Biases, you'll see the loss curve, training metrics, and more. If
 
 Here are some common issues you might encounter and how to address them:
 
-- **CUDA Out of Memory**: Try lowering the `micro_batch_size` or `sequence_length` in your configuration to fit within available memory.
+-   **CUDA Out of Memory**: Try lowering the `micro_batch_size` or `sequence_length` in your configuration to fit within available memory.
 
-- **Underutilized GPU or Low Memory Usage**: If memory usage is low or GPU utilization isn't maxed out, try increasing `micro_batch_size` (to 4, 8, or 16 if memory allows) or extending `sequence_length` (up to 2048, 3072, or 4096, as memory permits). Larger batches and longer sequences help keep GPUs engaged and reduce idle time.
+-   **Underutilized GPU or Low Memory Usage**: If memory usage is low or GPU utilization isn't maxed out, try increasing `micro_batch_size` (to 4, 8, or 16 if memory allows) or extending `sequence_length` (up to 2048, 3072, or 4096, as memory permits). Larger batches and longer sequences help keep GPUs engaged and reduce idle time.
 
-- **Docker Permission Issues**: If you encounter Docker permission errors, confirm that Docker has permission to access your GPUs. Use the `--gpus all` flag in your Docker run command and ensure your user has access to the `docker` and `nvidia-docker` groups.
+-   **Docker Permission Issues**: If you encounter Docker permission errors, confirm that Docker has permission to access your GPUs. Use the `--gpus all` flag in your Docker run command and ensure your user has access to the `docker` and `nvidia-docker` groups.
 
 ## Final Thoughts
 
diff --git a/docs/success-stories.md b/docs/success-stories.md
deleted file mode 100644
index e69de29bb..000000000
diff --git a/docs/success-stories/starcoder-2.md b/docs/success-stories/starcoder-2.md
index 22f84a822..577d85870 100644
--- a/docs/success-stories/starcoder-2.md
+++ b/docs/success-stories/starcoder-2.md
@@ -12,11 +12,11 @@ To bring StarCoder2 to life, we ran Fast-LLM on **NVIDIA's DGX SuperCloud**, uti
 
 Fast-LLM was designed to maximize efficiency in large-scale language model training—especially for tasks like StarCoder2. Here's how Fast-LLM's capabilities helped us achieve our goals:
 
-- **Optimized Throughput and GPU Utilization**: Fast-LLM's data parallelism allowed each A100-80GB GPU to operate at its peak, sustaining **10,000 tokens per second** throughput. This boosted GPU utilization and brought down training time by **20%** compared to other frameworks. Fast-LLM made sure every GPU cycle was used efficiently, cutting down on idle time across the board.
+-   **Optimized Throughput and GPU Utilization**: Fast-LLM's data parallelism allowed each A100-80GB GPU to operate at its peak, sustaining **10,000 tokens per second** throughput. This boosted GPU utilization and brought down training time by **20%** compared to other frameworks. Fast-LLM made sure every GPU cycle was used efficiently, cutting down on idle time across the board.
 
-- **Support for Long Contexts**: With Fast-LLM's built-in Grouped Query Attention (GQA), StarCoder2-3B was able to leverage a **16,384 token context window**. This is essential for code comprehension, where context often spans hundreds of lines or more. GQA enabled the model to hold extensive context across sequences, which translates into better understanding of long code snippets, in-depth documentation, and detailed coding conversations.
+-   **Support for Long Contexts**: With Fast-LLM's built-in Grouped Query Attention (GQA), StarCoder2-3B was able to leverage a **16,384 token context window**. This is essential for code comprehension, where context often spans hundreds of lines or more. GQA enabled the model to hold extensive context across sequences, which translates into better understanding of long code snippets, in-depth documentation, and detailed coding conversations.
 
-- **Fill-in-the-Middle (FIM) Training**: Fast-LLM supported FIM training objectives natively, allowing StarCoder2-3B to complete and understand code by predicting missing snippets in various contexts. This structure-focused training enhanced the model's performance, making it adept at understanding code structure, flow, and syntax.
+-   **Fill-in-the-Middle (FIM) Training**: Fast-LLM supported FIM training objectives natively, allowing StarCoder2-3B to complete and understand code by predicting missing snippets in various contexts. This structure-focused training enhanced the model's performance, making it adept at understanding code structure, flow, and syntax.
 
 ## The Takeaway
 

From 43869e51bef3f1bed0bf6038478332932725da71 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Sun, 3 Nov 2024 13:13:01 -0500
Subject: [PATCH 41/87] separate md linting for / and /docs

---
 .markdownlint.yaml      | 15 ++++---
 CODE_OF_CONDUCT.md      | 40 +++++++++---------
 CONTRIBUTING.md         | 48 ++++++++++-----------
 README.md               | 94 ++++++++++++++++++++---------------------
 SECURITY.md             | 16 +++----
 docs/.markdownlint.yaml | 32 ++++++++++++++
 6 files changed, 140 insertions(+), 105 deletions(-)
 create mode 100644 docs/.markdownlint.yaml

diff --git a/.markdownlint.yaml b/.markdownlint.yaml
index 547681254..3b8bac640 100644
--- a/.markdownlint.yaml
+++ b/.markdownlint.yaml
@@ -6,7 +6,7 @@ default: true
 # MD007/ul-indent : Unordered list indentation : https://github.com/DavidAnson/markdownlint/blob/v0.32.1/doc/md007.md
 MD007:
   # Spaces for indent
-  indent: 4
+  indent: 2
 
 # MD010/no-hard-tabs : Hard tabs : https://github.com/DavidAnson/markdownlint/blob/v0.32.1/doc/md010.md
 MD010:
@@ -15,18 +15,21 @@ MD010:
   # Fenced code languages to ignore
   ignore_code_languages: []
   # Number of spaces for each hard tab
-  spaces_per_tab: 4
+  spaces_per_tab: 2
 
 # MD013/line-length : Line length : https://github.com/DavidAnson/markdownlint/blob/v0.32.1/doc/md013.md
 MD013: false
 
+# MD024/no-duplicate-heading : Multiple headings with the same content : https://github.com/DavidAnson/markdownlint/blob/v0.32.1/doc/md024.md
+MD024: false
+
 # MD030/list-marker-space : Spaces after list markers : https://github.com/DavidAnson/markdownlint/blob/v0.32.1/doc/md030.md
 MD030:
   # Spaces for single-line unordered list items
-  ul_single: 3
+  ul_single: 1
   # Spaces for single-line ordered list items
-  ol_single: 3
+  ol_single: 1
   # Spaces for multi-line unordered list items
-  ul_multi: 3
+  ul_multi: 1
   # Spaces for multi-line ordered list items
-  ol_multi: 3
+  ol_multi: 1
diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
index 09229fce8..e04230bbd 100644
--- a/CODE_OF_CONDUCT.md
+++ b/CODE_OF_CONDUCT.md
@@ -6,15 +6,15 @@ This code of conduct provides guidelines for participation in ServiceNow-managed
 
 Communities thrive when members support each other and provide useful feedback.
 
--   Be polite and courteous. Respect and treat others as you would expect to be treated yourself.
--   Respect your audience. Posts should not upset, annoy, threaten, harass, abuse or embarrass other members.
--   User Contributions must not include material that is defamatory, obscene, indecent, abusive, offensive, harassing, violent, hateful, inflammatory or otherwise objectionable.
--   Lively and collegial discussions are always encouraged in a healthy community. It is okay to argue facts but not okay to argue personalities or personal beliefs.
--   Do not use text formats such as all caps or bold that may be read as annoying, rude or send a strong message.
--   Do not publish anyone's private personal information without their explicit consent.
--   Avoid using abbreviations or terminology that others may not understand. An abbreviation may mean something to you but in another context or country, it may have another meaning.
--   Be accountable for your actions by correcting your mistakes and indicating where you have changed a previous post of yours.
--   Mark content as correct and helpful, and provide feedback. If you read a discussion post that you find helpful, we encourage you to leave a positive vote and comment in the replies. If you find a post that is unhelpful, please provide more information in the issue comments.
+- Be polite and courteous. Respect and treat others as you would expect to be treated yourself.
+- Respect your audience. Posts should not upset, annoy, threaten, harass, abuse or embarrass other members.
+- User Contributions must not include material that is defamatory, obscene, indecent, abusive, offensive, harassing, violent, hateful, inflammatory or otherwise objectionable.
+- Lively and collegial discussions are always encouraged in a healthy community. It is okay to argue facts but not okay to argue personalities or personal beliefs.
+- Do not use text formats such as all caps or bold that may be read as annoying, rude or send a strong message.
+- Do not publish anyone's private personal information without their explicit consent.
+- Avoid using abbreviations or terminology that others may not understand. An abbreviation may mean something to you but in another context or country, it may have another meaning.
+- Be accountable for your actions by correcting your mistakes and indicating where you have changed a previous post of yours.
+- Mark content as correct and helpful, and provide feedback. If you read a discussion post that you find helpful, we encourage you to leave a positive vote and comment in the replies. If you find a post that is unhelpful, please provide more information in the issue comments.
 
 ## Issue board guidelines
 
@@ -22,20 +22,20 @@ Many open-source projects provide an Issues board, with similar functionality to
 
 ServiceNow suggests the following technical support pathways for open-source projects:
 
-1.   Clearly identify and document the issue or question you have.
-2.   View the Documentation.
-3.   Search the Discussions.
-4.   Search the project knowledge base or Wiki for known errors, useful solutions, and troubleshooting tips.
-5.   Check the project guidelines in the [`CONTRIBUTING.md`](CONTRIBUTING.md) file if you would like details on how you can submit a change. Community contributions are valued and appreciated!
-6.   Log an Issue if it hasn't already been logged. If the issue has already been logged by another user, vote it up, and add a comment with additional or missing information. Do your best to choose the correct category when logging a new issue. This will make it easier to differentiate bugs from new feature requests or ideas. If after logging an issue you find the solution, please close your issue and provide a comment with the solution. This will help the project owners and other users.
-7.   Contact the project team contributors of the project to see if they can help as a last resort only.
+1. Clearly identify and document the issue or question you have.
+2. View the Documentation.
+3. Search the Discussions.
+4. Search the project knowledge base or Wiki for known errors, useful solutions, and troubleshooting tips.
+5. Check the project guidelines in the [`CONTRIBUTING.md`](CONTRIBUTING.md) file if you would like details on how you can submit a change. Community contributions are valued and appreciated!
+6. Log an Issue if it hasn't already been logged. If the issue has already been logged by another user, vote it up, and add a comment with additional or missing information. Do your best to choose the correct category when logging a new issue. This will make it easier to differentiate bugs from new feature requests or ideas. If after logging an issue you find the solution, please close your issue and provide a comment with the solution. This will help the project owners and other users.
+7. Contact the project team contributors of the project to see if they can help as a last resort only.
 
 ## Repositories
 
--   Read and follow the license instructions
--   Remember to include citations if you use someone else's work in your own project. Use the [`CITATION.cff`](CITATION.cff) to find the correct project citation reference.
--   ‘Star' project repos to save for future reference.
--   ‘Watch' project repos to get notifications of changes – this can get noisy for some projects, so only watch the ones you really need to track closely.
+- Read and follow the license instructions
+- Remember to include citations if you use someone else's work in your own project. Use the [`CITATION.cff`](CITATION.cff) to find the correct project citation reference.
+- ‘Star' project repos to save for future reference.
+- ‘Watch' project repos to get notifications of changes – this can get noisy for some projects, so only watch the ones you really need to track closely.
 
 ## Enforcement and reporting
 
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 95a11edd4..9ef1c1856 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -8,18 +8,18 @@ If you have questions or want to start a discussion, feel free to [open a discus
 
 To get started with contributing to Fast-LLM, follow these steps to set up your environment:
 
-1.   **Set Up the Development Environment**: Fast-LLM is built on [PyTorch](https://pytorch.org/) and [Triton](https://triton-lang.org/). Check out our [setup guide](https://servicenow.github.io/Fast-LLM/developers/setup) for instructions on getting everything ready, including the development environment and dependencies.
-2.   **Learn Our Best Practices**: Get familiar with our [development best practices](https://servicenow.github.io/Fast-LLM/developers/dev-practices/), which cover code style, pre-commit hooks, and testing strategies.
-3.   **Launch Fast-LLM Locally or with Docker**: Need help getting started? Follow the instructions in the [launching section](https://servicenow.github.io/Fast-LLM/developers/launching) to get Fast-LLM up and running.
+1. **Set Up the Development Environment**: Fast-LLM is built on [PyTorch](https://pytorch.org/) and [Triton](https://triton-lang.org/). Check out our [setup guide](https://servicenow.github.io/Fast-LLM/developers/setup) for instructions on getting everything ready, including the development environment and dependencies.
+2. **Learn Our Best Practices**: Get familiar with our [development best practices](https://servicenow.github.io/Fast-LLM/developers/dev-practices/), which cover code style, pre-commit hooks, and testing strategies.
+3. **Launch Fast-LLM Locally or with Docker**: Need help getting started? Follow the instructions in the [launching section](https://servicenow.github.io/Fast-LLM/developers/launching) to get Fast-LLM up and running.
 
 ## How to Report a Bug 🐞
 
 Found a bug? Let's squash it together! [Open an issue](https://github.com/ServiceNow/Fast-LLM/issues/new/choose) and select "Bug report." Please include as much information as possible:
 
--   Steps to reproduce the issue.
--   What you expected to happen versus what actually happened.
--   Logs, Fast-LLM configuration, and error messages.
--   Details about your environment setup (e.g., CUDA hardware, PyTorch version, CUDA version).
+- Steps to reproduce the issue.
+- What you expected to happen versus what actually happened.
+- Logs, Fast-LLM configuration, and error messages.
+- Details about your environment setup (e.g., CUDA hardware, PyTorch version, CUDA version).
 
 If you're familiar with the codebase, consider adding a failing unit test to demonstrate the problem (optional, but helpful!).
 
@@ -27,33 +27,33 @@ If you're familiar with the codebase, consider adding a failing unit test to dem
 
 Before diving into code, [open an issue](https://github.com/ServiceNow/Fast-LLM/issues) to discuss your proposal. This is especially important if you're planning significant changes or adding new dependencies. Once your idea is approved, follow these steps:
 
-1.   **Fork the Repository**: [Fork Fast-LLM](https://github.com/ServiceNow/Fast-LLM/fork) to your own GitHub account.
-2.   **Clone Your Fork Locally**: Use `git clone` to bring the code to your local machine.
-3.   **Create a New Branch**: Name your branch descriptively, such as `feature/awesome-feature` or `fix/nasty-bug`.
-4.   **Make Your Changes**: Work your magic! Don't forget to add or update tests, benchmarks, or configurations as needed.
-5.   **Create a Properly Titled Pull Request**: When you're ready to open a PR, make sure to use a clear and descriptive title that follows our [PR title guidelines](https://servicenow.github.io/Fast-LLM/developers/pr-title-guidelines). This title will become the commit message for the squashed merge.
-6.   **Push to Your Fork**: Push the branch to your GitHub fork.
-7.   **Open a Pull Request**: [Submit a pull request](https://github.com/ServiceNow/Fast-LLM/compare) to the `main` branch. Reference the original issue number and provide a brief summary of your changes.
+1. **Fork the Repository**: [Fork Fast-LLM](https://github.com/ServiceNow/Fast-LLM/fork) to your own GitHub account.
+2. **Clone Your Fork Locally**: Use `git clone` to bring the code to your local machine.
+3. **Create a New Branch**: Name your branch descriptively, such as `feature/awesome-feature` or `fix/nasty-bug`.
+4. **Make Your Changes**: Work your magic! Don't forget to add or update tests, benchmarks, or configurations as needed.
+5. **Create a Properly Titled Pull Request**: When you're ready to open a PR, make sure to use a clear and descriptive title that follows our [PR title guidelines](https://servicenow.github.io/Fast-LLM/developers/pr-title-guidelines). This title will become the commit message for the squashed merge.
+6. **Push to Your Fork**: Push the branch to your GitHub fork.
+7. **Open a Pull Request**: [Submit a pull request](https://github.com/ServiceNow/Fast-LLM/compare) to the `main` branch. Reference the original issue number and provide a brief summary of your changes.
 
 ### Guidelines for a Successful Pull Request
 
 Here are some tips to ensure your pull request gets reviewed and merged promptly:
 
--   **Follow our coding standards**: Stick to our [development best practices](https://servicenow.github.io/Fast-LLM/developers/dev-practices/) to keep the code clean and consistent.
--   **Write tests**: Verify your changes with unit tests for new features or bug fixes.
--   **Test on GPUs and real-world workloads**: Since Fast-LLM is all about training large language models, make sure your changes work smoothly in GPU environments and on typical training setups.
--   **Run benchmarks and performance tests**: Make sure your changes don't slow things down. If there's any impact on performance, provide benchmark results to back it up.
--   **Avoid introducing new issues**: Check that there are no new runtime warnings, type checker errors, linting problems, or unhandled edge cases.
--   **Comment non-trivial code**: Make your code easy to understand for others.
--   **Keep sensitive data out**: Make sure your code or commit messages don't expose private or proprietary information.
--   **Use the [PR template](https://github.com/ServiceNow/Fast-LLM/blob/main/.github/PULL_REQUEST_TEMPLATE.md)**: Complete the checklist to make sure everything is in order before hitting submit.
+- **Follow our coding standards**: Stick to our [development best practices](https://servicenow.github.io/Fast-LLM/developers/dev-practices/) to keep the code clean and consistent.
+- **Write tests**: Verify your changes with unit tests for new features or bug fixes.
+- **Test on GPUs and real-world workloads**: Since Fast-LLM is all about training large language models, make sure your changes work smoothly in GPU environments and on typical training setups.
+- **Run benchmarks and performance tests**: Make sure your changes don't slow things down. If there's any impact on performance, provide benchmark results to back it up.
+- **Avoid introducing new issues**: Check that there are no new runtime warnings, type checker errors, linting problems, or unhandled edge cases.
+- **Comment non-trivial code**: Make your code easy to understand for others.
+- **Keep sensitive data out**: Make sure your code or commit messages don't expose private or proprietary information.
+- **Use the [PR template](https://github.com/ServiceNow/Fast-LLM/blob/main/.github/PULL_REQUEST_TEMPLATE.md)**: Complete the checklist to make sure everything is in order before hitting submit.
 
 ## Seeking Help or Clarification
 
 If you're unsure about something or need help, you've got options:
 
--   **GitHub Discussions**: [Start a discussion](https://github.com/ServiceNow/Fast-LLM/discussions) if you need advice or just want to chat.
--   **Project Maintainers**: Mention a maintainer in an issue or pull request if you need a review or guidance.
+- **GitHub Discussions**: [Start a discussion](https://github.com/ServiceNow/Fast-LLM/discussions) if you need advice or just want to chat.
+- **Project Maintainers**: Mention a maintainer in an issue or pull request if you need a review or guidance.
 
 ## Contributors
 
diff --git a/README.md b/README.md
index 963f22118..9da114bb3 100644
--- a/README.md
+++ b/README.md
@@ -25,36 +25,36 @@ As a truly open-source project, Fast-LLM allows full customization and extension
 
 ## Why Fast-LLM?
 
-1.   🚀 **Fast-LLM is Blazingly Fast**:
-    -   ⚡️ Optimized kernel efficiency and reduced overheads.
-    -   🔋 Optimized memory usage for best performance.
-    -   ⏳ Minimizes training time and cost.
-
-2.   📈 **Fast-LLM is Highly Scalable**:
-    -   📡 Distributed training across multiple GPUs and nodes using 3D parallelism (Data, Tensor, and Pipeline).
-    -   🔗 Supports sequence length parallelism to handle longer sequences effectively.
-    -   🧠 ZeRO-1, ZeRO-2, and ZeRO-3 implementations for improved memory efficiency.
-    -   🎛️ Mixed precision training support for better performance.
-    -   🏋️‍♂️ Large batch training and gradient accumulation support.
-    -   🔄 Reproducible training with deterministic behavior.
-
-3.   🎨 **Fast-LLM is Incredibly Flexible**:
-    -   🤖 Compatible with all common language model architectures in a unified class.
-    -   ⚡ Efficient dropless Mixture-of-Experts (MoE) implementation with SoTA performance.
-    -   🧩 Customizable language model architectures, data loaders, loss functions, and optimizers (in progress).
-    -   🤗 Seamless integration with [Hugging Face Transformers][transformers].
-
-4.   🎯 **Fast-LLM is Super Easy to Use**:
-    -   📦 [Pre-built Docker images](https://github.com/ServiceNow/Fast-LLM/pkgs/container/fast-llm) for quick deployment.
-    -   📝 Simple YAML configuration for hassle-free setup.
-    -   💻 Command-line interface for easy launches.
-    -   📊 Detailed logging and real-time monitoring features.
-    -   📚 Extensive [documentation][docs] and practical tutorials (in progress).
-
-5.   🌐 **Fast-LLM is Truly Open Source**:
-    -   ⚖️ Licensed under [Apache 2.0][license] for maximum freedom to use Fast-LLM at work, in your projects, or for research.
-    -   💻 Transparently developed on GitHub with public [roadmap][roadmap] and [issue tracking][issues].
-    -   🤝 Contributions and collaboration are always welcome!
+1. 🚀 **Fast-LLM is Blazingly Fast**:
+    - ⚡️ Optimized kernel efficiency and reduced overheads.
+    - 🔋 Optimized memory usage for best performance.
+    - ⏳ Minimizes training time and cost.
+
+2. 📈 **Fast-LLM is Highly Scalable**:
+    - 📡 Distributed training across multiple GPUs and nodes using 3D parallelism (Data, Tensor, and Pipeline).
+    - 🔗 Supports sequence length parallelism to handle longer sequences effectively.
+    - 🧠 ZeRO-1, ZeRO-2, and ZeRO-3 implementations for improved memory efficiency.
+    - 🎛️ Mixed precision training support for better performance.
+    - 🏋️‍♂️ Large batch training and gradient accumulation support.
+    - 🔄 Reproducible training with deterministic behavior.
+
+3. 🎨 **Fast-LLM is Incredibly Flexible**:
+    - 🤖 Compatible with all common language model architectures in a unified class.
+    - ⚡ Efficient dropless Mixture-of-Experts (MoE) implementation with SoTA performance.
+    - 🧩 Customizable language model architectures, data loaders, loss functions, and optimizers (in progress).
+    - 🤗 Seamless integration with [Hugging Face Transformers][transformers].
+
+4. 🎯 **Fast-LLM is Super Easy to Use**:
+    - 📦 [Pre-built Docker images](https://github.com/ServiceNow/Fast-LLM/pkgs/container/fast-llm) for quick deployment.
+    - 📝 Simple YAML configuration for hassle-free setup.
+    - 💻 Command-line interface for easy launches.
+    - 📊 Detailed logging and real-time monitoring features.
+    - 📚 Extensive [documentation][docs] and practical tutorials (in progress).
+
+5. 🌐 **Fast-LLM is Truly Open Source**:
+    - ⚖️ Licensed under [Apache 2.0][license] for maximum freedom to use Fast-LLM at work, in your projects, or for research.
+    - 💻 Transparently developed on GitHub with public [roadmap][roadmap] and [issue tracking][issues].
+    - 🤝 Contributions and collaboration are always welcome!
 
 ## Usage
 
@@ -71,14 +71,14 @@ Expect to see a significant speedup in training time compared to other libraries
 
 #### Prerequisites
 
--   A [Slurm](https://slurm.schedmd.com/) cluster with at least 4 DGX nodes with 8 A100-80GB or H100-80GB GPUs each.
--   CUDA 12.1 or higher.
--   Dependencies: [PyTorch][pytorch], [Triton][triton], and [Apex](https://github.com/NVIDIA/apex) installed on all nodes.
+- A [Slurm](https://slurm.schedmd.com/) cluster with at least 4 DGX nodes with 8 A100-80GB or H100-80GB GPUs each.
+- CUDA 12.1 or higher.
+- Dependencies: [PyTorch][pytorch], [Triton][triton], and [Apex](https://github.com/NVIDIA/apex) installed on all nodes.
 
 #### Steps
 
-1.   Deploy the [nvcr.io/nvidia/pytorch:24.07-py3](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) Docker image to all nodes (recommended), because it contains all the necessary dependencies.
-2.   Install Fast-LLM on all nodes:
+1. Deploy the [nvcr.io/nvidia/pytorch:24.07-py3](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) Docker image to all nodes (recommended), because it contains all the necessary dependencies.
+2. Install Fast-LLM on all nodes:
 
     ```bash
     sbatch <<EOF
@@ -92,16 +92,16 @@ Expect to see a significant speedup in training time compared to other libraries
     EOF
     ```
 
-3.   Use the example Slurm job script [examples/fast-llm.sbat](examples/fast-llm.sbat) to submit the job to the cluster:
+3. Use the example Slurm job script [examples/fast-llm.sbat](examples/fast-llm.sbat) to submit the job to the cluster:
 
     ```bash
     sbatch examples/fast-llm.sbat
     ```
 
-4.   Monitor the job's progress:
+4. Monitor the job's progress:
 
-    -   Logs: Follow `job_output.log` and `job_error.log` in your working directory for logs.
-    -   Status: Use `squeue -u $USER` to see the job status.
+    - Logs: Follow `job_output.log` and `job_error.log` in your working directory for logs.
+    - Status: Use `squeue -u $USER` to see the job status.
 
 Now, you can sit back and relax while Fast-LLM trains your model at full speed! ☕
 
@@ -109,28 +109,28 @@ Now, you can sit back and relax while Fast-LLM trains your model at full speed!
 
 #### Prerequisites
 
--   A [Kubernetes](https://kubernetes.io/) cluster with at least 4 DGX nodes with 8 A100-80GB or H100-80GB GPUs each.
--   [KubeFlow](https://www.kubeflow.org/) installed.
--   Locked memory limit set to unlimited at the host level on all nodes. Ask your cluster admin to do this if needed.
+- A [Kubernetes](https://kubernetes.io/) cluster with at least 4 DGX nodes with 8 A100-80GB or H100-80GB GPUs each.
+- [KubeFlow](https://www.kubeflow.org/) installed.
+- Locked memory limit set to unlimited at the host level on all nodes. Ask your cluster admin to do this if needed.
 
 #### Steps
 
-1.   Create a Kubernetes [PersistentVolumeClaim](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) (PVC) named `fast-llm-home` that will be mounted to `/home/fast-llm` in the container using [examples/fast-llm-pvc.yaml](examples/fast-llm-pvc.yaml):
+1. Create a Kubernetes [PersistentVolumeClaim](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) (PVC) named `fast-llm-home` that will be mounted to `/home/fast-llm` in the container using [examples/fast-llm-pvc.yaml](examples/fast-llm-pvc.yaml):
 
     ```bash
     kubectl apply -f examples/fast-llm-pvc.yaml
     ```
 
-2.   Create a [PyTorchJob](https://www.kubeflow.org/docs/components/training/user-guides/pytorch/) resource using the example configuration file [examples/fast-llm.pytorchjob.yaml](examples/fast-llm.pytorchjob.yaml):
+2. Create a [PyTorchJob](https://www.kubeflow.org/docs/components/training/user-guides/pytorch/) resource using the example configuration file [examples/fast-llm.pytorchjob.yaml](examples/fast-llm.pytorchjob.yaml):
 
     ```bash
     kubectl apply -f examples/fast-llm.pytorchjob.yaml
     ```
 
-3.   Monitor the job status:
+3. Monitor the job status:
 
-    -   Use `kubectl get pytorchjobs` to see the job status.
-    -   Use `kubectl logs -f fast-llm-master-0 -c pytorch` to follow the logs.
+    - Use `kubectl get pytorchjobs` to see the job status.
+    - Use `kubectl logs -f fast-llm-master-0 -c pytorch` to follow the logs.
 
 That's it! You're now up and running with Fast-LLM on Kubernetes. 🚀
 
diff --git a/SECURITY.md b/SECURITY.md
index 2a6258dc6..e3a80c5b0 100644
--- a/SECURITY.md
+++ b/SECURITY.md
@@ -21,11 +21,11 @@ We will process your report as soon as possible, depending on the severity of yo
 
 Please follow the guidelines below when [disclosing vulnerabilities](https://www.servicenow.com/company/trust/privacy/responsible-disclosure.html):
 
--   Report any potential security issue as soon as possible. ServiceNow will make every effort to quickly resolve the issue.
--   Provide sufficient detail to reproduce the vulnerability, including proof of concept. The use of ReproNow to demonstrate reproducibility is encouraged but not required.
--   Please do not disclose an issue to the public or any third party until ServiceNow has resolved it.
--   Make a good faith effort to avoid privacy violations, data destruction, and interruption or degradation of our services. Only interact with accounts you own or have explicit permission from the account holder to access.
--   Redact any language or images that may identify the program or ServiceNow customers from information about a resolved vulnerability.
--   Do not engage in disruptive testing (such as Denial of Service attacks) or any action that could impact the confidentiality, integrity, or availability of information and systems.
--   Do not engage in social engineering or phishing against customers or employees.
--   Please do not request compensation for time, materials, or discovered vulnerabilities through the Responsible Disclosure Program.
+- Report any potential security issue as soon as possible. ServiceNow will make every effort to quickly resolve the issue.
+- Provide sufficient detail to reproduce the vulnerability, including proof of concept. The use of ReproNow to demonstrate reproducibility is encouraged but not required.
+- Please do not disclose an issue to the public or any third party until ServiceNow has resolved it.
+- Make a good faith effort to avoid privacy violations, data destruction, and interruption or degradation of our services. Only interact with accounts you own or have explicit permission from the account holder to access.
+- Redact any language or images that may identify the program or ServiceNow customers from information about a resolved vulnerability.
+- Do not engage in disruptive testing (such as Denial of Service attacks) or any action that could impact the confidentiality, integrity, or availability of information and systems.
+- Do not engage in social engineering or phishing against customers or employees.
+- Please do not request compensation for time, materials, or discovered vulnerabilities through the Responsible Disclosure Program.
diff --git a/docs/.markdownlint.yaml b/docs/.markdownlint.yaml
new file mode 100644
index 000000000..547681254
--- /dev/null
+++ b/docs/.markdownlint.yaml
@@ -0,0 +1,32 @@
+# See https://github.com/DavidAnson/markdownlint/blob/v0.32.1/schema/.markdownlint.yaml for schema documentation
+
+# Default state for all rules
+default: true
+
+# MD007/ul-indent : Unordered list indentation : https://github.com/DavidAnson/markdownlint/blob/v0.32.1/doc/md007.md
+MD007:
+  # Spaces for indent
+  indent: 4
+
+# MD010/no-hard-tabs : Hard tabs : https://github.com/DavidAnson/markdownlint/blob/v0.32.1/doc/md010.md
+MD010:
+  # Include code blocks
+  code_blocks: false
+  # Fenced code languages to ignore
+  ignore_code_languages: []
+  # Number of spaces for each hard tab
+  spaces_per_tab: 4
+
+# MD013/line-length : Line length : https://github.com/DavidAnson/markdownlint/blob/v0.32.1/doc/md013.md
+MD013: false
+
+# MD030/list-marker-space : Spaces after list markers : https://github.com/DavidAnson/markdownlint/blob/v0.32.1/doc/md030.md
+MD030:
+  # Spaces for single-line unordered list items
+  ul_single: 3
+  # Spaces for single-line ordered list items
+  ol_single: 3
+  # Spaces for multi-line unordered list items
+  ul_multi: 3
+  # Spaces for multi-line ordered list items
+  ol_multi: 3

From dd35c5064f947e4cc09f03cf0ceeb64cf11c3619 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Tue, 5 Nov 2024 15:45:01 -0500
Subject: [PATCH 42/87] wip

---
 docs/.markdownlint.yaml                       |  4 +--
 .../{dev-practices.md => best-practices.md}   |  0
 docs/developers/contribute.md                 |  3 --
 docs/developers/contributing.md               | 19 +++++++++++
 docs/developers/index.md                      |  3 --
 docs/developers/launching.md                  |  0
 docs/developers/pr-title-guidelines.md        | 13 --------
 docs/developers/setup.md                      |  0
 docs/index.md                                 | 24 +++++++-------
 docs/join-us.md                               | 32 ++++++++++++-------
 .../continue-training-llama-8b.md             |  0
 .../{examples => recipes}/data-preparation.md |  0
 docs/{examples => recipes}/train-llama-8b.md  |  0
 .../upcycle-llama-3b-to-moe.md                |  0
 mkdocs.yaml                                   | 12 +++----
 15 files changed, 59 insertions(+), 51 deletions(-)
 rename docs/developers/{dev-practices.md => best-practices.md} (100%)
 delete mode 100644 docs/developers/contribute.md
 create mode 100644 docs/developers/contributing.md
 delete mode 100644 docs/developers/index.md
 delete mode 100644 docs/developers/launching.md
 delete mode 100644 docs/developers/pr-title-guidelines.md
 delete mode 100644 docs/developers/setup.md
 rename docs/{examples => recipes}/continue-training-llama-8b.md (100%)
 rename docs/{examples => recipes}/data-preparation.md (100%)
 rename docs/{examples => recipes}/train-llama-8b.md (100%)
 rename docs/{examples => recipes}/upcycle-llama-3b-to-moe.md (100%)

diff --git a/docs/.markdownlint.yaml b/docs/.markdownlint.yaml
index 547681254..44d5cf913 100644
--- a/docs/.markdownlint.yaml
+++ b/docs/.markdownlint.yaml
@@ -25,8 +25,8 @@ MD030:
   # Spaces for single-line unordered list items
   ul_single: 3
   # Spaces for single-line ordered list items
-  ol_single: 3
+  ol_single: 2
   # Spaces for multi-line unordered list items
   ul_multi: 3
   # Spaces for multi-line ordered list items
-  ol_multi: 3
+  ol_multi: 2
diff --git a/docs/developers/dev-practices.md b/docs/developers/best-practices.md
similarity index 100%
rename from docs/developers/dev-practices.md
rename to docs/developers/best-practices.md
diff --git a/docs/developers/contribute.md b/docs/developers/contribute.md
deleted file mode 100644
index 2adb786ca..000000000
--- a/docs/developers/contribute.md
+++ /dev/null
@@ -1,3 +0,0 @@
-# Contributing to Fast-LLM
-
-Coming soon...
diff --git a/docs/developers/contributing.md b/docs/developers/contributing.md
new file mode 100644
index 000000000..406d80154
--- /dev/null
+++ b/docs/developers/contributing.md
@@ -0,0 +1,19 @@
+# Contributing to Fast-LLM
+
+Coming soon...
+
+## PR Title Guidelines ✏️
+
+Since we squash commits when merging pull requests, the PR title will become the commit message for the squashed commit. To ensure a clear and consistent project history, follow these guidelines for naming your PR:
+
+1.  **Use a concise yet descriptive title**: The title should summarize the key change or feature introduced. Avoid vague titles like "Fix bug" or "Update code."
+2.  **Start with a keyword**: Use keywords to categorize the type of change. For example:
+
+    -   **feat:** for new features (e.g., `[feat] add support for mixed-precision training`)
+    -   **fix:** for bug fixes (e.g., `[fix] resolve memory leak during backpropagation`)
+    -   **perf:** for performance improvements (e.g., `[perf] optimize gradient accumulation step`)
+    -   **refactor:** for code refactoring (e.g., `[refactor] clean up data loader module`)
+    -   **docs:** for documentation changes (e.g., `[docs] update contributing guidelines`)
+    -   **build:** for changes to the build process or dependencies (e.g., `[build] bump PyTorch version`)
+
+3.  **Reference the issue number (if applicable)**: If the PR is related to a specific issue, include the issue number in the title (e.g., `[fix] resolve #123 memory leak in training loop`).
diff --git a/docs/developers/index.md b/docs/developers/index.md
deleted file mode 100644
index 740814854..000000000
--- a/docs/developers/index.md
+++ /dev/null
@@ -1,3 +0,0 @@
-# Developer Guides
-
-* [Contributing](contribute.md): How to contribute to Fast-LLM.
diff --git a/docs/developers/launching.md b/docs/developers/launching.md
deleted file mode 100644
index e69de29bb..000000000
diff --git a/docs/developers/pr-title-guidelines.md b/docs/developers/pr-title-guidelines.md
deleted file mode 100644
index fce109568..000000000
--- a/docs/developers/pr-title-guidelines.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# PR Title Guidelines ✏️
-
-Since we squash commits when merging pull requests, the PR title will become the commit message for the squashed commit. To ensure a clear and consistent project history, follow these guidelines for naming your PR:
-
-1. **Use a concise yet descriptive title**: The title should summarize the key change or feature introduced. Avoid vague titles like "Fix bug" or "Update code."
-2. **Start with a keyword**: Use keywords to categorize the type of change. For example:
-   - **feat:** for new features (e.g., `[feat] add support for mixed-precision training`)
-   - **fix:** for bug fixes (e.g., `[fix] resolve memory leak during backpropagation`)
-   - **perf:** for performance improvements (e.g., `[perf] optimize gradient accumulation step`)
-   - **refactor:** for code refactoring (e.g., `[refactor] clean up data loader module`)
-   - **docs:** for documentation changes (e.g., `[docs] update contributing guidelines`)
-   - **build:** for changes to the build process or dependencies (e.g., `[build] bump PyTorch version`)
-3. **Reference the issue number (if applicable)**: If the PR is related to a specific issue, include the issue number in the title (e.g., `[fix] resolve #123 memory leak in training loop`).
diff --git a/docs/developers/setup.md b/docs/developers/setup.md
deleted file mode 100644
index e69de29bb..000000000
diff --git a/docs/index.md b/docs/index.md
index e4e0ca2e4..1fe48e5f2 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -18,19 +18,19 @@ Fast-LLM isn't just another library, **it's a platform for powering the next gen
 
 -   **🚀 Purpose-Built for Small- and Large-Scale AI:** Optimized specifically for training language models of all sizes, Fast-LLM excels from **small models around 1B parameters to massive clusters running 70B+ parameter models**, with kernels that are fine-tuned for maximum throughput across this entire range. At 10B-parameter scale, Fast-LLM avoids costly 3D-paralelism through memory optimization techniques such as ZeRO and activation recomputation, whereas at 100B-parameter scale, Fast-LLM optimally supports 3D-parallelism; making Fast-LLM the go-to choice for diverse training needs.
 
--   **🧠 Unified Support for GPT-Like Architectures:** Fast-LLM **unifies all GPT-like model implementations** in a [single configuration file](https://github.com/ServiceNow/Fast-LLM/blob/main/fast_llm/models/gpt/model.py), and unlike HuggingFace transformers where every model has it's own, mostly independent, implementation, Fast-LLM reduces coding and adapts effortlessly, even with custom architectures.
+-   **🧠 Unified Support for GPT-Like Architectures:** Fast-LLM **unifies all GPT-like model implementations** in a [single Python file](https://github.com/ServiceNow/Fast-LLM/blob/main/fast_llm/models/gpt/model.py), and unlike HuggingFace transformers where every model has it's own, mostly independent, implementation, Fast-LLM reduces coding and adapts effortlessly, even with custom architectures.
 
 -   **💰 Cost Efficiency That Sets Fast-LLM Apart:**
 
-    -   **Lower Training Costs:** With higher throughput per GPU, Fast-LLM reduces the training time required. For instance, training a Mistral-7B model can be up to **xx% cheaper** compared to other frameworks due to faster processing and better memory efficiency.
+    -   **Lower Training Costs:** With higher throughput per GPU, Fast-LLM reduces the training time required. For instance, training models can cheaper compared to other frameworks due to faster processing and better memory efficiency.
 
-    -   **More Tokens for Your Budget:** Train up to xx% more tokens for the same budget, leading to better-trained models without breaking your financial constraints.
+    -   **More Tokens for Your Budget:** Train on more tokens for the same budget, leading to better-trained models without breaking your financial constraints.
 
-  [Learn more about Fast-LLM's cost efficiency and see detailed comparisons](cost-efficiency).
+    <!-- [Learn more about Fast-LLM's cost efficiency and see detailed comparisons](cost-efficiency.md). -->
 
 -   **🔓 Openness Without Compromise:** Fast-LLM's open-source approach ensures that you can **fully customize and extend the library** to fit your exact needs, without the restrictions of proprietary software. Developed transparently by a community of experts on GitHub, every change is **publicly discussed and vetted**, fostering **trust and collaboration** so you can innovate with confidence, knowing the entire development process and decision making is out in the open.
 
--   **🌍 Community-Driven Development:** Built by professionals for professionals, Fast-LLM's development is transparent, with an open invitation to the community to contribute. [**Join the Fast-LLM community**](join-us) to help shape the future of large-scale AI training.
+-   **🌍 Community-Driven Development:** Built by professionals for professionals, Fast-LLM's development is transparent, with an open invitation to the community to contribute. [**Join the Fast-LLM community**](join-us.md) to help shape the future of large-scale AI training.
 
 ### Key Features
 
@@ -46,7 +46,7 @@ Fast-LLM offers all the capabilities you need to accelerate your LLM training an
 
 -   **🛠️ Professional-Grade Tools:** Enjoy mixed precision training, large batch training, and gradient accumulation. Fast-LLM ensures reproducibility through deterministic behavior and provides pre-built Docker images, YAML configurations, and a simple, intuitive command-line interface.
 
-[Download Fast-LLM](https://github.com/ServiceNow/Fast-LLM/releases) and start training your large language models in record time. [Join the Fast-LLM community](join-us) and collaborate with like-minded professionals to advance the state-of-the-art in AI research and development.
+[Get Fast-LLM](https://github.com/ServiceNow/Fast-LLM/releases) and start training your large language models in record time. [Join the Fast-LLM community](join-us.md) and collaborate with like-minded professionals to advance the state-of-the-art in AI research and development.
 
 ## Use Cases and Success Stories
 
@@ -56,7 +56,7 @@ Fast-LLM powers the world's most advanced AI projects:
 -   **Enterprise AI Solutions:** Accelerate time-to-market for AI products by reducing training costs and enabling faster iteration.
 -   **Academic Collaborations:** Drive AI innovation with high-performance training capabilities that support cutting-edge research in machine learning.
 
-See how Fast-LLM has helped early adopters achieve up to xx% faster results. [Explore use cases and success stories](success-stories/starcoder-2).
+See how Fast-LLM has helped early adopters achieve faster results. [Explore use cases and success stories](success-stories/starcoder-2).
 
 ## Project Scope and Objectives
 
@@ -69,16 +69,16 @@ Fast-LLM is designed to be the **go-to solution** for those training the most so
 
 ## Collaboration and Contribution
 
-As we continue to expand Fast-LLM, we're looking for contributions from the community to help shape its future. We welcome:
+As Fast-LLM evolves, we invite the community to contribute and help shape its future. We welcome:
 
 -   **Testing and Bug Fixes:** Help us identify issues and improve stability.
--   **Feature Development:** Contribute new capabilities, such as custom kernels or support for alternative hardware like AMD and Intel.
--   **Documentation and Tutorials:** Make Fast-LLM more accessible by improving our [documentation](https://servicenow.github.io/Fast-LLM) and writing practical guides.
+-   **Feature Development:** Contribute new models, new training features, and new optimizations.
+-   **Documentation and Tutorials:** Make Fast-LLM more accessible by improving our documentation and writing practical guides.
 
-Fast-LLM is more than just software, it's a community. Get involved by exploring our [contribution guidelines](https://github.com/ServiceNow/Fast-LLM/CONTRIBUTING.md) and engaging with us on [GitHub Discussions](https://github.com/ServiceNow/Fast-LLM/discussions).
+Fast-LLM is more than just software, it's a community. Get involved by exploring our [contribution guidelines](developers/contributing.md) and engaging with us on [GitHub Discussions](https://github.com/ServiceNow/Fast-LLM/discussions).
 
 ## Getting Started
 
-Ready to dive in? Check out our [quick-start guide](quick-start) for an overview of how to set up and run Fast-LLM on different platforms, including [Slurm](https://slurm.schedmd.com) and [Kubernetes](https://kubernetes.io). Explore the [examples](https://github.com/ServiceNow/Fast-LLM/tree/main/examples) for pre-configured setups to help you get started quickly with your own training experiments.
+Ready to dive in? Check out our [quick-start guide](quick-start.md) for an overview of how to set up and run Fast-LLM on different platforms, including [Slurm](https://slurm.schedmd.com) and [Kubernetes](https://kubernetes.io). Explore the [examples](https://github.com/ServiceNow/Fast-LLM/tree/main/examples) for pre-configured setups to help you get started quickly with your own training experiments.
 
 For any questions or issues, open an [issue](https://github.com/ServiceNow/Fast-LLM/issues) or join the [community discussion](https://github.com/ServiceNow/Fast-LLM/discussions).
diff --git a/docs/join-us.md b/docs/join-us.md
index a5992649d..d3214f6b3 100644
--- a/docs/join-us.md
+++ b/docs/join-us.md
@@ -12,25 +12,25 @@ Fast-LLM is an open-source project driven by a community of passionate contribut
 
 Want to keep up with the latest Fast-LLM updates and new opportunities to get involved? **Star** the Fast-LLM repository on GitHub and **watch** the project for notifications on new releases, discussions, and updates. This way, you'll always know what's happening, from new features to community initiatives.
 
-[Star](https://github.com/ServiceNow/Fast-LLM/stargazers) ⭐ and [Watch](https://github.com/ServiceNow/Fast-LLM/subscription) 👀 the Fast-LLM repo on GitHub to stay updated on new releases, discussions, and upcoming features.
+[Star](https://github.com/ServiceNow/Fast-LLM/stargazers) and [Watch](https://github.com/ServiceNow/Fast-LLM/subscription) the Fast-LLM repo on GitHub to stay updated on new releases, discussions, and upcoming features.
 
 ---
 
 ## Code Contributions 🛠
 
-Fast-LLM thrives on collaboration, and we're excited to welcome new contributors! From fixing bugs to adding new features, every code contribution makes a difference. If you're just getting started, our [**Good First Issues**](https://github.com/ServiceNow/Fast-LLM/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22) on GitHub are labeled to help newcomers find approachable tasks. To set up your development environment and get oriented with Fast-LLM, check out our **Developer's Corner** for everything you need:
+Fast-LLM thrives on collaboration, and we're excited to welcome new contributors! From fixing bugs to adding new features, every code contribution makes a difference. If you're just getting started, our [Good First Issues](https://github.com/ServiceNow/Fast-LLM/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22) on GitHub are labeled to help newcomers find approachable tasks. To set up your development environment and get oriented with Fast-LLM, check out our **Developer's Corner** for everything you need:
 
--   [**Contributing**](developers/contributing) – for setup instructions and contributing guidelines
--   [**Best Practices**](developers/best-practices) – for tips on writing clean, maintainable code
+-   [**Contributing**](developers/contributing.md) – for setup instructions and contributing guidelines
+-   [**Best Practices**](developers/best-practices.md) – for tips on writing clean, maintainable code
 
 Here's a quick overview of the process:
 
-1.   **Fork & Clone**: Start by forking the repo and cloning it to your machine.
-2.   **Set Up Your Dev Environment**: The Developer's Corner guides you through configuring your environment for maximum productivity.
-3.   **Write Awesome Code**: Make your changes, document them, and follow our best practices.
-4.   **Open a Pull Request**: Submit a PR to showcase your work and get feedback from our team and the community.
+1.  **Fork & Clone**: Start by forking the repo and cloning it to your machine.
+2.  **Set Up Your Dev Environment**: The Developer's Corner guides you through configuring your environment for maximum productivity.
+3.  **Write Awesome Code**: Make your changes, document them, and follow our best practices.
+4.  **Open a Pull Request**: Submit a PR to showcase your work and get feedback from our team and the community.
 
-[Explore the Developer's Corner for everything you need to get started!](developers)
+Explore our [Developer's Corner](developers/contributing.md) for everything you need to get started!
 
 ---
 
@@ -38,7 +38,7 @@ Here's a quick overview of the process:
 
 Got a great idea? We want to hear it! Whether it's a new feature, an enhancement, or even a moonshot idea, head over to **GitHub Discussions** to share your thoughts. Community feedback drives Fast-LLM's evolution, and your ideas can help shape the future of the project.
 
-[Share your thoughts on GitHub Discussions](https://github.com/ServiceNow/Fast-LLM/discussions)
+Share your thoughts on [GitHub Discussions](https://github.com/ServiceNow/Fast-LLM/discussions).
 
 ---
 
@@ -46,13 +46,13 @@ Got a great idea? We want to hear it! Whether it's a new feature, an enhancement
 
 Your experience with Fast-LLM is invaluable, whether you're running it in production or experimenting at home. We rely on user feedback to find bugs, optimize performance, and improve documentation. Please share any bugs, performance quirks, or gaps you spot with us on GitHub Issues. This kind of feedback strengthens the entire project.
 
-[Report issues and share feedback on GitHub](https://github.com/ServiceNow/Fast-LLM/issues)
+Report issues and share feedback on [GitHub Issues](https://github.com/ServiceNow/Fast-LLM/issues).
 
 ---
 
 ## Help & Support 🤝
 
-Love helping others? Join our **GitHub Discussions** to answer questions, help troubleshoot, or share tips. Fast-LLM is a community, and the more we support each other, the stronger we become. Helping out is a great way to get involved and learn from others too.
+Love helping others? Join our [**GitHub Discussions**](https://github.com/ServiceNow/Fast-LLM/discussions) to answer questions, help troubleshoot, or share tips. Fast-LLM is a community, and the more we support each other, the stronger we become. Helping out is a great way to get involved and learn from others too.
 
 ---
 
@@ -62,4 +62,12 @@ If you're excited about Fast-LLM, let the world know! Share on social media, wri
 
 ---
 
+## Join Our Team 🌟
+
+Excited about contributing on a deeper level? The Foundation Models Lab at ServiceNow is at the forefront of large-scale AI training. We're looking for passionate individuals to push the boundaries of AI development with us. From research developers focusing on GPU optimization to visiting researchers refining our training frameworks, there's a role for everyone. Explore current opportunities and become a key player in shaping the future of AI at ServiceNow.
+
+Check out our [Careers page](https://www.servicenow.com/research/careers.html) for more information.
+
+---
+
 Let's push the boundaries of large-scale AI training together. We're thrilled to have you here. Welcome to the Fast-LLM community!
diff --git a/docs/examples/continue-training-llama-8b.md b/docs/recipes/continue-training-llama-8b.md
similarity index 100%
rename from docs/examples/continue-training-llama-8b.md
rename to docs/recipes/continue-training-llama-8b.md
diff --git a/docs/examples/data-preparation.md b/docs/recipes/data-preparation.md
similarity index 100%
rename from docs/examples/data-preparation.md
rename to docs/recipes/data-preparation.md
diff --git a/docs/examples/train-llama-8b.md b/docs/recipes/train-llama-8b.md
similarity index 100%
rename from docs/examples/train-llama-8b.md
rename to docs/recipes/train-llama-8b.md
diff --git a/docs/examples/upcycle-llama-3b-to-moe.md b/docs/recipes/upcycle-llama-3b-to-moe.md
similarity index 100%
rename from docs/examples/upcycle-llama-3b-to-moe.md
rename to docs/recipes/upcycle-llama-3b-to-moe.md
diff --git a/mkdocs.yaml b/mkdocs.yaml
index c56d5782b..3c607bf39 100644
--- a/mkdocs.yaml
+++ b/mkdocs.yaml
@@ -162,7 +162,7 @@ nav:
   - Welcome: index.md
   - Get Started:
     - Quick Start: quick-start.md
-    - Cost Efficiency: cost-efficiency.md
+    # - Cost Efficiency: cost-efficiency.md
     - Help: help.md
     - In Action:
       - On Slurm: in-action/slurm.md
@@ -170,11 +170,11 @@ nav:
     - Success Stories:
       - StarCoder 2: success-stories/starcoder-2.md
     - License: license.md
-  - Examples:
-    - Data Preparation: examples/data-preparation.md
-    - Train Llama 8B from scratch: examples/train-llama-8b.md
-    - Continue training Llama 8B: examples/continue-training-llama-8b.md
-    - Upcycle Llama 3B to MoE: examples/upcycle-llama-3b-to-moe.md
+  - Recipes:
+    - Data Preparation: recipes/data-preparation.md
+    - Train Llama 8B from scratch: recipes/train-llama-8b.md
+    - Continue training Llama 8B: recipes/continue-training-llama-8b.md
+    - Upcycle Llama 3B to MoE: recipes/upcycle-llama-3b-to-moe.md
   - Reference:
     - Configuration: reference/configuration.md
   - Developers:

From f6e163fc2619abbb3d2549bd3b2b8329809b00e8 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Wed, 6 Nov 2024 10:08:27 -0500
Subject: [PATCH 43/87] wip

---
 docs/in-action/slurm.md |  32 ++--
 docs/quick-start.md     | 334 ++++++++++++++++++++++++++--------------
 docs/refs.bib           |   2 +-
 mkdocs.yaml             |   2 +-
 4 files changed, 234 insertions(+), 136 deletions(-)

diff --git a/docs/in-action/slurm.md b/docs/in-action/slurm.md
index 6af6f6478..d92cb19f4 100644
--- a/docs/in-action/slurm.md
+++ b/docs/in-action/slurm.md
@@ -2,12 +2,12 @@
 title: "Slurm"
 ---
 
-- **Purpose:** These guides cover specific environments and configurations for deploying Fast-LLM in different setups.
-- **Content Organization:**
-  - **in-action/slurm**: Provide detailed instructions on deploying Fast-LLM on a Slurm cluster, covering multi-node setups, configuring Slurm scripts, and managing jobs.
-  - **in-action/kubernetes**: Guide for deploying Fast-LLM using Kubernetes, including creating the appropriate workloads (e.g., Job, Pod, StatefulSet), handling private Docker images, and configuring multi-node training.
-  - **File Single Node Guide Here Too?** If you include a "single-node" guide in this section as well, make it more advanced, focusing on optimizing performance, using different configurations, or tuning settings for different GPU models.
-- **Why It Makes Sense:** Organizing by deployment environment ensures users can quickly find the relevant guide based on their setup. Including both multi-node cluster guides and single-node advanced setups allows users to scale their knowledge.
+-   **Purpose:** These guides cover specific environments and configurations for deploying Fast-LLM in different setups.
+-   **Content Organization:**
+    -   **in-action/slurm**: Provide detailed instructions on deploying Fast-LLM on a Slurm cluster, covering multi-node setups, configuring Slurm scripts, and managing jobs.
+    -   **in-action/kubernetes**: Guide for deploying Fast-LLM using Kubernetes, including creating the appropriate workloads (e.g., Job, Pod, StatefulSet), handling private Docker images, and configuring multi-node training.
+    -   **File Single Node Guide Here Too?** If you include a "single-node" guide in this section as well, make it more advanced, focusing on optimizing performance, using different configurations, or tuning settings for different GPU models.
+-   **Why It Makes Sense:** Organizing by deployment environment ensures users can quickly find the relevant guide based on their setup. Including both multi-node cluster guides and single-node advanced setups allows users to scale their knowledge.
 
 ---
 
@@ -24,14 +24,14 @@ Expect to see a significant speedup in training time compared to other libraries
 
 #### Prerequisites
 
-- A [Slurm](https://slurm.schedmd.com/) cluster with at least 4 DGX nodes with 8 A100-80GB or H100-80GB GPUs each.
-- CUDA 12.1 or higher.
-- Dependencies: [PyTorch][pytorch], [Triton][triton], and [Apex](https://github.com/NVIDIA/apex) installed on all nodes.
+-   A [Slurm](https://slurm.schedmd.com/) cluster with at least 4 DGX nodes with 8 A100-80GB or H100-80GB GPUs each.
+-   CUDA 12.1 or higher.
+-   Dependencies: [PyTorch][pytorch], [Triton][triton], and [Apex](https://github.com/NVIDIA/apex) installed on all nodes.
 
 #### Steps
 
-1. Deploy the [nvcr.io/nvidia/pytorch:24.07-py3](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) Docker image to all nodes (recommended), because it contains all the necessary dependencies.
-2. Install Fast-LLM on all nodes:
+1.  Deploy the [nvcr.io/nvidia/pytorch:24.07-py3](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) Docker image to all nodes (recommended), because it contains all the necessary dependencies.
+2.  Install Fast-LLM on all nodes:
 
     ```bash
     sbatch <<EOF
@@ -45,19 +45,17 @@ Expect to see a significant speedup in training time compared to other libraries
     EOF
     ```
 
-3. Use the example Slurm job script [examples/fast-llm.sbat](examples/fast-llm.sbat) to submit the job to the cluster:
+3.  Use the example Slurm job script [examples/fast-llm.sbat](examples/fast-llm.sbat) to submit the job to the cluster:
 
     ```bash
     sbatch examples/fast-llm.sbat
     ```
 
-4. Monitor the job's progress:
+4.  Monitor the job's progress:
 
-    - Logs: Follow `job_output.log` and `job_error.log` in your working directory for logs.
-    - Status: Use `squeue -u $USER` to see the job status.
+    -   Logs: Follow `job_output.log` and `job_error.log` in your working directory for logs.
+    -   Status: Use `squeue -u $USER` to see the job status.
 
 Now, you can sit back and relax while Fast-LLM trains your model at full speed! ☕
 
-
 ### Running Fast-LLM on a Slurm Cluster with Docker
-
diff --git a/docs/quick-start.md b/docs/quick-start.md
index 213f810a1..1f37a8165 100644
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -7,10 +7,10 @@ This guide will get you up and running with Fast-LLM on a single machine. Let's
 You'll need:
 
 -   At least one NVIDIA GPU on your machine. We recommend 8 A100s or higher for this tutorial 🤑
--   Docker (installed and running)
+-   Docker (installed and running). You can run without without docker but we don't recommend it. 🐳
 -   Some patience for the initial setup and training 😊
 
-## Step 1: Pull the Fast-LLM Docker Image
+## Step 1: Pull the Fast-LLM Docker Image 🐳
 
 To start, grab the pre-built Fast-LLM Docker image:
 
@@ -18,6 +18,18 @@ To start, grab the pre-built Fast-LLM Docker image:
 docker pull ghcr.io/servicenow/fast-llm:latest
 ```
 
+This image contains everything you need to train LLMs with Fast-LLM.
+
+!!! info
+
+    Installing Fast-LLM from source is also an option:
+
+    ```sh
+    pip install --no-build-isolation "git+https://github.com/ServiceNow/Fast-LLM.git#egg=fast_llm[CORE,OPTIONAL,DEV]"
+    ```
+
+    However, we recommend the Docker image for simplicity and reproducibility.
+
 ## Step 2: Set Up Directories for Your Inputs and Outputs
 
 Let's create folders to store our input data and output results:
@@ -26,145 +38,219 @@ Let's create folders to store our input data and output results:
 mkdir ~/inputs ~/results
 ```
 
-## Step 3: Choose Your Model
-
-Fast-LLM supports many GPT variants, including (but not limited to) GPT-2, Llama, Mistral, and Qwen. For this tutorial, let's train the GPT-2 model from scratch with Fully Sharded Data Parallelism (FSDP). We'll grab a configuration file from Huggingface Hub and save it as `~/inputs/config.json`:
+## Step 3: Choose Your Model 🤖
 
-=== "GPT-2 (124M)"
+Fast-LLM supports many GPT variants, including (but not limited to) Llama, Mistral, and Mixtral. For this tutorial, let's train a Llama model with data parallelism. You can choose from two models:
 
-    ```bash
-    curl -O https://huggingface.co/openai-community/gpt2/resolve/main/config.json
-    ```
+=== "SmolLM-135M"
 
-=== "GPT-2 XL (1558M)"
+    SmolLM is a smaller, more manageable model with 135M parameters. It's perfect for testing and getting familiar with Fast-LLM. We'll grab its configuration file from Huggingface Hub and save it as `~/inputs/config.json`:
 
     ```bash
-    curl -O https://huggingface.co/openai-community/gpt2-xl/resolve/main/config.json
+    curl -O https://huggingface.co/HuggingFaceTB/SmolLM-135M/resolve/main/config.json
+    mv config.json ~/inputs
     ```
 
-=== "Llama-3.2-3B-Instruct"
+=== "Llama-3.2-1B"
 
-    ```bash
-    curl -O https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct/resolve/main/config.json
-    ```
-
-=== "Qwen2.5-3B-Instruct"
+    Llama is a larger model with 1B parameters. It's more powerful but requires more resources to train. We'll grab the model from Huggingface Hub and save it to `~/inputs`:
 
     ```bash
-    curl -O https://huggingface.co/Qwen/Qwen2.5-3B-Instruct/resolve/main/config.json
+    git lfs install
+    git clone https://huggingface.co/meta-llama/Llama-3.2-1B ~/inputs
     ```
 
 !!! tip "Model Size Matters"
 
-    Smaller models like GPT-2 (124M) will train relatively quickly, especially if you've only got a few GPUs. But if you're feeling adventurous (and patient), give the larger models a shot!
+    Smaller models like SmolLM-135M will train relatively quickly, especially if you've only got a few GPUs. But if you're feeling adventurous (and patient), give the larger Llama-3.2-1B a shot!
 
-## Step 4: Preparing the Training Data
+## Step 4: Preparing the Training Data 📚
 
-For this tutorial, we'll use 9B tokens of text from the [OpenWebText](https://skylion007.github.io/OpenWebTextCorpus/) dataset. This dataset is a free approximation of the WebText data OpenAI used for GPT-2, and it's perfect for our setup!
+For this tutorial, we'll use 9B tokens of text from the [OpenWebText](https://skylion007.github.io/OpenWebTextCorpus/) dataset. This dataset is a free approximation of the WebText data OpenAI used for GPT-2, and it's perfect for our test run!
 
 We've got a script that'll download and preprocess the dataset for you. Run it like this:
 
-```bash
-docker run -it --rm ghcr.io/servicenow/fast-llm:latest \
-    -v ~/inputs:/app/inputs \
-    python tools/prepare_dataset.py \
-    tokenizer_path_or_name="gpt2" \
-    dataset_name_or_path="openwebtext" \
-    dataset_split="train" \
-    output_dir="inputs" \
-    num_processes_load=4 \
-    num_processes_map=4 \
-    num_processes_save=4 \
-    num_tokens_per_shard=100000000
-```
+=== "SmolLM-135M"
+
+    ```bash
+    docker run -it --rm ghcr.io/servicenow/fast-llm:latest \
+        -v ~/inputs:/app/inputs \
+        python tools/prepare_dataset.py \
+        tokenizer_path_or_name="HuggingFaceTB/SmolLM-135M" \
+        dataset_name_or_path="openwebtext" \
+        dataset_split="train" \
+        output_dir="inputs" \
+        num_processes_load=4 \
+        num_processes_map=4 \
+        num_processes_save=4 \
+        num_tokens_per_shard=100000000
+    ```
+
+=== "Llama-3.2-1B"
+
+    ```bash
+    docker run -it --rm ghcr.io/servicenow/fast-llm:latest \
+        -v ~/inputs:/app/inputs \
+        python tools/prepare_dataset.py \
+        tokenizer_path_or_name="meta-llama/Llama-3.2-1B" \
+        dataset_name_or_path="openwebtext" \
+        dataset_split="train" \
+        output_dir="inputs" \
+        num_processes_load=4 \
+        num_processes_map=4 \
+        num_processes_save=4 \
+        num_tokens_per_shard=100000000
+    ```
 
 !!! info "What's Happening Here?"
 
-    This will grab the OpenWebText data, tokenize it with the GPT-2 tokenizer, and save it in 91 shards of 100M tokens each. Expect around 2 hours for the whole thing to finish, mainly due to tokenization. If you've got more CPU cores, try upping `num_processes_*` to speed things up.
+    This will grab the OpenWebText data, tokenize it, and save it in 91 shards of 100M tokens each. Expect around 2 hours for the whole thing to finish, mainly due to tokenization. If you've got more CPU cores, try upping `num_processes_*` to speed things up.
 
-!!! warning "Tokenizer Mismatch"
+!!! tip "Use a Smaller Dataset for Testing"
 
-    If you chose a different model in Step 3, make sure to adjust the `tokenizer_path_or_name` parameter to match the model's tokenizer.
+    Since we're just testing things out, we can also use a smaller dataset. Replace `openwebtext` with `stas/openwebtext-10k` to use a small subset representing the first 10K records from the original dataset. This will speed up the process and let you see how things work without waiting for hours.
 
-## Step 5: Set Up Your Training Configuration
+## Step 5: Set Up Your Training Configuration ⚙️
 
 Next, we'll create a configuration file for Fast-LLM. Save the following as `~/inputs/fast-llm-config.yaml`:
 
-```yaml
-training:
-  train_iters: 600_000  # (1)!
-  logs:
-    interval: 10
-  validation:
-    iterations: 25
-    interval: 1000
-  checkpoint:
-    interval: 1000
-    keep_latest: 5
-  test_iters: 0
-  export:
-    format: huggingface
-    interval: 20_000
-  wandb:
-    project_name: fast-llm
-    entity_name: servicenow  # (2)!
-    tags: quick-start
-    alert:
-      interval: 1000
-batch:
-  micro_batch_size: 1  # (3)!
-  sequence_length: 1024  # (4)!
-  batch_size: 480  # (5)!
-data:
-  format: file
-  path: /app/inputs/fast_llm_dataset.json  # (6)!
-  split: [998, 2, 0]  # (7)!
-optimizer:
-  weight_decay: 0.1  # (8)!
-  beta_1: 0.9  # (9)!
-  beta_2: 0.95  # (10)!
-  learning_rate:
-    base: 6.0e-04  # (11)!
-    minimum: 6.0e-05  # (12)!
-    decay_style: cosine  # (13)!
-    decay_iterations: 600_000  # (14)!
-    warmup_iterations: 2000  # (15)!
-pretrained:
-  format: huggingface
-  path: /app/inputs  # (16)!
-  load_weights: False  # (176)!
-model:
-  multi_stage:
-    zero_stage: 2
-  distributed:
-    training_dtype: bf16
-run:
-  experiment_dir: /app/results
-```
+=== "SmolLM-135M"
+
+    ```yaml
+    training:
+      train_iters: 600_000  # (1)!
+      logs:
+        interval: 10
+      validation:
+        iterations: 25
+        interval: 1000
+      checkpoint:
+        interval: 1000
+        keep: 5
+      test_iters: 0
+      export: # (2)!
+        format: llama
+        interval: 20_000
+      wandb: # (3)!
+        project_name: fast-llm
+        entity_name: servicenow
+        tags: quick-start
+    batch:
+      micro_batch_size: 1  # (4)!
+      sequence_length: 1024
+      batch_size: 480  # (5)!
+    data:
+      format: file
+      path: /app/inputs/fast_llm_dataset.json  # (6)!
+      split: [99, 1, 0]  # (7)!
+    optimizer: # (8)!
+      weight_decay: 0.1
+      beta_1: 0.9
+      beta_2: 0.95
+      learning_rate: # (9)!
+        base: 6.0e-04
+        minimum: 6.0e-05
+        decay_style: cosine
+        decay_iterations: 600_000
+        warmup_iterations: 2000
+    pretrained:
+      format: llama  # (10)!
+      path: /app/inputs
+      load_weights: no  # (11)!
+    model:
+      multi_stage:
+        zero_stage: null  # (12)!
+      distributed:
+        training_dtype: bf16  # (13)!
+    run:
+      experiment_dir: /app/results
+    ```
+
+    1.  Total number of training tokens will be approximately 300B.
+    2.  A Llama model will be saved in Hugging Face format to `~/results` directory every 20,000 iterations.
+    3.  Entirely optional, but it's a good idea to track your training progress with Weights & Biases. Replace `servicenow` with your own W&B entity name. If you don't want to use W&B, just remove this section.
+    4.  Adjust the number of sequences per GPU based on GPU memory. For SmolLM-135M and an A100-80GB, a `micro_batch_size` of 1 should work well.
+    5.  Must be divisible by the number of GPUs and the `micro_batch_size`. At 1024 tokens per sequence, 480 corresponds to about 500,000 tokens per batch.
+    6.  Location of the dataset metadata file generated in Step 4.
+    7.  99% train, 1% validation, 0% test. These settings need to be adjusted based on the size of your dataset. If you're using a smaller dataset, you need to increase the validation split.
+    8.  These are good default optimizer settings for training models.
+    9.  We are using a cosine decay schedule with linear warmup. After reaching the peak learning rate `base` at `warmup_iterations`, the learning rate will decay to `minimum` at `decay_iterations`, following a cosine curve. The minimum learning rate should be 1/10th of the base learning rate per Chinchilla.
+    10.  Format of the pretrained model. Since SmolLM is a Llama model, we set this to `llama`.
+    11.  We'll train SmolLM-135M from scratch. You can set to `yes` to continue training from a checkpoint (if you put one in `~/inputs`).
+    12.  We're not using ZeRO for this tutorial, so we set `zero_stage` to `null`. You can set this to `1`, `2`, or `3` for ZeRO-1, ZeRO-2, or ZeRO-3, respectively.
+    13.  `bf16` is supported on Ampere GPUs and higher. Fast-LLM also supports `fp16`.
+
+=== "Llama-3.2-1B"
+
+    ```yaml
+    training:
+      train_iters: 600_000  # (1)!
+      logs:
+        interval: 10
+      validation:
+        iterations: 25
+        interval: 1000
+      checkpoint:
+        interval: 1000
+        keep: 5
+      test_iters: 0
+      export:  # (2)!
+        format: llama
+        interval: 20_000
+      wandb:  # (3)!
+        project_name: fast-llm
+        entity_name: servicenow
+        tags: quick-start
+    batch:
+      micro_batch_size: 1  # (4)!
+      sequence_length: 1024
+      batch_size: 480  # (5)!
+    data:
+      format: file
+      path: /app/inputs/fast_llm_dataset.json  # (6)!
+      split: [99, 1, 0]  # (7)!
+    optimizer: # (8)!
+      weight_decay: 0.1
+      beta_1: 0.9
+      beta_2: 0.95
+      learning_rate:  # (9)!
+        base: 6.0e-04
+        minimum: 6.0e-05
+        decay_style: cosine
+        decay_iterations: 600_000
+        warmup_iterations: 2000
+    pretrained:
+      format: llama  # (10)!
+      path: /app/inputs
+      load_weights: yes  # (11)!
+    model:
+      multi_stage:
+        zero_stage: null  # (12)!
+      distributed:
+        training_dtype: bf16  # (13)!
+    run:
+      experiment_dir: /app/results
+    ```
+
+    1.  Total number of training tokens will be approximately 300B.
+    2.  A Llama model will be saved in Hugging Face format to `~/results` directory every 20,000 iterations.
+    3.  Entirely optional, but it's a good idea to track your training progress with Weights & Biases. Replace `servicenow` with your own W&B entity name. If you don't want to use W&B, just remove this section.
+    4.  Adjust the number of sequences per GPU based on GPU memory. For Llama-3.2-1B and an A100-80GB, a `micro_batch_size` of 1 should work well.
+    5.  Must be divisible by the number of GPUs and the `micro_batch_size`. At 1024 tokens per sequence, 480 corresponds to about 500,000 tokens per batch.
+    6.  Location of the dataset metadata file generated in Step 4.
+    7.  99% train, 1% validation, 0% test. These settings need to be adjusted based on the size of your dataset. If you're using a smaller dataset, you need to increase the validation split.
+    8.  These are good default optimizer settings for training models.
+    9.  We are using a cosine decay schedule with linear warmup. After reaching the peak learning rate `base` at `warmup_iterations`, the learning rate will decay to `minimum` at `decay_iterations`, following a cosine curve. The minimum learning rate should be 1/10th of the base learning rate per Chinchilla.
+    10.  Format of the pretrained model. Since it's a Llama model, we set this to `llama`.
+    11.  We want to continue training Llama-3.2-1B from a checkpoint. If you're training from scratch, set this to `no`.
+    12.  We're not using ZeRO for this tutorial, so we set `zero_stage` to `null`. You can set this to `1`, `2`, or `3` for ZeRO-1, ZeRO-2, or ZeRO-3, respectively.
+    13.  `bf16` is supported on Ampere GPUs and higher. Fast-LLM also supports `fp16`.
+
+## (Optional) Step 6: Add Your Weights & Biases API Key 🔑
+
+If you included the W&B section in your configuration, you'll need to add your API key. Save your W&B API key to `~/inputs/.wandb_api_key` so Fast-LLM can track your training progress there. You can create a free W&B account if you don't already have one.
 
-1.   Total number of training tokens will be ~300B.
-2.   Replace `servicenow` with your own W&B entity name.
-3.   Adjust based on GPU memory. For GPT-2 and an A100-80GB, a `micro_batch_size` of 1 should work well.
-4.   Should be a power of 2 and divisible by 8. For an A100-80GB, 1024 is a good starting point.
-5.   Must be divisible by number of GPUs. At 1024 tokens per sequence, 480 corresponds to about ~500k tokens per batch.
-6.   Location of the dataset metadata file generated in Step 4.
-7.   99.8% train, 0.2% validation, 0% test.
-8.   L2 regularization penalty.
-9.   1st Adam optimizer parameter.
-10.   2nd Adam optimizer parameter.
-11.   Peak learning rate.
-12.   Should be 1/10th of base per Chinchilla.
-13.   Cosine decay starting at `base` after warmup and ending at `minimum` after `decay_iterations`.
-14.   Usually the same as `train_iters`.
-15.   Number of steps of linear warmup.
-16.   Location of the `config.json` file downloaded in Step 4.
-17.   Set to `False` to train from scratch.
-
-## Step 6: Add Your Weights & Biases API Key
-
-Save your Weights & Biases API key to `~/inputs/.wandb_api_key` so Fast-LLM can track your training progress there. You can create a free W&B account if you don't already have one.
-
-## Step 7: Launch Training
+## Step 7: Launch Training 🚀
 
 Alright, the big moment! If you're on an 8-GPU machine, run the following to kick off training:
 
@@ -178,15 +264,29 @@ docker run --gpus all -it --rm ghcr.io/servicenow/fast-llm:latest \
     fast-llm train gpt --config /app/inputs/fast-llm-config.yaml
 ```
 
-!!! note "Python Hash Seed"
+!!! warning "Python Hash Seed"
+
+    Setting the Python hash seed to 0 ensures consistent, reproducible ordering in hash-dependent operations across processes. Training will fail if this isn't set.
+
+## Step 8. Track Training Progress 📊
+
+Fast-LLM will log training progress to the console every 10 iterations. You can expect to see the following throughput:
+
+=== "SmolLM-135M"
 
-    Setting the Python hash seed to 0 ensures consistent, reproducible ordering in hash-dependent operations across processes, which is crucial for parallel computations.
+    | Metric              | A100         | H100         |
+    |---------------------|-------------:|-------------:|
+    | Tokens/s            | 1,234,567    | 1,456,789    |
+    | TFLOPS              | 312          | 512          |
 
-Expect training to run for a few days (for a full 300B tokens). Keep an eye on the validation loss. You should see it drop as the model learns.
+=== "Llama-3.2-1B"
 
-## Tracking Your Progress with W&B 📊
+    | Metric              | A100         | H100         |
+    |---------------------|-------------:|-------------:|
+    | Tokens/s            | 1,234,567    | 1,456,789    |
+    | TFLOPS              | 312          | 512          |
 
-With Weights & Biases, you'll see the loss curve, training metrics, and more. If you follow this whole training setup, you should see the validation loss approaching the ballpark of ~2.85 (similar to the original GPT-2 model finetuned on OpenWebText).
+If you included the W&B section in your configuration, you can also track your training progress on the Weights & Biases dashboard as well. Follow the link in the console output to view your training run.
 
 ## Troubleshooting Basics 🛠️
 
diff --git a/docs/refs.bib b/docs/refs.bib
index 49b80c424..605cd6308 100644
--- a/docs/refs.bib
+++ b/docs/refs.bib
@@ -10,4 +10,4 @@ @article{lozhkov2024starcoder
   author={Lozhkov, Anton and Li, Raymond and Allal, Loubna Ben and Cassano, Federico and Lamy-Poirier, Joel and Tazi, Nouamane and Tang, Ao and Pykhtar, Dmytro and Liu, Jiawei and Wei, Yuxiang and others},
   journal={arXiv preprint arXiv:2402.19173},
   year={2024}
-}
\ No newline at end of file
+}
diff --git a/mkdocs.yaml b/mkdocs.yaml
index 3c607bf39..5fca92123 100644
--- a/mkdocs.yaml
+++ b/mkdocs.yaml
@@ -30,7 +30,7 @@ theme:
     - content.code.copy
     # - content.code.select
     # - content.footnote.tooltips
-    # - content.tabs.link
+    - content.tabs.link
     - content.tooltips
     # - header.autohide
     # - navigation.expand

From 8acb4f383763a1671e8ff3f510306d2fd4e6b706 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Wed, 6 Nov 2024 22:54:23 -0500
Subject: [PATCH 44/87] wip

---
 docs/cost-efficiency.md      |  98 -----
 docs/help.md                 |  10 +-
 docs/in-action/kubernetes.md |  51 ---
 docs/in-action/slurm.md      |  61 ---
 docs/quick-start.md          | 781 +++++++++++++++++++++++++++++++----
 mkdocs.yaml                  |   4 -
 6 files changed, 703 insertions(+), 302 deletions(-)
 delete mode 100644 docs/cost-efficiency.md
 delete mode 100644 docs/in-action/kubernetes.md
 delete mode 100644 docs/in-action/slurm.md

diff --git a/docs/cost-efficiency.md b/docs/cost-efficiency.md
deleted file mode 100644
index 169f7da45..000000000
--- a/docs/cost-efficiency.md
+++ /dev/null
@@ -1,98 +0,0 @@
----
-title: Cost Efficiency Comparison
----
-
-Fast-LLM is built for speed and scalability to minimize training costs. Its advanced parallelism techniques, memory-efficient implementations, and kernel optimizations enable significant cost savings compared to other training frameworks. Below, we present a detailed comparison of training costs for different model configurations and cluster sizes, demonstrating how Fast-LLM delivers more value for your budget.
-
-## Comparing Training Costs Across Frameworks
-
-To showcase the cost-saving potential of Fast-LLM, we've compared the cost of training a language model across various frameworks for different scenarios. For these calculations, we assume a cost of **USD 2.50 per H100 GPU per hour**.
-
-!!! note "Disclaimer"
-
-    All comparisons were conducted with identical model configurations and training setups across frameworks to maintain fairness. We optimized training parameters within each framework to achieve the best possible performance. Detailed configuration files are available in the footnotes for reference. If you have questions about our methods, assumptions, or suggestions for enhancing performance on any framework, please contact us at [fast-llm-team@servicenow.com](mailto:fast-llm-team@servicenow.com).
-
-### Scenario Comparison: Training Costs and Token Efficiency
-
-The tables below provide a comparison of training costs for three different model setups, including costs for training on **1 trillion tokens** and the total tokens trained within a **$100,000 budget**.
-
-#### 1B Model on 1 DGX Node (8 H100s)
-
-| Framework                                  | Training Throughput (tokens/s/GPU) | Cost to Train 1T Tokens (USD)  | Tokens Trained for $100k (Billion)  |
-|:-------------------------------------------|-----------------------------------:|-------------------------------:|------------------------------------:|
-| **Fast-LLM**[^fast-llm-1b]                 | [PLACEHOLDER]                      | **[PLACEHOLDER]**              | **[PLACEHOLDER]**                   |
-| NVIDIA Megatron[^megatron-1b]              | [PLACEHOLDER]                      | [PLACEHOLDER]                  | [PLACEHOLDER]                       |
-| MosaicML Composer[^mosaic-1b]              | [PLACEHOLDER]                      | [PLACEHOLDER]                  | [PLACEHOLDER]                       |
-| Hugging Face Transformers[^huggingface-1b] | [PLACEHOLDER]                      | [PLACEHOLDER]                  | [PLACEHOLDER]                       |
-| Meta Lingua[^metaligua-1b]                 | [PLACEHOLDER]                      | [PLACEHOLDER]                  | [PLACEHOLDER]                       |
-
-#### 8B Model on 4 DGX Nodes (32 H100s)
-
-| Framework                                  | Training Throughput (tokens/s/GPU) | Cost to Train 1T Tokens (USD)  | Tokens Trained for $100k (Billion)  |
-|:-------------------------------------------|-----------------------------------:|-------------------------------:|------------------------------------:|
-| **Fast-LLM**[^fast-llm-8b]                 | [PLACEHOLDER]                      | **[PLACEHOLDER]**              | **[PLACEHOLDER]**                   |
-| NVIDIA Megatron[^megatron-8b]              | [PLACEHOLDER]                      | [PLACEHOLDER]                  | [PLACEHOLDER]                       |
-| MosaicML Composer[^mosaic-8b]              | [PLACEHOLDER]                      | [PLACEHOLDER]                  | [PLACEHOLDER]                       |
-| Hugging Face Transformers[^huggingface-8b] | [PLACEHOLDER]                      | [PLACEHOLDER]                  | [PLACEHOLDER]                       |
-| Meta Lingua[^metaligua-8b]                 | [PLACEHOLDER]                      | [PLACEHOLDER]                  | [PLACEHOLDER]                       |
-
-#### Mixtral-8x7B Model on 16 DGX Nodes (128 H100s)
-
-| Framework                                       | Training Throughput (tokens/s/GPU) | Cost to Train 1T Tokens (USD)  | Tokens Trained for $100k (Billion)  |
-|:------------------------------------------------|-----------------------------------:|-------------------------------:|------------------------------------:|
-| **Fast-LLM**[^fast-llm-mixtral]                 | [PLACEHOLDER]                      | **[PLACEHOLDER]**              | **[PLACEHOLDER]**                   |
-| NVIDIA Megatron[^megatron-mixtral]              | [PLACEHOLDER]                      | [PLACEHOLDER]                  | [PLACEHOLDER]                       |
-| MosaicML Composer[^mosaic-mixtral]              | [PLACEHOLDER]                      | [PLACEHOLDER]                  | [PLACEHOLDER]                       |
-| Hugging Face Transformers[^huggingface-mixtral] | [PLACEHOLDER]                      | [PLACEHOLDER]                  | [PLACEHOLDER]                       |
-| Meta Lingua[^metaligua-mixtral]                 | not supported                      | not supported                  | not supported                       |
-
-### Key Takeaways
-
--   **Cost efficiency at all scales:** Fast-LLM consistently achieves lower training costs due to its advanced parallelism and memory efficiency, delivering value across various model sizes and hardware configurations.
--   **Superior token throughput:** By processing more tokens per second per GPU than other frameworks, Fast-LLM maximizes token efficiency, leading to substantial savings, particularly for longer training durations or larger GPU clusters.
--   **Optimized for large-scale training:** Fast-LLM's design allows it to scale effectively as model size and training setups expand, ensuring that the benefits of its optimizations grow with the size of the deployment.
-
-[^fast-llm-1b]:
-    Testing conducted in [Month, Year] using 8 NVIDIA H100 SXM5 80 GB GPUs in 1 DGX node connected with 3200 Gbps Infiniband. Fast-LLM version [VERSION/COMMIT HASH], CUDA version [VERSION]. Training was performed on randomly generated data. Configuration file: [Link to config file].
-
-[^megatron-1b]:
-    Testing conducted in [Month, Year] using 8 NVIDIA H100 SXM5 80 GB GPUs in 1 DGX node connected with 3200 Gbps Infiniband. NVIDIA Megatron version [VERSION], CUDA version [VERSION]. Training was performed on randomly generated data. Configuration file: [Link to config file].
-
-[^mosaic-1b]:
-    Testing conducted in [Month, Year] using 8 NVIDIA H100 SXM5 80 GB GPUs in 1 DGX node connected with 3200 Gbps Infiniband. MosaicML Composer version [VERSION], CUDA version [VERSION]. Training was performed on randomly generated data. Configuration file: [Link to config file].
-
-[^huggingface-1b]:
-    Testing conducted in [Month, Year] using 8 NVIDIA H100 SXM5 80 GB GPUs in 1 DGX node connected with 3200 Gbps Infiniband. Hugging Face Transformers version [VERSION], CUDA version [VERSION]. Training was performed on randomly generated data. Configuration file: [Link to config file].
-
-[^metaligua-1b]:
-    Testing conducted in [Month, Year] using 8 NVIDIA H100 SXM5 80 GB GPUs in 1 DGX node connected with 3200 Gbps Infiniband. Meta Lingua version [VERSION], CUDA version [VERSION]. Training was performed on randomly generated data. Configuration file: [Link to config file].
-
-[^fast-llm-8b]:
-    Testing conducted in [Month, Year] using 32 NVIDIA H100 SXM5 80 GB GPUs across 4 DGX nodes connected with 3200 Gbps Infiniband. Fast-LLM version [VERSION/COMMIT HASH], CUDA version [VERSION]. Training was performed on randomly generated data. Configuration file: [Link to config file].
-
-[^megatron-8b]:
-    Testing conducted in [Month, Year] using 32 NVIDIA H100 SXM5 80 GB GPUs across 4 DGX nodes connected with 3200 Gbps Infiniband. NVIDIA Megatron version [VERSION], CUDA version [VERSION]. Training was performed on randomly generated data. Configuration file: [Link to config file].
-
-[^mosaic-8b]:
-    Testing conducted in [Month, Year] using 32 NVIDIA H100 SXM5 80 GB GPUs across 4 DGX nodes connected with 3200 Gbps Infiniband. MosaicML Composer version [VERSION], CUDA version [VERSION]. Training was performed on randomly generated data. Configuration file: [Link to config file].
-
-[^huggingface-8b]:
-    Testing conducted in [Month, Year] using 32 NVIDIA H100 SXM5 80 GB GPUs across 4 DGX nodes connected with 3200 Gbps Infiniband. Hugging Face Transformers version [VERSION], CUDA version [VERSION]. Training was performed on randomly generated data. Configuration file: [Link to config file].
-
-[^metaligua-8b]:
-    Testing conducted in [Month, Year] using 32 NVIDIA H100 SXM5 80 GB GPUs across 4 DGX nodes connected with 3200 Gbps Infiniband. Meta Lingua version [VERSION], CUDA version [VERSION]. Training was performed on randomly generated data. Configuration file: [Link to config file].
-
-[^fast-llm-mixtral]:
-    Testing conducted in [Month, Year] using 128 NVIDIA H100 SXM5 80 GB GPUs across 16 DGX nodes connected with 3200 Gbps Infiniband. Fast-LLM version [VERSION/COMMIT HASH], CUDA version [VERSION]. Training was performed on randomly generated data. Configuration file: [Link to config file].
-
-[^megatron-mixtral]:
-    Testing conducted in [Month, Year] using 128 NVIDIA H100 SXM5 80 GB GPUs across 16 DGX nodes connected with 3200 Gbps Infiniband. NVIDIA Megatron version [VERSION], CUDA version [VERSION]. Training was performed on randomly generated data. Configuration file: [Link to config file].
-
-[^mosaic-mixtral]:
-    Testing conducted in [Month, Year] using 128 NVIDIA H100 SXM5 80 GB GPUs across 16 DGX nodes connected with 3200 Gbps Infiniband. MosaicML Composer version [VERSION], CUDA version [VERSION]. Training was performed on randomly generated data. Configuration file: [Link to config file].
-
-[^huggingface-mixtral]:
-    Testing conducted in [Month, Year] using 128 NVIDIA H100 SXM5 80 GB GPUs across 16 DGX nodes connected with 3200 Gbps Infiniband. Hugging Face Transformers version [VERSION], CUDA version [VERSION]. Training was performed on randomly generated data. Configuration file: [Link to config file].
-
-[^metaligua-mixtral]:
-    In [Month, Year], Meta Lingua did not support training this configuration.
diff --git a/docs/help.md b/docs/help.md
index 5b3fee3ec..0177dcddb 100644
--- a/docs/help.md
+++ b/docs/help.md
@@ -36,7 +36,7 @@ For more detailed solutions, check out our GitHub Issues page. Odds are someone'
 
 ## Reference 📚
 
-If you're the type who loves configurations and tweaking every detail, the [**Configuration Reference**](reference/configuration) is for you. It covers every config option you could imagine. From optimizer settings to batch sizes to distributed training parameters. It's all in there.
+If you're the type who loves configurations and tweaking every detail, the [**Configuration Reference**](reference/configuration.md) is for you. It covers every config option you could imagine. From optimizer settings to batch sizes to distributed training parameters. It's all in there.
 
 ---
 
@@ -44,9 +44,9 @@ If you're the type who loves configurations and tweaking every detail, the [**Co
 
 We've got some excellent tutorials to help you get the most out of Fast-LLM:
 
--   [**Quick-Start Guide**](/quick-start): Perfect for launching Fast-LLM on a single GPU machine. We walk you through setting up Docker, running your first training job, and handling common issues.
+-   [**Quick-Start Guide**](quick-start.md): Perfect for launching Fast-LLM on a single GPU machine. We walk you through setting up Docker, running your first training job, and handling common issues.
 
--   [**In-Action Guides**](/in-action/slurm): Ready to go big? These guides cover setting up Fast-LLM with Slurm and Kubernetes for multi-node training. This is where Fast-LLM really shows its power.
+-   [**Cookbook**](recipes/train-llama-8b.md): Ready to go big? These recipes cover real-world scenarios like training big models from scratch, continuing training from a checkpoint, and more. This is where Fast-LLM really shows its power.
 
 ---
 
@@ -54,9 +54,9 @@ We've got some excellent tutorials to help you get the most out of Fast-LLM:
 
 If Fast-LLM still isn't cooperating, here's where to look next:
 
-1.   **GitHub [Issues](https://github.com/ServiceNow/Fast-LLM/issues) & [Discussions](https://github.com/ServiceNow/Fast-LLM/discussions)**: This is your best resource. Use the search function to see if anyone has run into the same issue. The community and our team are pretty active, so you'll likely find a solution or get help quickly.
+1.  **GitHub [Issues](https://github.com/ServiceNow/Fast-LLM/issues) & [Discussions](https://github.com/ServiceNow/Fast-LLM/discussions)**: This is your best resource. Use the search function to see if anyone has run into the same issue. The community and our team are pretty active, so you'll likely find a solution or get help quickly.
 
-2.   **Email (last resort)**: As a final option, you can email us at [fast-llm-team@servicenow.com](mailto:fast-llm-team@servicenow.com). This is only for rare cases, though. GitHub is our go-to for answering questions, as it lets others benefit from the conversation too.
+2.  **Email (last resort)**: As a final option, you can email us at [fast-llm-team@servicenow.com](mailto:fast-llm-team@servicenow.com). This is only for rare cases, though. GitHub is our go-to for answering questions, as it lets others benefit from the conversation too.
 
 Fast-LLM is a growing community, and your questions and contributions help make it better for everyone. Who knows, you might just solve the next person's roadblock!
 
diff --git a/docs/in-action/kubernetes.md b/docs/in-action/kubernetes.md
deleted file mode 100644
index 7255c3cf0..000000000
--- a/docs/in-action/kubernetes.md
+++ /dev/null
@@ -1,51 +0,0 @@
----
-title: "Kubernetes"
----
-
-- **Purpose:** These guides cover specific environments and configurations for deploying Fast-LLM in different setups.
-- **Content Organization:**
-  - **in-action/slurm**: Provide detailed instructions on deploying Fast-LLM on a Slurm cluster, covering multi-node setups, configuring Slurm scripts, and managing jobs.
-  - **in-action/kubernetes**: Guide for deploying Fast-LLM using Kubernetes, including creating the appropriate workloads (e.g., Job, Pod, StatefulSet), handling private Docker images, and configuring multi-node training.
-  - **File Single Node Guide Here Too?** If you include a "single-node" guide in this section as well, make it more advanced, focusing on optimizing performance, using different configurations, or tuning settings for different GPU models.
-- **Why It Makes Sense:** Organizing by deployment environment ensures users can quickly find the relevant guide based on their setup. Including both multi-node cluster guides and single-node advanced setups allows users to scale their knowledge.
-
----
-
-We'll walk you through how to use Fast-LLM to train a large language model on a cluster with multiple nodes and GPUs. We'll show an example setup using a Slurm cluster and a Kubernetes cluster.
-
-For this demo, we will train a Mistral-7B model from scratch for 100 steps on random data. The config file `examples/mistral-4-node-benchmark.yaml` is pre-configured for a multi-node setup with 4 DGX nodes, each with 8 A100-80GB or H100-80GB GPUs.
-
-> [!NOTE]
-> Fast-LLM scales from a single GPU to large clusters. You can start small and expand based on your resources.
-
-Expect to see a significant speedup in training time compared to other libraries! For training Mistral-7B, Fast-LLM is expected to achieve a throughput of **9,800 tokens/s/H100** (batch size 32, sequence length 8k) on a 4-node cluster with 32 H100s.
-
-
-### Running Fast-LLM on a Kubernetes Cluster
-
-#### Prerequisites
-
-- A [Kubernetes](https://kubernetes.io/) cluster with at least 4 DGX nodes with 8 A100-80GB or H100-80GB GPUs each.
-- [KubeFlow](https://www.kubeflow.org/) installed.
-- Locked memory limit set to unlimited at the host level on all nodes. Ask your cluster admin to do this if needed.
-
-#### Steps
-
-1. Create a Kubernetes [PersistentVolumeClaim](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) (PVC) named `fast-llm-home` that will be mounted to `/home/fast-llm` in the container using [examples/fast-llm-pvc.yaml](examples/fast-llm-pvc.yaml):
-
-    ```bash
-    kubectl apply -f examples/fast-llm-pvc.yaml
-    ```
-
-2. Create a [PyTorchJob](https://www.kubeflow.org/docs/components/training/user-guides/pytorch/) resource using the example configuration file [examples/fast-llm.pytorchjob.yaml](examples/fast-llm.pytorchjob.yaml):
-
-    ```bash
-    kubectl apply -f examples/fast-llm.pytorchjob.yaml
-    ```
-
-3. Monitor the job status:
-
-    - Use `kubectl get pytorchjobs` to see the job status.
-    - Use `kubectl logs -f fast-llm-master-0 -c pytorch` to follow the logs.
-
-That's it! You're now up and running with Fast-LLM on Kubernetes. 🚀
diff --git a/docs/in-action/slurm.md b/docs/in-action/slurm.md
deleted file mode 100644
index d92cb19f4..000000000
--- a/docs/in-action/slurm.md
+++ /dev/null
@@ -1,61 +0,0 @@
----
-title: "Slurm"
----
-
--   **Purpose:** These guides cover specific environments and configurations for deploying Fast-LLM in different setups.
--   **Content Organization:**
-    -   **in-action/slurm**: Provide detailed instructions on deploying Fast-LLM on a Slurm cluster, covering multi-node setups, configuring Slurm scripts, and managing jobs.
-    -   **in-action/kubernetes**: Guide for deploying Fast-LLM using Kubernetes, including creating the appropriate workloads (e.g., Job, Pod, StatefulSet), handling private Docker images, and configuring multi-node training.
-    -   **File Single Node Guide Here Too?** If you include a "single-node" guide in this section as well, make it more advanced, focusing on optimizing performance, using different configurations, or tuning settings for different GPU models.
--   **Why It Makes Sense:** Organizing by deployment environment ensures users can quickly find the relevant guide based on their setup. Including both multi-node cluster guides and single-node advanced setups allows users to scale their knowledge.
-
----
-
-We'll walk you through how to use Fast-LLM to train a large language model on a cluster with multiple nodes and GPUs. We'll show an example setup using a Slurm cluster and a Kubernetes cluster.
-
-For this demo, we will train a Mistral-7B model from scratch for 100 steps on random data. The config file `examples/mistral-4-node-benchmark.yaml` is pre-configured for a multi-node setup with 4 DGX nodes, each with 8 A100-80GB or H100-80GB GPUs.
-
-> [!NOTE]
-> Fast-LLM scales from a single GPU to large clusters. You can start small and expand based on your resources.
-
-Expect to see a significant speedup in training time compared to other libraries! For training Mistral-7B, Fast-LLM is expected to achieve a throughput of **9,800 tokens/s/H100** (batch size 32, sequence length 8k) on a 4-node cluster with 32 H100s.
-
-### Running Fast-LLM on a Slurm Cluster without Docker
-
-#### Prerequisites
-
--   A [Slurm](https://slurm.schedmd.com/) cluster with at least 4 DGX nodes with 8 A100-80GB or H100-80GB GPUs each.
--   CUDA 12.1 or higher.
--   Dependencies: [PyTorch][pytorch], [Triton][triton], and [Apex](https://github.com/NVIDIA/apex) installed on all nodes.
-
-#### Steps
-
-1.  Deploy the [nvcr.io/nvidia/pytorch:24.07-py3](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) Docker image to all nodes (recommended), because it contains all the necessary dependencies.
-2.  Install Fast-LLM on all nodes:
-
-    ```bash
-    sbatch <<EOF
-    #!/bin/bash
-    #SBATCH --nodes=$(scontrol show node | grep -c NodeName)
-    #SBATCH --ntasks-per-node=1
-    #SBATCH --ntasks=$(scontrol show node | grep -c NodeName)
-    #SBATCH --exclusive
-
-    srun bash -c 'pip install --no-cache-dir -e "git+https://github.com/ServiceNow/Fast-LLM.git#egg=llm[CORE,OPTIONAL,DEV]"'
-    EOF
-    ```
-
-3.  Use the example Slurm job script [examples/fast-llm.sbat](examples/fast-llm.sbat) to submit the job to the cluster:
-
-    ```bash
-    sbatch examples/fast-llm.sbat
-    ```
-
-4.  Monitor the job's progress:
-
-    -   Logs: Follow `job_output.log` and `job_error.log` in your working directory for logs.
-    -   Status: Use `squeue -u $USER` to see the job status.
-
-Now, you can sit back and relax while Fast-LLM trains your model at full speed! ☕
-
-### Running Fast-LLM on a Slurm Cluster with Docker
diff --git a/docs/quick-start.md b/docs/quick-start.md
index 1f37a8165..5fab9c7e5 100644
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -4,113 +4,569 @@ title: "Quick Start 🚀"
 
 This guide will get you up and running with Fast-LLM on a single machine. Let's train a model and see some results!
 
-You'll need:
+## Prerequisites
 
--   At least one NVIDIA GPU on your machine. We recommend 8 A100s or higher for this tutorial 🤑
--   Docker (installed and running). You can run without without docker but we don't recommend it. 🐳
--   Some patience for the initial setup and training 😊
+To follow this guide, you'll need:
 
-## Step 1: Pull the Fast-LLM Docker Image 🐳
+-   **Hardware**: At least one NVIDIA GPU with Ampere architecture or newer. For optimal results in this tutorial, we recommend 8 A100 GPUs or better. 🤑
+-   **Software**:
+    -   **Docker** (if using the Docker setup), or
+    -   **Local Environment**: PyTorch 2.2 or later, CUDA 12.1 or later, and APEX AMP (if building from source), or
+    -   **Cluster Setup**: Access to a Slurm or Kubernetes cluster.
+-   **Time**: The initial setup and training process requires some patience. 😊
 
-To start, grab the pre-built Fast-LLM Docker image:
+## Step 1: Initial Setup 🏗 ️
 
-```bash
-docker pull ghcr.io/servicenow/fast-llm:latest
-```
+First, choose your environment. You can use Docker, your local environment, Slurm, or Kubernetes.
 
-This image contains everything you need to train LLMs with Fast-LLM.
+=== "Docker"
 
-!!! info
+    You selected Docker for this tutorial. We'll use the Fast-LLM Docker image to train our model, which includes all the necessary dependencies. Grab the pre-built Fast-LLM Docker image:
 
-    Installing Fast-LLM from source is also an option:
+    ```bash
+    docker pull ghcr.io/servicenow/fast-llm:latest
+    ```
+
+    Let's also create folders to store our input data and output results:
+
+    ```bash
+    mkdir ~/inputs ~/results
+    ```
+
+=== "Local Environment"
+
+    You selected to use your local environment to run Fast-LLM. You should have a machine with at least one NVIDIA GPU with Ampere architecture or newer. We need to install Fast-LLM and its dependencies in your environment. Our Fast-LLM docker image already includes all this, and we recommend using it for simplicity and reproducibility. If you still want to install Fast-LLM in your local environment, follow the steps below.
 
-    ```sh
+    Fast-LLM depends on [CUDA](https://developer.nvidia.com/about-cuda) 12.1 or later, [PyTorch](https://pytorch.org) 2.2 or later, [APEX](https://github.com/NVIDIA/apex?tab=readme-ov-file#installation), and [OpenAI Triton](https://github.com/triton-lang/triton). Follow the instructions on their respective websites to install them. If you use [conda](https://docs.conda.io/projects/conda/en/latest/index.html), you can create a new environment and install these dependencies in it.
+    
+    Now, make sure PyTorch can access your GPU by running the following command:
+
+    ```bash
+    python -c "import torch; print(torch.cuda.is_available())"
+    ```
+
+    If APEX is correctly installed, the following command should run without errors:
+
+    ```bash
+    python -c "from amp_C import *"
+    ```
+
+    For Triton, you can verify the installation by running:
+
+    ```bash
+    python -c "import triton; print(triton.__version__)"
+    ```
+    
+    Fast-LLM also depends on [FlashAttention-2](https://github.com/Dao-AILab/flash-attention), which will be installed automatically when you install Fast-LLM:
+
+    ```bash
     pip install --no-build-isolation "git+https://github.com/ServiceNow/Fast-LLM.git#egg=fast_llm[CORE,OPTIONAL,DEV]"
     ```
 
-    However, we recommend the Docker image for simplicity and reproducibility.
+    You can verify the installation by running:
 
-## Step 2: Set Up Directories for Your Inputs and Outputs
+    ```bash
+    python -C "import flash_attn; print(flash_attn.__version__)"
+    ```
 
-Let's create folders to store our input data and output results:
+    and
 
-```bash
-mkdir ~/inputs ~/results
-```
+    ```bash
+    python -c "import fast_llm; print(fast_llm.__version__)"
+    ```
 
-## Step 3: Choose Your Model 🤖
+    At this point, you should be ready to run Fast-LLM on your local environment.
 
-Fast-LLM supports many GPT variants, including (but not limited to) Llama, Mistral, and Mixtral. For this tutorial, let's train a Llama model with data parallelism. You can choose from two models:
+    Before we continue, let's create folders to store our input data and output results:
 
-=== "SmolLM-135M"
+    ```bash
+    mkdir /mnt/inputs /mnt/results
+    ```
 
-    SmolLM is a smaller, more manageable model with 135M parameters. It's perfect for testing and getting familiar with Fast-LLM. We'll grab its configuration file from Huggingface Hub and save it as `~/inputs/config.json`:
+    If this location isn't writable, you can create the folders in your home directory:
 
     ```bash
-    curl -O https://huggingface.co/HuggingFaceTB/SmolLM-135M/resolve/main/config.json
-    mv config.json ~/inputs
+    mkdir ~/inputs ~/results
     ```
 
-=== "Llama-3.2-1B"
+    Make sure to update the paths in the following commands accordingly.
+
+=== "Slurm"
 
-    Llama is a larger model with 1B parameters. It's more powerful but requires more resources to train. We'll grab the model from Huggingface Hub and save it to `~/inputs`:
+    You selected Docker-enabled [Slurm](https://slurm.schedmd.com/) for this tutorial. The Slurm setup requires a Slurm cluster with at least one node and one GPU of Ampere architecture or newer. Slurm will use the `ghcr.io/servicenow/fast-llm:latest` Docker image to train our model. It will need a shared file system for input data and output results. We will assume that your home directory is shared across all nodes.
+
+    Let's create a folder to store our input data and output results in the shared home directory:
 
     ```bash
-    git lfs install
-    git clone https://huggingface.co/meta-llama/Llama-3.2-1B ~/inputs
+    mkdir ~/inputs ~/results
     ```
 
+=== "Kubernetes"
+
+    You selected to use [Kubernetes](https://kubernetes.io/) with [KubeFlow](https://www.kubeflow.org/) for this tutorial. We will use a `PyTorchJob` resource to train our model with the `ghcr.io/servicenow/fast-llm:latest` Docker image and store our input data and output results in shared [persistent volume claims](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) (PVCs). The Kubernetes cluster should have at least one node and one GPU of Ampere architecture or newer.
+
+    Let's now create two PVCs named `pvc-fast-llm-inputs` and `pvc-fast-llm-results` to store our input data and output results, respectively.
+    
+    Create a file named `pvc-fast-llm-inputs.yaml` with the following content:
+
+    ```yaml
+    # Persistent volume claim for Fast-LLM inputs
+    apiVersion: "v1"
+    kind: "PersistentVolumeClaim"
+    metadata:
+      name: "pvc-fast-llm-inputs"
+    spec:
+      storageClassName: local-path
+      accessModes:
+        - ReadWriteMany
+      resources:
+        requests:
+          storage: 1000Gi
+    ```
+
+    Then, create a second file named `pvc-fast-llm-results.yaml` with these contents:
+
+    ```yaml
+    # Persistent volume claim for Fast-LLM results
+    apiVersion: "v1"
+    kind: "PersistentVolumeClaim"
+    metadata:
+      name: "pvc-fast-llm-results"
+    spec:
+      storageClassName: local-path
+      accessModes:
+        - ReadWriteMany
+      resources:
+        requests:
+          storage: 1000Gi
+    ```
+
+    Apply both PVCs to your Kubernetes cluster:
+
+    ```bash
+    kubectl apply -f pvc-fast-llm-inputs.yaml
+    kubectl apply -f pvc-fast-llm-results.yaml
+    ```
+
+## Step 2: Choose Your Model 🤖
+
+Fast-LLM supports many GPT variants, including (but not limited to) Llama, Mistral, and Mixtral. For this tutorial, let's train a Llama model with data parallelism. You can choose from two models:
+
+=== "SmolLM-135M"
+
+    SmolLM is a smaller, more manageable model with 135M parameters. It's perfect for testing and getting familiar with Fast-LLM. We'll grab its configuration file from Huggingface Hub and save it to our inputs folder:
+
+    === "Docker"
+
+        ```bash
+        curl -O https://huggingface.co/HuggingFaceTB/SmolLM-135M/resolve/main/config.json
+        mv config.json ~/inputs
+        ```
+
+    === "Local Environment"
+
+        ```bash
+        curl -O https://huggingface.co/HuggingFaceTB/SmolLM-135M/resolve/main/config.json
+        mv config.json /mnt/inputs
+        ```
+
+    === "Slurm"
+
+        ```bash
+        curl -O https://huggingface.co/HuggingFaceTB/SmolLM-135M/resolve/main/config.json
+        mv config.json ~/inputs
+        ```
+
+    === "Kubernetes"
+
+        First, download the configuration file to your local machine:
+
+        ```bash
+        curl -O https://huggingface.co/HuggingFaceTB/SmolLM-135M/resolve/main/config.json
+        ```
+
+        Then, create a temporary pod that mounts the inputs PVC, allowing you to copy files to it. Here's a basic YAML configuration for such a pod:
+
+        ```yaml
+        apiVersion: v1
+        kind: Pod
+        metadata:
+          name: file-transfer
+        spec:
+          containers:
+            - name: file-transfer-container
+              image: ubuntu
+              command: ["sleep", "infinity"]
+              volumeMounts:
+                - mountPath: /mnt/inputs
+                  name: inputs
+          volumes:
+            - name: inputs
+              persistentVolumeClaim:
+                claimName: pvc-fast-llm-inputs
+        ```
+
+        Save this configuration to a file named `file-transfer-pod.yaml` and apply it to your Kubernetes cluster:
+
+        ```bash
+        kubectl apply -f file-transfer-pod.yaml
+        ```
+
+        Copy the configuration file to the pod:
+
+        ```bash
+        kubectl cp config.json file-transfer:/mnt/inputs
+        ```
+
+        Finally, clean up the temporary pod and configuration file:
+
+        ```bash
+        kubectl delete pod file-transfer
+        rm config.json
+        ```
+
+=== "Llama-3.2-1B"
+
+    Llama is a larger model with 1B parameters. It's more powerful but requires more resources to train. We'll grab the model from the Huggingface Hub and save it to our inputs folder:
+
+    === "Docker"
+
+        First, sign in to your Hugging Face account:
+
+        ```bash
+        pip install huggingface_hub
+        huggingface-cli login
+        ```
+
+        Then, clone the model:
+
+        ```bash
+        git lfs install
+        git clone https://huggingface.co/meta-llama/Llama-3.2-1B ~/inputs
+        ```
+    
+    === "Local Environment"
+
+        First, sign in to your Hugging Face account:
+
+        ```bash
+        pip install huggingface_hub
+        huggingface-cli login
+        ```
+
+        Then, clone the model:
+
+        ```bash
+        git lfs install
+        git clone https://huggingface.co/meta-llama/Llama-3.2-1B /mnt/inputs
+        ```
+    
+    === "Slurm"
+
+        First, sign in to your Hugging Face account:
+
+        ```bash
+        pip install huggingface_hub
+        huggingface-cli login
+        ```
+
+        Then, clone the model:
+
+        ```bash
+        git lfs install
+        git clone https://huggingface.co/meta-llama/Llama-3.2-1B ~/inputs
+        ```
+    
+    === "Kubernetes"
+    
+        We need to create a temporary pod that mounts the inputs PVC and allows us to download the model. Here's a basic YAML configuration for such a pod:
+    
+        ```yaml
+        apiVersion: v1
+        kind: Pod
+        metadata:
+          name: clone-model
+        spec:
+          containers:
+            - name: clone-model-container
+              image: ubuntu
+              command: ["sleep", "infinity"]
+              volumeMounts:
+                - mountPath: /mnt/inputs
+                  name: inputs
+          volumes:
+            - name: inputs
+              persistentVolumeClaim:
+                claimName: pvc-fast-llm-inputs
+        ```
+
+        Save this configuration to a file named `clone-model-pod.yaml`. Next, apply this configuration to your Kubernetes cluster:
+
+        ```bash
+        kubectl apply -f clone-model-pod.yaml
+        ```
+
+        Now, enter the pod, log in to your Hugging Face account, and clone the model:
+
+        ```bash
+        kubectl exec -it clone-model -- /bin/bash
+        pip install huggingface_hub
+        huggingface-cli login
+        git lfs install
+        git clone https://huggingface.co/meta-llama/Llama-3.2-1B /mnt/inputs
+        ```
+
+        Finally, clean up the temporary pod, it's no longer needed:
+
+        ```bash
+        kubectl delete pod clone-model
+        ```
+
 !!! tip "Model Size Matters"
 
     Smaller models like SmolLM-135M will train relatively quickly, especially if you've only got a few GPUs. But if you're feeling adventurous (and patient), give the larger Llama-3.2-1B a shot!
 
-## Step 4: Preparing the Training Data 📚
+## Step 3: Prepare the Training Data 📚
 
 For this tutorial, we'll use 9B tokens of text from the [OpenWebText](https://skylion007.github.io/OpenWebTextCorpus/) dataset. This dataset is a free approximation of the WebText data OpenAI used for GPT-2, and it's perfect for our test run!
 
-We've got a script that'll download and preprocess the dataset for you. Run it like this:
-
 === "SmolLM-135M"
 
-    ```bash
-    docker run -it --rm ghcr.io/servicenow/fast-llm:latest \
-        -v ~/inputs:/app/inputs \
-        python tools/prepare_dataset.py \
-        tokenizer_path_or_name="HuggingFaceTB/SmolLM-135M" \
-        dataset_name_or_path="openwebtext" \
-        dataset_split="train" \
-        output_dir="inputs" \
-        num_processes_load=4 \
-        num_processes_map=4 \
-        num_processes_save=4 \
-        num_tokens_per_shard=100000000
-    ```
+    === "Docker"
+
+        We've got a script that'll download and preprocess the dataset for you. Run it like this:
+
+        ```bash
+        docker run -it --rm ghcr.io/servicenow/fast-llm:latest \
+            -v ~/inputs:/mnt/inputs \
+            python tools/prepare_dataset.py \
+            tokenizer_path_or_name="HuggingFaceTB/SmolLM-135M" \
+            dataset_name_or_path="openwebtext" \
+            dataset_split="train" \
+            output_dir="/mnt/inputs" \
+            num_processes_load=4 \
+            num_processes_map=4 \
+            num_processes_save=4 \
+            num_tokens_per_shard=100000000
+        ```
+    
+    === "Local Environment"
+
+        Fast-LLM ships with a [script](https://github.com/ServiceNow/Fast-LLM/blob/main/tools/prepare_dataset.py) that downloads and preprocesses the dataset for you. Download and run it like this:
+
+        ```bash
+        curl -O https://raw.githubusercontent.com/ServiceNow/Fast-LLM/main/tools/prepare_dataset.py
+        python prepare_dataset.py \
+            tokenizer_path_or_name="HuggingFaceTB/SmolLM-135M" \
+            dataset_name_or_path="openwebtext" \
+            dataset_split="train" \
+            output_dir="/mnt/inputs" \
+            num_processes_load=4 \
+            num_processes_map=4 \
+            num_processes_save=4 \
+            num_tokens_per_shard=100000000
+        ```
+    
+    === "Slurm"
+
+        Fast-LLM has got you covered with a script that'll download and preprocess the dataset for you. Run it like this:
+
+        ```bash
+        sbatch <<EOF
+        #!/bin/bash
+        # SBATCH --nodes=1
+        # SBATCH --ntasks-per-node=1
+        # SBATCH --exclusive
+        # SBATCH --output=/mnt/outputs/job_output.log
+        # SBATCH --error=/mnt/outputs/job_error.log
+
+        srun \
+            --container-image="ghcr.io/servicenow/fast-llm:latest" \
+            --container-mounts="${HOME}/inputs:/mnt/inputs,${HOME}/results:/mnt/results" \
+            --ntasks-per-node=$SLURM_NTASKS_PER_NODE \
+            bash -c "
+                python tools/prepare_dataset.py \
+                    tokenizer_path_or_name='HuggingFaceTB/SmolLM-135M' \
+                    dataset_name_or_path='openwebtext' \
+                    dataset_split='train' \
+                    output_dir='/mnt/inputs' \
+                    num_processes_load=4 \
+                    num_processes_map=4 \
+                    num_processes_save=4 \
+                    num_tokens_per_shard=100000000"
+        EOF
+        ```
+
+        You can follow the job's progress by running `squeue -u $USER` and checking the logs in `~/results/job_output.log` and `~/results/job_error.log`.
+    
+    === "Kubernetes"
+
+        Fast-LLM comes with a script that'll download and preprocess the dataset for you. We will run this script in a Kubernetes job. Here's a basic configuration for the job:
+
+        ```yaml
+        apiVersion: batch/v1
+        kind: Job
+        metadata:
+          name: prepare-dataset
+        spec:
+          template:
+            spec:
+              containers:
+                - name: prepare-dataset
+                  image: ghcr.io/servicenow/fast-llm:latest
+                  command: ["python", "tools/prepare_dataset.py"]
+                  args:
+                    - tokenizer_path_or_name=HuggingFaceTB/SmolLM-135M
+                    - dataset_name_or_path=openwebtext
+                    - dataset_split=train
+                    - output_dir=/mnt/inputs
+                    - num_processes_load=4
+                    - num_processes_map=4
+                    - num_processes_save=4
+                    - num_tokens_per_shard=100000000
+                  resources:
+                    requests:
+                      cpu: 4
+                  volumeMounts:
+                    - name: inputs
+                      mountPath: /mnt/inputs
+              volumes:
+                - name: inputs
+                  persistentVolumeClaim:
+                    claimName: pvc-fast-llm-inputs
+        ```
+
+        Save this configuration to a file named `prepare-dataset-job.yaml` and apply it to your Kubernetes cluster:
+
+        ```bash
+        kubectl apply -f prepare-dataset-job.yaml
+        ```
+
+        You can follow the job's progress by running `kubectl get pods` and checking the logs with `kubectl logs prepare-dataset`.
 
 === "Llama-3.2-1B"
 
-    ```bash
-    docker run -it --rm ghcr.io/servicenow/fast-llm:latest \
-        -v ~/inputs:/app/inputs \
-        python tools/prepare_dataset.py \
-        tokenizer_path_or_name="meta-llama/Llama-3.2-1B" \
-        dataset_name_or_path="openwebtext" \
-        dataset_split="train" \
-        output_dir="inputs" \
-        num_processes_load=4 \
-        num_processes_map=4 \
-        num_processes_save=4 \
-        num_tokens_per_shard=100000000
-    ```
+    === "Docker"
+
+        We've got a script that'll download and preprocess the dataset for you. Run it like this:
+
+        ```bash
+        docker run -it --rm ghcr.io/servicenow/fast-llm:latest \
+            -v ~/inputs:/mnt/inputs \
+            python tools/prepare_dataset.py \
+            tokenizer_path_or_name="meta-llama/Llama-3.2-1B" \
+            dataset_name_or_path="openwebtext" \
+            dataset_split="train" \
+            output_dir="inputs" \
+            num_processes_load=4 \
+            num_processes_map=4 \
+            num_processes_save=4 \
+            num_tokens_per_shard=100000000
+        ```
+    
+    === "Local Environment"
+
+        Fast-LLM ships with a [script](https://github.com/ServiceNow/Fast-LLM/blob/main/tools/prepare_dataset.py) that downloads and preprocesses the dataset for you. Download and run it like this:
+
+        ```bash
+        curl -O https://raw.githubusercontent.com/ServiceNow/Fast-LLM/main/tools/prepare_dataset.py
+        python prepare_dataset.py \
+            tokenizer_path_or_name="meta-llama/Llama-3.2-1B" \
+            dataset_name_or_path="openwebtext" \
+            dataset_split="train" \
+            output_dir="/mnt/inputs" \
+            num_processes_load=4 \
+            num_processes_map=4 \
+            num_processes_save=4 \
+            num_tokens_per_shard=100000000
+        ```
+
+    === "Slurm"
+
+        Fast-LLM has got you covered with a script that'll download and preprocess the dataset for you. Run it like this:
+
+        ```bash
+        sbatch <<EOF
+        #!/bin/bash
+        # SBATCH --nodes=1
+        # SBATCH --ntasks-per-node=1
+        # SBATCH --exclusive
+        # SBATCH --output=/mnt/outputs/job_output.log
+        # SBATCH --error=/mnt/outputs/job_error.log
+
+        srun \
+            --container-image="ghcr.io/servicenow/fast-llm:latest" \
+            --container-mounts="${HOME}/inputs:/mnt/inputs,${HOME}/results:/mnt/results" \
+            --ntasks-per-node=$SLURM_NTASKS_PER_NODE \
+            bash -c "
+                python tools/prepare_dataset.py \
+                    tokenizer_path_or_name='meta-llama/Llama-3.2-1B' \
+                    dataset_name_or_path='openwebtext' \
+                    dataset_split='train' \
+                    output_dir='/mnt/inputs' \
+                    num_processes_load=4 \
+                    num_processes_map=4 \
+                    num_processes_save=4 \
+                    num_tokens_per_shard=100000000"
+        EOF
+        ```
+
+        You can follow the job's progress by running `squeue -u $USER` and checking the logs in `~/results/job_output.log` and `~/results/job_error.log`.
+
+    === "Kubernetes"
+
+        Fast-LLM comes with a script that'll download and preprocess the dataset for you. We will run this script in a Kubernetes job. Here's a basic configuration for the job:
+
+        ```yaml
+        apiVersion: batch/v1
+        kind: Job
+        metadata:
+          name: prepare-dataset
+        spec:
+          template:
+            spec:
+              containers:
+                - name: prepare-dataset
+                  image: ghcr.io/servicenow/fast-llm:latest
+                  command: ["python", "tools/prepare_dataset.py"]
+                  args:
+                    - tokenizer_path_or_name=meta-llama/Llama-3.2-1B
+                    - dataset_name_or_path=openwebtext
+                    - dataset_split=train
+                    - output_dir=/mnt/inputs
+                    - num_processes_load=4
+                    - num_processes_map=4
+                    - num_processes_save=4
+                    - num_tokens_per_shard=100000000
+                  resources:
+                    requests:
+                      cpu: 4
+                  volumeMounts:
+                    - name: inputs
+                      mountPath: /mnt/inputs
+              volumes:
+                - name: inputs
+                  persistentVolumeClaim:
+                    claimName: pvc-fast-llm-inputs
+        ```
+
+        Save this configuration to a file named `prepare-dataset-job.yaml` and apply it to your Kubernetes cluster:
+
+        ```bash
+        kubectl apply -f prepare-dataset-job.yaml
+        ```
+
+        You can follow the job's progress by running `kubectl get pods` and checking the logs with `kubectl logs prepare-dataset`.
 
 !!! info "What's Happening Here?"
 
-    This will grab the OpenWebText data, tokenize it, and save it in 91 shards of 100M tokens each. Expect around 2 hours for the whole thing to finish, mainly due to tokenization. If you've got more CPU cores, try upping `num_processes_*` to speed things up.
+    The `prepare_dataset.py` script will grab the OpenWebText data from the Huggingface Hub, tokenize it, and save it in 91 shards of 100M tokens each to the input folder. Expect around 2 hours for the whole thing to finish, mainly due to tokenization. If you've got more CPU cores, try upping `num_processes_*` to speed things up.
 
 !!! tip "Use a Smaller Dataset for Testing"
 
-    Since we're just testing things out, we can also use a smaller dataset. Replace `openwebtext` with `stas/openwebtext-10k` to use a small subset representing the first 10K records from the original dataset. This will speed up the process and let you see how things work without waiting for hours.
+    If you're just testing things out, you can also use a smaller dataset. Replace `openwebtext` with `stas/openwebtext-10k` to use a small subset representing the first 10K records from the original dataset. This will speed up the process and let you see how things work without waiting for hours.
 
-## Step 5: Set Up Your Training Configuration ⚙️
+## Step 4: Configure Fast-LLM ⚙️
 
 Next, we'll create a configuration file for Fast-LLM. Save the following as `~/inputs/fast-llm-config.yaml`:
 
@@ -141,7 +597,7 @@ Next, we'll create a configuration file for Fast-LLM. Save the following as `~/i
       batch_size: 480  # (5)!
     data:
       format: file
-      path: /app/inputs/fast_llm_dataset.json  # (6)!
+      path: /mnt/inputs/fast_llm_dataset.json  # (6)!
       split: [99, 1, 0]  # (7)!
     optimizer: # (8)!
       weight_decay: 0.1
@@ -155,7 +611,7 @@ Next, we'll create a configuration file for Fast-LLM. Save the following as `~/i
         warmup_iterations: 2000
     pretrained:
       format: llama  # (10)!
-      path: /app/inputs
+      path: /mnt/inputs
       load_weights: no  # (11)!
     model:
       multi_stage:
@@ -163,7 +619,7 @@ Next, we'll create a configuration file for Fast-LLM. Save the following as `~/i
       distributed:
         training_dtype: bf16  # (13)!
     run:
-      experiment_dir: /app/results
+      experiment_dir: /mnt/results
     ```
 
     1.  Total number of training tokens will be approximately 300B.
@@ -207,7 +663,7 @@ Next, we'll create a configuration file for Fast-LLM. Save the following as `~/i
       batch_size: 480  # (5)!
     data:
       format: file
-      path: /app/inputs/fast_llm_dataset.json  # (6)!
+      path: /mnt/inputs/fast_llm_dataset.json  # (6)!
       split: [99, 1, 0]  # (7)!
     optimizer: # (8)!
       weight_decay: 0.1
@@ -221,7 +677,7 @@ Next, we'll create a configuration file for Fast-LLM. Save the following as `~/i
         warmup_iterations: 2000
     pretrained:
       format: llama  # (10)!
-      path: /app/inputs
+      path: /mnt/inputs
       load_weights: yes  # (11)!
     model:
       multi_stage:
@@ -229,7 +685,7 @@ Next, we'll create a configuration file for Fast-LLM. Save the following as `~/i
       distributed:
         training_dtype: bf16  # (13)!
     run:
-      experiment_dir: /app/results
+      experiment_dir: /mnt/results
     ```
 
     1.  Total number of training tokens will be approximately 300B.
@@ -252,17 +708,159 @@ If you included the W&B section in your configuration, you'll need to add your A
 
 ## Step 7: Launch Training 🚀
 
-Alright, the big moment! If you're on an 8-GPU machine, run the following to kick off training:
+Alright, the big moment! Let's launch the training run.
+
+=== "Docker"
+
+    If you're on an 8-GPU machine, run the following to kick off training:
+
+    ```bash
+    docker run --gpus all -it --rm ghcr.io/servicenow/fast-llm:latest \
+        -v ~/inputs:/mnt/inputs \
+        -v ~/results:/mnt/results \
+        -e PYTHONHASHSEED=0 \
+        -e WANDB_API_KEY_PATH=/mnt/inputs/.wandb_api_key \
+        torchrun --standalone --nnodes 1 --nproc_per_node=8 --no_python \
+        fast-llm train gpt --config /mnt/inputs/fast-llm-config.yaml
+    ```
+
+    Adjust `--nproc_per_node` based on the number of GPUs you have available.
+    Replace `--gpus all` with `--gpus '"device=0,1,2,3,4,5,6,7"'` if you want to use specific GPUs.
+    Remove `-e WANDB_API_KEY_PATH=/mnt/inputs/.wandb_api_key` if you're not using W&B.
+
+=== "Local Environment"
+
+    ```bash
+    export PYTHONHASHSEED=0
+    export WANDB_API_KEY_PATH=/mnt/inputs/.wandb_api_key
+    torchrun --standalone --nnodes 1 --nproc_per_node=8 --no_python \
+        fast-llm train gpt --config /mnt/inputs/fast-llm-config.yaml
+    ```
+
+=== "Slurm"
+
+    We create a Slurm batch script to run the training job. Save the following as `fast-llm.sbat`:
+
+    ```bash
+    #!/bin/bash
+    # SBATCH --job-name=fast-llm
+    # SBATCH --nodes=1
+    # SBATCH --gpus-per-node=8
+    # SBATCH --ntasks-per-node=1
+    # SBATCH --exclusive
+    # SBATCH --output=/mnt/outputs/job_output.log
+    # SBATCH --error=/mnt/outputs/job_error.log
+
+    export PYTHONHASHSEED=0
+    export WANDB_API_KEY_PATH=/mnt/inputs/.wandb_api_key
+
+    srun \
+        --container-image="ghcr.io/servicenow/fast-llm:latest" \
+        --container-mounts="${HOME}/inputs:/mnt/inputs,${HOME}/results:/mnt/results" \
+        --container-env="PYTHONHASHSEED,WANDB_API_KEY_PATH" \
+        --gpus-per-node=$SLURM_GPUS_PER_NODE \
+        --ntasks-per-node=$SLURM_NTASKS_PER_NODE \
+        bash -c "
+            torchrun \
+            --standalone \
+            --nnodes=\$SLURM_NNODES \
+            --nproc_per_node=\$SLURM_GPUS_PER_NODE \
+            --no_python \
+            fast-llm train gpt \
+            --config /mnt/inputs/fast-llm-config.yaml"
+    ```
+
+    Change the `--gpus-per-node` value to match the number of GPUs on your node.
+    If you're not using W&B, remove the references to `WARDB_API_KEY_PATH`.
+
+    Submit the job to the Slurm cluster:
+
+    ```bash
+    sbatch fast-llm.sbat
+    ```
+
+=== "Kubernetes"
+
+    We create a [PyTorchJob](https://www.kubeflow.org/docs/components/training/user-guides/pytorch/) resource with the following configuration and save it as `fast-llm.pytorchjob.yaml`:
+
+    ```yaml
+    apiVersion: "kubeflow.org/v1"
+    kind: "PyTorchJob"
+    metadata:
+      name: "fast-llm"
+    spec:
+      nprocPerNode: "8"
+      pytorchReplicaSpecs:
+        Master:
+          replicas: 1
+          restartPolicy: Never
+          template:
+            spec:
+              tolerations:
+                - key: nvidia.com/gpu
+                  value: "true"
+                  operator: Equal
+                  effect: NoSchedule
+              containers:
+                - name: pytorch
+                  image: ghcr.io/servicenow/fast-llm:latest
+                  resources:
+                    limits:
+                      nvidia.com/gpu: 8
+                      rdma/rdma_shared_device_a: 1
+                      memory: "1024Gi"
+                      cpu:
+                    requests:
+                      nvidia.com/gpu: 8
+                      rdma/rdma_shared_device_a: 1
+                      memory: "1024Gi"
+                      cpu: 128
+                  command:
+                    - /bin/bash
+                    - -c
+                    - |
+                      torchrun --standalone
+                               --nnodes=${PET_NNODES} \
+                               --nproc_per_node=${PET_NPROC_PER_NODE} \
+                               --no_python \
+                               fast-llm train gpt \
+                               --config /mnt/inputs/fast-llm-config.yaml
+                  env:
+                    - name: PYTHONHASHSEED
+                      value: "0"
+                    - name: WANDB_API_KEY_PATH
+                      value: "/mnt/inputs/.wandb_api_key"
+                  securityContext:
+                    capabilities:
+                      add:
+                        - IPC_LOCK
+                  volumeMounts:
+                    - mountPath: /mnt/inputs
+                      name: fast-llm-inputs
+                    - mountPath: /mnt/results
+                      name: fast-llm-results
+                    - mountPath: /dev/shm
+                      name: dshm
+              volumes:
+                - name: fast-llm-inputs
+                  persistentVolumeClaim:
+                    claimName: pvc-fast-llm-inputs
+                - name: fast-llm-results
+                  persistentVolumeClaim:
+                    claimName: pvc-fast-llm-results
+                - name: dshm
+                  emptyDir:
+                    medium: Memory
+                    sizeLimit: "1024Gi"
+    ```
 
-```bash
-docker run --gpus all -it --rm ghcr.io/servicenow/fast-llm:latest \
-    -v ~/inputs:/app/inputs \
-    -v ~/results:/app/results \
-    -e PYTHONHASHSEED=0 \
-    -e WANDB_API_KEY_PATH=/app/inputs/.wandb_api_key \
-    torchrun --nproc_per_node=8 --no_python \
-    fast-llm train gpt --config /app/inputs/fast-llm-config.yaml
-```
+    Change the `nprocPerNode` value to match the number of GPUs on your node. If you're not using W&B, remove the references to `WARDB_API_KEY_PATH`.
+
+    Submit the job to the Kubernetes cluster:
+
+    ```bash
+    kubectl apply -f fast-llm.pytorchjob.yaml
+    ```
 
 !!! warning "Python Hash Seed"
 
@@ -270,11 +868,31 @@ docker run --gpus all -it --rm ghcr.io/servicenow/fast-llm:latest \
 
 ## Step 8. Track Training Progress 📊
 
-Fast-LLM will log training progress to the console every 10 iterations. You can expect to see the following throughput:
+=== "Docker"
+
+    Fast-LLM will log training progress to the console every 10 iterations.
+
+=== "Local Environment"
+
+    Fast-LLM will log training progress to the console every 10 iterations.
+
+=== "Slurm"
+
+    Use `squeue -u $USER` to see the job status.
+    Follow `job_output.log` and `job_error.log` in your working directory for logs.
+    Fast-LLM will log training progress to those files every 10 iterations.
+
+=== "Kubernetes"
+
+    Use `kubectl get pods` to see the job status.
+    Use `kubectl logs fast-llm-master-0` to check the logs.
+    Fast-LLM will log training progress to the console every 10 iterations.
+
+You can expect to see the following throughput:
 
 === "SmolLM-135M"
 
-    | Metric              | A100         | H100         |
+    | Metric              | A100-80GB    | H100         |
     |---------------------|-------------:|-------------:|
     | Tokens/s            | 1,234,567    | 1,456,789    |
     | TFLOPS              | 312          | 512          |
@@ -296,9 +914,6 @@ Here are some common issues you might encounter and how to address them:
 
 -   **Underutilized GPU or Low Memory Usage**: If memory usage is low or GPU utilization isn't maxed out, try increasing `micro_batch_size` (to 4, 8, or 16 if memory allows) or extending `sequence_length` (up to 2048, 3072, or 4096, as memory permits). Larger batches and longer sequences help keep GPUs engaged and reduce idle time.
 
--   **Docker Permission Issues**: If you encounter Docker permission errors, confirm that Docker has permission to access your GPUs. Use the `--gpus all` flag in your Docker run command and ensure your user has access to the `docker` and `nvidia-docker` groups.
-
 ## Final Thoughts
 
-And that's it! You've set up, prepped data, chosen a model, configured training, and launched a full training run with Fast-LLM. From here, feel free to tweak the model, try out larger datasets, or scale things up to a multi-node setup if you're on a cluster.
-We have guides for Slurm and Kubernetes setups if distributed training is your jam. Happy training! 🚀
+And that's it! You've set up, prepped data, chosen a model, configured training, and launched a full training run with Fast-LLM. From here, feel free to tweak the model, try out larger datasets, or scale things up to a multi-node setup if you're on a cluster. Happy training! 🚀
diff --git a/mkdocs.yaml b/mkdocs.yaml
index 5fca92123..93a9592a4 100644
--- a/mkdocs.yaml
+++ b/mkdocs.yaml
@@ -162,11 +162,7 @@ nav:
   - Welcome: index.md
   - Get Started:
     - Quick Start: quick-start.md
-    # - Cost Efficiency: cost-efficiency.md
     - Help: help.md
-    - In Action:
-      - On Slurm: in-action/slurm.md
-      - On Kubernetes: in-action/kubernetes.md
     - Success Stories:
       - StarCoder 2: success-stories/starcoder-2.md
     - License: license.md

From 1ea3422d74838a6670ae0a910917d862e72bb40a Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Wed, 6 Nov 2024 22:57:54 -0500
Subject: [PATCH 45/87] wip

---
 docs/quick-start.md | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/docs/quick-start.md b/docs/quick-start.md
index 5fab9c7e5..4a96a5f86 100644
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -704,7 +704,7 @@ Next, we'll create a configuration file for Fast-LLM. Save the following as `~/i
 
 ## (Optional) Step 6: Add Your Weights & Biases API Key 🔑
 
-If you included the W&B section in your configuration, you'll need to add your API key. Save your W&B API key to `~/inputs/.wandb_api_key` so Fast-LLM can track your training progress there. You can create a free W&B account if you don't already have one.
+If you included the W&B section in your configuration, you'll need to add your API key. Save your W&B API key to `.wandb_api_key` in your inputs folder so Fast-LLM can track your training progress there. You can create a free W&B account if you don't already have one.
 
 ## Step 7: Launch Training 🚀
 
@@ -730,6 +730,8 @@ Alright, the big moment! Let's launch the training run.
 
 === "Local Environment"
 
+    If you have 8 GPUs available, run the following to start training:
+
     ```bash
     export PYTHONHASHSEED=0
     export WANDB_API_KEY_PATH=/mnt/inputs/.wandb_api_key
@@ -737,6 +739,9 @@ Alright, the big moment! Let's launch the training run.
         fast-llm train gpt --config /mnt/inputs/fast-llm-config.yaml
     ```
 
+    Adjust `--nproc_per_node` based on the number of GPUs you have available.
+    Remove `export WANDB_API_KEY_PATH=/mnt/inputs/.wandb_api_key` if you're not using W&B.
+
 === "Slurm"
 
     We create a Slurm batch script to run the training job. Save the following as `fast-llm.sbat`:

From f3eb0d67d100d25dc76fc2161eb71f82b55b69b2 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Wed, 6 Nov 2024 23:08:54 -0500
Subject: [PATCH 46/87] wip

---
 docs/help.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/help.md b/docs/help.md
index 0177dcddb..6eed9b04f 100644
--- a/docs/help.md
+++ b/docs/help.md
@@ -10,7 +10,7 @@ Welcome to the Fast-LLM Help Center! Here, you'll find fixes for common hiccups,
 
 Let's stay one step ahead of those pesky gotchas. Here's a list of common issues and quick fixes:
 
--   **CUDA Out of Memory**: When the GPU throws a fit, a few tweaks can help. First, try lowering `micro_batch_size` or `sequence_length` in the configuration to fit within the available memory. Still stuck? Try setting the `mlp_recompute_level` option to `activation` to save memory in the backward pass, or experiment with higher ZeRO stages for reduced memory usage. And if that's not enough, tensor or model parallelism may be your friend. We've got a guide for this, so you're covered.
+-   **CUDA Out of Memory**: When the GPU throws a fit, a few tweaks can help. First, try lowering `micro_batch_size` or `sequence_length` in the configuration to fit within the available memory. Still stuck? Try setting the `mlp_recompute_level` option to `activation` or `full` to save memory in the backward pass, or experiment with higher ZeRO stages for reduced memory usage. And if that's not enough, tensor or model parallelism may be your friend.
 
 -   **Python Hash Seed Sync Error**: Encountering an error like
 
@@ -18,7 +18,7 @@ Let's stay one step ahead of those pesky gotchas. Here's a list of common issues
     RuntimeError: Desync detected for barrier train begin (66830148464 != 133042721120)
     ```
   
-    points to a hashing inconsistency. To fix it, set `PYTHONHASHSEED=0` in your environment variables. This ensures consistent hashing across processes, keeping them in sync.
+    points to a hashing inconsistency. To fix it, set `PYTHONHASHSEED=0` in your environment variables. This ensures that Python's hash seed is consistent across all processes. If these processes have different hash seeds, they'll generate different hash values, leading to desynchronization, as seen in the error message.
 
 -   **`torchrun` Timeout Errors**: If you see timeout errors related to `torchrun` during rendezvous, it could be DNS resolution or a networking issue. Check that all worker nodes are communicating properly with the master node.
 
@@ -28,7 +28,7 @@ Let's stay one step ahead of those pesky gotchas. Here's a list of common issues
     Watchdog caught collective operation timeout: WorkNCCL(SeqNum=408951, OpType=_ALLGATHER_BASE, … , Timeout(ms)=600000) ran for 600351 milliseconds before timing out
     ```
   
-    appearing across all GPU workers, it usually means one or more hosts failed to complete a NCCL operation, causing others to block. NCCL errors can be frustrating to diagnose since they rarely specify which node or GPU caused the issue. We're working on improving this by surfacing which messages and operations are in progress during these crashes to better identify any problematic hosts or GPUs. Stay tuned!
+    appearing across all GPU workers, it usually means one or more hosts failed to complete a NCCL operation, causing others to block. NCCL errors can be frustrating to diagnose since they rarely specify which node or GPU caused the issue. It is difficult to surface which messages and operations are in progress during these crashes. In most cases, the best we can do is to restart the training job and hope it doesn't happen again. If the issue persists, it might be because of network congestion or a problematic GPU. If the worker that crashed is consistent across multiple runs, it's likely a hardware issue. If you can't resolve it, open an issue on GitHub, and we'll help you troubleshoot.
 
 For more detailed solutions, check out our GitHub Issues page. Odds are someone's already tackled a similar problem, and you might find the exact fix you need.
 

From 127cd430fe57fa2c8e5ec31305afbb70aa0ee654 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Thu, 7 Nov 2024 22:11:58 -0500
Subject: [PATCH 47/87] wip

---
 docs/index.md       |  2 +-
 docs/quick-start.md | 36 ++++++++++++++++++------------------
 2 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/docs/index.md b/docs/index.md
index 1fe48e5f2..af136b674 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -6,7 +6,7 @@ hide:
 
 Introducing **Fast-LLM**, the cutting-edge open-source library built for training large language models (LLMs) with **unmatched speed, scalability, and cost-efficiency**. Developed by [ServiceNow Research](https://www.servicenow.com/research/)'s Foundation Models Lab, Fast-LLM is engineered to meet the rigorous demands of professional AI researchers, AI/ML engineers, academic and industrial research institutions, and enterprise product development teams pushing the limits of generative AI. **Achieve groundbreaking research and high-stakes production goals faster with Fast-LLM.**
 
-[Start your journey with Fast-LLM](quick-start.md) and explore the future of LLM training. Dive into [real-world use cases](in-action/slurm.md) to see how Fast-LLM can elevate your training workflows.
+[Start your journey with Fast-LLM](quick-start.md) and explore the future of LLM training. Dive into [real-world use cases](recipes/train-llama-8b.md) to see how Fast-LLM can elevate your training workflows.
 
 ## Why Fast-LLM?
 
diff --git a/docs/quick-start.md b/docs/quick-start.md
index 4a96a5f86..d0cacbc49 100644
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -584,13 +584,13 @@ Next, we'll create a configuration file for Fast-LLM. Save the following as `~/i
         interval: 1000
         keep: 5
       test_iters: 0
-      export: # (2)!
+      export:  # (2)!
         format: llama
         interval: 20_000
-      wandb: # (3)!
-        project_name: fast-llm
+      wandb:  # (3)!
+        project_name: fast-llm-quickstart
+        group_name: smollm-135m
         entity_name: servicenow
-        tags: quick-start
     batch:
       micro_batch_size: 1  # (4)!
       sequence_length: 1024
@@ -612,7 +612,7 @@ Next, we'll create a configuration file for Fast-LLM. Save the following as `~/i
     pretrained:
       format: llama  # (10)!
       path: /mnt/inputs
-      load_weights: no  # (11)!
+      model_weights: no  # (11)!
     model:
       multi_stage:
         zero_stage: null  # (12)!
@@ -654,9 +654,9 @@ Next, we'll create a configuration file for Fast-LLM. Save the following as `~/i
         format: llama
         interval: 20_000
       wandb:  # (3)!
-        project_name: fast-llm
+        project_name: fast-llm-quickstart
+        group_name: llama-3.2-1B
         entity_name: servicenow
-        tags: quick-start
     batch:
       micro_batch_size: 1  # (4)!
       sequence_length: 1024
@@ -678,7 +678,7 @@ Next, we'll create a configuration file for Fast-LLM. Save the following as `~/i
     pretrained:
       format: llama  # (10)!
       path: /mnt/inputs
-      load_weights: yes  # (11)!
+      model_weights: yes  # (11)!
     model:
       multi_stage:
         zero_stage: null  # (12)!
@@ -776,7 +776,7 @@ Alright, the big moment! Let's launch the training run.
     ```
 
     Change the `--gpus-per-node` value to match the number of GPUs on your node.
-    If you're not using W&B, remove the references to `WARDB_API_KEY_PATH`.
+    If you're not using W&B, remove the references to `WANDB_API_KEY_PATH`.
 
     Submit the job to the Slurm cluster:
 
@@ -893,21 +893,21 @@ Alright, the big moment! Let's launch the training run.
     Use `kubectl logs fast-llm-master-0` to check the logs.
     Fast-LLM will log training progress to the console every 10 iterations.
 
-You can expect to see the following throughput:
+You can expect to see the following performance metrics in Fast-LLM's output:
 
 === "SmolLM-135M"
 
-    | Metric              | A100-80GB    | H100         |
-    |---------------------|-------------:|-------------:|
-    | Tokens/s            | 1,234,567    | 1,456,789    |
-    | TFLOPS              | 312          | 512          |
+    | Performance Metric  | A100 SXM4 80 GB | H100 SXM5 80 GB |
+    |---------------------|----------------:|----------------:|
+    | Tokens/s/GPU        | 1,234,567       | 1,456,789       |
+    | TFLOPS              | 312             | 512             |
 
 === "Llama-3.2-1B"
 
-    | Metric              | A100         | H100         |
-    |---------------------|-------------:|-------------:|
-    | Tokens/s            | 1,234,567    | 1,456,789    |
-    | TFLOPS              | 312          | 512          |
+    | Performance Metric  | A100 SXM4 80 GB | H100 SXM5 80 GB |
+    |---------------------|----------------:|----------------:|
+    | Tokens/s/GPU        | 1,234,567       | 1,456,789       |
+    | TFLOPS              | 312             | 512             |
 
 If you included the W&B section in your configuration, you can also track your training progress on the Weights & Biases dashboard as well. Follow the link in the console output to view your training run.
 

From c3822985cab2d29060968fb27f5ab49b7938fd63 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Thu, 7 Nov 2024 22:27:53 -0500
Subject: [PATCH 48/87] add datasets as dependency

---
 setup.cfg | 1 +
 1 file changed, 1 insertion(+)

diff --git a/setup.cfg b/setup.cfg
index f36ad457f..62206bd4e 100644
--- a/setup.cfg
+++ b/setup.cfg
@@ -32,6 +32,7 @@ OPTIONAL =
     # Huggingface tools
     transformers>=4.44.2
     hf-transfer>=0.1.8
+    datasets>=3.1.0
     # Weights and biases
     wandb>=0.17.7
     # Hydra

From 7304119ce1f5804943a2e5ef65a98288f7d79136 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Sat, 9 Nov 2024 11:47:52 -0500
Subject: [PATCH 49/87] fix GPTMemmapDataset

---
 fast_llm/data/gpt/memmap.py  |  2 +-
 setup.cfg                    |  1 +
 tests/test_memmap_dataset.py | 33 +++++++++++++++++++++++++++++++++
 3 files changed, 35 insertions(+), 1 deletion(-)
 create mode 100644 tests/test_memmap_dataset.py

diff --git a/fast_llm/data/gpt/memmap.py b/fast_llm/data/gpt/memmap.py
index b49bb9a57..0ff4857b5 100644
--- a/fast_llm/data/gpt/memmap.py
+++ b/fast_llm/data/gpt/memmap.py
@@ -106,7 +106,7 @@ def write_dataset(cls, prefix: pathlib.Path | str, documents: list[np.ndarray]):
         dtype = documents[0].dtype
         num_documents = len(documents)
         lengths = np.array([len(document) for document in documents], dtype=np.int32)
-        pointers = padded_cumsum(lengths[:-1].astype(np.int64) * 2)
+        pointers = padded_cumsum(lengths[:-1].astype(np.int64)) * np.dtype(dtype).itemsize
         prefix.parent.mkdir(parents=True, exist_ok=True)
         with prefix.with_suffix(".idx").open("wb") as stream:
             stream.write(cls._INDEX_HEADER)
diff --git a/setup.cfg b/setup.cfg
index a353151cd..d45144a5e 100644
--- a/setup.cfg
+++ b/setup.cfg
@@ -42,6 +42,7 @@ OPTIONAL =
 DEV =
     pytest>=8.3.2
     pytest-depends>=1.0.1
+    hypothesis>=6.118.1
 
 # Required for building the documentation
 DOCS =
diff --git a/tests/test_memmap_dataset.py b/tests/test_memmap_dataset.py
new file mode 100644
index 000000000..af7153cad
--- /dev/null
+++ b/tests/test_memmap_dataset.py
@@ -0,0 +1,33 @@
+from hypothesis import given, strategies as st
+from hypothesis.extra import numpy as npst
+import numpy as np
+from tempfile import TemporaryDirectory
+from pathlib import Path
+from fast_llm.data.gpt.memmap import GPTMemmapDataset
+
+def dtype_arrays(dtype: type[np.integer], min_size: int=1, max_size: int=100) -> st.SearchStrategy:
+    return st.lists(
+        npst.arrays(
+            dtype=dtype,
+            shape=st.integers(1, 1000),
+            elements=st.integers(
+                min_value=np.iinfo(dtype).min,
+                max_value=np.iinfo(dtype).max,
+            ),
+        ),
+        min_size=min_size,
+        max_size=max_size,
+    )
+
+for dtype in [np.int8, np.uint16, np.int16, np.int32, np.int64]:
+    @given(arrays=dtype_arrays(dtype))
+    def test_gpt_memmap_dataset(arrays: list[np.ndarray]):
+        run_gpt_memmap_dataset_test(documents=arrays)
+
+def run_gpt_memmap_dataset_test(documents: list[np.ndarray]) -> None:
+    with TemporaryDirectory() as temp_dir:
+        prefix = Path(temp_dir)
+        GPTMemmapDataset.write_dataset(prefix=prefix, documents=documents)
+        dataset = GPTMemmapDataset(name="foo", prefix=prefix)
+        for i, document in enumerate(documents):
+            assert np.array_equal(dataset.get(i), document), f"Mismatch at index {i}"

From 47d453b6b7482c63e5469b26dc5bb990ea736a42 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Sat, 9 Nov 2024 12:03:14 -0500
Subject: [PATCH 50/87] fix GPTMemmapDataset

---
 tests/test_memmap_dataset.py | 36 +++++++++++++++---------------------
 1 file changed, 15 insertions(+), 21 deletions(-)

diff --git a/tests/test_memmap_dataset.py b/tests/test_memmap_dataset.py
index af7153cad..49ca44d13 100644
--- a/tests/test_memmap_dataset.py
+++ b/tests/test_memmap_dataset.py
@@ -4,30 +4,24 @@
 from tempfile import TemporaryDirectory
 from pathlib import Path
 from fast_llm.data.gpt.memmap import GPTMemmapDataset
+import pytest
 
-def dtype_arrays(dtype: type[np.integer], min_size: int=1, max_size: int=100) -> st.SearchStrategy:
+def dtype_arrays(dtype: np.dtype, min_size: int=1, max_size: int=100) -> st.SearchStrategy:
     return st.lists(
-        npst.arrays(
-            dtype=dtype,
-            shape=st.integers(1, 1000),
-            elements=st.integers(
-                min_value=np.iinfo(dtype).min,
-                max_value=np.iinfo(dtype).max,
-            ),
-        ),
+        npst.arrays(dtype=dtype, shape=st.integers(1, 1000)),
         min_size=min_size,
         max_size=max_size,
     )
 
-for dtype in [np.int8, np.uint16, np.int16, np.int32, np.int64]:
-    @given(arrays=dtype_arrays(dtype))
-    def test_gpt_memmap_dataset(arrays: list[np.ndarray]):
-        run_gpt_memmap_dataset_test(documents=arrays)
-
-def run_gpt_memmap_dataset_test(documents: list[np.ndarray]) -> None:
-    with TemporaryDirectory() as temp_dir:
-        prefix = Path(temp_dir)
-        GPTMemmapDataset.write_dataset(prefix=prefix, documents=documents)
-        dataset = GPTMemmapDataset(name="foo", prefix=prefix)
-        for i, document in enumerate(documents):
-            assert np.array_equal(dataset.get(i), document), f"Mismatch at index {i}"
+@pytest.mark.parametrize("dtype", GPTMemmapDataset._DTYPES.values())
+def test_gpt_memmap_dataset(dtype):
+    @given(documents=dtype_arrays(dtype))
+    def inner_test(documents):
+        with TemporaryDirectory() as temp_dir:
+            prefix = Path(temp_dir)
+            GPTMemmapDataset.write_dataset(prefix=prefix, documents=documents)
+            dataset = GPTMemmapDataset(name="foo", prefix=prefix)
+            for i, document in enumerate(documents):
+                assert np.array_equal(dataset.get(i), document, equal_nan=True), f"Mismatch at index {i}"
+    
+    inner_test()

From bef3a723f8d014a2e24d3546198d18d54234a69f Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Sat, 9 Nov 2024 23:00:47 -0500
Subject: [PATCH 51/87] add prepare-dataset command

---
 fast_llm/data/config.py           |   1 -
 fast_llm/data/tokenizer.py        |   2 +-
 fast_llm/tools/cli.py             |   4 +-
 fast_llm/tools/prepare_dataset.py | 345 ++++++++++++++++++++++++++++++
 fast_llm/utils.py                 |   8 +-
 setup.cfg                         |   3 +
 6 files changed, 358 insertions(+), 5 deletions(-)
 create mode 100644 fast_llm/tools/prepare_dataset.py

diff --git a/fast_llm/data/config.py b/fast_llm/data/config.py
index 836a6d17b..5e829150b 100644
--- a/fast_llm/data/config.py
+++ b/fast_llm/data/config.py
@@ -106,7 +106,6 @@ def _validate(self):
 class TokenizerConfig(Config):
     """
     Configuration for the tokenizer.
-    Currently, the tokenizer is only needed for FIM.
     """
 
     format: str = Field(
diff --git a/fast_llm/data/tokenizer.py b/fast_llm/data/tokenizer.py
index d75aab7f1..2061d6b6f 100644
--- a/fast_llm/data/tokenizer.py
+++ b/fast_llm/data/tokenizer.py
@@ -6,7 +6,7 @@
 
 class Tokenizer:
     """
-    A Huggingface (transformers) tokenizer.
+    A wrapper around Huggingface (transformers) tokenizer.
     """
 
     def __init__(self, config: TokenizerConfig):
diff --git a/fast_llm/tools/cli.py b/fast_llm/tools/cli.py
index 7b338953f..d3ac5e6d7 100644
--- a/fast_llm/tools/cli.py
+++ b/fast_llm/tools/cli.py
@@ -15,13 +15,15 @@ def fast_llm(args=None):
     # (Pre-)configure logging
     configure_logging()
     parser = argparse.ArgumentParser(add_help=False)
-    parser.add_argument("subcommand", choices=["train", "convert"])
+    parser.add_argument("subcommand", choices=["train", "convert", "prepare_dataset"])
     parsed, unparsed = parser.parse_known_args(args)
     try:
         if parsed.subcommand == "train":
             from fast_llm.tools.train import CliTrainingConfig as Runnable
         elif parsed.subcommand == "convert":
             from fast_llm.tools.convert import ConversionConfig as Runnable
+        elif parsed.subcommand == "prepare_dataset":
+            from fast_llm.tools.prepare_dataset import PrepareDatasetConfig as Runnable
         else:
             raise RuntimeError("Unknown subcommand")
         Runnable.parse_and_run(unparsed)
diff --git a/fast_llm/tools/prepare_dataset.py b/fast_llm/tools/prepare_dataset.py
new file mode 100644
index 000000000..80dc14263
--- /dev/null
+++ b/fast_llm/tools/prepare_dataset.py
@@ -0,0 +1,345 @@
+import abc
+import argparse
+import json
+import os
+import pathlib
+import typing
+from multiprocessing import Pool
+
+import numpy as np
+import torch.distributed
+
+from fast_llm.config import Config, Field, FieldHint, check_field, config_class
+from fast_llm.data.config import TokenizerConfig
+from fast_llm.data.gpt.memmap import GPTMemmapDataset
+from fast_llm.data.tokenizer import Tokenizer
+from fast_llm.engine.config_utils.data_type import DataType
+from fast_llm.engine.config_utils.runnable import RunnableConfig
+from fast_llm.utils import Assert, Registry
+
+
+@config_class
+class DistributedConfig(Config):
+    default_world_size: typing.ClassVar[int] = int(os.environ.get("WORLD_SIZE", 1))
+    default_rank: typing.ClassVar[int] = int(os.environ.get("RANK", 0))
+    world_size: int = Field(
+        default=None,
+        desc="Size of the world group. Typically provided by torchrun or equivalent through the `WORLD_SIZE` environment variable.",
+        hint=FieldHint.expert,
+        valid=check_field(Assert.gt, 0),
+    )
+    rank: int = Field(
+        default=None,
+        desc="Rank of the local process. Typically provided by torchrun or equivalent through the `RANK` environment variable.",
+        hint=FieldHint.expert,
+        valid=check_field(Assert.geq, 0),
+    )
+    backend: str = Field(
+        default="gloo",
+        desc="Distributed backend to use.",
+        hint=FieldHint.optional,
+        valid=check_field(Assert.incl, torch.distributed.Backend.backend_list),
+    )
+
+    def _validate(self):
+        if self.world_size is None:
+            self.world_size = self.default_world_size
+        if self.rank is None:
+            self.rank = self.default_rank
+        super()._validate()
+        Assert.in_range(self.rank, 0, self.world_size)
+
+
+@config_class()
+class DatasetPreparatorConfig(RunnableConfig):
+    _abstract = True
+    model_name: typing.ClassVar[str]
+
+    output_path: pathlib.Path = Field(
+        desc="Output directory for the processed dataset.",
+        hint=FieldHint.core,
+    )
+    distributed: DistributedConfig = Field(
+        default_factory=DistributedConfig,
+        desc="Configuration for distributed processing.",
+        hint=FieldHint.feature,
+    )
+
+    @classmethod
+    def get_dataset_preparator_class(cls) -> typing.Type["DatasetPreparator"]:
+        raise NotImplementedError
+
+    def _get_runnable(self, parsed: argparse.Namespace) -> typing.Callable[[], None]:
+        dataset_preparator = self.get_dataset_preparator_class()(config=self)
+        return dataset_preparator.run
+
+
+class DatasetPreparator(abc.ABC):
+    _abstract = True
+    _config: DatasetPreparatorConfig
+    config_class: typing.ClassVar[type[DatasetPreparatorConfig]] = DatasetPreparatorConfig
+
+    def __init__(self, config: DatasetPreparatorConfig) -> None:
+        Assert.custom(isinstance, config, self.config_class)
+        config.validate()
+        self._config = config
+
+    def run(self) -> None:
+        raise NotImplementedError
+
+
+@config_class
+class GPTDatasetConfig(Config):
+    name_or_path: str = Field(
+        desc="Name or path of the dataset.",
+        hint=FieldHint.core,
+    )
+    config_name: None | str = Field(
+        default=None,
+        desc="Specific configuration name for the dataset.",
+        hint=FieldHint.optional,
+    )
+    split: str = Field(
+        default="train",
+        desc="Split of the dataset to use.",
+        hint=FieldHint.optional,
+    )
+    field: str = Field(
+        default="text",
+        desc="Field of the dataset to use.",
+        hint=FieldHint.optional,
+    )
+    data_type: DataType = Field(
+        default=None,
+        desc="Data type of the dataset field.",
+        hint=FieldHint.derived,
+    )
+    trust_remote_code: bool = Field(
+        default=False,
+        desc="Trust remote code when downloading the dataset.",
+        hint=FieldHint.optional,
+    )
+    disable_disk_space_check: bool = Field(
+        default=False,
+        desc="Disable disk space check. Useful for environments where disk space is not accurately reported.",
+        hint=FieldHint.optional,
+    )
+
+
+@config_class()
+class GPTDatasetPreparatorConfig(DatasetPreparatorConfig):
+    _abstract = False
+    model_name: typing.ClassVar[str] = "gpt"
+
+    tokens_per_shard: int = Field(
+        default=1_000_000_000,
+        desc="Approximate number of tokens per shard.",
+        hint=FieldHint.feature,
+        valid=check_field(Assert.geq, 100_000),
+    )
+    loading_workers: int = Field(
+        default=1,
+        desc="Number of workers in load_dataset() call.",
+        hint=FieldHint.optional,
+        valid=check_field(Assert.geq, 1),
+    )
+    tokenize_workers: int = Field(
+        default=1,
+        desc="Number of workers for tokenization.",
+        hint=FieldHint.optional,
+        valid=check_field(Assert.geq, 1),
+    )
+    saving_workers: int = Field(
+        default=1,
+        desc="Number of processes for saving the data.",
+        hint=FieldHint.optional,
+        valid=check_field(Assert.geq, 1),
+    )
+    dataset: GPTDatasetConfig = Field(
+        default_factory=GPTDatasetConfig,
+        desc="Configuration for the dataset.",
+        hint=FieldHint.feature,
+    )
+    tokenizer: TokenizerConfig = Field(
+        default_factory=TokenizerConfig,
+        desc="Configuration for the tokenizer.",
+        hint=FieldHint.feature,
+    )
+    _tokenizer: Tokenizer = Field(
+        init=False,
+        desc="The tokenizer instance.",
+        hint=FieldHint.derived,
+    )
+
+    def _validate(self):
+        Assert.not_none(self.tokenizer.path)
+        self._tokenizer = Tokenizer(config=self.tokenizer)
+        if self.dataset.data_type is None:
+            # Decide the datatype based on the tokenizer vocabulary size
+            vocab_size = self._tokenizer.vocab_size
+            if vocab_size <= np.iinfo(np.int16).max:
+                self.dataset.data_type = DataType.int16
+            # elif vocab_size <= np.iinfo(np.uint16).max:
+            #     self.dataset.data_type = DataType.uint16  # Not supported by Fast-LLM's DataType
+            elif vocab_size <= np.iinfo(np.int32).max:
+                self.dataset.data_type = DataType.int32
+            else:
+                raise ValueError(f"Tokenizer vocabulary size {vocab_size} is too large. This is likely an error.")
+        super()._validate()
+
+    @classmethod
+    def get_dataset_preparator_class(cls):
+        return GPTDatasetPreparator
+
+
+class GPTDatasetPreparator(DatasetPreparator):
+    _abstract = False
+    _config: GPTDatasetPreparatorConfig
+    config_class = GPTDatasetPreparatorConfig
+
+    def _tokenize_batch(self, batch):
+        input_ids = [
+            np.array(self._tokenizer.tokenize(text), dtype=self.dataset.data_type.numpy)
+            for text in batch[self.dataset.field]
+        ]
+        num_tokens = [len(x) for x in input_ids]
+        return {
+            "input_ids": input_ids,
+            "num_tokens": num_tokens,
+        }
+
+    def _save_shard(self, args) -> dict:
+        from tqdm import tqdm
+
+        shard_idx, shard_dataset = args
+        prefix = f"shard_{self.rank}_{shard_idx}"
+        shard_output_path = self._config.output_path / prefix
+        documents = [
+            np.array(item["input_ids"], dtype=self.dataset.data_type.numpy)
+            for item in tqdm(shard_dataset, desc=f"Saving shard {shard_idx}", unit="docs")
+        ]
+        GPTMemmapDataset.write_dataset(prefix=shard_output_path, documents=documents)
+        dataset_dict = {
+            "prefix": prefix,
+            "num_documents": len(documents),
+            "num_tokens": sum(len(doc) for doc in documents),
+        }
+        return dataset_dict
+
+    def run(self):
+        import datasets
+        import transformers
+        from tqdm import tqdm
+
+        # Set transformers logging verbosity
+        transformers.logging.set_verbosity_error()
+
+        if self._config.dataset.disable_disk_space_check:
+            datasets.builder.has_sufficient_disk_space = lambda needed_bytes, directory=".": True
+
+        # Initialize distributed processing
+        if self._config.distributed.world_size > 1:
+            torch.distributed.init_process_group(
+                backend=self._config.distributed.backend,
+                rank=self._config.distributed.rank,
+                world_size=self._config.distributed.world_size,
+            )
+
+        # Prepare output directory
+        self._config.output_path.mkdir(parents=True, exist_ok=True)
+
+        # Download dataset
+        download_path = self._config.output_path / "downloaded_dataset"
+        if self._config.distributed.rank == 0:
+            datasets.load_dataset(
+                path=self._config.dataset.name_or_path,
+                name=self._config.dataset.config_name,
+                split=self._config.dataset.split,
+                num_proc=self._config.loading_workers,
+                trust_remote_code=self._config.dataset.trust_remote_code,
+            ).save_to_disk(download_path, num_proc=self._config.saving_workers)
+
+        # Synchronize processes to wait for the download
+        if self._config.distributed.world_size > 1:
+            torch.distributed.barrier()
+
+        # Load and shard the dataset
+        dataset = datasets.load_from_disk(download_path).shard(
+            num_shards=self._config.distributed.world_size,
+            index=self._config.distribted.rank,
+        )
+        if self._config.dataset.field not in dataset.column_names:
+            raise ValueError(f"Dataset does not have field '{self._config.dataset.field}'.")
+
+        # Tokenize the dataset
+        tokenized_dataset = dataset.map(
+            self._tokenize_batch,
+            batched=True,
+            num_proc=self._config.tokenize_workers,
+            desc="Tokenizing batches",
+        )
+
+        # Calculate total number of tokens
+        total_tokens = sum(tqdm(tokenized_dataset["num_tokens"], desc="Counting tokens", unit="tokens"))
+
+        # Split dataset into shards
+        num_shards = int(np.ceil(total_tokens / self._config.tokens_per_shard))
+        shards = [
+            (i, tokenized_dataset.shard(num_shards=num_shards, index=i))
+            for i in tqdm(range(num_shards), desc="Creating shards")
+        ]
+
+        # Use multiprocessing to save each shard in parallel
+        with Pool(processes=self._config.saving_workers) as pool:
+            dataset_dicts = pool.map(self._save_shard, shards)
+
+        # Gather dataset_dicts from all ranks to rank 0
+        if self._config.distributed.world_size > 1:
+            all_dataset_dicts = [None] * self._config.distributed.world_size
+            torch.distributed.gather_object(dataset_dicts, all_dataset_dicts, dst=0)
+            if self._config.distributed.rank == 0:
+                dataset_dicts = [item for sublist in all_dataset_dicts for item in sublist]
+
+        # Create a metadata file
+        if self._config.distributed.rank == 0:
+            total_tokens = sum(dataset_dict["num_tokens"] for dataset_dict in dataset_dicts)
+            for dataset_dict in dataset_dicts:
+                dataset_dict["weight"] = float(dataset_dict["num_tokens"]) / float(total_tokens)
+            output_file = self._config.output_path / "fast_llm_dataset.json"
+            json.dump({"datasets": dataset_dicts}, output_file.open("w"))
+
+        # Finalize distributed processing
+        if self._config.distributed.world_size > 1:
+            torch.distributed.barrier()
+            torch.distributed.destroy_process_group()
+
+
+dataset_preparator_registry = Registry(
+    "DatasetPreparator",
+    {
+        dataset_preparator.model_name: dataset_preparator
+        for dataset_preparator in [
+            GPTDatasetPreparatorConfig,
+        ]
+    },
+)
+
+
+class PrepareDatasetConfig(RunnableConfig):
+    @classmethod
+    def _get_parser(cls):
+        parser = super()._get_parser()
+        parser.add_argument(
+            "model_type",
+            choices=dataset_preparator_registry.keys(),
+            help="The Fast-LLM model type to use. Must be defined in the model registry in `fast_llm.models.auto`.",
+        )
+        return parser
+
+    @classmethod
+    def _from_parsed_args(cls, parsed: argparse.Namespace, unparsed: list[str]):
+        return dataset_preparator_registry[parsed.model_type]._from_parsed_args(parsed, unparsed)
+
+
+if __name__ == "__main__":
+    PrepareDatasetConfig.parse_and_run()
diff --git a/fast_llm/utils.py b/fast_llm/utils.py
index 937aacd8c..111d5ca1f 100644
--- a/fast_llm/utils.py
+++ b/fast_llm/utils.py
@@ -116,6 +116,10 @@ def in_range_incl(x, low, high):
     @staticmethod
     def none(x):
         assert x is None, f"Object of type {type(x)} is not None ({str(x)})"
+    
+    @staticmethod
+    def not_none(x):
+        assert x is not None, "Object is None"
 
     @staticmethod
     def empty(x):
@@ -175,8 +179,8 @@ def not_custom(fn, *args, **kwargs):
         ), f"Assertion failed: not fn({', '.join(itertools.chain((str(x) for x in args),(f'{str(k)}={str(v)}' for k,v in kwargs.items())))})"
 
 
-class Registry:
-    def __init__(self, name, data: dict):
+class Registry[_KT, _VT]:
+    def __init__(self, name: str, data: dict[_KT, _VT]):
         self._name = name
         self._data = data.copy()
 
diff --git a/setup.cfg b/setup.cfg
index d45144a5e..1cf0a541e 100644
--- a/setup.cfg
+++ b/setup.cfg
@@ -32,11 +32,14 @@ OPTIONAL =
     # Huggingface tools
     transformers>=4.44.2
     hf-transfer>=0.1.8
+    datasets>=3.1.0
     # Weights and biases
     wandb>=0.17.7
     # Hydra
     hydra-core>=1.3.2
     omegaconf>=2.3.0
+    # Miscaleaneous
+    tqdm>=4.66.3
 
 # Required for testing
 DEV =

From 0ffc75c6e3039e2456810749d506e277e84b890b Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Sat, 9 Nov 2024 23:39:01 -0500
Subject: [PATCH 52/87] add prepare-dataset command

---
 fast_llm/tools/prepare_dataset.py | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fast_llm/tools/prepare_dataset.py b/fast_llm/tools/prepare_dataset.py
index 80dc14263..67f82afb5 100644
--- a/fast_llm/tools/prepare_dataset.py
+++ b/fast_llm/tools/prepare_dataset.py
@@ -199,8 +199,8 @@ class GPTDatasetPreparator(DatasetPreparator):
 
     def _tokenize_batch(self, batch):
         input_ids = [
-            np.array(self._tokenizer.tokenize(text), dtype=self.dataset.data_type.numpy)
-            for text in batch[self.dataset.field]
+            np.array(self._config._tokenizer.tokenize(text), dtype=self._config.dataset.data_type.numpy)
+            for text in batch[self._config.dataset.field]
         ]
         num_tokens = [len(x) for x in input_ids]
         return {
@@ -266,7 +266,7 @@ def run(self):
         # Load and shard the dataset
         dataset = datasets.load_from_disk(download_path).shard(
             num_shards=self._config.distributed.world_size,
-            index=self._config.distribted.rank,
+            index=self._config.distributed.rank,
         )
         if self._config.dataset.field not in dataset.column_names:
             raise ValueError(f"Dataset does not have field '{self._config.dataset.field}'.")

From fda6386538516774efed05d6a2540d708c6eb67b Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Sun, 10 Nov 2024 09:42:54 -0500
Subject: [PATCH 53/87] add prepare-dataset command

---
 fast_llm/tools/prepare_dataset.py | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fast_llm/tools/prepare_dataset.py b/fast_llm/tools/prepare_dataset.py
index 67f82afb5..f0437dd09 100644
--- a/fast_llm/tools/prepare_dataset.py
+++ b/fast_llm/tools/prepare_dataset.py
@@ -212,10 +212,10 @@ def _save_shard(self, args) -> dict:
         from tqdm import tqdm
 
         shard_idx, shard_dataset = args
-        prefix = f"shard_{self.rank}_{shard_idx}"
+        prefix = f"shard_{self._config.distributed.rank}_{shard_idx}"
         shard_output_path = self._config.output_path / prefix
         documents = [
-            np.array(item["input_ids"], dtype=self.dataset.data_type.numpy)
+            np.array(item["input_ids"], dtype=self._config.dataset.data_type.numpy)
             for item in tqdm(shard_dataset, desc=f"Saving shard {shard_idx}", unit="docs")
         ]
         GPTMemmapDataset.write_dataset(prefix=shard_output_path, documents=documents)
@@ -295,10 +295,12 @@ def run(self):
 
         # Gather dataset_dicts from all ranks to rank 0
         if self._config.distributed.world_size > 1:
-            all_dataset_dicts = [None] * self._config.distributed.world_size
-            torch.distributed.gather_object(dataset_dicts, all_dataset_dicts, dst=0)
             if self._config.distributed.rank == 0:
+                all_dataset_dicts = [None] * self._config.distributed.world_size
+                torch.distributed.gather_object(dataset_dicts, all_dataset_dicts, dst=0)
                 dataset_dicts = [item for sublist in all_dataset_dicts for item in sublist]
+            else:
+                torch.distributed.gather_object(dataset_dicts, [], dst=0)
 
         # Create a metadata file
         if self._config.distributed.rank == 0:

From acae7d92960ac11eb9d05035a8a41a5f8c3a0a69 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Sun, 10 Nov 2024 09:58:00 -0500
Subject: [PATCH 54/87] add prepare-dataset command

---
 fast_llm/tools/prepare_dataset.py | 27 +++++++++++++++++++--------
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/fast_llm/tools/prepare_dataset.py b/fast_llm/tools/prepare_dataset.py
index f0437dd09..d172c7c67 100644
--- a/fast_llm/tools/prepare_dataset.py
+++ b/fast_llm/tools/prepare_dataset.py
@@ -155,6 +155,11 @@ class GPTDatasetPreparatorConfig(DatasetPreparatorConfig):
         hint=FieldHint.optional,
         valid=check_field(Assert.geq, 1),
     )
+    clean_output: bool = Field(
+        default=False,
+        desc="Remove downloaded dataset after processing.",
+        hint=FieldHint.optional,
+    )
     dataset: GPTDatasetConfig = Field(
         default_factory=GPTDatasetConfig,
         desc="Configuration for the dataset.",
@@ -248,9 +253,10 @@ def run(self):
         # Prepare output directory
         self._config.output_path.mkdir(parents=True, exist_ok=True)
 
-        # Download dataset
+        # Download dataset if necessary on rank 0
         download_path = self._config.output_path / "downloaded_dataset"
-        if self._config.distributed.rank == 0:
+        download_path_ok = download_path / "ok"
+        if self._config.distributed.rank == 0 and not download_path_ok.exists():
             datasets.load_dataset(
                 path=self._config.dataset.name_or_path,
                 name=self._config.dataset.config_name,
@@ -258,12 +264,13 @@ def run(self):
                 num_proc=self._config.loading_workers,
                 trust_remote_code=self._config.dataset.trust_remote_code,
             ).save_to_disk(download_path, num_proc=self._config.saving_workers)
+            download_path_ok.touch()
 
-        # Synchronize processes to wait for the download
+        # Synchronize processes to wait for the download to finish
         if self._config.distributed.world_size > 1:
             torch.distributed.barrier()
 
-        # Load and shard the dataset
+        # Load and shard the dataset on each rank
         dataset = datasets.load_from_disk(download_path).shard(
             num_shards=self._config.distributed.world_size,
             index=self._config.distributed.rank,
@@ -271,7 +278,7 @@ def run(self):
         if self._config.dataset.field not in dataset.column_names:
             raise ValueError(f"Dataset does not have field '{self._config.dataset.field}'.")
 
-        # Tokenize the dataset
+        # Tokenize the dataset in parallel
         tokenized_dataset = dataset.map(
             self._tokenize_batch,
             batched=True,
@@ -282,14 +289,14 @@ def run(self):
         # Calculate total number of tokens
         total_tokens = sum(tqdm(tokenized_dataset["num_tokens"], desc="Counting tokens", unit="tokens"))
 
-        # Split dataset into shards
+        # Split dataset into shards based on number of tokens
         num_shards = int(np.ceil(total_tokens / self._config.tokens_per_shard))
         shards = [
             (i, tokenized_dataset.shard(num_shards=num_shards, index=i))
             for i in tqdm(range(num_shards), desc="Creating shards")
         ]
 
-        # Use multiprocessing to save each shard in parallel
+        # Use multiprocessing to save each shard in parallel on all ranks
         with Pool(processes=self._config.saving_workers) as pool:
             dataset_dicts = pool.map(self._save_shard, shards)
 
@@ -302,7 +309,7 @@ def run(self):
             else:
                 torch.distributed.gather_object(dataset_dicts, [], dst=0)
 
-        # Create a metadata file
+        # Create a metadata file on rank 0
         if self._config.distributed.rank == 0:
             total_tokens = sum(dataset_dict["num_tokens"] for dataset_dict in dataset_dicts)
             for dataset_dict in dataset_dicts:
@@ -314,6 +321,10 @@ def run(self):
         if self._config.distributed.world_size > 1:
             torch.distributed.barrier()
             torch.distributed.destroy_process_group()
+        
+        # Clean up downloaded dataset
+        if self._config.clean_output and self._config.distributed.rank == 0:
+            download_path.unlink(missing_ok=True)
 
 
 dataset_preparator_registry = Registry(

From eb7da598608719db6f995422b3fe28e8e72719ff Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Sun, 10 Nov 2024 10:01:14 -0500
Subject: [PATCH 55/87] add prepare-dataset command

---
 fast_llm/tools/prepare_dataset.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fast_llm/tools/prepare_dataset.py b/fast_llm/tools/prepare_dataset.py
index d172c7c67..b8d18fa8a 100644
--- a/fast_llm/tools/prepare_dataset.py
+++ b/fast_llm/tools/prepare_dataset.py
@@ -155,7 +155,7 @@ class GPTDatasetPreparatorConfig(DatasetPreparatorConfig):
         hint=FieldHint.optional,
         valid=check_field(Assert.geq, 1),
     )
-    clean_output: bool = Field(
+    remove_downloads: bool = Field(
         default=False,
         desc="Remove downloaded dataset after processing.",
         hint=FieldHint.optional,
@@ -323,7 +323,7 @@ def run(self):
             torch.distributed.destroy_process_group()
         
         # Clean up downloaded dataset
-        if self._config.clean_output and self._config.distributed.rank == 0:
+        if self._config.remove_downloads and self._config.distributed.rank == 0:
             download_path.unlink(missing_ok=True)
 
 

From b5ed2f0535fafff54e77ea1e93217cb89b358d0e Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Sun, 10 Nov 2024 10:05:14 -0500
Subject: [PATCH 56/87] add prepare-dataset command

---
 fast_llm/tools/prepare_dataset.py | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fast_llm/tools/prepare_dataset.py b/fast_llm/tools/prepare_dataset.py
index b8d18fa8a..38a79221e 100644
--- a/fast_llm/tools/prepare_dataset.py
+++ b/fast_llm/tools/prepare_dataset.py
@@ -3,6 +3,7 @@
 import json
 import os
 import pathlib
+import shutil
 import typing
 from multiprocessing import Pool
 
@@ -321,10 +322,10 @@ def run(self):
         if self._config.distributed.world_size > 1:
             torch.distributed.barrier()
             torch.distributed.destroy_process_group()
-        
+
         # Clean up downloaded dataset
         if self._config.remove_downloads and self._config.distributed.rank == 0:
-            download_path.unlink(missing_ok=True)
+            shutil.rmtree(download_path)
 
 
 dataset_preparator_registry = Registry(

From c8f746a55a871652352e6e017f2d65cbdfd38998 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Sun, 10 Nov 2024 13:40:18 -0500
Subject: [PATCH 57/87] only push latest tag for commits to main

---
 .github/workflows/ci.yaml | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml
index 8629c06be..7b3accb20 100644
--- a/.github/workflows/ci.yaml
+++ b/.github/workflows/ci.yaml
@@ -57,12 +57,9 @@ jobs:
             ghcr.io/servicenow/fast-llm
           tags: |
             type=schedule
-            type=ref,event=branch
-            type=semver,pattern={{version}}
-            type=semver,pattern={{major}}.{{minor}}
-            type=semver,pattern={{major}}
+            type=pep440,pattern={{version}}
             type=sha
-            type=raw,value=latest,enabled={{github.ref == 'refs/heads/main'}}
+            type=raw,value=latest,enable={{is_default_branch}}
 
       - name: Set up Docker Buildx
         uses: docker/setup-buildx-action@v3
@@ -78,7 +75,6 @@ jobs:
         uses: docker/build-push-action@v6
         with:
           context: .
-          # push: ${{ github.event_name != 'pull_request' }}
           push: true
           tags: ${{ steps.meta.outputs.tags }}
           labels: ${{ steps.meta.outputs.labels }}

From 0f80b763fe5f1070127d9a658809ab1d3ddd861d Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Sun, 10 Nov 2024 13:59:59 -0500
Subject: [PATCH 58/87] add V100

---
 docs/quick-start.md | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/docs/quick-start.md b/docs/quick-start.md
index d0cacbc49..d8c99a519 100644
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -8,7 +8,7 @@ This guide will get you up and running with Fast-LLM on a single machine. Let's
 
 To follow this guide, you'll need:
 
--   **Hardware**: At least one NVIDIA GPU with Ampere architecture or newer. For optimal results in this tutorial, we recommend 8 A100 GPUs or better. 🤑
+-   **Hardware**: At least one NVIDIA GPU with Volta architecture or newer. For optimal results in this tutorial, we recommend 8 A100 GPUs or better. 🤑
 -   **Software**:
     -   **Docker** (if using the Docker setup), or
     -   **Local Environment**: PyTorch 2.2 or later, CUDA 12.1 or later, and APEX AMP (if building from source), or
@@ -35,7 +35,7 @@ First, choose your environment. You can use Docker, your local environment, Slur
 
 === "Local Environment"
 
-    You selected to use your local environment to run Fast-LLM. You should have a machine with at least one NVIDIA GPU with Ampere architecture or newer. We need to install Fast-LLM and its dependencies in your environment. Our Fast-LLM docker image already includes all this, and we recommend using it for simplicity and reproducibility. If you still want to install Fast-LLM in your local environment, follow the steps below.
+    You selected to use your local environment to run Fast-LLM. You should have a machine with at least one NVIDIA GPU with Volta architecture or newer. We need to install Fast-LLM and its dependencies in your environment. Our Fast-LLM docker image already includes all this, and we recommend using it for simplicity and reproducibility. If you still want to install Fast-LLM in your local environment, follow the steps below.
 
     Fast-LLM depends on [CUDA](https://developer.nvidia.com/about-cuda) 12.1 or later, [PyTorch](https://pytorch.org) 2.2 or later, [APEX](https://github.com/NVIDIA/apex?tab=readme-ov-file#installation), and [OpenAI Triton](https://github.com/triton-lang/triton). Follow the instructions on their respective websites to install them. If you use [conda](https://docs.conda.io/projects/conda/en/latest/index.html), you can create a new environment and install these dependencies in it.
     
@@ -93,7 +93,7 @@ First, choose your environment. You can use Docker, your local environment, Slur
 
 === "Slurm"
 
-    You selected Docker-enabled [Slurm](https://slurm.schedmd.com/) for this tutorial. The Slurm setup requires a Slurm cluster with at least one node and one GPU of Ampere architecture or newer. Slurm will use the `ghcr.io/servicenow/fast-llm:latest` Docker image to train our model. It will need a shared file system for input data and output results. We will assume that your home directory is shared across all nodes.
+    You selected Docker-enabled [Slurm](https://slurm.schedmd.com/) for this tutorial. The Slurm setup requires a Slurm cluster with at least one node and one GPU of Volta architecture or newer. Slurm will use the `ghcr.io/servicenow/fast-llm:latest` Docker image to train our model. It will need a shared file system for input data and output results. We will assume that your home directory is shared across all nodes.
 
     Let's create a folder to store our input data and output results in the shared home directory:
 
@@ -103,7 +103,7 @@ First, choose your environment. You can use Docker, your local environment, Slur
 
 === "Kubernetes"
 
-    You selected to use [Kubernetes](https://kubernetes.io/) with [KubeFlow](https://www.kubeflow.org/) for this tutorial. We will use a `PyTorchJob` resource to train our model with the `ghcr.io/servicenow/fast-llm:latest` Docker image and store our input data and output results in shared [persistent volume claims](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) (PVCs). The Kubernetes cluster should have at least one node and one GPU of Ampere architecture or newer.
+    You selected to use [Kubernetes](https://kubernetes.io/) with [KubeFlow](https://www.kubeflow.org/) for this tutorial. We will use a `PyTorchJob` resource to train our model with the `ghcr.io/servicenow/fast-llm:latest` Docker image and store our input data and output results in shared [persistent volume claims](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) (PVCs). The Kubernetes cluster should have at least one node and one GPU of Volta architecture or newer.
 
     Let's now create two PVCs named `pvc-fast-llm-inputs` and `pvc-fast-llm-results` to store our input data and output results, respectively.
     
@@ -634,7 +634,7 @@ Next, we'll create a configuration file for Fast-LLM. Save the following as `~/i
     10.  Format of the pretrained model. Since SmolLM is a Llama model, we set this to `llama`.
     11.  We'll train SmolLM-135M from scratch. You can set to `yes` to continue training from a checkpoint (if you put one in `~/inputs`).
     12.  We're not using ZeRO for this tutorial, so we set `zero_stage` to `null`. You can set this to `1`, `2`, or `3` for ZeRO-1, ZeRO-2, or ZeRO-3, respectively.
-    13.  `bf16` is supported on Ampere GPUs and higher. Fast-LLM also supports `fp16`.
+    13.  `bf16` (bfloat16, or Brain Floating Point 16) is supported on Ampere GPUs and higher. On Volta GPUs, you can use `fp16` (half-precision floating point) for training instead of `bf16`.
 
 === "Llama-3.2-1B"
 
@@ -897,17 +897,17 @@ You can expect to see the following performance metrics in Fast-LLM's output:
 
 === "SmolLM-135M"
 
-    | Performance Metric  | A100 SXM4 80 GB | H100 SXM5 80 GB |
-    |---------------------|----------------:|----------------:|
-    | Tokens/s/GPU        | 1,234,567       | 1,456,789       |
-    | TFLOPS              | 312             | 512             |
+    | Performance Metric  | V100-SXM2-32GB | A100-SXM4-80GB | H100-SXM5-80GB |
+    |---------------------|---------------:|---------------:|---------------:|
+    | Tokens/s/GPU        | 1,234,567      | 1,456,789      | 1,678,901      |
+    | TFLOPS              | 312            | 512            | 768            |
 
 === "Llama-3.2-1B"
 
-    | Performance Metric  | A100 SXM4 80 GB | H100 SXM5 80 GB |
-    |---------------------|----------------:|----------------:|
-    | Tokens/s/GPU        | 1,234,567       | 1,456,789       |
-    | TFLOPS              | 312             | 512             |
+    | Performance Metric  | V100-SXM2-32GB | A100-SXM4-80GB | H100-SXM5-80GB |
+    |---------------------|---------------:|---------------:|---------------:|
+    | Tokens/s/GPU        | 1,234,567      | 1,456,789      | 1,678,901      |
+    | TFLOPS              | 312            | 512            | 768            |
 
 If you included the W&B section in your configuration, you can also track your training progress on the Weights & Biases dashboard as well. Follow the link in the console output to view your training run.
 

From e0f813ce2e4d944ad91e4c2fdd9ab93022bb187f Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Sun, 10 Nov 2024 14:01:49 -0500
Subject: [PATCH 59/87] use older generics syntax

---
 fast_llm/utils.py | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/fast_llm/utils.py b/fast_llm/utils.py
index 111d5ca1f..285c22722 100644
--- a/fast_llm/utils.py
+++ b/fast_llm/utils.py
@@ -116,7 +116,7 @@ def in_range_incl(x, low, high):
     @staticmethod
     def none(x):
         assert x is None, f"Object of type {type(x)} is not None ({str(x)})"
-    
+
     @staticmethod
     def not_none(x):
         assert x is not None, "Object is None"
@@ -179,7 +179,11 @@ def not_custom(fn, *args, **kwargs):
         ), f"Assertion failed: not fn({', '.join(itertools.chain((str(x) for x in args),(f'{str(k)}={str(v)}' for k,v in kwargs.items())))})"
 
 
-class Registry[_KT, _VT]:
+_KT = typing.TypeVar("_KT")
+_VT = typing.TypeVar("_VT")
+
+
+class Registry(typing.Generic[_KT, _VT]):
     def __init__(self, name: str, data: dict[_KT, _VT]):
         self._name = name
         self._data = data.copy()

From b88c9d3eae97e63d9e1d67ebbf99fe18d4839ccd Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Sun, 10 Nov 2024 14:15:10 -0500
Subject: [PATCH 60/87] remove user and install Fast-LLM globally

---
 Dockerfile | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/Dockerfile b/Dockerfile
index 9c3ecf492..956cafb75 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -7,28 +7,32 @@ RUN apt-get update \
     && rm -rf /var/lib/apt/lists/* \
     && git lfs install
 
-# Add a user for Fast-LLM with sudo privileges for runtime adjustments
-ARG FAST_LLM_USER_ID=1000
-RUN useradd -m -u $FAST_LLM_USER_ID -s /bin/bash fast_llm \
-    && echo 'fast_llm ALL=(ALL) NOPASSWD: ALL' >> /etc/sudoers
+# Create a generic writable home directory for arbitrary users
+RUN mkdir -p /home/user && chmod -R a+w /home/user
 
-USER fast_llm
+# Set the working directory
 WORKDIR /app
 
-# Environment settings for Python and PATH
+# Environment settings for Python and the user
 ENV PYTHONPATH=/app:/app/Megatron-LM \
-    PATH=$PATH:/home/fast_llm/.local/bin/
+    HOME=/home/user
 
-# Copy the dependency files and install dependencies
-COPY --chown=fast_llm setup.py setup.cfg pyproject.toml ./
-COPY --chown=fast_llm ./fast_llm/csrc/ fast_llm/csrc/
+# Copy the dependency files and install dependencies globally
+COPY setup.py setup.cfg pyproject.toml ./
+COPY ./fast_llm/csrc/ fast_llm/csrc/
 RUN PIP_NO_INPUT=1 pip3 install --no-cache-dir --no-build-isolation -e ".[CORE,OPTIONAL,DEV]"
 
 # Copy the rest of the code
-COPY --chown=fast_llm ./Megatron-LM Megatron-LM
-COPY --chown=fast_llm ./examples examples
-COPY --chown=fast_llm ./tests tests
-COPY --chown=fast_llm ./tools tools
+COPY ./Megatron-LM Megatron-LM
+COPY ./examples examples
+COPY ./tests tests
+COPY ./tools tools
 
-# Copy the main source code for Fast-LLM
-COPY --exclude=./fast_llm/csrc/ --chown=fast_llm ./fast_llm/ fast_llm/
+# Copy the main source code
+COPY --exclude=./fast_llm/csrc/ ./fast_llm/ fast_llm/
+
+# Ensure the source code files are writable
+RUN chmod -R a+w /app
+
+# Ensure the user can write to the home directory
+ENTRYPOINT ["/bin/bash", "-c", "export HOME=${HOME} && exec \"$@\"", "--"]

From 4df12d964b2e6a016b526f898da82a10b2256b68 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Sun, 10 Nov 2024 19:39:11 -0500
Subject: [PATCH 61/87] simplify Dockerfile

---
 Dockerfile | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/Dockerfile b/Dockerfile
index 956cafb75..e3804d596 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -7,15 +7,11 @@ RUN apt-get update \
     && rm -rf /var/lib/apt/lists/* \
     && git lfs install
 
-# Create a generic writable home directory for arbitrary users
-RUN mkdir -p /home/user && chmod -R a+w /home/user
-
 # Set the working directory
 WORKDIR /app
 
 # Environment settings for Python and the user
-ENV PYTHONPATH=/app:/app/Megatron-LM \
-    HOME=/home/user
+ENV PYTHONPATH=/app:/app/Megatron-LM
 
 # Copy the dependency files and install dependencies globally
 COPY setup.py setup.cfg pyproject.toml ./
@@ -33,6 +29,3 @@ COPY --exclude=./fast_llm/csrc/ ./fast_llm/ fast_llm/
 
 # Ensure the source code files are writable
 RUN chmod -R a+w /app
-
-# Ensure the user can write to the home directory
-ENTRYPOINT ["/bin/bash", "-c", "export HOME=${HOME} && exec \"$@\"", "--"]

From 3c5d4d9ea9d30e5e7c4f0c95db5f93cb72303fc4 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Mon, 11 Nov 2024 15:03:58 -0500
Subject: [PATCH 62/87] wip

---
 docs/README.md      |  28 ++++----
 docs/quick-start.md | 161 +++++++++++++++++++++++++++-----------------
 2 files changed, 113 insertions(+), 76 deletions(-)

diff --git a/docs/README.md b/docs/README.md
index e4f8e2aaf..704c0d316 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -10,27 +10,27 @@ To view the complete, rendered documentation, please visit the [Fast-LLM Documen
 
 To build and preview the documentation locally, follow these simple steps:
 
-1.   **Install the necessary dependencies:**
+1.  **Install the necessary dependencies:**
 
-   ```bash
-   pip install -e ".[DOCS]"
-   ```
+    ```bash
+    pip install -e ".[DOCS]"
+    ```
 
-2.   **Build the documentation:**
+2.  **Build the documentation:**
 
-   ```bash
-   mkdocs build
-   ```
+    ```bash
+    mkdocs build
+    ```
 
-   This will generate the static documentation files in a `site/` folder.
+    This will generate the static documentation files in a `site/` folder.
 
-3.   **Serve the documentation locally (with auto-reload):**
+3.  **Serve the documentation locally (with auto-reload):**
 
-   ```bash
-   mkdocs serve
-   ```
+    ```bash
+    mkdocs serve
+    ```
 
-   The documentation site will be served locally at [http://127.0.0.1:8000](http://127.0.0.1:8000), and any changes made to the source files will automatically trigger a rebuild.
+    The documentation site will be served locally at [http://127.0.0.1:8000](http://127.0.0.1:8000), and any changes made to the source files will automatically trigger a rebuild.
 
 ## Contributing to the Documentation
 
diff --git a/docs/quick-start.md b/docs/quick-start.md
index d8c99a519..744286532 100644
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -152,49 +152,43 @@ First, choose your environment. You can use Docker, your local environment, Slur
 
 Fast-LLM supports many GPT variants, including (but not limited to) Llama, Mistral, and Mixtral. For this tutorial, let's train a Llama model with data parallelism. You can choose from two models:
 
-=== "SmolLM-135M"
+=== "SmolLM2-135M"
 
-    SmolLM is a smaller, more manageable model with 135M parameters. It's perfect for testing and getting familiar with Fast-LLM. We'll grab its configuration file from Huggingface Hub and save it to our inputs folder:
+    SmolLM2 is a smaller, more manageable model with 135M parameters. It is similar to GPT-2 but with a few improvements. A perfect choice for testing and getting familiar with Fast-LLM. We'll grab the model from Huggingface Hub and save it to our inputs folder.
 
     === "Docker"
 
         ```bash
-        curl -O https://huggingface.co/HuggingFaceTB/SmolLM-135M/resolve/main/config.json
-        mv config.json ~/inputs
+        git lfs install
+        git clone https://huggingface.co/HuggingFaceTB/SmolLM2-135M ~/inputs/SmolLM2-135M
         ```
 
     === "Local Environment"
 
         ```bash
-        curl -O https://huggingface.co/HuggingFaceTB/SmolLM-135M/resolve/main/config.json
-        mv config.json /mnt/inputs
+        git lfs install
+        git clone https://huggingface.co/HuggingFaceTB/SmolLM2-135M /mnt/inputs/SmolLM2-135M
         ```
 
     === "Slurm"
 
         ```bash
-        curl -O https://huggingface.co/HuggingFaceTB/SmolLM-135M/resolve/main/config.json
-        mv config.json ~/inputs
+        git lfs install
+        git clone https://huggingface.co/HuggingFaceTB/SmolLM2-135M ~/inputs/SmolLM2-135M
         ```
 
     === "Kubernetes"
 
-        First, download the configuration file to your local machine:
-
-        ```bash
-        curl -O https://huggingface.co/HuggingFaceTB/SmolLM-135M/resolve/main/config.json
-        ```
-
-        Then, create a temporary pod that mounts the inputs PVC, allowing you to copy files to it. Here's a basic YAML configuration for such a pod:
+        We need to create a temporary pod that mounts the inputs PVC and allows us to download the model. Here's a basic YAML configuration for such a pod:
 
         ```yaml
         apiVersion: v1
         kind: Pod
         metadata:
-          name: file-transfer
+          name: clone-model
         spec:
           containers:
-            - name: file-transfer-container
+            - name: clone-model-container
               image: ubuntu
               command: ["sleep", "infinity"]
               volumeMounts:
@@ -206,28 +200,33 @@ Fast-LLM supports many GPT variants, including (but not limited to) Llama, Mistr
                 claimName: pvc-fast-llm-inputs
         ```
 
-        Save this configuration to a file named `file-transfer-pod.yaml` and apply it to your Kubernetes cluster:
+        Save this configuration to a file named `clone-model-pod.yaml`. Next, apply this configuration to your Kubernetes cluster:
 
         ```bash
-        kubectl apply -f file-transfer-pod.yaml
+        kubectl apply -f clone-model-pod.yaml
         ```
 
-        Copy the configuration file to the pod:
+        Now, enter the pod, log in to your Hugging Face account, and clone the model:
 
         ```bash
-        kubectl cp config.json file-transfer:/mnt/inputs
+        kubectl exec -it clone-model -- /bin/bash
+        git lfs install
+        git clone https://huggingface.co/HuggingFaceTB/SmolLM2-135M /mnt/inputs/SmolLM2-135M
         ```
 
-        Finally, clean up the temporary pod and configuration file:
+        Finally, clean up the temporary pod, it's no longer needed:
 
         ```bash
-        kubectl delete pod file-transfer
-        rm config.json
+        kubectl delete pod clone-model
         ```
 
 === "Llama-3.2-1B"
 
-    Llama is a larger model with 1B parameters. It's more powerful but requires more resources to train. We'll grab the model from the Huggingface Hub and save it to our inputs folder:
+    Llama is a larger model with 1B parameters. It's more powerful but requires more resources to train. We'll grab the model from the Huggingface Hub and save it to our inputs folder.
+    
+    !!! note "Access Required"
+    
+        Meta gates access to the Llama model. You need to request access to the model from Meta before you can download it at https://huggingface.co/meta-llama/Llama-3.2-1B.
 
     === "Docker"
 
@@ -242,7 +241,7 @@ Fast-LLM supports many GPT variants, including (but not limited to) Llama, Mistr
 
         ```bash
         git lfs install
-        git clone https://huggingface.co/meta-llama/Llama-3.2-1B ~/inputs
+        git clone https://huggingface.co/meta-llama/Llama-3.2-1B ~/inputs/Llama-3.2-1B
         ```
     
     === "Local Environment"
@@ -258,7 +257,7 @@ Fast-LLM supports many GPT variants, including (but not limited to) Llama, Mistr
 
         ```bash
         git lfs install
-        git clone https://huggingface.co/meta-llama/Llama-3.2-1B /mnt/inputs
+        git clone https://huggingface.co/meta-llama/Llama-3.2-1B /mnt/inputs/Llama-3.2-1B
         ```
     
     === "Slurm"
@@ -274,13 +273,13 @@ Fast-LLM supports many GPT variants, including (but not limited to) Llama, Mistr
 
         ```bash
         git lfs install
-        git clone https://huggingface.co/meta-llama/Llama-3.2-1B ~/inputs
+        git clone https://huggingface.co/meta-llama/Llama-3.2-1B ~/inputs/Llama-3.2-1B
         ```
     
     === "Kubernetes"
     
         We need to create a temporary pod that mounts the inputs PVC and allows us to download the model. Here's a basic YAML configuration for such a pod:
-    
+
         ```yaml
         apiVersion: v1
         kind: Pod
@@ -313,7 +312,7 @@ Fast-LLM supports many GPT variants, including (but not limited to) Llama, Mistr
         pip install huggingface_hub
         huggingface-cli login
         git lfs install
-        git clone https://huggingface.co/meta-llama/Llama-3.2-1B /mnt/inputs
+        git clone https://huggingface.co/meta-llama/Llama-3.2-1B /mnt/inputs/Llama-3.2-1B
         ```
 
         Finally, clean up the temporary pod, it's no longer needed:
@@ -324,13 +323,13 @@ Fast-LLM supports many GPT variants, including (but not limited to) Llama, Mistr
 
 !!! tip "Model Size Matters"
 
-    Smaller models like SmolLM-135M will train relatively quickly, especially if you've only got a few GPUs. But if you're feeling adventurous (and patient), give the larger Llama-3.2-1B a shot!
+    Smaller models like SmolLM2-135M will train relatively quickly, especially if you've only got a few GPUs. But if you're feeling adventurous (and patient), give the larger Llama-3.2-1B a shot!
 
 ## Step 3: Prepare the Training Data 📚
 
 For this tutorial, we'll use 9B tokens of text from the [OpenWebText](https://skylion007.github.io/OpenWebTextCorpus/) dataset. This dataset is a free approximation of the WebText data OpenAI used for GPT-2, and it's perfect for our test run!
 
-=== "SmolLM-135M"
+=== "SmolLM2-135M"
 
     === "Docker"
 
@@ -340,7 +339,7 @@ For this tutorial, we'll use 9B tokens of text from the [OpenWebText](https://sk
         docker run -it --rm ghcr.io/servicenow/fast-llm:latest \
             -v ~/inputs:/mnt/inputs \
             python tools/prepare_dataset.py \
-            tokenizer_path_or_name="HuggingFaceTB/SmolLM-135M" \
+            tokenizer_path_or_name="HuggingFaceTB/SmolLM2-135M" \
             dataset_name_or_path="openwebtext" \
             dataset_split="train" \
             output_dir="/mnt/inputs" \
@@ -357,7 +356,7 @@ For this tutorial, we'll use 9B tokens of text from the [OpenWebText](https://sk
         ```bash
         curl -O https://raw.githubusercontent.com/ServiceNow/Fast-LLM/main/tools/prepare_dataset.py
         python prepare_dataset.py \
-            tokenizer_path_or_name="HuggingFaceTB/SmolLM-135M" \
+            tokenizer_path_or_name="HuggingFaceTB/SmolLM2-135M" \
             dataset_name_or_path="openwebtext" \
             dataset_split="train" \
             output_dir="/mnt/inputs" \
@@ -386,7 +385,7 @@ For this tutorial, we'll use 9B tokens of text from the [OpenWebText](https://sk
             --ntasks-per-node=$SLURM_NTASKS_PER_NODE \
             bash -c "
                 python tools/prepare_dataset.py \
-                    tokenizer_path_or_name='HuggingFaceTB/SmolLM-135M' \
+                    tokenizer_path_or_name='HuggingFaceTB/SmolLM2-135M' \
                     dataset_name_or_path='openwebtext' \
                     dataset_split='train' \
                     output_dir='/mnt/inputs' \
@@ -416,7 +415,7 @@ For this tutorial, we'll use 9B tokens of text from the [OpenWebText](https://sk
                   image: ghcr.io/servicenow/fast-llm:latest
                   command: ["python", "tools/prepare_dataset.py"]
                   args:
-                    - tokenizer_path_or_name=HuggingFaceTB/SmolLM-135M
+                    - tokenizer_path_or_name=HuggingFaceTB/SmolLM2-135M
                     - dataset_name_or_path=openwebtext
                     - dataset_split=train
                     - output_dir=/mnt/inputs
@@ -570,7 +569,7 @@ For this tutorial, we'll use 9B tokens of text from the [OpenWebText](https://sk
 
 Next, we'll create a configuration file for Fast-LLM. Save the following as `~/inputs/fast-llm-config.yaml`:
 
-=== "SmolLM-135M"
+=== "SmolLM2-135M"
 
     ```yaml
     training:
@@ -589,15 +588,15 @@ Next, we'll create a configuration file for Fast-LLM. Save the following as `~/i
         interval: 20_000
       wandb:  # (3)!
         project_name: fast-llm-quickstart
-        group_name: smollm-135m
+        group_name: SmolLM2-135M
         entity_name: servicenow
     batch:
-      micro_batch_size: 1  # (4)!
+      micro_batch_size: 20  # (4)!
       sequence_length: 1024
       batch_size: 480  # (5)!
     data:
       format: file
-      path: /mnt/inputs/fast_llm_dataset.json  # (6)!
+      path: /mnt/inputs/openwebtext/fast_llm_dataset.json  # (6)!
       split: [99, 1, 0]  # (7)!
     optimizer: # (8)!
       weight_decay: 0.1
@@ -611,13 +610,16 @@ Next, we'll create a configuration file for Fast-LLM. Save the following as `~/i
         warmup_iterations: 2000
     pretrained:
       format: llama  # (10)!
-      path: /mnt/inputs
+      path: /mnt/inputs/SmolLM2-135M
       model_weights: no  # (11)!
     model:
+      base_model:
+        transformer:
+          use_flash_attention: yes  # (12)!
       multi_stage:
-        zero_stage: null  # (12)!
+        zero_stage: null  # (13)!
       distributed:
-        training_dtype: bf16  # (13)!
+        training_dtype: bf16  # (14)!
     run:
       experiment_dir: /mnt/results
     ```
@@ -625,22 +627,23 @@ Next, we'll create a configuration file for Fast-LLM. Save the following as `~/i
     1.  Total number of training tokens will be approximately 300B.
     2.  A Llama model will be saved in Hugging Face format to `~/results` directory every 20,000 iterations.
     3.  Entirely optional, but it's a good idea to track your training progress with Weights & Biases. Replace `servicenow` with your own W&B entity name. If you don't want to use W&B, just remove this section.
-    4.  Adjust the number of sequences per GPU based on GPU memory. For SmolLM-135M and an A100-80GB, a `micro_batch_size` of 1 should work well.
+    4.  Adjust the number of sequences per GPU based on GPU memory. For SmolLM2-135M and an A100-80GB, a `micro_batch_size` of 1 should work well.
     5.  Must be divisible by the number of GPUs and the `micro_batch_size`. At 1024 tokens per sequence, 480 corresponds to about 500,000 tokens per batch.
     6.  Location of the dataset metadata file generated in Step 4.
     7.  99% train, 1% validation, 0% test. These settings need to be adjusted based on the size of your dataset. If you're using a smaller dataset, you need to increase the validation split.
     8.  These are good default optimizer settings for training models.
     9.  We are using a cosine decay schedule with linear warmup. After reaching the peak learning rate `base` at `warmup_iterations`, the learning rate will decay to `minimum` at `decay_iterations`, following a cosine curve. The minimum learning rate should be 1/10th of the base learning rate per Chinchilla.
     10.  Format of the pretrained model. Since SmolLM is a Llama model, we set this to `llama`.
-    11.  We'll train SmolLM-135M from scratch. You can set to `yes` to continue training from a checkpoint (if you put one in `~/inputs`).
-    12.  We're not using ZeRO for this tutorial, so we set `zero_stage` to `null`. You can set this to `1`, `2`, or `3` for ZeRO-1, ZeRO-2, or ZeRO-3, respectively.
-    13.  `bf16` (bfloat16, or Brain Floating Point 16) is supported on Ampere GPUs and higher. On Volta GPUs, you can use `fp16` (half-precision floating point) for training instead of `bf16`.
+    11.  We'll train SmolLM2-135M from scratch. You can set to `yes` to continue training from a checkpoint (if you put one in `~/inputs`).
+    12.  If you're using Ampere GPUs or higher, you can enable FlashAttention for faster training. Otherwise, set this to `no`. The default is `yes`.
+    13.  We're not using ZeRO for this tutorial, so we set `zero_stage` to `null`. You can set this to `1`, `2`, or `3` for ZeRO-1, ZeRO-2, or ZeRO-3, respectively.
+    14.  `bf16` (bfloat16, or Brain Floating Point 16) is supported on Ampere GPUs and higher. On Volta GPUs, you can use `fp16` (half-precision floating point) for training instead of `bf16`.
 
 === "Llama-3.2-1B"
 
     ```yaml
     training:
-      train_iters: 600_000  # (1)!
+      train_iters: 100_000  # (1)!
       logs:
         interval: 10
       validation:
@@ -680,10 +683,13 @@ Next, we'll create a configuration file for Fast-LLM. Save the following as `~/i
       path: /mnt/inputs
       model_weights: yes  # (11)!
     model:
+      base_model:
+        transformer:
+          use_flash_attention: yes  # (12)!
       multi_stage:
-        zero_stage: null  # (12)!
+        zero_stage: null  # (13)!
       distributed:
-        training_dtype: bf16  # (13)!
+        training_dtype: bf16  # (14)!
     run:
       experiment_dir: /mnt/results
     ```
@@ -699,8 +705,9 @@ Next, we'll create a configuration file for Fast-LLM. Save the following as `~/i
     9.  We are using a cosine decay schedule with linear warmup. After reaching the peak learning rate `base` at `warmup_iterations`, the learning rate will decay to `minimum` at `decay_iterations`, following a cosine curve. The minimum learning rate should be 1/10th of the base learning rate per Chinchilla.
     10.  Format of the pretrained model. Since it's a Llama model, we set this to `llama`.
     11.  We want to continue training Llama-3.2-1B from a checkpoint. If you're training from scratch, set this to `no`.
-    12.  We're not using ZeRO for this tutorial, so we set `zero_stage` to `null`. You can set this to `1`, `2`, or `3` for ZeRO-1, ZeRO-2, or ZeRO-3, respectively.
-    13.  `bf16` is supported on Ampere GPUs and higher. Fast-LLM also supports `fp16`.
+    12.  If you're using Ampere GPUs or higher, you can enable FlashAttention for faster training. Otherwise, set this to `no`. The default is `yes`.
+    13.  We're not using ZeRO for this tutorial, so we set `zero_stage` to `null`. You can set this to `1`, `2`, or `3` for ZeRO-1, ZeRO-2, or ZeRO-3, respectively.
+    14.  `bf16` (bfloat16, or Brain Floating Point 16) is supported on Ampere GPUs and higher. On Volta GPUs, you can use `fp16` (half-precision floating point) for training instead of `bf16`.
 
 ## (Optional) Step 6: Add Your Weights & Biases API Key 🔑
 
@@ -895,19 +902,49 @@ Alright, the big moment! Let's launch the training run.
 
 You can expect to see the following performance metrics in Fast-LLM's output:
 
-=== "SmolLM-135M"
-
-    | Performance Metric  | V100-SXM2-32GB | A100-SXM4-80GB | H100-SXM5-80GB |
-    |---------------------|---------------:|---------------:|---------------:|
-    | Tokens/s/GPU        | 1,234,567      | 1,456,789      | 1,678,901      |
-    | TFLOPS              | 312            | 512            | 768            |
+=== "SmolLM2-135M"
+
+    | Performance Metric  | 8x V100-SXM2-32GB[^SmolLM2-V100] | 8x A100-SXM4-80GB[^SmolLM2-A100] | 8x H100-SXM5-80GB[^SmolLM2-H100] |
+    |---------------------|---------------------------------:|---------------------------------:|---------------------------------:|
+    | tokens/s/GPU        | 18300                            |                                  |                                  |
+    | tflop/s (model)     | 16.7                             |                                  |                                  |
+    | tflop/s (hardware)  | 17.0                             |                                  |                                  |
+    | total training time | 23.3 days                        |                                  |                                  |
+
+    [^SmolLM2-V100]:
+        `bf16` is not supported on V100 GPUs. Precision was set to `fp16`.
+        FlashAttention is not supported on V100 GPUs, so it was disabled.
+        Micro-batch size was set to 12.
+    [^SmolLM2-A100]:
+        Precision was set to `bf16`.
+        FlashAttention was enabled.
+        Micro-batch size was set to ???.
+    [^SmolLM2-H100]:
+        Precision was set to `bf16`.
+        FlashAttention was enabled.
+        Micro-batch size was set to ???.
 
 === "Llama-3.2-1B"
 
-    | Performance Metric  | V100-SXM2-32GB | A100-SXM4-80GB | H100-SXM5-80GB |
-    |---------------------|---------------:|---------------:|---------------:|
-    | Tokens/s/GPU        | 1,234,567      | 1,456,789      | 1,678,901      |
-    | TFLOPS              | 312            | 512            | 768            |
+    | Performance Metric  | 8x V100-SXM2-32GB[^Llama-V100] | 8x A100-SXM4-80GB[^Llama-A100] | 8x H100-SXM5-80GB[^Llama-H100] |
+    |---------------------|-------------------------------:|-------------------------------:|-------------------------------:|
+    | tokens/s/GPU        | 5680                           |                                |                                |
+    | tflop/s (model)     | 43.3                           |                                |                                |
+    | tflop/s (hardware)  | 43.4                           |                                |                                |
+    | total training time | 12.5                           |                                |                                |
+
+    [^Llama-V100]:
+        `bf16` is not supported on V100 GPUs. Precision was set to `fp16`.
+        FlashAttention is not supported on V100 GPUs, so it was disabled.
+        Micro-batch size was set to 4.
+    [^Llama-A100]:
+        Precision was set to `bf16`.
+        FlashAttention was enabled.
+        Micro-batch size was set to ???.
+    [^Llama-H100]:
+        Precision was set to `bf16`.
+        FlashAttention was enabled.
+        Micro-batch size was set to ???.
 
 If you included the W&B section in your configuration, you can also track your training progress on the Weights & Biases dashboard as well. Follow the link in the console output to view your training run.
 

From 3737bc0d7fc99954842bf2cb2628fc325d740053 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Mon, 11 Nov 2024 18:14:41 -0500
Subject: [PATCH 63/87] improvements

---
 .dockerignore |  7 +++++++
 Dockerfile    | 57 ++++++++++++++++++++++++++++++++++++++-------------
 2 files changed, 50 insertions(+), 14 deletions(-)

diff --git a/.dockerignore b/.dockerignore
index 2022ee390..0ed5480a2 100644
--- a/.dockerignore
+++ b/.dockerignore
@@ -1,4 +1,7 @@
+# Ignore everything by default
 *
+
+# Allow specific files and directories
 !setup.py
 !setup.cfg
 !Megatron-LM
@@ -7,3 +10,7 @@
 !tools
 !tests
 !pyproject.toml
+
+# Exclude Python cache directories and shared object files within included directories
+**/__pycache__/
+**/*.so
diff --git a/Dockerfile b/Dockerfile
index e3804d596..41fbd2c94 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -1,31 +1,60 @@
 # syntax=docker/dockerfile:1.7-labs
-FROM nvcr.io/nvidia/pytorch:24.07-py3
+FROM nvcr.io/nvidia/pytorch:24.07-py3 as base
 
-# Install git-lfs for Huggingface hub interaction and sudo for system adjustments
+# Install dependencies
 RUN apt-get update \
-    && apt-get install --no-install-recommends -y git-lfs sudo util-linux \
+    && apt-get install --no-install-recommends -y acl python3.10-venv git-lfs \
     && rm -rf /var/lib/apt/lists/* \
     && git lfs install
 
 # Set the working directory
 WORKDIR /app
 
-# Environment settings for Python and the user
-ENV PYTHONPATH=/app:/app/Megatron-LM
+# Set the setgid bit and default ACL for /app
+RUN chmod g+s /app && \
+    setfacl -d -m u::rwx,g::rwx,o::rwx /app && \
+    setfacl -d -m u::rw-,g::rw-,o::rw- /app
 
-# Copy the dependency files and install dependencies globally
-COPY setup.py setup.cfg pyproject.toml ./
-COPY ./fast_llm/csrc/ fast_llm/csrc/
-RUN PIP_NO_INPUT=1 pip3 install --no-cache-dir --no-build-isolation -e ".[CORE,OPTIONAL,DEV]"
+# Environment settings for the virtual environment
+ENV VIRTUAL_ENV=/app/venv
+ENV PATH="$VIRTUAL_ENV/bin:$PATH"
 
-# Copy the rest of the code
+# Create the virtual environment with system site packages
+RUN python3 -m venv $VIRTUAL_ENV --system-site-packages
+
+# Copy dependency files with universal write permissions for all users
+COPY --chmod=666 setup.py setup.cfg pyproject.toml ./
+COPY --chmod=666 ./fast_llm/csrc/ fast_llm/csrc/
+
+# Install dependencies within the virtual environment
+RUN pip install --no-cache-dir --no-build-isolation -e ".[CORE,OPTIONAL,DEV]"
+
+# Use intermediate build stage to copy the remaining source code
+FROM alpine as copy_source
+
+# Set the working directory
+WORKDIR /app
+
+# Copy remaining source code with universal write permissions
 COPY ./Megatron-LM Megatron-LM
 COPY ./examples examples
 COPY ./tests tests
 COPY ./tools tools
-
-# Copy the main source code
 COPY --exclude=./fast_llm/csrc/ ./fast_llm/ fast_llm/
 
-# Ensure the source code files are writable
-RUN chmod -R a+w /app
+RUN find Megatron-LM -type f -exec chmod 666 {} \; && \
+    find examples -type f -exec chmod 666 {} \; && \
+    find tests -type f -exec chmod 666 {} \; && \
+    find tools -type f -exec chmod 666 {} \; && \
+    find fast_llm -type f -exec chmod 666 {} \; && \
+    find . -type d -exec chmod 777 {} \;
+
+# Create a tar archive of /app with permissions preserved
+RUN tar -cf /app.tar -C /app .
+
+# Continue with the base stage
+FROM base
+
+# Copy the remaining source code from the intermediate build stage
+COPY --from=copy_source /app.tar /
+RUN tar -xf /app.tar -C /app && rm /app.tar

From 4b6b195548235d38ee85d866c0893083e8df94e8 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Mon, 11 Nov 2024 18:26:38 -0500
Subject: [PATCH 64/87] add docstring

---
 fast_llm/data/config.py | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fast_llm/data/config.py b/fast_llm/data/config.py
index 5e829150b..32d48fab5 100644
--- a/fast_llm/data/config.py
+++ b/fast_llm/data/config.py
@@ -106,6 +106,7 @@ def _validate(self):
 class TokenizerConfig(Config):
     """
     Configuration for the tokenizer.
+    The tokenizer is needed for FIM and dataset preparation.
     """
 
     format: str = Field(

From 52a6f0be9719e23f5a589a9767c1b8dd33f49659 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Mon, 11 Nov 2024 18:28:37 -0500
Subject: [PATCH 65/87] use full imports

---
 tests/test_memmap_dataset.py | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/tests/test_memmap_dataset.py b/tests/test_memmap_dataset.py
index 49ca44d13..6d27faf34 100644
--- a/tests/test_memmap_dataset.py
+++ b/tests/test_memmap_dataset.py
@@ -1,21 +1,24 @@
-from hypothesis import given, strategies as st
-from hypothesis.extra import numpy as npst
+import hypothesis
+import hypothesis.strategies
+import hypothesis.extra.numpy
 import numpy as np
 from tempfile import TemporaryDirectory
 from pathlib import Path
 from fast_llm.data.gpt.memmap import GPTMemmapDataset
 import pytest
 
-def dtype_arrays(dtype: np.dtype, min_size: int=1, max_size: int=100) -> st.SearchStrategy:
-    return st.lists(
-        npst.arrays(dtype=dtype, shape=st.integers(1, 1000)),
+
+def dtype_arrays(dtype: np.dtype, min_size: int = 1, max_size: int = 100) -> hypothesis.strategies.SearchStrategy:
+    return hypothesis.strategies.lists(
+        hypothesis.extra.numpy.arrays(dtype=dtype, shape=hypothesis.strategies.integers(1, 1000)),
         min_size=min_size,
         max_size=max_size,
     )
 
+
 @pytest.mark.parametrize("dtype", GPTMemmapDataset._DTYPES.values())
 def test_gpt_memmap_dataset(dtype):
-    @given(documents=dtype_arrays(dtype))
+    @hypothesis.given(documents=dtype_arrays(dtype))
     def inner_test(documents):
         with TemporaryDirectory() as temp_dir:
             prefix = Path(temp_dir)
@@ -23,5 +26,5 @@ def inner_test(documents):
             dataset = GPTMemmapDataset(name="foo", prefix=prefix)
             for i, document in enumerate(documents):
                 assert np.array_equal(dataset.get(i), document, equal_nan=True), f"Mismatch at index {i}"
-    
+
     inner_test()

From 55b0b88ea466946bb4f13d8c8823cc442fd9cf81 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Mon, 11 Nov 2024 18:29:17 -0500
Subject: [PATCH 66/87] use full imports

---
 tests/test_memmap_dataset.py | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/tests/test_memmap_dataset.py b/tests/test_memmap_dataset.py
index 6d27faf34..8cef40a87 100644
--- a/tests/test_memmap_dataset.py
+++ b/tests/test_memmap_dataset.py
@@ -1,12 +1,14 @@
+import pathlib
+from tempfile import TemporaryDirectory
+
 import hypothesis
-import hypothesis.strategies
 import hypothesis.extra.numpy
+import hypothesis.strategies
 import numpy as np
-from tempfile import TemporaryDirectory
-from pathlib import Path
-from fast_llm.data.gpt.memmap import GPTMemmapDataset
 import pytest
 
+from fast_llm.data.gpt.memmap import GPTMemmapDataset
+
 
 def dtype_arrays(dtype: np.dtype, min_size: int = 1, max_size: int = 100) -> hypothesis.strategies.SearchStrategy:
     return hypothesis.strategies.lists(
@@ -21,7 +23,7 @@ def test_gpt_memmap_dataset(dtype):
     @hypothesis.given(documents=dtype_arrays(dtype))
     def inner_test(documents):
         with TemporaryDirectory() as temp_dir:
-            prefix = Path(temp_dir)
+            prefix = pathlib.Path(temp_dir)
             GPTMemmapDataset.write_dataset(prefix=prefix, documents=documents)
             dataset = GPTMemmapDataset(name="foo", prefix=prefix)
             for i, document in enumerate(documents):

From 1f975d227ad586e577557a42f366ca50310efa32 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Mon, 11 Nov 2024 18:30:51 -0500
Subject: [PATCH 67/87] use full imports

---
 fast_llm/tools/prepare_dataset.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fast_llm/tools/prepare_dataset.py b/fast_llm/tools/prepare_dataset.py
index 38a79221e..16a7d9304 100644
--- a/fast_llm/tools/prepare_dataset.py
+++ b/fast_llm/tools/prepare_dataset.py
@@ -5,7 +5,7 @@
 import pathlib
 import shutil
 import typing
-from multiprocessing import Pool
+import multiprocessing
 
 import numpy as np
 import torch.distributed
@@ -298,7 +298,7 @@ def run(self):
         ]
 
         # Use multiprocessing to save each shard in parallel on all ranks
-        with Pool(processes=self._config.saving_workers) as pool:
+        with multiprocessing.Pool(processes=self._config.saving_workers) as pool:
             dataset_dicts = pool.map(self._save_shard, shards)
 
         # Gather dataset_dicts from all ranks to rank 0

From b665e914f8bff9be4b20ba93081450cc44c7d4e5 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Mon, 11 Nov 2024 18:37:44 -0500
Subject: [PATCH 68/87] don't load tokenizer during validatin

---
 fast_llm/tools/prepare_dataset.py | 39 ++++++++++++++++++-------------
 fast_llm/utils.py                 |  4 ----
 2 files changed, 23 insertions(+), 20 deletions(-)

diff --git a/fast_llm/tools/prepare_dataset.py b/fast_llm/tools/prepare_dataset.py
index 16a7d9304..cc5dc9122 100644
--- a/fast_llm/tools/prepare_dataset.py
+++ b/fast_llm/tools/prepare_dataset.py
@@ -110,10 +110,10 @@ class GPTDatasetConfig(Config):
         desc="Field of the dataset to use.",
         hint=FieldHint.optional,
     )
-    data_type: DataType = Field(
+    data_type: DataType | None = Field(
         default=None,
-        desc="Data type of the dataset field.",
-        hint=FieldHint.derived,
+        desc="Data type of the dataset field. If not provided, it will be inferred based on the tokenizer vocabulary size.",
+        hint=FieldHint.optional,
     )
     trust_remote_code: bool = Field(
         default=False,
@@ -178,19 +178,9 @@ class GPTDatasetPreparatorConfig(DatasetPreparatorConfig):
     )
 
     def _validate(self):
-        Assert.not_none(self.tokenizer.path)
-        self._tokenizer = Tokenizer(config=self.tokenizer)
-        if self.dataset.data_type is None:
-            # Decide the datatype based on the tokenizer vocabulary size
-            vocab_size = self._tokenizer.vocab_size
-            if vocab_size <= np.iinfo(np.int16).max:
-                self.dataset.data_type = DataType.int16
-            # elif vocab_size <= np.iinfo(np.uint16).max:
-            #     self.dataset.data_type = DataType.uint16  # Not supported by Fast-LLM's DataType
-            elif vocab_size <= np.iinfo(np.int32).max:
-                self.dataset.data_type = DataType.int32
-            else:
-                raise ValueError(f"Tokenizer vocabulary size {vocab_size} is too large. This is likely an error.")
+        assert self.tokenizer.path is not None
+        if self.dataset.data_type is not None:
+            Assert.incl(self.dataset.data_type.numpy, GPTMemmapDataset._DTYPES.values())
         super()._validate()
 
     @classmethod
@@ -240,9 +230,26 @@ def run(self):
         # Set transformers logging verbosity
         transformers.logging.set_verbosity_error()
 
+        # Disable disk space check if requested
         if self._config.dataset.disable_disk_space_check:
             datasets.builder.has_sufficient_disk_space = lambda needed_bytes, directory=".": True
 
+        # Load tokenizer
+        self._tokenizer = Tokenizer(config=self.tokenizer)
+
+        # Set data type if not provided
+        if self.dataset.data_type is None:
+            # Decide the datatype based on the tokenizer vocabulary size
+            vocab_size = self._tokenizer.vocab_size
+            if vocab_size <= np.iinfo(np.int16).max:
+                self.dataset.data_type = DataType.int16
+            # elif vocab_size <= np.iinfo(np.uint16).max:
+            #     self.dataset.data_type = DataType.uint16  # Not supported by Fast-LLM's DataType
+            elif vocab_size <= np.iinfo(np.int32).max:
+                self.dataset.data_type = DataType.int32
+            else:
+                raise ValueError(f"Tokenizer vocabulary size {vocab_size} is too large. This is likely an error.")
+
         # Initialize distributed processing
         if self._config.distributed.world_size > 1:
             torch.distributed.init_process_group(
diff --git a/fast_llm/utils.py b/fast_llm/utils.py
index 285c22722..66539efb3 100644
--- a/fast_llm/utils.py
+++ b/fast_llm/utils.py
@@ -117,10 +117,6 @@ def in_range_incl(x, low, high):
     def none(x):
         assert x is None, f"Object of type {type(x)} is not None ({str(x)})"
 
-    @staticmethod
-    def not_none(x):
-        assert x is not None, "Object is None"
-
     @staticmethod
     def empty(x):
         assert len(x) == 0, f"Not empty (len={len(x)}), {x}"

From e51677f54992abf5f861c7acd17873bacb60472d Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Mon, 11 Nov 2024 19:10:17 -0500
Subject: [PATCH 69/87] simplify

---
 Dockerfile | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/Dockerfile b/Dockerfile
index 41fbd2c94..99611493a 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -42,12 +42,8 @@ COPY ./tests tests
 COPY ./tools tools
 COPY --exclude=./fast_llm/csrc/ ./fast_llm/ fast_llm/
 
-RUN find Megatron-LM -type f -exec chmod 666 {} \; && \
-    find examples -type f -exec chmod 666 {} \; && \
-    find tests -type f -exec chmod 666 {} \; && \
-    find tools -type f -exec chmod 666 {} \; && \
-    find fast_llm -type f -exec chmod 666 {} \; && \
-    find . -type d -exec chmod 777 {} \;
+# Set permissions for all users to write to /app
+RUN chmod -R a+w /app
 
 # Create a tar archive of /app with permissions preserved
 RUN tar -cf /app.tar -C /app .

From 1f447bbb1ea81525c4bd76852e80bd3d8ba14e76 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Tue, 12 Nov 2024 08:06:38 -0500
Subject: [PATCH 70/87] simplify

---
 Dockerfile | 39 +++++++++------------------------------
 1 file changed, 9 insertions(+), 30 deletions(-)

diff --git a/Dockerfile b/Dockerfile
index 99611493a..64d2bb368 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -10,10 +10,8 @@ RUN apt-get update \
 # Set the working directory
 WORKDIR /app
 
-# Set the setgid bit and default ACL for /app
-RUN chmod g+s /app && \
-    setfacl -d -m u::rwx,g::rwx,o::rwx /app && \
-    setfacl -d -m u::rw-,g::rw-,o::rw- /app
+# Set the default ACL for /app to rwx for all users
+RUN setfacl -d -m u::rwx,g::rwx,o::rwx /app
 
 # Environment settings for the virtual environment
 ENV VIRTUAL_ENV=/app/venv
@@ -23,34 +21,15 @@ ENV PATH="$VIRTUAL_ENV/bin:$PATH"
 RUN python3 -m venv $VIRTUAL_ENV --system-site-packages
 
 # Copy dependency files with universal write permissions for all users
-COPY --chmod=666 setup.py setup.cfg pyproject.toml ./
-COPY --chmod=666 ./fast_llm/csrc/ fast_llm/csrc/
+COPY --chmod=777 setup.py setup.cfg pyproject.toml ./
+COPY --chmod=777 ./fast_llm/csrc/ fast_llm/csrc/
 
 # Install dependencies within the virtual environment
 RUN pip install --no-cache-dir --no-build-isolation -e ".[CORE,OPTIONAL,DEV]"
 
-# Use intermediate build stage to copy the remaining source code
-FROM alpine as copy_source
-
-# Set the working directory
-WORKDIR /app
-
 # Copy remaining source code with universal write permissions
-COPY ./Megatron-LM Megatron-LM
-COPY ./examples examples
-COPY ./tests tests
-COPY ./tools tools
-COPY --exclude=./fast_llm/csrc/ ./fast_llm/ fast_llm/
-
-# Set permissions for all users to write to /app
-RUN chmod -R a+w /app
-
-# Create a tar archive of /app with permissions preserved
-RUN tar -cf /app.tar -C /app .
-
-# Continue with the base stage
-FROM base
-
-# Copy the remaining source code from the intermediate build stage
-COPY --from=copy_source /app.tar /
-RUN tar -xf /app.tar -C /app && rm /app.tar
+COPY --chmod=777 ./Megatron-LM Megatron-LM
+COPY --chmod=777 ./examples examples
+COPY --chmod=777 ./tests tests
+COPY --chmod=777 ./tools tools
+COPY --chmod=777 --exclude=./fast_llm/csrc/ ./fast_llm/ fast_llm/

From fb50c13c0fae7ffecb33859383cd97566fd45010 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Tue, 12 Nov 2024 08:35:19 -0500
Subject: [PATCH 71/87] address comments

---
 fast_llm/data/config.py           | 72 +++++++++++++++++++++++++++++
 fast_llm/data/prepare.py          |  0
 fast_llm/tools/prepare_dataset.py | 75 ++-----------------------------
 fast_llm/utils.py                 |  8 ++--
 setup.cfg                         |  2 +-
 5 files changed, 80 insertions(+), 77 deletions(-)
 create mode 100644 fast_llm/data/prepare.py

diff --git a/fast_llm/data/config.py b/fast_llm/data/config.py
index 59476eb44..7add60925 100644
--- a/fast_llm/data/config.py
+++ b/fast_llm/data/config.py
@@ -1,9 +1,12 @@
 import abc
+import argparse
 import enum
+import os
 import pathlib
 import typing
 
 from fast_llm.config import Config, Field, FieldHint, check_field, config_class, skip_valid_if_none
+from fast_llm.engine.config_utils.runnable import RunnableConfig
 from fast_llm.engine.distributed.config import PhaseType
 from fast_llm.engine.schedule.config import BatchConfig
 from fast_llm.utils import Assert
@@ -186,3 +189,72 @@ def __getitem__(self, index: int):
     @abc.abstractmethod
     def __len__(self):
         pass
+
+
+@config_class
+class _DistributedConfig(Config):
+    # TODO: Unify with fast_llm.engine.distributed.config.DistributedConfig
+
+    default_world_size: typing.ClassVar[int] = int(os.environ.get("WORLD_SIZE", 1))
+    default_rank: typing.ClassVar[int] = int(os.environ.get("RANK", 0))
+    world_size: int = Field(
+        default=None,
+        desc="Size of the world group. Typically provided by torchrun or equivalent through the `WORLD_SIZE` environment variable.",
+        hint=FieldHint.expert,
+        valid=check_field(Assert.gt, 0),
+    )
+    rank: int = Field(
+        default=None,
+        desc="Rank of the local process. Typically provided by torchrun or equivalent through the `RANK` environment variable.",
+        hint=FieldHint.expert,
+        valid=check_field(Assert.geq, 0),
+    )
+    backend: str = Field(
+        default="gloo",
+        desc="Distributed backend to use.",
+        hint=FieldHint.optional,
+    )
+
+    def _validate(self):
+        if self.world_size is None:
+            self.world_size = self.default_world_size
+        if self.rank is None:
+            self.rank = self.default_rank
+        super()._validate()
+        Assert.in_range(self.rank, 0, self.world_size)
+
+
+@config_class()
+class DatasetPreparatorConfig(RunnableConfig):
+    preparator_name: typing.ClassVar[str]
+
+    output_path: pathlib.Path = Field(
+        desc="Output directory for the processed dataset.",
+        hint=FieldHint.core,
+    )
+    distributed: _DistributedConfig = Field(
+        default_factory=_DistributedConfig,
+        desc="Configuration for distributed processing.",
+        hint=FieldHint.feature,
+    )
+
+    @classmethod
+    def get_dataset_preparator_class(cls) -> typing.Type["DatasetPreparator"]:
+        raise NotImplementedError
+
+    def _get_runnable(self, parsed: argparse.Namespace) -> typing.Callable[[], None]:
+        dataset_preparator = self.get_dataset_preparator_class()(config=self)
+        return dataset_preparator.run
+
+
+class DatasetPreparator(abc.ABC):
+    _config: DatasetPreparatorConfig
+    config_class: typing.ClassVar[type[DatasetPreparatorConfig]] = DatasetPreparatorConfig
+
+    def __init__(self, config: DatasetPreparatorConfig) -> None:
+        Assert.custom(isinstance, config, self.config_class)
+        config.validate()
+        self._config = config
+
+    def run(self) -> None:
+        raise NotImplementedError
diff --git a/fast_llm/data/prepare.py b/fast_llm/data/prepare.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/fast_llm/tools/prepare_dataset.py b/fast_llm/tools/prepare_dataset.py
index cc5dc9122..8e4283aad 100644
--- a/fast_llm/tools/prepare_dataset.py
+++ b/fast_llm/tools/prepare_dataset.py
@@ -19,75 +19,6 @@
 from fast_llm.utils import Assert, Registry
 
 
-@config_class
-class DistributedConfig(Config):
-    default_world_size: typing.ClassVar[int] = int(os.environ.get("WORLD_SIZE", 1))
-    default_rank: typing.ClassVar[int] = int(os.environ.get("RANK", 0))
-    world_size: int = Field(
-        default=None,
-        desc="Size of the world group. Typically provided by torchrun or equivalent through the `WORLD_SIZE` environment variable.",
-        hint=FieldHint.expert,
-        valid=check_field(Assert.gt, 0),
-    )
-    rank: int = Field(
-        default=None,
-        desc="Rank of the local process. Typically provided by torchrun or equivalent through the `RANK` environment variable.",
-        hint=FieldHint.expert,
-        valid=check_field(Assert.geq, 0),
-    )
-    backend: str = Field(
-        default="gloo",
-        desc="Distributed backend to use.",
-        hint=FieldHint.optional,
-        valid=check_field(Assert.incl, torch.distributed.Backend.backend_list),
-    )
-
-    def _validate(self):
-        if self.world_size is None:
-            self.world_size = self.default_world_size
-        if self.rank is None:
-            self.rank = self.default_rank
-        super()._validate()
-        Assert.in_range(self.rank, 0, self.world_size)
-
-
-@config_class()
-class DatasetPreparatorConfig(RunnableConfig):
-    _abstract = True
-    model_name: typing.ClassVar[str]
-
-    output_path: pathlib.Path = Field(
-        desc="Output directory for the processed dataset.",
-        hint=FieldHint.core,
-    )
-    distributed: DistributedConfig = Field(
-        default_factory=DistributedConfig,
-        desc="Configuration for distributed processing.",
-        hint=FieldHint.feature,
-    )
-
-    @classmethod
-    def get_dataset_preparator_class(cls) -> typing.Type["DatasetPreparator"]:
-        raise NotImplementedError
-
-    def _get_runnable(self, parsed: argparse.Namespace) -> typing.Callable[[], None]:
-        dataset_preparator = self.get_dataset_preparator_class()(config=self)
-        return dataset_preparator.run
-
-
-class DatasetPreparator(abc.ABC):
-    _abstract = True
-    _config: DatasetPreparatorConfig
-    config_class: typing.ClassVar[type[DatasetPreparatorConfig]] = DatasetPreparatorConfig
-
-    def __init__(self, config: DatasetPreparatorConfig) -> None:
-        Assert.custom(isinstance, config, self.config_class)
-        config.validate()
-        self._config = config
-
-    def run(self) -> None:
-        raise NotImplementedError
-
 
 @config_class
 class GPTDatasetConfig(Config):
@@ -130,13 +61,13 @@ class GPTDatasetConfig(Config):
 @config_class()
 class GPTDatasetPreparatorConfig(DatasetPreparatorConfig):
     _abstract = False
-    model_name: typing.ClassVar[str] = "gpt"
+    preparator_name: typing.ClassVar[str] = "gpt_memmap"
 
     tokens_per_shard: int = Field(
-        default=1_000_000_000,
+        default=10**9,
         desc="Approximate number of tokens per shard.",
         hint=FieldHint.feature,
-        valid=check_field(Assert.geq, 100_000),
+        valid=check_field(Assert.geq, 10**5),
     )
     loading_workers: int = Field(
         default=1,
diff --git a/fast_llm/utils.py b/fast_llm/utils.py
index 66539efb3..5ae1c5d0d 100644
--- a/fast_llm/utils.py
+++ b/fast_llm/utils.py
@@ -175,12 +175,12 @@ def not_custom(fn, *args, **kwargs):
         ), f"Assertion failed: not fn({', '.join(itertools.chain((str(x) for x in args),(f'{str(k)}={str(v)}' for k,v in kwargs.items())))})"
 
 
-_KT = typing.TypeVar("_KT")
-_VT = typing.TypeVar("_VT")
+_KeyType = typing.TypeVar("_KeyType")
+_ValueType = typing.TypeVar("_ValueType")
 
 
-class Registry(typing.Generic[_KT, _VT]):
-    def __init__(self, name: str, data: dict[_KT, _VT]):
+class Registry(typing.Generic[_KeyType, _ValueType]):
+    def __init__(self, name: str, data: dict[_KeyType, _ValueType]):
         self._name = name
         self._data = data.copy()
 
diff --git a/setup.cfg b/setup.cfg
index 1cf0a541e..68f3a0645 100644
--- a/setup.cfg
+++ b/setup.cfg
@@ -38,7 +38,7 @@ OPTIONAL =
     # Hydra
     hydra-core>=1.3.2
     omegaconf>=2.3.0
-    # Miscaleaneous
+    # Miscellanous
     tqdm>=4.66.3
 
 # Required for testing

From 33067c81cc87a734272620c0dab24737575bfcf8 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Tue, 12 Nov 2024 08:37:00 -0500
Subject: [PATCH 72/87] address comments

---
 fast_llm/tools/prepare_dataset.py | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fast_llm/tools/prepare_dataset.py b/fast_llm/tools/prepare_dataset.py
index 8e4283aad..3339636f6 100644
--- a/fast_llm/tools/prepare_dataset.py
+++ b/fast_llm/tools/prepare_dataset.py
@@ -1,8 +1,5 @@
-import abc
 import argparse
 import json
-import os
-import pathlib
 import shutil
 import typing
 import multiprocessing

From dbc221c6dcfc17d5e6e0a59701a2d23cc192407a Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Tue, 12 Nov 2024 08:48:59 -0500
Subject: [PATCH 73/87] address comments

---
 fast_llm/data/auto.py        |  12 ++
 fast_llm/data/gpt/prepare.py | 260 +++++++++++++++++++++++++++++++++++
 fast_llm/data/prepare.py     |   0
 3 files changed, 272 insertions(+)
 create mode 100644 fast_llm/data/auto.py
 create mode 100644 fast_llm/data/gpt/prepare.py
 delete mode 100644 fast_llm/data/prepare.py

diff --git a/fast_llm/data/auto.py b/fast_llm/data/auto.py
new file mode 100644
index 000000000..c7467fba0
--- /dev/null
+++ b/fast_llm/data/auto.py
@@ -0,0 +1,12 @@
+from fast_llm.data.gpt.prepare import GPTDatasetPreparatorConfig
+from fast_llm.utils import Registry
+
+dataset_preparator_registry = Registry(
+    "DatasetPreparator",
+    {
+        dataset_preparator.model_name: dataset_preparator
+        for dataset_preparator in [
+            GPTDatasetPreparatorConfig,
+        ]
+    },
+)
diff --git a/fast_llm/data/gpt/prepare.py b/fast_llm/data/gpt/prepare.py
new file mode 100644
index 000000000..6a7ad6d00
--- /dev/null
+++ b/fast_llm/data/gpt/prepare.py
@@ -0,0 +1,260 @@
+import json
+import multiprocessing
+import shutil
+import typing
+
+import numpy as np
+import torch.distributed
+
+from fast_llm.config import Config, Field, FieldHint, check_field, config_class
+from fast_llm.data.config import DatasetPreparator, DatasetPreparatorConfig, TokenizerConfig
+from fast_llm.data.gpt.memmap import GPTMemmapDataset
+from fast_llm.data.tokenizer import Tokenizer
+from fast_llm.engine.config_utils.data_type import DataType
+from fast_llm.utils import Assert
+
+
+@config_class
+class GPTDatasetConfig(Config):
+    name_or_path: str = Field(
+        desc="Name or path of the dataset.",
+        hint=FieldHint.core,
+    )
+    config_name: None | str = Field(
+        default=None,
+        desc="Specific configuration name for the dataset.",
+        hint=FieldHint.optional,
+    )
+    split: str = Field(
+        default="train",
+        desc="Split of the dataset to use.",
+        hint=FieldHint.optional,
+    )
+    field: str = Field(
+        default="text",
+        desc="Field of the dataset to use.",
+        hint=FieldHint.optional,
+    )
+    data_type: DataType | None = Field(
+        default=None,
+        desc="Data type of the dataset field. If not provided, it will be inferred based on the tokenizer vocabulary size.",
+        hint=FieldHint.optional,
+    )
+    trust_remote_code: bool = Field(
+        default=False,
+        desc="Trust remote code when downloading the dataset.",
+        hint=FieldHint.optional,
+    )
+    disable_disk_space_check: bool = Field(
+        default=False,
+        desc="Disable disk space check. Useful for environments where disk space is not accurately reported.",
+        hint=FieldHint.optional,
+    )
+
+
+@config_class()
+class GPTDatasetPreparatorConfig(DatasetPreparatorConfig):
+    _abstract = False
+    preparator_name: typing.ClassVar[str] = "gpt_memmap"
+
+    tokens_per_shard: int = Field(
+        default=10**9,
+        desc="Approximate number of tokens per shard.",
+        hint=FieldHint.feature,
+        valid=check_field(Assert.geq, 10**5),
+    )
+    loading_workers: int = Field(
+        default=1,
+        desc="Number of workers in load_dataset() call.",
+        hint=FieldHint.optional,
+        valid=check_field(Assert.geq, 1),
+    )
+    tokenize_workers: int = Field(
+        default=1,
+        desc="Number of workers for tokenization.",
+        hint=FieldHint.optional,
+        valid=check_field(Assert.geq, 1),
+    )
+    saving_workers: int = Field(
+        default=1,
+        desc="Number of processes for saving the data.",
+        hint=FieldHint.optional,
+        valid=check_field(Assert.geq, 1),
+    )
+    remove_downloads: bool = Field(
+        default=False,
+        desc="Remove downloaded dataset after processing.",
+        hint=FieldHint.optional,
+    )
+    dataset: GPTDatasetConfig = Field(
+        default_factory=GPTDatasetConfig,
+        desc="Configuration for the dataset.",
+        hint=FieldHint.feature,
+    )
+    tokenizer: TokenizerConfig = Field(
+        default_factory=TokenizerConfig,
+        desc="Configuration for the tokenizer.",
+        hint=FieldHint.feature,
+    )
+    _tokenizer: Tokenizer = Field(
+        init=False,
+        desc="The tokenizer instance.",
+        hint=FieldHint.derived,
+    )
+
+    def _validate(self):
+        assert self.tokenizer.path is not None
+        if self.dataset.data_type is not None:
+            Assert.incl(self.dataset.data_type.numpy, GPTMemmapDataset._DTYPES.values())
+        super()._validate()
+
+    @classmethod
+    def get_dataset_preparator_class(cls):
+        return GPTDatasetPreparator
+
+
+class GPTDatasetPreparator(DatasetPreparator):
+    _abstract = False
+    _config: GPTDatasetPreparatorConfig
+    config_class = GPTDatasetPreparatorConfig
+
+    def _tokenize_batch(self, batch):
+        input_ids = [
+            np.array(self._config._tokenizer.tokenize(text), dtype=self._config.dataset.data_type.numpy)
+            for text in batch[self._config.dataset.field]
+        ]
+        num_tokens = [len(x) for x in input_ids]
+        return {
+            "input_ids": input_ids,
+            "num_tokens": num_tokens,
+        }
+
+    def _save_shard(self, args) -> dict:
+        from tqdm import tqdm
+
+        shard_idx, shard_dataset = args
+        prefix = f"shard_{self._config.distributed.rank}_{shard_idx}"
+        shard_output_path = self._config.output_path / prefix
+        documents = [
+            np.array(item["input_ids"], dtype=self._config.dataset.data_type.numpy)
+            for item in tqdm(shard_dataset, desc=f"Saving shard {shard_idx}", unit="docs")
+        ]
+        GPTMemmapDataset.write_dataset(prefix=shard_output_path, documents=documents)
+        dataset_dict = {
+            "prefix": prefix,
+            "num_documents": len(documents),
+            "num_tokens": sum(len(doc) for doc in documents),
+        }
+        return dataset_dict
+
+    def run(self):
+        import datasets
+        import transformers
+        from tqdm import tqdm
+
+        # Set transformers logging verbosity
+        transformers.logging.set_verbosity_error()
+
+        # Disable disk space check if requested
+        if self._config.dataset.disable_disk_space_check:
+            datasets.builder.has_sufficient_disk_space = lambda needed_bytes, directory=".": True
+
+        # Load tokenizer
+        self._tokenizer = Tokenizer(config=self.tokenizer)
+
+        # Set data type if not provided
+        if self.dataset.data_type is None:
+            # Decide the datatype based on the tokenizer vocabulary size
+            vocab_size = self._tokenizer.vocab_size
+            if vocab_size <= np.iinfo(np.int16).max:
+                self.dataset.data_type = DataType.int16
+            # elif vocab_size <= np.iinfo(np.uint16).max:
+            #     self.dataset.data_type = DataType.uint16  # Not supported by Fast-LLM's DataType
+            elif vocab_size <= np.iinfo(np.int32).max:
+                self.dataset.data_type = DataType.int32
+            else:
+                raise ValueError(f"Tokenizer vocabulary size {vocab_size} is too large. This is likely an error.")
+
+        # Initialize distributed processing
+        if self._config.distributed.world_size > 1:
+            torch.distributed.init_process_group(
+                backend=self._config.distributed.backend,
+                rank=self._config.distributed.rank,
+                world_size=self._config.distributed.world_size,
+            )
+
+        # Prepare output directory
+        self._config.output_path.mkdir(parents=True, exist_ok=True)
+
+        # Download dataset if necessary on rank 0
+        download_path = self._config.output_path / "downloaded_dataset"
+        download_path_ok = download_path / "ok"
+        if self._config.distributed.rank == 0 and not download_path_ok.exists():
+            datasets.load_dataset(
+                path=self._config.dataset.name_or_path,
+                name=self._config.dataset.config_name,
+                split=self._config.dataset.split,
+                num_proc=self._config.loading_workers,
+                trust_remote_code=self._config.dataset.trust_remote_code,
+            ).save_to_disk(download_path, num_proc=self._config.saving_workers)
+            download_path_ok.touch()
+
+        # Synchronize processes to wait for the download to finish
+        if self._config.distributed.world_size > 1:
+            torch.distributed.barrier()
+
+        # Load and shard the dataset on each rank
+        dataset = datasets.load_from_disk(download_path).shard(
+            num_shards=self._config.distributed.world_size,
+            index=self._config.distributed.rank,
+        )
+        if self._config.dataset.field not in dataset.column_names:
+            raise ValueError(f"Dataset does not have field '{self._config.dataset.field}'.")
+
+        # Tokenize the dataset in parallel
+        tokenized_dataset = dataset.map(
+            self._tokenize_batch,
+            batched=True,
+            num_proc=self._config.tokenize_workers,
+            desc="Tokenizing batches",
+        )
+
+        # Calculate total number of tokens
+        total_tokens = sum(tqdm(tokenized_dataset["num_tokens"], desc="Counting tokens", unit="tokens"))
+
+        # Split dataset into shards based on number of tokens
+        num_shards = int(np.ceil(total_tokens / self._config.tokens_per_shard))
+        shards = [
+            (i, tokenized_dataset.shard(num_shards=num_shards, index=i))
+            for i in tqdm(range(num_shards), desc="Creating shards")
+        ]
+
+        # Use multiprocessing to save each shard in parallel on all ranks
+        with multiprocessing.Pool(processes=self._config.saving_workers) as pool:
+            dataset_dicts = pool.map(self._save_shard, shards)
+
+        # Gather dataset_dicts from all ranks to rank 0
+        if self._config.distributed.world_size > 1:
+            if self._config.distributed.rank == 0:
+                all_dataset_dicts = [None] * self._config.distributed.world_size
+                torch.distributed.gather_object(dataset_dicts, all_dataset_dicts, dst=0)
+                dataset_dicts = [item for sublist in all_dataset_dicts for item in sublist]
+            else:
+                torch.distributed.gather_object(dataset_dicts, [], dst=0)
+
+        # Create a metadata file on rank 0
+        if self._config.distributed.rank == 0:
+            total_tokens = sum(dataset_dict["num_tokens"] for dataset_dict in dataset_dicts)
+            for dataset_dict in dataset_dicts:
+                dataset_dict["weight"] = float(dataset_dict["num_tokens"]) / float(total_tokens)
+            output_file = self._config.output_path / "fast_llm_dataset.json"
+            json.dump({"datasets": dataset_dicts}, output_file.open("w"))
+
+        # Finalize distributed processing
+        if self._config.distributed.world_size > 1:
+            torch.distributed.barrier()
+            torch.distributed.destroy_process_group()
+
+        # Clean up downloaded dataset
+        if self._config.remove_downloads and self._config.distributed.rank == 0:
+            shutil.rmtree(download_path)
diff --git a/fast_llm/data/prepare.py b/fast_llm/data/prepare.py
deleted file mode 100644
index e69de29bb..000000000

From a2ae05109f82d55507c86ad01be7a8d88d8d3b06 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Tue, 12 Nov 2024 09:12:37 -0500
Subject: [PATCH 74/87] address comments

---
 fast_llm/config.py                |   4 +-
 fast_llm/data/auto.py             |   6 +-
 fast_llm/data/config.py           |   1 +
 fast_llm/data/gpt/prepare.py      |  42 +++--
 fast_llm/tools/prepare_dataset.py | 272 +-----------------------------
 5 files changed, 27 insertions(+), 298 deletions(-)

diff --git a/fast_llm/config.py b/fast_llm/config.py
index 66d9c5303..a3211de75 100644
--- a/fast_llm/config.py
+++ b/fast_llm/config.py
@@ -301,7 +301,7 @@ def __setattr__(self, key, value):
                 # Allow setting the exact same object to facilitate setup of cross-dependencies.
                 # Ex. allow re-setting cross-dependencies of already validated sub-configs.
                 return
-            raise RuntimeError()
+            raise RuntimeError(f"Cannot set attribute `{key}` after validation.")
         super().__setattr__(key, value)
 
     def __delattr__(self, key):
@@ -309,7 +309,7 @@ def __delattr__(self, key):
         Make the class read-only after validation.
         """
         if getattr(self, "_validated", False):
-            raise RuntimeError()
+            raise RuntimeError(f"Cannot delete attribute `{key}` after validation.")
         super().__delattr__(key)
 
     def validate(self, *, _is_validating=False):
diff --git a/fast_llm/data/auto.py b/fast_llm/data/auto.py
index c7467fba0..682056966 100644
--- a/fast_llm/data/auto.py
+++ b/fast_llm/data/auto.py
@@ -1,12 +1,12 @@
-from fast_llm.data.gpt.prepare import GPTDatasetPreparatorConfig
+from fast_llm.data.gpt.prepare import GPTMemmapDatasetPreparatorConfig
 from fast_llm.utils import Registry
 
 dataset_preparator_registry = Registry(
     "DatasetPreparator",
     {
-        dataset_preparator.model_name: dataset_preparator
+        dataset_preparator.preparator_name: dataset_preparator
         for dataset_preparator in [
-            GPTDatasetPreparatorConfig,
+            GPTMemmapDatasetPreparatorConfig,
         ]
     },
 )
diff --git a/fast_llm/data/config.py b/fast_llm/data/config.py
index 7add60925..891076705 100644
--- a/fast_llm/data/config.py
+++ b/fast_llm/data/config.py
@@ -256,5 +256,6 @@ def __init__(self, config: DatasetPreparatorConfig) -> None:
         config.validate()
         self._config = config
 
+    @abc.abstractmethod
     def run(self) -> None:
         raise NotImplementedError
diff --git a/fast_llm/data/gpt/prepare.py b/fast_llm/data/gpt/prepare.py
index 6a7ad6d00..fd1e82f5a 100644
--- a/fast_llm/data/gpt/prepare.py
+++ b/fast_llm/data/gpt/prepare.py
@@ -15,7 +15,7 @@
 
 
 @config_class
-class GPTDatasetConfig(Config):
+class GPTHuggingfaceDatasetConfig(Config):
     name_or_path: str = Field(
         desc="Name or path of the dataset.",
         hint=FieldHint.core,
@@ -53,8 +53,7 @@ class GPTDatasetConfig(Config):
 
 
 @config_class()
-class GPTDatasetPreparatorConfig(DatasetPreparatorConfig):
-    _abstract = False
+class GPTMemmapDatasetPreparatorConfig(DatasetPreparatorConfig):
     preparator_name: typing.ClassVar[str] = "gpt_memmap"
 
     tokens_per_shard: int = Field(
@@ -86,8 +85,8 @@ class GPTDatasetPreparatorConfig(DatasetPreparatorConfig):
         desc="Remove downloaded dataset after processing.",
         hint=FieldHint.optional,
     )
-    dataset: GPTDatasetConfig = Field(
-        default_factory=GPTDatasetConfig,
+    dataset: GPTHuggingfaceDatasetConfig = Field(
+        default_factory=GPTHuggingfaceDatasetConfig,
         desc="Configuration for the dataset.",
         hint=FieldHint.feature,
     )
@@ -96,11 +95,6 @@ class GPTDatasetPreparatorConfig(DatasetPreparatorConfig):
         desc="Configuration for the tokenizer.",
         hint=FieldHint.feature,
     )
-    _tokenizer: Tokenizer = Field(
-        init=False,
-        desc="The tokenizer instance.",
-        hint=FieldHint.derived,
-    )
 
     def _validate(self):
         assert self.tokenizer.path is not None
@@ -110,17 +104,19 @@ def _validate(self):
 
     @classmethod
     def get_dataset_preparator_class(cls):
-        return GPTDatasetPreparator
+        return GPTMemmapDatasetPreparator
+
 
+class GPTMemmapDatasetPreparator(DatasetPreparator):
+    _config: GPTMemmapDatasetPreparatorConfig
+    config_class = GPTMemmapDatasetPreparatorConfig
 
-class GPTDatasetPreparator(DatasetPreparator):
-    _abstract = False
-    _config: GPTDatasetPreparatorConfig
-    config_class = GPTDatasetPreparatorConfig
+    _tokenizer: Tokenizer
+    _data_type: DataType
 
     def _tokenize_batch(self, batch):
         input_ids = [
-            np.array(self._config._tokenizer.tokenize(text), dtype=self._config.dataset.data_type.numpy)
+            np.array(self._tokenizer.tokenize(text), dtype=self._data_type.numpy)
             for text in batch[self._config.dataset.field]
         ]
         num_tokens = [len(x) for x in input_ids]
@@ -136,7 +132,7 @@ def _save_shard(self, args) -> dict:
         prefix = f"shard_{self._config.distributed.rank}_{shard_idx}"
         shard_output_path = self._config.output_path / prefix
         documents = [
-            np.array(item["input_ids"], dtype=self._config.dataset.data_type.numpy)
+            np.array(item["input_ids"], dtype=self._data_type.numpy)
             for item in tqdm(shard_dataset, desc=f"Saving shard {shard_idx}", unit="docs")
         ]
         GPTMemmapDataset.write_dataset(prefix=shard_output_path, documents=documents)
@@ -160,20 +156,22 @@ def run(self):
             datasets.builder.has_sufficient_disk_space = lambda needed_bytes, directory=".": True
 
         # Load tokenizer
-        self._tokenizer = Tokenizer(config=self.tokenizer)
+        self._tokenizer = Tokenizer(config=self._config.tokenizer)
 
         # Set data type if not provided
-        if self.dataset.data_type is None:
+        if self._config.dataset.data_type is None:
             # Decide the datatype based on the tokenizer vocabulary size
             vocab_size = self._tokenizer.vocab_size
             if vocab_size <= np.iinfo(np.int16).max:
-                self.dataset.data_type = DataType.int16
+                self._data_type = DataType.int16
             # elif vocab_size <= np.iinfo(np.uint16).max:
-            #     self.dataset.data_type = DataType.uint16  # Not supported by Fast-LLM's DataType
+            #     self._data_type = DataType.uint16  # Not supported by Fast-LLM's DataType
             elif vocab_size <= np.iinfo(np.int32).max:
-                self.dataset.data_type = DataType.int32
+                self._data_type = DataType.int32
             else:
                 raise ValueError(f"Tokenizer vocabulary size {vocab_size} is too large. This is likely an error.")
+        else:
+            self._data_type = self._config.dataset.data_type
 
         # Initialize distributed processing
         if self._config.distributed.world_size > 1:
diff --git a/fast_llm/tools/prepare_dataset.py b/fast_llm/tools/prepare_dataset.py
index 3339636f6..aafe26902 100644
--- a/fast_llm/tools/prepare_dataset.py
+++ b/fast_llm/tools/prepare_dataset.py
@@ -1,277 +1,7 @@
 import argparse
-import json
-import shutil
-import typing
-import multiprocessing
 
-import numpy as np
-import torch.distributed
-
-from fast_llm.config import Config, Field, FieldHint, check_field, config_class
-from fast_llm.data.config import TokenizerConfig
-from fast_llm.data.gpt.memmap import GPTMemmapDataset
-from fast_llm.data.tokenizer import Tokenizer
-from fast_llm.engine.config_utils.data_type import DataType
+from fast_llm.data.auto import dataset_preparator_registry
 from fast_llm.engine.config_utils.runnable import RunnableConfig
-from fast_llm.utils import Assert, Registry
-
-
-
-@config_class
-class GPTDatasetConfig(Config):
-    name_or_path: str = Field(
-        desc="Name or path of the dataset.",
-        hint=FieldHint.core,
-    )
-    config_name: None | str = Field(
-        default=None,
-        desc="Specific configuration name for the dataset.",
-        hint=FieldHint.optional,
-    )
-    split: str = Field(
-        default="train",
-        desc="Split of the dataset to use.",
-        hint=FieldHint.optional,
-    )
-    field: str = Field(
-        default="text",
-        desc="Field of the dataset to use.",
-        hint=FieldHint.optional,
-    )
-    data_type: DataType | None = Field(
-        default=None,
-        desc="Data type of the dataset field. If not provided, it will be inferred based on the tokenizer vocabulary size.",
-        hint=FieldHint.optional,
-    )
-    trust_remote_code: bool = Field(
-        default=False,
-        desc="Trust remote code when downloading the dataset.",
-        hint=FieldHint.optional,
-    )
-    disable_disk_space_check: bool = Field(
-        default=False,
-        desc="Disable disk space check. Useful for environments where disk space is not accurately reported.",
-        hint=FieldHint.optional,
-    )
-
-
-@config_class()
-class GPTDatasetPreparatorConfig(DatasetPreparatorConfig):
-    _abstract = False
-    preparator_name: typing.ClassVar[str] = "gpt_memmap"
-
-    tokens_per_shard: int = Field(
-        default=10**9,
-        desc="Approximate number of tokens per shard.",
-        hint=FieldHint.feature,
-        valid=check_field(Assert.geq, 10**5),
-    )
-    loading_workers: int = Field(
-        default=1,
-        desc="Number of workers in load_dataset() call.",
-        hint=FieldHint.optional,
-        valid=check_field(Assert.geq, 1),
-    )
-    tokenize_workers: int = Field(
-        default=1,
-        desc="Number of workers for tokenization.",
-        hint=FieldHint.optional,
-        valid=check_field(Assert.geq, 1),
-    )
-    saving_workers: int = Field(
-        default=1,
-        desc="Number of processes for saving the data.",
-        hint=FieldHint.optional,
-        valid=check_field(Assert.geq, 1),
-    )
-    remove_downloads: bool = Field(
-        default=False,
-        desc="Remove downloaded dataset after processing.",
-        hint=FieldHint.optional,
-    )
-    dataset: GPTDatasetConfig = Field(
-        default_factory=GPTDatasetConfig,
-        desc="Configuration for the dataset.",
-        hint=FieldHint.feature,
-    )
-    tokenizer: TokenizerConfig = Field(
-        default_factory=TokenizerConfig,
-        desc="Configuration for the tokenizer.",
-        hint=FieldHint.feature,
-    )
-    _tokenizer: Tokenizer = Field(
-        init=False,
-        desc="The tokenizer instance.",
-        hint=FieldHint.derived,
-    )
-
-    def _validate(self):
-        assert self.tokenizer.path is not None
-        if self.dataset.data_type is not None:
-            Assert.incl(self.dataset.data_type.numpy, GPTMemmapDataset._DTYPES.values())
-        super()._validate()
-
-    @classmethod
-    def get_dataset_preparator_class(cls):
-        return GPTDatasetPreparator
-
-
-class GPTDatasetPreparator(DatasetPreparator):
-    _abstract = False
-    _config: GPTDatasetPreparatorConfig
-    config_class = GPTDatasetPreparatorConfig
-
-    def _tokenize_batch(self, batch):
-        input_ids = [
-            np.array(self._config._tokenizer.tokenize(text), dtype=self._config.dataset.data_type.numpy)
-            for text in batch[self._config.dataset.field]
-        ]
-        num_tokens = [len(x) for x in input_ids]
-        return {
-            "input_ids": input_ids,
-            "num_tokens": num_tokens,
-        }
-
-    def _save_shard(self, args) -> dict:
-        from tqdm import tqdm
-
-        shard_idx, shard_dataset = args
-        prefix = f"shard_{self._config.distributed.rank}_{shard_idx}"
-        shard_output_path = self._config.output_path / prefix
-        documents = [
-            np.array(item["input_ids"], dtype=self._config.dataset.data_type.numpy)
-            for item in tqdm(shard_dataset, desc=f"Saving shard {shard_idx}", unit="docs")
-        ]
-        GPTMemmapDataset.write_dataset(prefix=shard_output_path, documents=documents)
-        dataset_dict = {
-            "prefix": prefix,
-            "num_documents": len(documents),
-            "num_tokens": sum(len(doc) for doc in documents),
-        }
-        return dataset_dict
-
-    def run(self):
-        import datasets
-        import transformers
-        from tqdm import tqdm
-
-        # Set transformers logging verbosity
-        transformers.logging.set_verbosity_error()
-
-        # Disable disk space check if requested
-        if self._config.dataset.disable_disk_space_check:
-            datasets.builder.has_sufficient_disk_space = lambda needed_bytes, directory=".": True
-
-        # Load tokenizer
-        self._tokenizer = Tokenizer(config=self.tokenizer)
-
-        # Set data type if not provided
-        if self.dataset.data_type is None:
-            # Decide the datatype based on the tokenizer vocabulary size
-            vocab_size = self._tokenizer.vocab_size
-            if vocab_size <= np.iinfo(np.int16).max:
-                self.dataset.data_type = DataType.int16
-            # elif vocab_size <= np.iinfo(np.uint16).max:
-            #     self.dataset.data_type = DataType.uint16  # Not supported by Fast-LLM's DataType
-            elif vocab_size <= np.iinfo(np.int32).max:
-                self.dataset.data_type = DataType.int32
-            else:
-                raise ValueError(f"Tokenizer vocabulary size {vocab_size} is too large. This is likely an error.")
-
-        # Initialize distributed processing
-        if self._config.distributed.world_size > 1:
-            torch.distributed.init_process_group(
-                backend=self._config.distributed.backend,
-                rank=self._config.distributed.rank,
-                world_size=self._config.distributed.world_size,
-            )
-
-        # Prepare output directory
-        self._config.output_path.mkdir(parents=True, exist_ok=True)
-
-        # Download dataset if necessary on rank 0
-        download_path = self._config.output_path / "downloaded_dataset"
-        download_path_ok = download_path / "ok"
-        if self._config.distributed.rank == 0 and not download_path_ok.exists():
-            datasets.load_dataset(
-                path=self._config.dataset.name_or_path,
-                name=self._config.dataset.config_name,
-                split=self._config.dataset.split,
-                num_proc=self._config.loading_workers,
-                trust_remote_code=self._config.dataset.trust_remote_code,
-            ).save_to_disk(download_path, num_proc=self._config.saving_workers)
-            download_path_ok.touch()
-
-        # Synchronize processes to wait for the download to finish
-        if self._config.distributed.world_size > 1:
-            torch.distributed.barrier()
-
-        # Load and shard the dataset on each rank
-        dataset = datasets.load_from_disk(download_path).shard(
-            num_shards=self._config.distributed.world_size,
-            index=self._config.distributed.rank,
-        )
-        if self._config.dataset.field not in dataset.column_names:
-            raise ValueError(f"Dataset does not have field '{self._config.dataset.field}'.")
-
-        # Tokenize the dataset in parallel
-        tokenized_dataset = dataset.map(
-            self._tokenize_batch,
-            batched=True,
-            num_proc=self._config.tokenize_workers,
-            desc="Tokenizing batches",
-        )
-
-        # Calculate total number of tokens
-        total_tokens = sum(tqdm(tokenized_dataset["num_tokens"], desc="Counting tokens", unit="tokens"))
-
-        # Split dataset into shards based on number of tokens
-        num_shards = int(np.ceil(total_tokens / self._config.tokens_per_shard))
-        shards = [
-            (i, tokenized_dataset.shard(num_shards=num_shards, index=i))
-            for i in tqdm(range(num_shards), desc="Creating shards")
-        ]
-
-        # Use multiprocessing to save each shard in parallel on all ranks
-        with multiprocessing.Pool(processes=self._config.saving_workers) as pool:
-            dataset_dicts = pool.map(self._save_shard, shards)
-
-        # Gather dataset_dicts from all ranks to rank 0
-        if self._config.distributed.world_size > 1:
-            if self._config.distributed.rank == 0:
-                all_dataset_dicts = [None] * self._config.distributed.world_size
-                torch.distributed.gather_object(dataset_dicts, all_dataset_dicts, dst=0)
-                dataset_dicts = [item for sublist in all_dataset_dicts for item in sublist]
-            else:
-                torch.distributed.gather_object(dataset_dicts, [], dst=0)
-
-        # Create a metadata file on rank 0
-        if self._config.distributed.rank == 0:
-            total_tokens = sum(dataset_dict["num_tokens"] for dataset_dict in dataset_dicts)
-            for dataset_dict in dataset_dicts:
-                dataset_dict["weight"] = float(dataset_dict["num_tokens"]) / float(total_tokens)
-            output_file = self._config.output_path / "fast_llm_dataset.json"
-            json.dump({"datasets": dataset_dicts}, output_file.open("w"))
-
-        # Finalize distributed processing
-        if self._config.distributed.world_size > 1:
-            torch.distributed.barrier()
-            torch.distributed.destroy_process_group()
-
-        # Clean up downloaded dataset
-        if self._config.remove_downloads and self._config.distributed.rank == 0:
-            shutil.rmtree(download_path)
-
-
-dataset_preparator_registry = Registry(
-    "DatasetPreparator",
-    {
-        dataset_preparator.model_name: dataset_preparator
-        for dataset_preparator in [
-            GPTDatasetPreparatorConfig,
-        ]
-    },
-)
 
 
 class PrepareDatasetConfig(RunnableConfig):

From d68ce822bccd5085a3a03bac2a33c82e5bb79dcd Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Tue, 12 Nov 2024 14:53:40 -0500
Subject: [PATCH 75/87] fix link

---
 docs/index.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/index.md b/docs/index.md
index af136b674..0171e79f2 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -56,7 +56,7 @@ Fast-LLM powers the world's most advanced AI projects:
 -   **Enterprise AI Solutions:** Accelerate time-to-market for AI products by reducing training costs and enabling faster iteration.
 -   **Academic Collaborations:** Drive AI innovation with high-performance training capabilities that support cutting-edge research in machine learning.
 
-See how Fast-LLM has helped early adopters achieve faster results. [Explore use cases and success stories](success-stories/starcoder-2).
+See how Fast-LLM has helped early adopters achieve faster results. [Explore use cases and success stories](success-stories/starcoder-2.md).
 
 ## Project Scope and Objectives
 

From 2fad03c70777a29e18776c84f3390051c064824a Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Tue, 12 Nov 2024 20:47:37 -0500
Subject: [PATCH 76/87] clean up

---
 fast_llm/data/config.py      |  73 ----------
 fast_llm/data/gpt/prepare.py | 258 -----------------------------------
 tools/prepare_dataset.py     | 224 ------------------------------
 3 files changed, 555 deletions(-)
 delete mode 100644 fast_llm/data/gpt/prepare.py
 delete mode 100644 tools/prepare_dataset.py

diff --git a/fast_llm/data/config.py b/fast_llm/data/config.py
index 891076705..59476eb44 100644
--- a/fast_llm/data/config.py
+++ b/fast_llm/data/config.py
@@ -1,12 +1,9 @@
 import abc
-import argparse
 import enum
-import os
 import pathlib
 import typing
 
 from fast_llm.config import Config, Field, FieldHint, check_field, config_class, skip_valid_if_none
-from fast_llm.engine.config_utils.runnable import RunnableConfig
 from fast_llm.engine.distributed.config import PhaseType
 from fast_llm.engine.schedule.config import BatchConfig
 from fast_llm.utils import Assert
@@ -189,73 +186,3 @@ def __getitem__(self, index: int):
     @abc.abstractmethod
     def __len__(self):
         pass
-
-
-@config_class
-class _DistributedConfig(Config):
-    # TODO: Unify with fast_llm.engine.distributed.config.DistributedConfig
-
-    default_world_size: typing.ClassVar[int] = int(os.environ.get("WORLD_SIZE", 1))
-    default_rank: typing.ClassVar[int] = int(os.environ.get("RANK", 0))
-    world_size: int = Field(
-        default=None,
-        desc="Size of the world group. Typically provided by torchrun or equivalent through the `WORLD_SIZE` environment variable.",
-        hint=FieldHint.expert,
-        valid=check_field(Assert.gt, 0),
-    )
-    rank: int = Field(
-        default=None,
-        desc="Rank of the local process. Typically provided by torchrun or equivalent through the `RANK` environment variable.",
-        hint=FieldHint.expert,
-        valid=check_field(Assert.geq, 0),
-    )
-    backend: str = Field(
-        default="gloo",
-        desc="Distributed backend to use.",
-        hint=FieldHint.optional,
-    )
-
-    def _validate(self):
-        if self.world_size is None:
-            self.world_size = self.default_world_size
-        if self.rank is None:
-            self.rank = self.default_rank
-        super()._validate()
-        Assert.in_range(self.rank, 0, self.world_size)
-
-
-@config_class()
-class DatasetPreparatorConfig(RunnableConfig):
-    preparator_name: typing.ClassVar[str]
-
-    output_path: pathlib.Path = Field(
-        desc="Output directory for the processed dataset.",
-        hint=FieldHint.core,
-    )
-    distributed: _DistributedConfig = Field(
-        default_factory=_DistributedConfig,
-        desc="Configuration for distributed processing.",
-        hint=FieldHint.feature,
-    )
-
-    @classmethod
-    def get_dataset_preparator_class(cls) -> typing.Type["DatasetPreparator"]:
-        raise NotImplementedError
-
-    def _get_runnable(self, parsed: argparse.Namespace) -> typing.Callable[[], None]:
-        dataset_preparator = self.get_dataset_preparator_class()(config=self)
-        return dataset_preparator.run
-
-
-class DatasetPreparator(abc.ABC):
-    _config: DatasetPreparatorConfig
-    config_class: typing.ClassVar[type[DatasetPreparatorConfig]] = DatasetPreparatorConfig
-
-    def __init__(self, config: DatasetPreparatorConfig) -> None:
-        Assert.custom(isinstance, config, self.config_class)
-        config.validate()
-        self._config = config
-
-    @abc.abstractmethod
-    def run(self) -> None:
-        raise NotImplementedError
diff --git a/fast_llm/data/gpt/prepare.py b/fast_llm/data/gpt/prepare.py
deleted file mode 100644
index fd1e82f5a..000000000
--- a/fast_llm/data/gpt/prepare.py
+++ /dev/null
@@ -1,258 +0,0 @@
-import json
-import multiprocessing
-import shutil
-import typing
-
-import numpy as np
-import torch.distributed
-
-from fast_llm.config import Config, Field, FieldHint, check_field, config_class
-from fast_llm.data.config import DatasetPreparator, DatasetPreparatorConfig, TokenizerConfig
-from fast_llm.data.gpt.memmap import GPTMemmapDataset
-from fast_llm.data.tokenizer import Tokenizer
-from fast_llm.engine.config_utils.data_type import DataType
-from fast_llm.utils import Assert
-
-
-@config_class
-class GPTHuggingfaceDatasetConfig(Config):
-    name_or_path: str = Field(
-        desc="Name or path of the dataset.",
-        hint=FieldHint.core,
-    )
-    config_name: None | str = Field(
-        default=None,
-        desc="Specific configuration name for the dataset.",
-        hint=FieldHint.optional,
-    )
-    split: str = Field(
-        default="train",
-        desc="Split of the dataset to use.",
-        hint=FieldHint.optional,
-    )
-    field: str = Field(
-        default="text",
-        desc="Field of the dataset to use.",
-        hint=FieldHint.optional,
-    )
-    data_type: DataType | None = Field(
-        default=None,
-        desc="Data type of the dataset field. If not provided, it will be inferred based on the tokenizer vocabulary size.",
-        hint=FieldHint.optional,
-    )
-    trust_remote_code: bool = Field(
-        default=False,
-        desc="Trust remote code when downloading the dataset.",
-        hint=FieldHint.optional,
-    )
-    disable_disk_space_check: bool = Field(
-        default=False,
-        desc="Disable disk space check. Useful for environments where disk space is not accurately reported.",
-        hint=FieldHint.optional,
-    )
-
-
-@config_class()
-class GPTMemmapDatasetPreparatorConfig(DatasetPreparatorConfig):
-    preparator_name: typing.ClassVar[str] = "gpt_memmap"
-
-    tokens_per_shard: int = Field(
-        default=10**9,
-        desc="Approximate number of tokens per shard.",
-        hint=FieldHint.feature,
-        valid=check_field(Assert.geq, 10**5),
-    )
-    loading_workers: int = Field(
-        default=1,
-        desc="Number of workers in load_dataset() call.",
-        hint=FieldHint.optional,
-        valid=check_field(Assert.geq, 1),
-    )
-    tokenize_workers: int = Field(
-        default=1,
-        desc="Number of workers for tokenization.",
-        hint=FieldHint.optional,
-        valid=check_field(Assert.geq, 1),
-    )
-    saving_workers: int = Field(
-        default=1,
-        desc="Number of processes for saving the data.",
-        hint=FieldHint.optional,
-        valid=check_field(Assert.geq, 1),
-    )
-    remove_downloads: bool = Field(
-        default=False,
-        desc="Remove downloaded dataset after processing.",
-        hint=FieldHint.optional,
-    )
-    dataset: GPTHuggingfaceDatasetConfig = Field(
-        default_factory=GPTHuggingfaceDatasetConfig,
-        desc="Configuration for the dataset.",
-        hint=FieldHint.feature,
-    )
-    tokenizer: TokenizerConfig = Field(
-        default_factory=TokenizerConfig,
-        desc="Configuration for the tokenizer.",
-        hint=FieldHint.feature,
-    )
-
-    def _validate(self):
-        assert self.tokenizer.path is not None
-        if self.dataset.data_type is not None:
-            Assert.incl(self.dataset.data_type.numpy, GPTMemmapDataset._DTYPES.values())
-        super()._validate()
-
-    @classmethod
-    def get_dataset_preparator_class(cls):
-        return GPTMemmapDatasetPreparator
-
-
-class GPTMemmapDatasetPreparator(DatasetPreparator):
-    _config: GPTMemmapDatasetPreparatorConfig
-    config_class = GPTMemmapDatasetPreparatorConfig
-
-    _tokenizer: Tokenizer
-    _data_type: DataType
-
-    def _tokenize_batch(self, batch):
-        input_ids = [
-            np.array(self._tokenizer.tokenize(text), dtype=self._data_type.numpy)
-            for text in batch[self._config.dataset.field]
-        ]
-        num_tokens = [len(x) for x in input_ids]
-        return {
-            "input_ids": input_ids,
-            "num_tokens": num_tokens,
-        }
-
-    def _save_shard(self, args) -> dict:
-        from tqdm import tqdm
-
-        shard_idx, shard_dataset = args
-        prefix = f"shard_{self._config.distributed.rank}_{shard_idx}"
-        shard_output_path = self._config.output_path / prefix
-        documents = [
-            np.array(item["input_ids"], dtype=self._data_type.numpy)
-            for item in tqdm(shard_dataset, desc=f"Saving shard {shard_idx}", unit="docs")
-        ]
-        GPTMemmapDataset.write_dataset(prefix=shard_output_path, documents=documents)
-        dataset_dict = {
-            "prefix": prefix,
-            "num_documents": len(documents),
-            "num_tokens": sum(len(doc) for doc in documents),
-        }
-        return dataset_dict
-
-    def run(self):
-        import datasets
-        import transformers
-        from tqdm import tqdm
-
-        # Set transformers logging verbosity
-        transformers.logging.set_verbosity_error()
-
-        # Disable disk space check if requested
-        if self._config.dataset.disable_disk_space_check:
-            datasets.builder.has_sufficient_disk_space = lambda needed_bytes, directory=".": True
-
-        # Load tokenizer
-        self._tokenizer = Tokenizer(config=self._config.tokenizer)
-
-        # Set data type if not provided
-        if self._config.dataset.data_type is None:
-            # Decide the datatype based on the tokenizer vocabulary size
-            vocab_size = self._tokenizer.vocab_size
-            if vocab_size <= np.iinfo(np.int16).max:
-                self._data_type = DataType.int16
-            # elif vocab_size <= np.iinfo(np.uint16).max:
-            #     self._data_type = DataType.uint16  # Not supported by Fast-LLM's DataType
-            elif vocab_size <= np.iinfo(np.int32).max:
-                self._data_type = DataType.int32
-            else:
-                raise ValueError(f"Tokenizer vocabulary size {vocab_size} is too large. This is likely an error.")
-        else:
-            self._data_type = self._config.dataset.data_type
-
-        # Initialize distributed processing
-        if self._config.distributed.world_size > 1:
-            torch.distributed.init_process_group(
-                backend=self._config.distributed.backend,
-                rank=self._config.distributed.rank,
-                world_size=self._config.distributed.world_size,
-            )
-
-        # Prepare output directory
-        self._config.output_path.mkdir(parents=True, exist_ok=True)
-
-        # Download dataset if necessary on rank 0
-        download_path = self._config.output_path / "downloaded_dataset"
-        download_path_ok = download_path / "ok"
-        if self._config.distributed.rank == 0 and not download_path_ok.exists():
-            datasets.load_dataset(
-                path=self._config.dataset.name_or_path,
-                name=self._config.dataset.config_name,
-                split=self._config.dataset.split,
-                num_proc=self._config.loading_workers,
-                trust_remote_code=self._config.dataset.trust_remote_code,
-            ).save_to_disk(download_path, num_proc=self._config.saving_workers)
-            download_path_ok.touch()
-
-        # Synchronize processes to wait for the download to finish
-        if self._config.distributed.world_size > 1:
-            torch.distributed.barrier()
-
-        # Load and shard the dataset on each rank
-        dataset = datasets.load_from_disk(download_path).shard(
-            num_shards=self._config.distributed.world_size,
-            index=self._config.distributed.rank,
-        )
-        if self._config.dataset.field not in dataset.column_names:
-            raise ValueError(f"Dataset does not have field '{self._config.dataset.field}'.")
-
-        # Tokenize the dataset in parallel
-        tokenized_dataset = dataset.map(
-            self._tokenize_batch,
-            batched=True,
-            num_proc=self._config.tokenize_workers,
-            desc="Tokenizing batches",
-        )
-
-        # Calculate total number of tokens
-        total_tokens = sum(tqdm(tokenized_dataset["num_tokens"], desc="Counting tokens", unit="tokens"))
-
-        # Split dataset into shards based on number of tokens
-        num_shards = int(np.ceil(total_tokens / self._config.tokens_per_shard))
-        shards = [
-            (i, tokenized_dataset.shard(num_shards=num_shards, index=i))
-            for i in tqdm(range(num_shards), desc="Creating shards")
-        ]
-
-        # Use multiprocessing to save each shard in parallel on all ranks
-        with multiprocessing.Pool(processes=self._config.saving_workers) as pool:
-            dataset_dicts = pool.map(self._save_shard, shards)
-
-        # Gather dataset_dicts from all ranks to rank 0
-        if self._config.distributed.world_size > 1:
-            if self._config.distributed.rank == 0:
-                all_dataset_dicts = [None] * self._config.distributed.world_size
-                torch.distributed.gather_object(dataset_dicts, all_dataset_dicts, dst=0)
-                dataset_dicts = [item for sublist in all_dataset_dicts for item in sublist]
-            else:
-                torch.distributed.gather_object(dataset_dicts, [], dst=0)
-
-        # Create a metadata file on rank 0
-        if self._config.distributed.rank == 0:
-            total_tokens = sum(dataset_dict["num_tokens"] for dataset_dict in dataset_dicts)
-            for dataset_dict in dataset_dicts:
-                dataset_dict["weight"] = float(dataset_dict["num_tokens"]) / float(total_tokens)
-            output_file = self._config.output_path / "fast_llm_dataset.json"
-            json.dump({"datasets": dataset_dicts}, output_file.open("w"))
-
-        # Finalize distributed processing
-        if self._config.distributed.world_size > 1:
-            torch.distributed.barrier()
-            torch.distributed.destroy_process_group()
-
-        # Clean up downloaded dataset
-        if self._config.remove_downloads and self._config.distributed.rank == 0:
-            shutil.rmtree(download_path)
diff --git a/tools/prepare_dataset.py b/tools/prepare_dataset.py
deleted file mode 100644
index b38a88cbc..000000000
--- a/tools/prepare_dataset.py
+++ /dev/null
@@ -1,224 +0,0 @@
-import json
-import os
-from functools import cached_property
-from multiprocessing import Pool
-from pathlib import Path
-
-import numpy as np
-from datasets import load_dataset, load_from_disk
-from torch import distributed as dist
-from tqdm import tqdm
-from transformers import AutoTokenizer, logging
-
-from fast_llm.config import Field, FieldHint, check_field, config_class
-from fast_llm.data.mmap import MMapIndexedDataset
-from fast_llm.engine.config_utils.data_type import DataType
-from fast_llm.engine.config_utils.runnable import RunnableConfig
-from fast_llm.utils import Assert
-
-logging.set_verbosity_error()
-
-
-@config_class
-class PrepareDatasetConfig(RunnableConfig):
-    dataset_name_or_path: str = Field(
-        desc="Name or path of the dataset.",
-        hint=FieldHint.core,
-    )
-    tokenizer_path_or_name: str = Field(
-        desc="Path or name of the tokenizer.",
-        hint=FieldHint.core,
-    )
-    output_dir: str = Field(
-        desc="Output directory for the processed dataset.",
-        hint=FieldHint.core,
-    )
-    num_processes_load: int = Field(
-        default=1,
-        desc="Number of workers in load_dataset() call.",
-        hint=FieldHint.optional,
-        valid=check_field(Assert.geq, 0),
-    )
-    num_processes_map: int = Field(
-        default=1,
-        desc="Number of workers in .map() call.",
-        hint=FieldHint.optional,
-        valid=check_field(Assert.geq, 0),
-    )
-    num_processes_save: int = Field(
-        default=1,
-        desc="Number of processes for saving the mmap'ed datasets.",
-        hint=FieldHint.optional,
-        valid=check_field(Assert.geq, 0),
-    )
-    num_tokens_per_shard: int = Field(
-        default=1000000000,
-        desc="Approximate number of tokens per shard.",
-        hint=FieldHint.optional,
-        valid=check_field(Assert.geq, 1),
-    )
-    dataset_config_name: None | str = Field(
-        default=None,
-        desc="Specific configuration name for the dataset.",
-        hint=FieldHint.optional,
-    )
-    dataset_split: str = Field(
-        default="train",
-        desc="Split of the dataset to use.",
-        hint=FieldHint.optional,
-    )
-    dataset_field: str = Field(
-        default="text",
-        desc="Field of the dataset to use.",
-        hint=FieldHint.optional,
-    )
-    dataset_dtype: DataType = Field(
-        default=None,
-        desc="Data type of the dataset field.",
-        hint=FieldHint.derived,
-    )
-    rank: int = Field(
-        default=0,
-        desc="Rank of the process for distributed processing.",
-        hint=FieldHint.optional,
-    )
-    world_size: int = Field(
-        default=1,
-        desc="Total number of processes in distributed processing.",
-        hint=FieldHint.optional,
-    )
-    distributed_backend: str = Field(
-        default="gloo",
-        desc="Distributed backend for distributed processing.",
-        hint=FieldHint.optional,
-    )
-
-    @cached_property
-    def tokenizer(self):
-        return AutoTokenizer.from_pretrained(self.tokenizer_path_or_name)
-
-    def _validate(self):
-        if self.dataset_dtype is None:
-            # Decide the dtype based on the tokenizer vocabulary size
-            vocab_size = len(self.tokenizer)
-
-            if vocab_size <= np.iinfo(np.int8).max:
-                self.dataset_dtype = DataType.int8
-            elif vocab_size <= np.iinfo(np.int16).max:
-                self.dataset_dtype = DataType.int16
-            elif vocab_size <= np.iinfo(np.int32).max:
-                self.dataset_dtype = DataType.int32
-            elif vocab_size <= np.iinfo(np.int64).max:
-                self.dataset_dtype = DataType.int64
-            else:
-                raise ValueError(
-                    f"Tokenizer vocabulary size {vocab_size} is too large for supported dtypes in MMapIndexedDataset."
-                )
-        super()._validate()
-
-    def _tokenize_text(self, text):
-        tokens = self.tokenizer(
-            text,
-            truncation=False,
-            padding=False,
-            add_special_tokens=True,
-        )["input_ids"]
-        return np.array(tokens, dtype=self.dataset_dtype.numpy)
-
-    def _tokenize_batch(self, batch):
-        input_ids = [self._tokenize_text(text) for text in batch[self.dataset_field]]
-        num_tokens = [len(x) for x in input_ids]
-        return {
-            "input_ids": input_ids,
-            "num_tokens": num_tokens,
-        }
-
-    def _save_shard(self, args) -> dict:
-        shard_idx, shard_dataset = args
-        prefix = f"shard_{self.rank}_{shard_idx}"
-        shard_output_path = Path(self.output_dir) / prefix
-        documents = [
-            np.array(item["input_ids"], dtype=self.dataset_dtype.numpy)
-            for item in tqdm(shard_dataset, desc=f"Saving shard {shard_idx}", unit="docs")
-        ]
-        MMapIndexedDataset.write_dataset(prefix=shard_output_path, documents=documents)
-        dataset_dict = {
-            "prefix": prefix,
-            "num_documents": len(documents),
-            "num_tokens": sum(len(doc) for doc in documents),
-        }
-        return dataset_dict
-
-    def run(self):
-        # Initialize distributed processing
-        if self.world_size > 1:
-            dist.init_process_group(backend=self.distributed_backend, rank=self.rank, world_size=self.world_size)
-
-        # Prepare output directory
-        os.makedirs(self.output_dir, exist_ok=True)
-
-        # Download dataset
-        download_dir = Path(self.output_dir) / "downloaded_dataset"
-        if self.rank == 0:
-            load_dataset(
-                path=self.dataset_name_or_path,
-                name=self.dataset_config_name,
-                split=self.dataset_split,
-                num_proc=self.num_processes_load,
-                trust_remote_code=True,
-            ).save_to_disk(download_dir, num_proc=self.num_processes_save)
-
-        # Synchronize processes to wait for the download
-        if self.world_size > 1:
-            dist.barrier()
-
-        # Load and shard the dataset
-        dataset = load_from_disk(download_dir).shard(num_shards=self.world_size, index=self.rank)
-        if self.dataset_field not in dataset.column_names:
-            raise ValueError(f"Dataset does not have field '{self.dataset_field}'.")
-
-        # Tokenize the dataset
-        tokenized_dataset = dataset.map(
-            self._tokenize_batch,
-            batched=True,
-            num_proc=self.num_processes_map,
-            desc="Tokenizing batches",
-        )
-
-        # Calculate total number of tokens
-        total_tokens = sum(tqdm(tokenized_dataset["num_tokens"], desc="Counting tokens", unit="tokens"))
-
-        # Split dataset into shards
-        num_shards = int(np.ceil(total_tokens / self.num_tokens_per_shard))
-        shards = [
-            (i, tokenized_dataset.shard(num_shards=num_shards, index=i))
-            for i in tqdm(range(num_shards), desc="Creating shards")
-        ]
-
-        # Use multiprocessing to save each shard in parallel
-        with Pool(processes=self.num_processes_save) as pool:
-            dataset_dicts = pool.map(self._save_shard, shards)
-
-        # Gather dataset_dicts from all ranks to rank 0
-        if self.world_size > 1:
-            all_dataset_dicts = [None] * self.world_size
-            dist.gather_object(dataset_dicts, all_dataset_dicts, dst=0)
-            if self.rank == 0:
-                dataset_dicts = [item for sublist in all_dataset_dicts for item in sublist]
-
-        # Create a metadata file
-        if self.rank == 0:
-            total_tokens = sum(dataset_dict["num_tokens"] for dataset_dict in dataset_dicts)
-            for dataset_dict in dataset_dicts:
-                dataset_dict["weight"] = float(dataset_dict["num_tokens"]) / float(total_tokens)
-            output_file = Path(self.output_dir) / "fast_llm_dataset.json"
-            json.dump({"datasets": dataset_dicts}, output_file.open("w"))
-
-        # Finalize distributed processing
-        if self.world_size > 1:
-            dist.barrier()
-            dist.destroy_process_group()
-
-
-if __name__ == "__main__":
-    PrepareDatasetConfig.parse_and_run()

From 223bab07546d1b05eaadb4c5346bf2bbbabfd9f3 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Tue, 12 Nov 2024 20:48:31 -0500
Subject: [PATCH 77/87] clean up

---
 setup.cfg | 1 -
 1 file changed, 1 deletion(-)

diff --git a/setup.cfg b/setup.cfg
index 14e2618c4..5429dc919 100644
--- a/setup.cfg
+++ b/setup.cfg
@@ -45,7 +45,6 @@ OPTIONAL =
 DEV =
     pytest>=8.3.2
     pytest-depends>=1.0.1
-    hypothesis>=6.118.1
 
 # Required for building the documentation
 DOCS =

From 94008ea555fa22191f481537a8dca4bc53b12531 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Wed, 13 Nov 2024 14:22:24 -0500
Subject: [PATCH 78/87] wip

---
 docs/quick-start.md | 430 ++++++++++++++++----------------------------
 1 file changed, 151 insertions(+), 279 deletions(-)

diff --git a/docs/quick-start.md b/docs/quick-start.md
index 744286532..7425fab7a 100644
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -148,6 +148,55 @@ First, choose your environment. You can use Docker, your local environment, Slur
     kubectl apply -f pvc-fast-llm-results.yaml
     ```
 
+    We also need to create a temporary pod that mounts the inputs PVC and allows us to copy files there. Here's a basic YAML configuration for such a pod:
+
+    ```yaml
+    # Temporary pod to manage input data and results
+    apiVersion: v1
+    kind: Pod
+    metadata:
+      name: fast-llm-data-management
+    spec:
+      containers:
+        - name: fast-llm-data-management-container
+          image: ubuntu
+          command: ["sleep", "infinity"]
+          volumeMounts:
+            - mountPath: /mnt/inputs
+              name: inputs
+            - mountPath: /mnt/results
+              name: results
+      volumes:
+        - name: inputs
+          persistentVolumeClaim:
+            claimName: pvc-fast-llm-inputs
+        - name: results
+          persistentVolumeClaim:
+            claimName: pvc-fast-llm-results
+    ```
+
+    Save this configuration to a file named `pod-fast-llm-data-management.yaml`. Next, apply this configuration to your Kubernetes cluster:
+
+    ```bash
+    kubectl apply -f pod-fast-llm-data-management.yaml
+    ```
+
+    This pod will allow you to copy files to and from the inputs and results PVCs. You can access it by running:
+
+    ```bash
+    kubectl exec -it fast-llm-data-management -- /bin/bash
+    ```
+
+    !!! note "Cleaning up unused resources"
+    
+        At the very end of this guide, you should clean up the data management pod to avoid unnecessary resource consumption by running
+
+        ```bash
+        kubectl delete pod fast-llm-data-management
+        ```
+
+        Don't run this just yet, though. You'll need this pod throughout the guide.
+
 ## Step 2: Choose Your Model 🤖
 
 Fast-LLM supports many GPT variants, including (but not limited to) Llama, Mistral, and Mixtral. For this tutorial, let's train a Llama model with data parallelism. You can choose from two models:
@@ -179,54 +228,19 @@ Fast-LLM supports many GPT variants, including (but not limited to) Llama, Mistr
 
     === "Kubernetes"
 
-        We need to create a temporary pod that mounts the inputs PVC and allows us to download the model. Here's a basic YAML configuration for such a pod:
-
-        ```yaml
-        apiVersion: v1
-        kind: Pod
-        metadata:
-          name: clone-model
-        spec:
-          containers:
-            - name: clone-model-container
-              image: ubuntu
-              command: ["sleep", "infinity"]
-              volumeMounts:
-                - mountPath: /mnt/inputs
-                  name: inputs
-          volumes:
-            - name: inputs
-              persistentVolumeClaim:
-                claimName: pvc-fast-llm-inputs
-        ```
-
-        Save this configuration to a file named `clone-model-pod.yaml`. Next, apply this configuration to your Kubernetes cluster:
-
-        ```bash
-        kubectl apply -f clone-model-pod.yaml
-        ```
-
-        Now, enter the pod, log in to your Hugging Face account, and clone the model:
-
         ```bash
-        kubectl exec -it clone-model -- /bin/bash
+        kubectl exec -it fast-llm-data-management -- /bin/bash
         git lfs install
         git clone https://huggingface.co/HuggingFaceTB/SmolLM2-135M /mnt/inputs/SmolLM2-135M
         ```
 
-        Finally, clean up the temporary pod, it's no longer needed:
-
-        ```bash
-        kubectl delete pod clone-model
-        ```
-
 === "Llama-3.2-1B"
 
     Llama is a larger model with 1B parameters. It's more powerful but requires more resources to train. We'll grab the model from the Huggingface Hub and save it to our inputs folder.
     
     !!! note "Access Required"
     
-        Meta gates access to the Llama model. You need to request access to the model from Meta before you can download it at https://huggingface.co/meta-llama/Llama-3.2-1B.
+        Meta gates access to their Llama models. You need to request access to the model from Meta before you can download it at https://huggingface.co/meta-llama/Llama-3.2-1B.
 
     === "Docker"
 
@@ -278,47 +292,19 @@ Fast-LLM supports many GPT variants, including (but not limited to) Llama, Mistr
     
     === "Kubernetes"
     
-        We need to create a temporary pod that mounts the inputs PVC and allows us to download the model. Here's a basic YAML configuration for such a pod:
-
-        ```yaml
-        apiVersion: v1
-        kind: Pod
-        metadata:
-          name: clone-model
-        spec:
-          containers:
-            - name: clone-model-container
-              image: ubuntu
-              command: ["sleep", "infinity"]
-              volumeMounts:
-                - mountPath: /mnt/inputs
-                  name: inputs
-          volumes:
-            - name: inputs
-              persistentVolumeClaim:
-                claimName: pvc-fast-llm-inputs
-        ```
-
-        Save this configuration to a file named `clone-model-pod.yaml`. Next, apply this configuration to your Kubernetes cluster:
-
-        ```bash
-        kubectl apply -f clone-model-pod.yaml
-        ```
-
-        Now, enter the pod, log in to your Hugging Face account, and clone the model:
+        First, sign in to your Hugging Face account:
 
         ```bash
-        kubectl exec -it clone-model -- /bin/bash
+        kubectl exec -it fast-llm-data-management -- /bin/bash
         pip install huggingface_hub
         huggingface-cli login
-        git lfs install
-        git clone https://huggingface.co/meta-llama/Llama-3.2-1B /mnt/inputs/Llama-3.2-1B
         ```
-
-        Finally, clean up the temporary pod, it's no longer needed:
+        
+        Then, clone the model:
 
         ```bash
-        kubectl delete pod clone-model
+        git lfs install
+        git clone https://huggingface.co/meta-llama/Llama-3.2-1B /mnt/inputs/Llama-3.2-1B
         ```
 
 !!! tip "Model Size Matters"
@@ -329,237 +315,123 @@ Fast-LLM supports many GPT variants, including (but not limited to) Llama, Mistr
 
 For this tutorial, we'll use 9B tokens of text from the [OpenWebText](https://skylion007.github.io/OpenWebTextCorpus/) dataset. This dataset is a free approximation of the WebText data OpenAI used for GPT-2, and it's perfect for our test run!
 
+Create a configuration file for the dataset preparation. Copy the following content:
+
 === "SmolLM2-135M"
 
-    === "Docker"
+    ```yaml
+    output_path: /mnt/inputs/openwebtext-SmolLM2
 
-        We've got a script that'll download and preprocess the dataset for you. Run it like this:
+    loading_workers: 4
+    tokenize_workers: 4
+    saving_workers: 4
 
-        ```bash
-        docker run -it --rm ghcr.io/servicenow/fast-llm:latest \
-            -v ~/inputs:/mnt/inputs \
-            python tools/prepare_dataset.py \
-            tokenizer_path_or_name="HuggingFaceTB/SmolLM2-135M" \
-            dataset_name_or_path="openwebtext" \
-            dataset_split="train" \
-            output_dir="/mnt/inputs" \
-            num_processes_load=4 \
-            num_processes_map=4 \
-            num_processes_save=4 \
-            num_tokens_per_shard=100000000
-        ```
-    
-    === "Local Environment"
+    dataset:
+      path: openwebtext
 
-        Fast-LLM ships with a [script](https://github.com/ServiceNow/Fast-LLM/blob/main/tools/prepare_dataset.py) that downloads and preprocesses the dataset for you. Download and run it like this:
+    tokenizer:
+      path: /mnt/inputs/SmolLM2-135M/tokenizer.json
 
-        ```bash
-        curl -O https://raw.githubusercontent.com/ServiceNow/Fast-LLM/main/tools/prepare_dataset.py
-        python prepare_dataset.py \
-            tokenizer_path_or_name="HuggingFaceTB/SmolLM2-135M" \
-            dataset_name_or_path="openwebtext" \
-            dataset_split="train" \
-            output_dir="/mnt/inputs" \
-            num_processes_load=4 \
-            num_processes_map=4 \
-            num_processes_save=4 \
-            num_tokens_per_shard=100000000
-        ```
-    
-    === "Slurm"
-
-        Fast-LLM has got you covered with a script that'll download and preprocess the dataset for you. Run it like this:
-
-        ```bash
-        sbatch <<EOF
-        #!/bin/bash
-        # SBATCH --nodes=1
-        # SBATCH --ntasks-per-node=1
-        # SBATCH --exclusive
-        # SBATCH --output=/mnt/outputs/job_output.log
-        # SBATCH --error=/mnt/outputs/job_error.log
-
-        srun \
-            --container-image="ghcr.io/servicenow/fast-llm:latest" \
-            --container-mounts="${HOME}/inputs:/mnt/inputs,${HOME}/results:/mnt/results" \
-            --ntasks-per-node=$SLURM_NTASKS_PER_NODE \
-            bash -c "
-                python tools/prepare_dataset.py \
-                    tokenizer_path_or_name='HuggingFaceTB/SmolLM2-135M' \
-                    dataset_name_or_path='openwebtext' \
-                    dataset_split='train' \
-                    output_dir='/mnt/inputs' \
-                    num_processes_load=4 \
-                    num_processes_map=4 \
-                    num_processes_save=4 \
-                    num_tokens_per_shard=100000000"
-        EOF
-        ```
-
-        You can follow the job's progress by running `squeue -u $USER` and checking the logs in `~/results/job_output.log` and `~/results/job_error.log`.
-    
-    === "Kubernetes"
+    remove_downloads: false
+    ```
 
-        Fast-LLM comes with a script that'll download and preprocess the dataset for you. We will run this script in a Kubernetes job. Here's a basic configuration for the job:
+=== "Llama-3.2-1B"
 
-        ```yaml
-        apiVersion: batch/v1
-        kind: Job
-        metadata:
-          name: prepare-dataset
-        spec:
-          template:
-            spec:
-              containers:
-                - name: prepare-dataset
-                  image: ghcr.io/servicenow/fast-llm:latest
-                  command: ["python", "tools/prepare_dataset.py"]
-                  args:
-                    - tokenizer_path_or_name=HuggingFaceTB/SmolLM2-135M
-                    - dataset_name_or_path=openwebtext
-                    - dataset_split=train
-                    - output_dir=/mnt/inputs
-                    - num_processes_load=4
-                    - num_processes_map=4
-                    - num_processes_save=4
-                    - num_tokens_per_shard=100000000
-                  resources:
-                    requests:
-                      cpu: 4
-                  volumeMounts:
-                    - name: inputs
-                      mountPath: /mnt/inputs
-              volumes:
-                - name: inputs
-                  persistentVolumeClaim:
-                    claimName: pvc-fast-llm-inputs
-        ```
+    ```yaml
+    output_path: /mnt/inputs/openwebtext-Llama
 
-        Save this configuration to a file named `prepare-dataset-job.yaml` and apply it to your Kubernetes cluster:
+    loading_workers: 4
+    tokenize_workers: 4
+    saving_workers: 4
 
-        ```bash
-        kubectl apply -f prepare-dataset-job.yaml
-        ```
+    dataset:
+      path: openwebtext
+    
+    tokenizer:
+      path: /mnt/inputs/Llama-3.2-1B/tokenizer.json
+    
+    remove_downloads: false
+    ```
 
-        You can follow the job's progress by running `kubectl get pods` and checking the logs with `kubectl logs prepare-dataset`.
+and save it as `prepare-config.yaml` in your inputs folder.
 
-=== "Llama-3.2-1B"
+Fast-LLM ships with a `prepare` command that'll download and preprocess the dataset for you. Run it like this:
 
-    === "Docker"
+=== "Docker"
 
-        We've got a script that'll download and preprocess the dataset for you. Run it like this:
+    ```bash
+    docker run -it --rm ghcr.io/servicenow/fast-llm:latest \
+        -v ~/inputs:/mnt/inputs \
+        fast-llm prepare --config /mnt/inputs/prepare-config.yaml
+    ```
 
-        ```bash
-        docker run -it --rm ghcr.io/servicenow/fast-llm:latest \
-            -v ~/inputs:/mnt/inputs \
-            python tools/prepare_dataset.py \
-            tokenizer_path_or_name="meta-llama/Llama-3.2-1B" \
-            dataset_name_or_path="openwebtext" \
-            dataset_split="train" \
-            output_dir="inputs" \
-            num_processes_load=4 \
-            num_processes_map=4 \
-            num_processes_save=4 \
-            num_tokens_per_shard=100000000
-        ```
-    
-    === "Local Environment"
+=== "Local Environment"
 
-        Fast-LLM ships with a [script](https://github.com/ServiceNow/Fast-LLM/blob/main/tools/prepare_dataset.py) that downloads and preprocesses the dataset for you. Download and run it like this:
+    ```bash
+    fast-llm prepare --config /mnt/inputs/prepare-config.yaml
+    ```
 
-        ```bash
-        curl -O https://raw.githubusercontent.com/ServiceNow/Fast-LLM/main/tools/prepare_dataset.py
-        python prepare_dataset.py \
-            tokenizer_path_or_name="meta-llama/Llama-3.2-1B" \
-            dataset_name_or_path="openwebtext" \
-            dataset_split="train" \
-            output_dir="/mnt/inputs" \
-            num_processes_load=4 \
-            num_processes_map=4 \
-            num_processes_save=4 \
-            num_tokens_per_shard=100000000
-        ```
+=== "Slurm"
 
-    === "Slurm"
+    ```bash
+    sbatch <<EOF
+    #!/bin/bash
+    # SBATCH --nodes=1
+    # SBATCH --ntasks-per-node=1
+    # SBATCH --exclusive
+    # SBATCH --output=/mnt/results/job_output.log
+    # SBATCH --error=/mnt/results/job_error.log
 
-        Fast-LLM has got you covered with a script that'll download and preprocess the dataset for you. Run it like this:
+    srun \
+        --container-image="ghcr.io/servicenow/fast-llm:latest" \
+        --container-mounts="${HOME}/inputs:/mnt/inputs" \
+        --ntasks-per-node=$SLURM_NTASKS_PER_NODE \
+        bash -c "fast-llm prepare --config /mnt/inputs/prepare-config.yaml"
+    EOF
+    ```
 
-        ```bash
-        sbatch <<EOF
-        #!/bin/bash
-        # SBATCH --nodes=1
-        # SBATCH --ntasks-per-node=1
-        # SBATCH --exclusive
-        # SBATCH --output=/mnt/outputs/job_output.log
-        # SBATCH --error=/mnt/outputs/job_error.log
-
-        srun \
-            --container-image="ghcr.io/servicenow/fast-llm:latest" \
-            --container-mounts="${HOME}/inputs:/mnt/inputs,${HOME}/results:/mnt/results" \
-            --ntasks-per-node=$SLURM_NTASKS_PER_NODE \
-            bash -c "
-                python tools/prepare_dataset.py \
-                    tokenizer_path_or_name='meta-llama/Llama-3.2-1B' \
-                    dataset_name_or_path='openwebtext' \
-                    dataset_split='train' \
-                    output_dir='/mnt/inputs' \
-                    num_processes_load=4 \
-                    num_processes_map=4 \
-                    num_processes_save=4 \
-                    num_tokens_per_shard=100000000"
-        EOF
-        ```
+    You can follow the job's progress by running `squeue -u $USER` and checking the logs in `job_output.log` and `job_error.log` in your results folder.
 
-        You can follow the job's progress by running `squeue -u $USER` and checking the logs in `~/results/job_output.log` and `~/results/job_error.log`.
+=== "Kubernetes"
 
-    === "Kubernetes"
+    ```bash
+    kubectl apply -f prepare-job.yaml
+    ```
 
-        Fast-LLM comes with a script that'll download and preprocess the dataset for you. We will run this script in a Kubernetes job. Here's a basic configuration for the job:
+    where `prepare-job.yaml` is a file containing the following configuration:
 
-        ```yaml
-        apiVersion: batch/v1
-        kind: Job
-        metadata:
-          name: prepare-dataset
+    ```yaml
+    apiVersion: batch/v1
+    kind: Job
+    metadata:
+      name: fast-llm-prepare
+    spec:
+      template:
         spec:
-          template:
-            spec:
-              containers:
-                - name: prepare-dataset
-                  image: ghcr.io/servicenow/fast-llm:latest
-                  command: ["python", "tools/prepare_dataset.py"]
-                  args:
-                    - tokenizer_path_or_name=meta-llama/Llama-3.2-1B
-                    - dataset_name_or_path=openwebtext
-                    - dataset_split=train
-                    - output_dir=/mnt/inputs
-                    - num_processes_load=4
-                    - num_processes_map=4
-                    - num_processes_save=4
-                    - num_tokens_per_shard=100000000
-                  resources:
-                    requests:
-                      cpu: 4
-                  volumeMounts:
-                    - name: inputs
-                      mountPath: /mnt/inputs
-              volumes:
+          containers:
+            - name: fast-llm-prepare-container
+              image: ghcr.io/servicenow/fast-llm:latest
+              command: ["fast-llm", "prepare"]
+              args:
+                - "--config"
+                - "/mnt/inputs/prepare-config.yaml"
+              resources:
+                requests:
+                  cpu: 4
+              volumeMounts:
                 - name: inputs
-                  persistentVolumeClaim:
-                    claimName: pvc-fast-llm-inputs
-        ```
-
-        Save this configuration to a file named `prepare-dataset-job.yaml` and apply it to your Kubernetes cluster:
-
-        ```bash
-        kubectl apply -f prepare-dataset-job.yaml
-        ```
+                  mountPath: /mnt/inputs
+          volumes:
+            - name: inputs
+              persistentVolumeClaim:
+                claimName: pvc-fast-llm-inputs
+    ```
 
-        You can follow the job's progress by running `kubectl get pods` and checking the logs with `kubectl logs prepare-dataset`.
+    You can follow the job's progress by running `kubectl get pods` and checking the logs with `kubectl logs fast-llm-prepare`.
 
 !!! info "What's Happening Here?"
 
-    The `prepare_dataset.py` script will grab the OpenWebText data from the Huggingface Hub, tokenize it, and save it in 91 shards of 100M tokens each to the input folder. Expect around 2 hours for the whole thing to finish, mainly due to tokenization. If you've got more CPU cores, try upping `num_processes_*` to speed things up.
+    The `prepare` command will grab the OpenWebText data from the Huggingface Hub, tokenize it, and save it in 91 shards of 100M tokens each to the input folder. Expect around 2 hours for the whole thing to finish, mainly due to tokenization. If you've got more CPU cores, try upping the number of workers to speed things up.
 
 !!! tip "Use a Smaller Dataset for Testing"
 
@@ -567,7 +439,7 @@ For this tutorial, we'll use 9B tokens of text from the [OpenWebText](https://sk
 
 ## Step 4: Configure Fast-LLM ⚙️
 
-Next, we'll create a configuration file for Fast-LLM. Save the following as `~/inputs/fast-llm-config.yaml`:
+Next, we'll create a configuration file for Fast-LLM. Save the following as `train-config.yaml` in your inputs folder:
 
 === "SmolLM2-135M"
 
@@ -596,7 +468,7 @@ Next, we'll create a configuration file for Fast-LLM. Save the following as `~/i
       batch_size: 480  # (5)!
     data:
       format: file
-      path: /mnt/inputs/openwebtext/fast_llm_dataset.json  # (6)!
+      path: /mnt/inputs/openwebtext-SmolLM2/fast_llm_dataset.json  # (6)!
       split: [99, 1, 0]  # (7)!
     optimizer: # (8)!
       weight_decay: 0.1
@@ -621,7 +493,7 @@ Next, we'll create a configuration file for Fast-LLM. Save the following as `~/i
       distributed:
         training_dtype: bf16  # (14)!
     run:
-      experiment_dir: /mnt/results
+      experiment_dir: /mnt/results/SmolLM2-135M
     ```
 
     1.  Total number of training tokens will be approximately 300B.
@@ -666,9 +538,9 @@ Next, we'll create a configuration file for Fast-LLM. Save the following as `~/i
       batch_size: 480  # (5)!
     data:
       format: file
-      path: /mnt/inputs/fast_llm_dataset.json  # (6)!
+      path: /mnt/inputs/openwebtext-Llama/fast_llm_dataset.json  # (6)!
       split: [99, 1, 0]  # (7)!
-    optimizer: # (8)!
+    optimizer:  # (8)!
       weight_decay: 0.1
       beta_1: 0.9
       beta_2: 0.95
@@ -680,7 +552,7 @@ Next, we'll create a configuration file for Fast-LLM. Save the following as `~/i
         warmup_iterations: 2000
     pretrained:
       format: llama  # (10)!
-      path: /mnt/inputs
+      path: /mnt/inputs/Llama-3.2-1B
       model_weights: yes  # (11)!
     model:
       base_model:
@@ -691,7 +563,7 @@ Next, we'll create a configuration file for Fast-LLM. Save the following as `~/i
       distributed:
         training_dtype: bf16  # (14)!
     run:
-      experiment_dir: /mnt/results
+      experiment_dir: /mnt/results/Llama-3.2-1B
     ```
 
     1.  Total number of training tokens will be approximately 300B.

From 763d84332dabed4f35452e465cfe4f3fb202eb52 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Wed, 13 Nov 2024 15:24:43 -0500
Subject: [PATCH 79/87] update dependencies

---
 docs/README.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/docs/README.md b/docs/README.md
index 704c0d316..1db83b67c 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -13,9 +13,11 @@ To build and preview the documentation locally, follow these simple steps:
 1.  **Install the necessary dependencies:**
 
     ```bash
-    pip install -e ".[DOCS]"
+    pip install --no-build-isolation -e ".[DOCS]"
     ```
 
+    You also need to install `libcairo` for image processing on your system. Follow <https://squidfunk.github.io/mkdocs-material/plugins/requirements/image-processing/> for more details.
+
 2.  **Build the documentation:**
 
     ```bash

From af3f1f042b8c259a2c0b3a10f1eff820f6423c05 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Thu, 14 Nov 2024 13:25:37 -0500
Subject: [PATCH 80/87] wip

---
 docs/quick-start.md | 65 ++++++++++++++++++++++++---------------------
 1 file changed, 35 insertions(+), 30 deletions(-)

diff --git a/docs/quick-start.md b/docs/quick-start.md
index 7425fab7a..0a6158b35 100644
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -8,12 +8,12 @@ This guide will get you up and running with Fast-LLM on a single machine. Let's
 
 To follow this guide, you'll need:
 
--   **Hardware**: At least one NVIDIA GPU with Volta architecture or newer. For optimal results in this tutorial, we recommend 8 A100 GPUs or better. 🤑
+-   **Hardware**: At least one NVIDIA GPU with Volta architecture or newer. We wrote this guide with an 8-GPU machine of Ampere or Hopper architecture in mind.
 -   **Software**:
     -   **Docker** (if using the Docker setup), or
     -   **Local Environment**: PyTorch 2.2 or later, CUDA 12.1 or later, and APEX AMP (if building from source), or
-    -   **Cluster Setup**: Access to a Slurm or Kubernetes cluster.
--   **Time**: The initial setup and training process requires some patience. 😊
+    -   **Cluster Setup**: Access to a Kubernetes or Docker-enabled Slurm cluster.
+-   **Time**: The initial setup and training process requires a little patience. 😊
 
 ## Step 1: Initial Setup 🏗 ️
 
@@ -21,7 +21,7 @@ First, choose your environment. You can use Docker, your local environment, Slur
 
 === "Docker"
 
-    You selected Docker for this tutorial. We'll use the Fast-LLM Docker image to train our model, which includes all the necessary dependencies. Grab the pre-built Fast-LLM Docker image:
+    You selected Docker for this tutorial. We'll use the Fast-LLM Docker image to train our model, which includes all the necessary dependencies. Grab the [pre-built Fast-LLM Docker image](https://github.com/ServiceNow/Fast-LLM/pkgs/container/fast-llm) from GitHub's container registry (GHCR).
 
     ```bash
     docker pull ghcr.io/servicenow/fast-llm:latest
@@ -35,7 +35,7 @@ First, choose your environment. You can use Docker, your local environment, Slur
 
 === "Local Environment"
 
-    You selected to use your local environment to run Fast-LLM. You should have a machine with at least one NVIDIA GPU with Volta architecture or newer. We need to install Fast-LLM and its dependencies in your environment. Our Fast-LLM docker image already includes all this, and we recommend using it for simplicity and reproducibility. If you still want to install Fast-LLM in your local environment, follow the steps below.
+    You're setting up Fast-LLM in your machine's local environment. This means you'll need to install Fast-LLM and its dependencies. For simplicity and reproducibility, we recommend using the Fast-LLM Docker image instead. It's preconfigured with everything you need. But if you're set on a local installation, follow the steps below.
 
     Fast-LLM depends on [CUDA](https://developer.nvidia.com/about-cuda) 12.1 or later, [PyTorch](https://pytorch.org) 2.2 or later, [APEX](https://github.com/NVIDIA/apex?tab=readme-ov-file#installation), and [OpenAI Triton](https://github.com/triton-lang/triton). Follow the instructions on their respective websites to install them. If you use [conda](https://docs.conda.io/projects/conda/en/latest/index.html), you can create a new environment and install these dependencies in it.
     
@@ -93,7 +93,7 @@ First, choose your environment. You can use Docker, your local environment, Slur
 
 === "Slurm"
 
-    You selected Docker-enabled [Slurm](https://slurm.schedmd.com/) for this tutorial. The Slurm setup requires a Slurm cluster with at least one node and one GPU of Volta architecture or newer. Slurm will use the `ghcr.io/servicenow/fast-llm:latest` Docker image to train our model. It will need a shared file system for input data and output results. We will assume that your home directory is shared across all nodes.
+    You've chosen Docker-enabled [Slurm](https://slurm.schedmd.com/) for this tutorial. Slurm will pull the `ghcr.io/servicenow/fast-llm:latest` Docker image to train the model. Just make sure there's a shared file system for both input data and output results. We'll assume your home directory is accessible across all nodes.
 
     Let's create a folder to store our input data and output results in the shared home directory:
 
@@ -103,10 +103,10 @@ First, choose your environment. You can use Docker, your local environment, Slur
 
 === "Kubernetes"
 
-    You selected to use [Kubernetes](https://kubernetes.io/) with [KubeFlow](https://www.kubeflow.org/) for this tutorial. We will use a `PyTorchJob` resource to train our model with the `ghcr.io/servicenow/fast-llm:latest` Docker image and store our input data and output results in shared [persistent volume claims](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) (PVCs). The Kubernetes cluster should have at least one node and one GPU of Volta architecture or newer.
+    You selected to use [Kubernetes](https://kubernetes.io/) with [KubeFlow](https://www.kubeflow.org/) for this tutorial. We will use a `PyTorchJob` resource to train our model with the `ghcr.io/servicenow/fast-llm:latest` Docker image and store our input data and output results in shared [persistent volume claims](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) (PVCs).
 
     Let's now create two PVCs named `pvc-fast-llm-inputs` and `pvc-fast-llm-results` to store our input data and output results, respectively.
-    
+
     Create a file named `pvc-fast-llm-inputs.yaml` with the following content:
 
     ```yaml
@@ -175,13 +175,13 @@ First, choose your environment. You can use Docker, your local environment, Slur
             claimName: pvc-fast-llm-results
     ```
 
-    Save this configuration to a file named `pod-fast-llm-data-management.yaml`. Next, apply this configuration to your Kubernetes cluster:
+    Save this configuration to a file named `pod-fast-llm-data-management.yaml`. Next, apply this configuration to your Kubernetes cluster to create the pod:
 
     ```bash
     kubectl apply -f pod-fast-llm-data-management.yaml
     ```
 
-    This pod will allow you to copy files to and from the inputs and results PVCs. You can access it by running:
+    The pod will allow you to copy files to and from the inputs and results PVCs. You can access it by running:
 
     ```bash
     kubectl exec -it fast-llm-data-management -- /bin/bash
@@ -195,11 +195,11 @@ First, choose your environment. You can use Docker, your local environment, Slur
         kubectl delete pod fast-llm-data-management
         ```
 
-        Don't run this just yet, though. You'll need this pod throughout the guide.
+        Don't run this just yet, though. You'll need the pod throughout the guide.
 
 ## Step 2: Choose Your Model 🤖
 
-Fast-LLM supports many GPT variants, including (but not limited to) Llama, Mistral, and Mixtral. For this tutorial, let's train a Llama model with data parallelism. You can choose from two models:
+Fast-LLM supports many GPT variants, including (but not limited to) Llama, Mistral, and Mixtral. For this tutorial, you can choose from two models:
 
 === "SmolLM2-135M"
 
@@ -237,7 +237,7 @@ Fast-LLM supports many GPT variants, including (but not limited to) Llama, Mistr
 === "Llama-3.2-1B"
 
     Llama is a larger model with 1B parameters. It's more powerful but requires more resources to train. We'll grab the model from the Huggingface Hub and save it to our inputs folder.
-    
+
     !!! note "Access Required"
     
         Meta gates access to their Llama models. You need to request access to the model from Meta before you can download it at https://huggingface.co/meta-llama/Llama-3.2-1B.
@@ -257,7 +257,7 @@ Fast-LLM supports many GPT variants, including (but not limited to) Llama, Mistr
         git lfs install
         git clone https://huggingface.co/meta-llama/Llama-3.2-1B ~/inputs/Llama-3.2-1B
         ```
-    
+
     === "Local Environment"
 
         First, sign in to your Hugging Face account:
@@ -384,7 +384,7 @@ Fast-LLM ships with a `prepare` command that'll download and preprocess the data
 
     srun \
         --container-image="ghcr.io/servicenow/fast-llm:latest" \
-        --container-mounts="${HOME}/inputs:/mnt/inputs" \
+        --container-mounts="${HOME}/inputs:/mnt/inputs,${HOME}/results:/mnt/results" \
         --ntasks-per-node=$SLURM_NTASKS_PER_NODE \
         bash -c "fast-llm prepare --config /mnt/inputs/prepare-config.yaml"
     EOF
@@ -429,13 +429,9 @@ Fast-LLM ships with a `prepare` command that'll download and preprocess the data
 
     You can follow the job's progress by running `kubectl get pods` and checking the logs with `kubectl logs fast-llm-prepare`.
 
-!!! info "What's Happening Here?"
-
-    The `prepare` command will grab the OpenWebText data from the Huggingface Hub, tokenize it, and save it in 91 shards of 100M tokens each to the input folder. Expect around 2 hours for the whole thing to finish, mainly due to tokenization. If you've got more CPU cores, try upping the number of workers to speed things up.
-
 !!! tip "Use a Smaller Dataset for Testing"
 
-    If you're just testing things out, you can also use a smaller dataset. Replace `openwebtext` with `stas/openwebtext-10k` to use a small subset representing the first 10K records from the original dataset. This will speed up the process and let you see how things work without waiting for hours.
+    The full OpenWebText dataset is quite large and will take a while to process, around 2 hours. If you're just testing things out, you can also use a smaller dataset. Replace `openwebtext` with `stas/openwebtext-10k` to use a small subset representing the first 10K records from the original dataset. This will speed up the process and let you see how things work without waiting for hours.
 
 ## Step 4: Configure Fast-LLM ⚙️
 
@@ -470,11 +466,11 @@ Next, we'll create a configuration file for Fast-LLM. Save the following as `tra
       format: file
       path: /mnt/inputs/openwebtext-SmolLM2/fast_llm_dataset.json  # (6)!
       split: [99, 1, 0]  # (7)!
-    optimizer: # (8)!
+    optimizer:  # (8)!
       weight_decay: 0.1
       beta_1: 0.9
       beta_2: 0.95
-      learning_rate: # (9)!
+      learning_rate:  # (9)!
         base: 6.0e-04
         minimum: 6.0e-05
         decay_style: cosine
@@ -603,9 +599,11 @@ Alright, the big moment! Let's launch the training run.
         fast-llm train gpt --config /mnt/inputs/fast-llm-config.yaml
     ```
 
-    Adjust `--nproc_per_node` based on the number of GPUs you have available.
-    Replace `--gpus all` with `--gpus '"device=0,1,2,3,4,5,6,7"'` if you want to use specific GPUs.
-    Remove `-e WANDB_API_KEY_PATH=/mnt/inputs/.wandb_api_key` if you're not using W&B.
+    !!! tip "Customize Your Docker Command"
+
+        * Adjust `--nproc_per_node` based on the number of GPUs you have available.
+        * Replace `--gpus all` with `--gpus '"device=0,1,2,3,4,5,6,7"'` if you want to use specific GPUs.
+        * Remove `-e WANDB_API_KEY_PATH=/mnt/inputs/.wandb_api_key` if you're not using W&B.
 
 === "Local Environment"
 
@@ -618,8 +616,10 @@ Alright, the big moment! Let's launch the training run.
         fast-llm train gpt --config /mnt/inputs/fast-llm-config.yaml
     ```
 
-    Adjust `--nproc_per_node` based on the number of GPUs you have available.
-    Remove `export WANDB_API_KEY_PATH=/mnt/inputs/.wandb_api_key` if you're not using W&B.
+    !!! tip "Customize Your Command"
+    
+        * Adjust `--nproc_per_node` based on the number of GPUs you have available.
+        * Remove `export WANDB_API_KEY_PATH=/mnt/inputs/.wandb_api_key` if you're not using W&B.
 
 === "Slurm"
 
@@ -654,8 +654,10 @@ Alright, the big moment! Let's launch the training run.
             --config /mnt/inputs/fast-llm-config.yaml"
     ```
 
-    Change the `--gpus-per-node` value to match the number of GPUs on your node.
-    If you're not using W&B, remove the references to `WANDB_API_KEY_PATH`.
+    !!! tip "Customize Your Slurm Script"
+
+        * Change the `--gpus-per-node` value to match the number of GPUs on your node.
+        * If you're not using W&B, remove the references to `WANDB_API_KEY_PATH`.
 
     Submit the job to the Slurm cluster:
 
@@ -738,7 +740,10 @@ Alright, the big moment! Let's launch the training run.
                     sizeLimit: "1024Gi"
     ```
 
-    Change the `nprocPerNode` value to match the number of GPUs on your node. If you're not using W&B, remove the references to `WARDB_API_KEY_PATH`.
+    !!! tip "Customize Your PyTorchJob"
+
+        * Change the `nprocPerNode` value to match the number of GPUs on your node.
+        * If you're not using W&B, remove the references to `WARDB_API_KEY_PATH`.
 
     Submit the job to the Kubernetes cluster:
 

From cc6ae8b91602cf30ecd4e8170389f2e15366153d Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Thu, 14 Nov 2024 13:28:59 -0500
Subject: [PATCH 81/87] revert changes

---
 .github/ISSUE_TEMPLATE/bug_report.md |  1 +
 .github/PULL_REQUEST_TEMPLATE.md     | 54 ++++++++++++++--------------
 2 files changed, 28 insertions(+), 27 deletions(-)

diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md
index 6f40cfda7..72e68d695 100644
--- a/.github/ISSUE_TEMPLATE/bug_report.md
+++ b/.github/ISSUE_TEMPLATE/bug_report.md
@@ -108,6 +108,7 @@ echo "=== END OF ENVIRONMENT INFORMATION ==="
 # 📝 Additional Context
 
 Include any other information that may help us understand the issue, such as:
+
 - Recent changes to the configuration or code.
 - Whether the issue occurs consistently or intermittently.
 - Any troubleshooting steps you have already tried.
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
index 77cb0630a..43cca0b59 100644
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@@ -9,21 +9,21 @@ Closes # <!-- Insert issue number here, if applicable -->
 
 Select all that apply:
 
--   [ ] 🐛 **Bug fix** (non-breaking change that addresses a specific issue)
--   [ ] 🚀 **New feature** (non-breaking change that adds functionality)
--   [ ] ⚠️ **Breaking change** (a change that could affect existing functionality)
--   [ ] 📈 **Performance improvement/optimization** (improves speed, memory usage, or efficiency)
--   [ ] 🛠️ **Code refactor** (non-functional changes that improve code readability, structure, etc.)
--   [ ] 📦 **Dependency bump** (updates dependencies, including Dockerfile or package changes)
--   [ ] 📝 **Documentation change** (updates documentation, including new content or typo fixes)
--   [ ] 🔧 **Infrastructure/Build change** (affects build process, CI/CD, or dependencies)
+- [ ] 🐛 **Bug fix** (non-breaking change that addresses a specific issue)
+- [ ] 🚀 **New feature** (non-breaking change that adds functionality)
+- [ ] ⚠️ **Breaking change** (a change that could affect existing functionality)
+- [ ] 📈 **Performance improvement/optimization** (improves speed, memory usage, or efficiency)
+- [ ] 🛠️ **Code refactor** (non-functional changes that improve code readability, structure, etc.)
+- [ ] 📦 **Dependency bump** (updates dependencies, including Dockerfile or package changes)
+- [ ] 📝 **Documentation change** (updates documentation, including new content or typo fixes)
+- [ ] 🔧 **Infrastructure/Build change** (affects build process, CI/CD, or dependencies)
 
 ## 📝 Changes
 
 List the key changes introduced in this PR:
 
-1.   Change A
-2.   Change B
+1. Change A
+2. Change B
 
 ## ✅ Checklist
 
@@ -31,32 +31,32 @@ Make sure the following tasks are completed before submitting the PR:
 
 ### General
 
--   [ ] 📜 I have read and followed the [contributing guidelines](https://github.com/ServiceNow/Fast-LLM/blob/main/CONTRIBUTING.md).
--   [ ] 🏷️ I am using a clear and descriptive title that follows the [PR title guidelines](https://servicenow.github.io/Fast-LLM/developers/pr-title-guidelines).
--   [ ] 🎉 The functionality is complete, and I have tested the changes.
--   [ ] 📝 I have updated the documentation if needed.
--   [ ] ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
--   [ ] 🧩 I have commented my code, especially in hard-to-understand areas.
+- [ ] 📜 I have read and followed the [contributing guidelines](https://github.com/ServiceNow/Fast-LLM/blob/main/CONTRIBUTING.md).
+- [ ] 🏷️ I am using a clear and descriptive title that follows the [PR title guidelines](https://servicenow.github.io/Fast-LLM/developers/pr-title-guidelines).
+- [ ] 🎉 The functionality is complete, and I have tested the changes.
+- [ ] 📝 I have updated the documentation if needed.
+- [ ] ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
+- [ ] 🧩 I have commented my code, especially in hard-to-understand areas.
 
 ### Dependencies and Configuration
 
--   [ ] 🐋 I have updated the Docker configuration or dependencies, if applicable.
--   [ ] 🔄 I have ensured compatibility with the existing setup after dependency changes.
+- [ ] 🐋 I have updated the Docker configuration or dependencies, if applicable.
+- [ ] 🔄 I have ensured compatibility with the existing setup after dependency changes.
 
 ### Testing
 
--   [ ] 🧪 I have added or updated tests to cover my changes.
--   [ ] ✔️ New and existing tests pass locally with my changes.
--   [ ] 🚦 I have tested these changes on GPUs and verified training stability.
--   [ ] 🏋️ I have tested the changes on realistic training workloads, if applicable.
+- [ ] 🧪 I have added or updated tests to cover my changes.
+- [ ] ✔️ New and existing tests pass locally with my changes.
+- [ ] 🚦 I have tested these changes on GPUs and verified training stability.
+- [ ] 🏋️ I have tested the changes on realistic training workloads, if applicable.
 
 ### Performance Impact
 
--   [ ] 📊 I have run benchmarks where applicable to evaluate the performance impact.
--   [ ] ✅ The benchmarks show no performance regression.
--   [ ] 🚀 The benchmarks indicate a potential performance improvement.
--   [ ] ⚠️ The benchmarks indicate a potential performance degradation.
--   [ ] 📈 I have provided benchmark results and detailed any performance impact below, if applicable.
+- [ ] 📊 I have run benchmarks where applicable to evaluate the performance impact.
+- [ ] ✅ The benchmarks show no performance regression.
+- [ ] 🚀 The benchmarks indicate a potential performance improvement.
+- [ ] ⚠️ The benchmarks indicate a potential performance degradation.
+- [ ] 📈 I have provided benchmark results and detailed any performance impact below, if applicable.
 
 ## 📊 Performance Impact Details
 

From e96c41145f66d6f0e6dd936ba6f22d7fecd6d6d2 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Thu, 14 Nov 2024 14:39:36 -0500
Subject: [PATCH 82/87] wip

---
 .github/PULL_REQUEST_TEMPLATE.md           |   4 +-
 CODE_OF_CONDUCT.md                         |   4 +-
 CONTRIBUTING.md                            |  63 +-----
 docs/about-us.md                           |   8 +-
 docs/developers/best-practices.md          |   0
 docs/developers/contributing.md            |  70 ++++--
 docs/developers/dev-practices.md           |  15 ++
 docs/developers/style-guide.md             |   7 +
 docs/join-us.md                            |  30 +--
 docs/quick-start.md                        |  22 +-
 docs/recipes/continue-training-llama-8b.md | 155 +------------
 docs/recipes/data-preparation.md           | 101 +--------
 docs/recipes/train-llama-8b.md             | 250 +--------------------
 docs/recipes/upcycle-llama-3b-to-moe.md    |   7 +
 docs/reference/configuration.md            |   8 +-
 mkdocs.yaml                                |   3 +-
 16 files changed, 134 insertions(+), 613 deletions(-)
 delete mode 100644 docs/developers/best-practices.md
 create mode 100644 docs/developers/dev-practices.md
 create mode 100644 docs/developers/style-guide.md

diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
index 43cca0b59..7eb522e11 100644
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@@ -31,8 +31,8 @@ Make sure the following tasks are completed before submitting the PR:
 
 ### General
 
-- [ ] 📜 I have read and followed the [contributing guidelines](https://github.com/ServiceNow/Fast-LLM/blob/main/CONTRIBUTING.md).
-- [ ] 🏷️ I am using a clear and descriptive title that follows the [PR title guidelines](https://servicenow.github.io/Fast-LLM/developers/pr-title-guidelines).
+- [ ] 📜 I have read and followed the [contributing guidelines](https://servicenow.github.io/Fast-LLM/developers/contributing).
+- [ ] 🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
 - [ ] 🎉 The functionality is complete, and I have tested the changes.
 - [ ] 📝 I have updated the documentation if needed.
 - [ ] ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
index e04230bbd..4e623f9f4 100644
--- a/CODE_OF_CONDUCT.md
+++ b/CODE_OF_CONDUCT.md
@@ -25,8 +25,8 @@ ServiceNow suggests the following technical support pathways for open-source pro
 1. Clearly identify and document the issue or question you have.
 2. View the Documentation.
 3. Search the Discussions.
-4. Search the project knowledge base or Wiki for known errors, useful solutions, and troubleshooting tips.
-5. Check the project guidelines in the [`CONTRIBUTING.md`](CONTRIBUTING.md) file if you would like details on how you can submit a change. Community contributions are valued and appreciated!
+4. Search the project documentation for known errors, useful solutions, and troubleshooting tips.
+5. Check the project contribution guidelines if you would like details on how you can submit a change. Community contributions are valued and appreciated!
 6. Log an Issue if it hasn't already been logged. If the issue has already been logged by another user, vote it up, and add a comment with additional or missing information. Do your best to choose the correct category when logging a new issue. This will make it easier to differentiate bugs from new feature requests or ideas. If after logging an issue you find the solution, please close your issue and provide a comment with the solution. This will help the project owners and other users.
 7. Contact the project team contributors of the project to see if they can help as a last resort only.
 
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 9ef1c1856..16580f7d1 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -1,62 +1,3 @@
-# Contributing to Fast-LLM 🚀
+# Contributing to Fast-LLM
 
-Thank you for your interest in contributing to Fast-LLM! We're thrilled to have you here, and your support is invaluable in helping us accelerate LLM training to full speed. This guide will walk you through the steps to contribute, from reporting issues to submitting changes and setting up your development environment.
-
-If you have questions or want to start a discussion, feel free to [open a discussion](https://github.com/ServiceNow/Fast-LLM/discussions) on our GitHub page.
-
-## Getting Started
-
-To get started with contributing to Fast-LLM, follow these steps to set up your environment:
-
-1. **Set Up the Development Environment**: Fast-LLM is built on [PyTorch](https://pytorch.org/) and [Triton](https://triton-lang.org/). Check out our [setup guide](https://servicenow.github.io/Fast-LLM/developers/setup) for instructions on getting everything ready, including the development environment and dependencies.
-2. **Learn Our Best Practices**: Get familiar with our [development best practices](https://servicenow.github.io/Fast-LLM/developers/dev-practices/), which cover code style, pre-commit hooks, and testing strategies.
-3. **Launch Fast-LLM Locally or with Docker**: Need help getting started? Follow the instructions in the [launching section](https://servicenow.github.io/Fast-LLM/developers/launching) to get Fast-LLM up and running.
-
-## How to Report a Bug 🐞
-
-Found a bug? Let's squash it together! [Open an issue](https://github.com/ServiceNow/Fast-LLM/issues/new/choose) and select "Bug report." Please include as much information as possible:
-
-- Steps to reproduce the issue.
-- What you expected to happen versus what actually happened.
-- Logs, Fast-LLM configuration, and error messages.
-- Details about your environment setup (e.g., CUDA hardware, PyTorch version, CUDA version).
-
-If you're familiar with the codebase, consider adding a failing unit test to demonstrate the problem (optional, but helpful!).
-
-## Proposing Changes
-
-Before diving into code, [open an issue](https://github.com/ServiceNow/Fast-LLM/issues) to discuss your proposal. This is especially important if you're planning significant changes or adding new dependencies. Once your idea is approved, follow these steps:
-
-1. **Fork the Repository**: [Fork Fast-LLM](https://github.com/ServiceNow/Fast-LLM/fork) to your own GitHub account.
-2. **Clone Your Fork Locally**: Use `git clone` to bring the code to your local machine.
-3. **Create a New Branch**: Name your branch descriptively, such as `feature/awesome-feature` or `fix/nasty-bug`.
-4. **Make Your Changes**: Work your magic! Don't forget to add or update tests, benchmarks, or configurations as needed.
-5. **Create a Properly Titled Pull Request**: When you're ready to open a PR, make sure to use a clear and descriptive title that follows our [PR title guidelines](https://servicenow.github.io/Fast-LLM/developers/pr-title-guidelines). This title will become the commit message for the squashed merge.
-6. **Push to Your Fork**: Push the branch to your GitHub fork.
-7. **Open a Pull Request**: [Submit a pull request](https://github.com/ServiceNow/Fast-LLM/compare) to the `main` branch. Reference the original issue number and provide a brief summary of your changes.
-
-### Guidelines for a Successful Pull Request
-
-Here are some tips to ensure your pull request gets reviewed and merged promptly:
-
-- **Follow our coding standards**: Stick to our [development best practices](https://servicenow.github.io/Fast-LLM/developers/dev-practices/) to keep the code clean and consistent.
-- **Write tests**: Verify your changes with unit tests for new features or bug fixes.
-- **Test on GPUs and real-world workloads**: Since Fast-LLM is all about training large language models, make sure your changes work smoothly in GPU environments and on typical training setups.
-- **Run benchmarks and performance tests**: Make sure your changes don't slow things down. If there's any impact on performance, provide benchmark results to back it up.
-- **Avoid introducing new issues**: Check that there are no new runtime warnings, type checker errors, linting problems, or unhandled edge cases.
-- **Comment non-trivial code**: Make your code easy to understand for others.
-- **Keep sensitive data out**: Make sure your code or commit messages don't expose private or proprietary information.
-- **Use the [PR template](https://github.com/ServiceNow/Fast-LLM/blob/main/.github/PULL_REQUEST_TEMPLATE.md)**: Complete the checklist to make sure everything is in order before hitting submit.
-
-## Seeking Help or Clarification
-
-If you're unsure about something or need help, you've got options:
-
-- **GitHub Discussions**: [Start a discussion](https://github.com/ServiceNow/Fast-LLM/discussions) if you need advice or just want to chat.
-- **Project Maintainers**: Mention a maintainer in an issue or pull request if you need a review or guidance.
-
-## Contributors
-
-We're grateful for all the awesome contributors who help make Fast-LLM better. Join our contributors' list and make your first contribution!
-
-To learn more about the team and maintainers, visit our [About page](https://servicenow.github.io/Fast-LLM/about-us).
+Please refer to the [contributing guidelines](https://servicenow.github.io/Fast-LLM/developers/contributing) for more information on how to contribute to Fast-LLM.
diff --git a/docs/about-us.md b/docs/about-us.md
index 9b1bc2be0..eedd852a4 100644
--- a/docs/about-us.md
+++ b/docs/about-us.md
@@ -6,15 +6,15 @@ hide:
 
 Welcome to Fast-LLM! We are a global team of engineers, researchers, and AI professionals led by the Foundation Models Lab at [ServiceNow Research](https://www.servicenow.com/research/), dedicated to advancing large language models (LLMs) and providing the highest-performance tools for serious users. Designed with professionals, research institutions, and enterprises in mind, Fast-LLM offers the speed, scalability, and flexibility needed to train the biggest and most complex models. Our commitment to open-source ensures that you have full control over your workflows, without the limitations or compromises of commercial frameworks.
 
-## Our Mission
+## 🚀 Our Mission
 
 Our mission is to deliver a best-in-class library for training large-scale language models, combining cutting-edge performance with robust, customizable features. Fast-LLM is built to meet the needs of researchers and organizations who push the boundaries of generative AI, enabling them to train state-of-the-art models more efficiently. By optimizing training workflows and scaling to massive compute clusters, we help professionals unlock the full potential of LLMs, reducing costs and time-to-deployment for ambitious AI projects.
 
-## Our Vision
+## 🌍 Our Vision
 
 We envision Fast-LLM as the go-to solution for serious AI practitioners who require more than what typical frameworks can offer. Our goal is to empower research institutions, corporate AI teams, and universities to train sophisticated models that exceed the capabilities of standard tools. By creating a highly performant and customizable library, we aim to be the backbone of cutting-edge AI research and development, equipping experts with the tools they need to tackle the toughest training challenges.
 
-## Our Values
+## 🎯 Our Values
 
 At Fast-LLM, we adhere to a set of guiding principles that define our approach:
 
@@ -23,7 +23,7 @@ At Fast-LLM, we adhere to a set of guiding principles that define our approach:
 -   **Open Innovation:** While we cater to advanced users, our commitment to open-source ensures that innovation remains accessible. We believe in building a community where professionals can collaborate and contribute to shaping the future of AI.
 -   **Reliability at Scale:** Fast-LLM is built with rigorous standards to support production-level workloads. We prioritize stability, reproducibility, and robustness, ensuring that your models can scale from research to real-world applications seamlessly.
 
-## Meet the Team
+## 👥 Meet the Team
 
 Fast-LLM is led by the Foundation Models Lab at [ServiceNow Research](https://www.servicenow.com/research/), with development driven by a dedicated group of professionals who bring extensive expertise in AI, machine learning, and distributed systems. While the project direction is guided by the Foundation Models Lab, contributions come from a growing network of researchers, developers, and industry experts worldwide. Here are some of the key members leading the project:
 
diff --git a/docs/developers/best-practices.md b/docs/developers/best-practices.md
deleted file mode 100644
index e69de29bb..000000000
diff --git a/docs/developers/contributing.md b/docs/developers/contributing.md
index 406d80154..38b018680 100644
--- a/docs/developers/contributing.md
+++ b/docs/developers/contributing.md
@@ -1,19 +1,63 @@
-# Contributing to Fast-LLM
+---
+title: Contributing
+---
 
-Coming soon...
+Thank you for your interest in contributing to Fast-LLM! We're thrilled to have you here, and your support is invaluable in helping us accelerate LLM training to full speed. This guide will walk you through the steps to contribute, from reporting issues to submitting changes and setting up your development environment.
 
-## PR Title Guidelines ✏️
+If you have questions or want to start a discussion, feel free to [open a discussion](https://github.com/ServiceNow/Fast-LLM/discussions) on our GitHub page.
 
-Since we squash commits when merging pull requests, the PR title will become the commit message for the squashed commit. To ensure a clear and consistent project history, follow these guidelines for naming your PR:
+## 🚀 Getting Started
 
-1.  **Use a concise yet descriptive title**: The title should summarize the key change or feature introduced. Avoid vague titles like "Fix bug" or "Update code."
-2.  **Start with a keyword**: Use keywords to categorize the type of change. For example:
+To get started with contributing to Fast-LLM, follow these steps to set up your environment:
 
-    -   **feat:** for new features (e.g., `[feat] add support for mixed-precision training`)
-    -   **fix:** for bug fixes (e.g., `[fix] resolve memory leak during backpropagation`)
-    -   **perf:** for performance improvements (e.g., `[perf] optimize gradient accumulation step`)
-    -   **refactor:** for code refactoring (e.g., `[refactor] clean up data loader module`)
-    -   **docs:** for documentation changes (e.g., `[docs] update contributing guidelines`)
-    -   **build:** for changes to the build process or dependencies (e.g., `[build] bump PyTorch version`)
+1.  **Learn Our Development Practices**: Get familiar with our [development best practices](https://servicenow.github.io/Fast-LLM/developers/dev-practices), which cover development setup, testing, and benchmarking.
+2.  **Read the Style Guide**: Follow our [style guide](https://servicenow.github.io/Fast-LLM/developers/style-guide) to maintain consistency in code style, documentation, and commit messages.
 
-3.  **Reference the issue number (if applicable)**: If the PR is related to a specific issue, include the issue number in the title (e.g., `[fix] resolve #123 memory leak in training loop`).
+## 🐞 How to Report a Bug
+
+Found a bug? Let's squash it together! [Open an issue](https://github.com/ServiceNow/Fast-LLM/issues/new/choose) and select "Bug report." Please include as much information as possible:
+
+-   Steps to reproduce the issue.
+-   What you expected to happen versus what actually happened.
+-   Logs, Fast-LLM configuration, and error messages.
+-   Details about your environment setup (e.g., CUDA hardware, PyTorch version, CUDA version).
+
+If you're familiar with the codebase, consider adding a failing unit test to demonstrate the problem (optional, but helpful!).
+
+## 🛠️ Proposing Changes
+
+Before diving into code, [open an issue](https://github.com/ServiceNow/Fast-LLM/issues) to discuss your proposal. This is especially important if you're planning significant changes or adding new dependencies. Once your idea is approved, follow these steps:
+
+1.  **Fork the Repository**: [Fork Fast-LLM](https://github.com/ServiceNow/Fast-LLM/fork) to your own GitHub account.
+2.  **Clone Your Fork Locally**: Use `git clone` to bring the code to your local machine.
+3.  **Create a New Branch**: Name your branch descriptively, such as `fix/training-memory-leak` or `feature/rope-scaling`.
+4.  **Make Your Changes**: Work your magic! Don't forget to add or update tests, benchmarks, or configurations as needed.
+5.  **Push to Your Fork**: Push the branch to your GitHub fork.
+6.  **Open a Pull Request**: [Submit a pull request](https://github.com/ServiceNow/Fast-LLM/compare) to the `main` branch. Reference the original issue number and provide a brief summary of your changes.
+
+## 🏆 Guidelines for a Successful Pull Request
+
+Here are some tips to ensure your pull request gets reviewed and merged promptly:
+
+-   **Follow our coding standards**: Stick to our [style guide and conventions](https://servicenow.github.io/Fast-LLM/developers/style-guide) to keep the code clean and consistent.
+-   **Write tests**: Verify your changes with unit tests for new features or bug fixes.
+-   **Test on GPUs and real-world workloads**: Since Fast-LLM is all about training large language models, make sure your changes work smoothly in GPU environments and on typical training setups.
+-   **Run benchmarks and performance tests**: Make sure your changes don't slow things down. If there's any impact on performance, provide benchmark results to back it up.
+-   **Avoid introducing new issues**: Check that there are no new runtime warnings, type checker errors, linting problems, or unhandled edge cases.
+-   **Comment non-trivial code**: Make your code easy to understand for others.
+-   **Keep sensitive data out**: Make sure your code or commit messages don't expose private or proprietary information.
+-   **Use a clear and descriptive title**: The PR title should summarize the key change or feature introduced. Avoid vague titles like "Fix bug" or "Update code." Start with a keyword like `[feat]`, `[fix]`, `[docs]`, etc. to categorize the change. Reference the issue number if applicable (e.g., `[fix] resolve #123 memory leak in training loop`). This title will become the commit message for the squashed merge.
+-   **Use the [PR template](https://github.com/ServiceNow/Fast-LLM/blob/main/.github/PULL_REQUEST_TEMPLATE.md)**: Complete the checklist to make sure everything is in order before hitting submit.
+
+## 🆘 Seeking Help or Clarification
+
+If you're unsure about something or need help, you've got options:
+
+-   **GitHub Discussions**: [Start a discussion](https://github.com/ServiceNow/Fast-LLM/discussions) if you need advice or just want to chat.
+-   **Project Maintainers**: Mention a maintainer in an issue or pull request if you need a review or guidance.
+
+## 🌟 Contributors
+
+We're grateful for all the awesome contributors who help make Fast-LLM better. Join our contributors' list and make your first contribution!
+
+To learn more about the team and maintainers, visit our [About page](https://servicenow.github.io/Fast-LLM/about-us).
diff --git a/docs/developers/dev-practices.md b/docs/developers/dev-practices.md
new file mode 100644
index 000000000..3a21845c8
--- /dev/null
+++ b/docs/developers/dev-practices.md
@@ -0,0 +1,15 @@
+---
+title: Development Practices
+---
+
+!!! warning
+
+    Work in progress! We are updating our development practices to reflect the latest changes in our codebase. Check back soon for the updated content.
+
+## Recommended Development Setup
+
+Stay tuned...
+
+## Testing and Benchmarking
+
+Stay tuned...
diff --git a/docs/developers/style-guide.md b/docs/developers/style-guide.md
new file mode 100644
index 000000000..959f0a2dd
--- /dev/null
+++ b/docs/developers/style-guide.md
@@ -0,0 +1,7 @@
+---
+title: Style Guide
+---
+
+!!! warning
+
+    This section is work in progress. We are in the process of updating our style guide to reflect the latest changes in codebase. Please check back soon for the updated content.
diff --git a/docs/join-us.md b/docs/join-us.md
index d3214f6b3..3b09c7787 100644
--- a/docs/join-us.md
+++ b/docs/join-us.md
@@ -6,17 +6,13 @@ hide:
 
 Fast-LLM is an open-source project driven by a community of passionate contributors. Whether you're a researcher, developer, or AI enthusiast, there's a place for you to make a real impact on the future of large-scale AI training. Join us, dive in, and help shape the tools that push the boundaries of language model training. Here's how you can get involved:
 
----
-
-## Stay in the Loop 📬
+## 📬 Stay in the Loop
 
 Want to keep up with the latest Fast-LLM updates and new opportunities to get involved? **Star** the Fast-LLM repository on GitHub and **watch** the project for notifications on new releases, discussions, and updates. This way, you'll always know what's happening, from new features to community initiatives.
 
 [Star](https://github.com/ServiceNow/Fast-LLM/stargazers) and [Watch](https://github.com/ServiceNow/Fast-LLM/subscription) the Fast-LLM repo on GitHub to stay updated on new releases, discussions, and upcoming features.
 
----
-
-## Code Contributions 🛠
+## 🛠 Code Contributions
 
 Fast-LLM thrives on collaboration, and we're excited to welcome new contributors! From fixing bugs to adding new features, every code contribution makes a difference. If you're just getting started, our [Good First Issues](https://github.com/ServiceNow/Fast-LLM/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22) on GitHub are labeled to help newcomers find approachable tasks. To set up your development environment and get oriented with Fast-LLM, check out our **Developer's Corner** for everything you need:
 
@@ -32,42 +28,30 @@ Here's a quick overview of the process:
 
 Explore our [Developer's Corner](developers/contributing.md) for everything you need to get started!
 
----
-
-## Feature Requests & Ideas 💡
+## 💡 Feature Requests & Ideas
 
 Got a great idea? We want to hear it! Whether it's a new feature, an enhancement, or even a moonshot idea, head over to **GitHub Discussions** to share your thoughts. Community feedback drives Fast-LLM's evolution, and your ideas can help shape the future of the project.
 
 Share your thoughts on [GitHub Discussions](https://github.com/ServiceNow/Fast-LLM/discussions).
 
----
-
-## Testing & Feedback 🔍
+## 🔍 Testing & Feedback
 
 Your experience with Fast-LLM is invaluable, whether you're running it in production or experimenting at home. We rely on user feedback to find bugs, optimize performance, and improve documentation. Please share any bugs, performance quirks, or gaps you spot with us on GitHub Issues. This kind of feedback strengthens the entire project.
 
 Report issues and share feedback on [GitHub Issues](https://github.com/ServiceNow/Fast-LLM/issues).
 
----
-
-## Help & Support 🤝
+## 🤝 Help & Support
 
 Love helping others? Join our [**GitHub Discussions**](https://github.com/ServiceNow/Fast-LLM/discussions) to answer questions, help troubleshoot, or share tips. Fast-LLM is a community, and the more we support each other, the stronger we become. Helping out is a great way to get involved and learn from others too.
 
----
-
-## Spread the Word 📣
+## 📣 Spread the Word
 
 If you're excited about Fast-LLM, let the world know! Share on social media, write a blog post, or give a talk at your next tech meetup. Spreading the word helps grow our community and brings new talent into the project.
 
----
-
-## Join Our Team 🌟
+## 🌟 Join Our Team
 
 Excited about contributing on a deeper level? The Foundation Models Lab at ServiceNow is at the forefront of large-scale AI training. We're looking for passionate individuals to push the boundaries of AI development with us. From research developers focusing on GPU optimization to visiting researchers refining our training frameworks, there's a role for everyone. Explore current opportunities and become a key player in shaping the future of AI at ServiceNow.
 
 Check out our [Careers page](https://www.servicenow.com/research/careers.html) for more information.
 
----
-
 Let's push the boundaries of large-scale AI training together. We're thrilled to have you here. Welcome to the Fast-LLM community!
diff --git a/docs/quick-start.md b/docs/quick-start.md
index 0a6158b35..fa4c66e46 100644
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -1,5 +1,5 @@
 ---
-title: "Quick Start 🚀"
+title: "Quick Start"
 ---
 
 This guide will get you up and running with Fast-LLM on a single machine. Let's train a model and see some results!
@@ -15,7 +15,7 @@ To follow this guide, you'll need:
     -   **Cluster Setup**: Access to a Kubernetes or Docker-enabled Slurm cluster.
 -   **Time**: The initial setup and training process requires a little patience. 😊
 
-## Step 1: Initial Setup 🏗 ️
+## 🏗 Step 1: Initial Setup
 
 First, choose your environment. You can use Docker, your local environment, Slurm, or Kubernetes.
 
@@ -197,7 +197,7 @@ First, choose your environment. You can use Docker, your local environment, Slur
 
         Don't run this just yet, though. You'll need the pod throughout the guide.
 
-## Step 2: Choose Your Model 🤖
+## 🤖 Step 2: Choose Your Model
 
 Fast-LLM supports many GPT variants, including (but not limited to) Llama, Mistral, and Mixtral. For this tutorial, you can choose from two models:
 
@@ -311,7 +311,7 @@ Fast-LLM supports many GPT variants, including (but not limited to) Llama, Mistr
 
     Smaller models like SmolLM2-135M will train relatively quickly, especially if you've only got a few GPUs. But if you're feeling adventurous (and patient), give the larger Llama-3.2-1B a shot!
 
-## Step 3: Prepare the Training Data 📚
+## 📚 Step 3: Prepare the Training Data
 
 For this tutorial, we'll use 9B tokens of text from the [OpenWebText](https://skylion007.github.io/OpenWebTextCorpus/) dataset. This dataset is a free approximation of the WebText data OpenAI used for GPT-2, and it's perfect for our test run!
 
@@ -433,7 +433,7 @@ Fast-LLM ships with a `prepare` command that'll download and preprocess the data
 
     The full OpenWebText dataset is quite large and will take a while to process, around 2 hours. If you're just testing things out, you can also use a smaller dataset. Replace `openwebtext` with `stas/openwebtext-10k` to use a small subset representing the first 10K records from the original dataset. This will speed up the process and let you see how things work without waiting for hours.
 
-## Step 4: Configure Fast-LLM ⚙️
+## ⚙️ Step 4: Configure Fast-LLM
 
 Next, we'll create a configuration file for Fast-LLM. Save the following as `train-config.yaml` in your inputs folder:
 
@@ -577,11 +577,11 @@ Next, we'll create a configuration file for Fast-LLM. Save the following as `tra
     13.  We're not using ZeRO for this tutorial, so we set `zero_stage` to `null`. You can set this to `1`, `2`, or `3` for ZeRO-1, ZeRO-2, or ZeRO-3, respectively.
     14.  `bf16` (bfloat16, or Brain Floating Point 16) is supported on Ampere GPUs and higher. On Volta GPUs, you can use `fp16` (half-precision floating point) for training instead of `bf16`.
 
-## (Optional) Step 6: Add Your Weights & Biases API Key 🔑
+## 🔑 (Optional) Step 6: Add Your Weights & Biases API Key
 
 If you included the W&B section in your configuration, you'll need to add your API key. Save your W&B API key to `.wandb_api_key` in your inputs folder so Fast-LLM can track your training progress there. You can create a free W&B account if you don't already have one.
 
-## Step 7: Launch Training 🚀
+## 🚀 Step 7: Launch Training
 
 Alright, the big moment! Let's launch the training run.
 
@@ -755,7 +755,7 @@ Alright, the big moment! Let's launch the training run.
 
     Setting the Python hash seed to 0 ensures consistent, reproducible ordering in hash-dependent operations across processes. Training will fail if this isn't set.
 
-## Step 8. Track Training Progress 📊
+## 📊 Step 8. Track Training Progress
 
 === "Docker"
 
@@ -825,7 +825,7 @@ You can expect to see the following performance metrics in Fast-LLM's output:
 
 If you included the W&B section in your configuration, you can also track your training progress on the Weights & Biases dashboard as well. Follow the link in the console output to view your training run.
 
-## Troubleshooting Basics 🛠️
+## 🛠️ Troubleshooting Basics
 
 Here are some common issues you might encounter and how to address them:
 
@@ -833,6 +833,6 @@ Here are some common issues you might encounter and how to address them:
 
 -   **Underutilized GPU or Low Memory Usage**: If memory usage is low or GPU utilization isn't maxed out, try increasing `micro_batch_size` (to 4, 8, or 16 if memory allows) or extending `sequence_length` (up to 2048, 3072, or 4096, as memory permits). Larger batches and longer sequences help keep GPUs engaged and reduce idle time.
 
-## Final Thoughts
+## 🎉 Final Thoughts
 
-And that's it! You've set up, prepped data, chosen a model, configured training, and launched a full training run with Fast-LLM. From here, feel free to tweak the model, try out larger datasets, or scale things up to a multi-node setup if you're on a cluster. Happy training! 🚀
+And that's it! You've set up, prepped data, chosen a model, configured training, and launched a full training run with Fast-LLM. From here, feel free to tweak the model, try out larger datasets, or scale things up to a multi-node setup if you're on a cluster. Happy training!
diff --git a/docs/recipes/continue-training-llama-8b.md b/docs/recipes/continue-training-llama-8b.md
index 1024f7332..159be53e0 100644
--- a/docs/recipes/continue-training-llama-8b.md
+++ b/docs/recipes/continue-training-llama-8b.md
@@ -1,154 +1,7 @@
-# Load Mistral-7B
-
-## Download pretrained weights
-
-Since we are interested in extending the pretraining of Mistral-7B, the first step is to obtain the pretrained weights.
-We do so by downloading them from the [Huggingface Hub](https://huggingface.co/mistralai/Mistral-7B-v0.1).
-This requires:
-
-- Git lfs (`git lfs install`).
-- An account for the Huggingface Hub, together with an [access token](https://huggingface.co/docs/hub/security-tokens).
-- Permission to use [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1), obtained by accepting the terms and conditions.
-
-Then, clone the repository to download the weights (use the access token as password).
-```bash
-git clone https://huggingface.co/mistralai/Mistral-7B-v0.1 $PRETRAINED_CHECKPOINT_PATH
-```
-
-
-## Load the model in Fast-LLM
-
-Fast-LLM may load the model architecture and pretrained weights of supported Huggingface models directly at the beginning of training.
-To do so, we simply specify the pretrained checkpoint format and location,
-which overrides the model architecture with Mistral-7B.
-```bash
-export ARCHITECTURE_ARGS_MISTRAL_PRETRAINED="\
---pretrained_checkpoint_type=huggingface \
---pretrained_checkpoint_path=$PRETRAINED_MISTRAL_PATH \
-"
-```
-
-To obtain the full model configuration, we also need to set the non-architecture parameters,
-which are not imported during conversion.
-
-```bash
-export MODEL_ARGS_MISTRAL_PRETRAINED="\
-$ARCHITECTURE_ARGS_MISTRAL_PRETRAINED \
---window_size=4096 \
-"
-```
+---
+title: Continual Pretraining of Llama 3.1 8B
+---
 
 !!! warning
 
-    Make sure to check which model parameters are part of the architecture and which ones are not,
-    and set all required non-architecture parameters explicitly.
-
-!!! warning
-
-    Make sure the downloaded checkpoint is accessible to every worker, and adjust the path as needed.
-
-
-## (Optional) Train from scratch
-
-If we want to train a Mistral-7B model from scratch, we may still load the architecture from the Huggingface repo:
-```bash
-export ARCHITECTURE_ARGS_MISTRAL_FROM_SCRATCH="\
---pretrained_checkpoint_type=huggingface \
---pretrained_checkpoint_path=$PRETRAINED_CHECKPOINT_PATH \
---load_pretrained_weights=0 \
-"
-```
-
-Alternatively, we may specify the architecture explicitly, which makes it easier to adjust the parameters.
-```bash
-export ARCHITECTURE_ARGS_MISTRAL="\
---num_layers=32 \
---hidden_size=4096 \
---vocab_size=32000 \
---num_attention_heads=32 \
---head_groups=8 \
---add_linear_biases=0 \
---ffn_hidden_size=14336 \
---kv_channels=128 \
---use_rotary_embeddings=1 \
---rotary_embedding_scale=-9.210340371976184 \
---gated=1 \
---activation_type=silu \
---normalization_type=rms_norm \
---tie_word_embeddings=0 \
-"
-```
-
-Please refer to the trainer config for additional extended pretraining options.
-
-
-## (Optional) Train Mixtral-8x7B
-
-<!--- TODO: Move to separate file? --->
-
-We may train Mixtral-8x7B instead, which simply requires pointing to a different checkpoint:
-
-```bash
-git clone https://huggingface.co/mistralai/Mistral-7B-v0.1Mixtral-8x7B-v0.1 $PRETRAINED_CHECKPOINT_PATH
-```
-Other than a small memory optimization, this tutorial can be run as-is with Mixtral-8x7B.
-The architecture is a slight vatiation of Mistral-7B:
-```bash
-export ARCHITECTURE_ARGS_MIXTRAL="\
-$ARCHITECTURE_ARGS_MISTRAL \
---num_experts=8 \
---num_experts_per_token=2 \
-"
-```
-
-
-# Converting Fast-LLM Models to Hugging Face Format
-
-Now that we have trained a Mistral model, the natural next step is to try it for inference or benchmarks.
-Fast-LLM does not support such task (at least for the time being),
-but instead supports conversion to [Huggingface transformers](https://github.com/huggingface/transformers) models,
-which are themselves compatible with a large variety of tools.
-
-This article guides you through the conversion process for a Mistral-7B checkpoint (export)
-generated during training as described in [the previous tutorial](launch_training.md).
-This checkpoint may be found at `$EXP_BASE_DIR/export/$ITERATION/`.
-Allow some time for the first checkpoint to be generated.
-
-
-## Convert a Mistral-7B checkpoint
-
-We convert the checkpoint with Fast-LLM's
-[conversion script](https://github.com/ServiceNow/Fast-LLM/blob/main/tools/convert_model.py),
-and we specify the input and output locations and formats:
-
-```bash
-python3 -m tools.convert_model \
-    --input_type distributed \
-    --output_type huggingface \
-    --input_path $EXP_BASE_DIR/export/$ITERATION/ \
-    --output_path $CONVERTED_DIR \
-    --model_type mistral
-```
-
-<!--- TODO: What Tokenizer? --->
-
-!!! warning "Don't Forget the Tokenizer"
-
-    Make sure to add a tokenizer file and its configuration to the output directory, since `convert_model.py` does not include these files in the conversion.
-
-
-<!--- TODO: What Tokenizer? --->
-
-You can then load and use the converted model
-[as you would with any Transformers model](https://huggingface.co/docs/transformers/index).
-For example:
-```python
-import torch
-from transformers import AutoModelForCausalLM
-
-import transformers
-
-model = AutoModelForCausalLM.from_pretrained(converted_dir).to(device="cuda")
-x = torch.randint(0, 32000, (1, 1024))
-y = model(x)
-```
+    This recipe’s still in the oven. Check back soon for the full details!
diff --git a/docs/recipes/data-preparation.md b/docs/recipes/data-preparation.md
index 28833e831..bd47d602b 100644
--- a/docs/recipes/data-preparation.md
+++ b/docs/recipes/data-preparation.md
@@ -1,100 +1,7 @@
-# Training Data Preparation
-
-<!--- TODO: Provide an actual example dataset --->
-
-## Prepare datasets
-
-<!--- TODO: Tokenizer? --->
-
-The data processing of Fast-LLM is designed to closely match that of [Megatron-LM](https://github.com/NVIDIA/Megatron-LM).
-In particular, it requires datasets to be converted to the Megatron-LM binary format.
-Please refer to [this guide](https://github.com/NVIDIA/Megatron-LM?tab=readme-ov-file#data-preprocessing)
-for details on how to prepare the dataset(s).
-
-At the end of this process, each dataset should have a consist of a binary file `$DATA_PREFIX_[i].bin` and an index file `$DATA_PREFIX_[i].idx`
-
-## List configuration
-
-Datasets may be configured via a simple string in the `--data_path` argument.
-(Again, in the exact same format as with Megatron-LM).
-For a single dataset, we only need to specify its prefix:
-```bash
-export DATA_ARGS_SINGLE="\
---split=9998,2,0 \
---dataset_source=list \
---data_path=$DATA_PREFIX_0 \
-"
-```
-Note that we also specify a train/validation/test split for the dataset.
-Fow multiple datasets, we specify the prefixes together with relative dataset sampling probabilities.
-For examples
-```bash
-export DATA_ARGS_MULTIPLE="\
---split=9998,2,0 \
---dataset_source=list \
---data_path=\"0.3 $DATA_PREFIX_0 0.5 $DATA_PREFIX_1 0.2 $DATA_PREFIX_2\" \
-"
-```
-
-!!! warning
-
-    The same dataset split is used for every dataset.
-    This may cause problems for extremely small datasets, which we recommend avoiding.
-    (If needed, we suggest concatenating small datasets into larger ones.)
+---
+title: Preparing Data for Training
+---
 
 !!! warning
 
-    Make sure to dedicate enough data for validation and/or testing, and adjust the split according to you dataset.
-    Our setup assumes a dataset of 500 billion tokens, and requires 26 million tokens for each validation,
-    so allocating 0.02% of the total data (100 million tokens)
-    ensures sufficient data without excessively reducing the training set size.
-
-
-## Json configuration
-
-While the list configuration is sufficient for a small number of datasets,
-it becomes impractical when there are many of them.
-For that purpose, Fast-LLM allows configuring a dataset from an external json file.
-
-A common use case concerns large datasets with hundreds of billions of tokens,
-which need to be split into multiple ones to keep the file size reasonable.
-We want to sample each dataset as if it was not split, i.e. with probability proportional to its document count.
-In that case, the json configuration file can be generated automatically using the `concatenate_dataset.py` script:
-```bash
-python3 tools/concatenate_dataset.py --directory=$DATASET_DIR --output_name=$JSON_DATA_PATH
-"
-```
-This script will recursively scan `$DATASET_DIR` for datasets (`.idx` files),
-and create a json dataset configuration at `$JSON_DATA_PATH` with the appropriate dataset prefixes and probabilities.
-The resulting json file can be used to configure the datasets:
-```bash
-export DATA_ARGS="\
---split=9998,2,0 \
---dataset_source=file \
---data_path=$JSON_DATA_PATH \
-"
-```
-
-??? question "More on the json dataset file"
-
-    The json dataset file is a simple structure for holding the data prefixes and probabilities,
-    to avoid writing them explicitly in the Fast-LLM configuration.
-    It may be created manually or through a script such as `concatenate_dataset.py`
-    It may also contain metadata about the dataset contents, for example the total number of tokens and documents.
-    The file should be structured as:
-    ```json
-    {
-        "datasets": [
-            {
-                "prefix": $RELATIVE_DATA_PREFIX_0"
-                "weight": 0.3
-                "num_documents": 12345,
-                "num_tokens": 987654321,
-                ...
-            },
-            ...
-        ]
-    }
-    ```
-    Note that in the json format, paths are relative to the directory containing the json file
-    instead of the current working directory.
+    This guide’s still in the works. Stay tuned—full instructions coming soon!
diff --git a/docs/recipes/train-llama-8b.md b/docs/recipes/train-llama-8b.md
index 43a32eca2..d76f28223 100644
--- a/docs/recipes/train-llama-8b.md
+++ b/docs/recipes/train-llama-8b.md
@@ -1,249 +1,7 @@
-# Getting Started
-
-<!--- TODO: Remove the ServiceNow-specific content. --->
-
-## Build the image
+---
+title: Training Llama 3.1 8B
+---
 
 !!! warning
 
-    This guide is not yet working.
-
-The preferred way to run [Fast-LLM](https://github.com/ServiceNow/Fast-LLM) is through a docker image built with the provided Dockerfile.
-For example, from a terminal running on a GPU node:
-
-```bash
-git clone git@github.com:ServiceNow/Fast-LLM.git
-cd Fast-LLM
-docker build -t my_fast_llm_image .
-docker run --rm -it --gpus all --net=host --ipc=host my_fast_llm_image bash
-```
-
-## First examples
-
-All training runs are launched throught the entry point [pretrain_fast_llm.py](https://github.com/ServiceNow/Fast-LLM/blob/main/pretrain_fast_llm.py).
-We can run a minimalistic training example with:
-```bash
-python3 pretrain_fast_llm.py --train_iters=100 --batch_size=32 --dataset_source=random
-```
-This will launch a short single-GPU training from scratch of a 180 M parameter model on a randomly generated dataset.
-
-To run distributed training, we run our training script through [torchrun](https://pytorch.org/docs/stable/elastic/run.html),
-the PyTorch distributed launcher. For example, on 8 GPUs:
-```bash
-torchrun --nproc-per-node=8 pretrain_fast_llm.py --train_iters=100 --batch_size=32 --dataset_source=random
-```
-Note that by default, Fast-LLM parallelizes over samples (data-parallel), so the number of GPUs should divide the batch size.
-
-Multi-node training also uses torchrun, and requires the same command to be run on each node,
-with the additional specification of a rendez-vous endpoint, i.e., the address of one of the nodes.
-For example, on four nodes:
-```bash
-torchrun --nproc-per-node=8 --nnodes=4 --rdzv-backend=c10d --rdzv-endpoint=$HOST_NODE_ADDR pretrain_fast_llm.py --train_iters=100 --batch_size=32 --dataset_source=random
-```
-
-See the [torchrun documentation](https://pytorch.org/docs/stable/elastic/run.html) for more details.
-Note that if you are using cloud or managed hardware, there Now tutorial](servicenow.md)
-may be a simpler automated method to launch multi-node jobs.
-Please refer to your provider for more details.
-The ServiceNow-specific method may be found in the [Service
-
-## More on training arguments
-
-<!--- TODO: Document arguments --->
-
-The training script supports hundreds of arguments, though most of them are optional and/or have sensible defaults.
-We already saw three arguments above, and we will see many important ones in this tutorial.
-
-At the beginning of training, Fast-LLM displays a list of arguments and their values:
-```
------------------------- arguments ------------------------
-  activation_type ................................. gelu
-  adam_beta1 ...................................... 0.9
-  adam_beta2 ...................................... 0.999
-  adam_eps ........................................ 1e-08
-  add_linear_biases ............................... True
-  attention_dropout ............................... 0.0
-  batch_size ...................................... 1
-  [...]
--------------------- end of arguments ---------------------
-```
-All of these arguments can be set as arguments of `pretrain_fast_llm.py`, in the form `--[name]=[value]`,
-provided the values have the expected data type, and in some case satisfy extra constraints.
-For example, we may enable attention dropout with `--attention_dropout=0.1`.
-Note that booleans are set as integers (ex. `--add_linear_biases=0` to disable biases),
-and that `None` cannot be represented.
-Please refer to each parameter's definition for more details.
-
-
-# Prepare the training configuration
-
-# Training parameters
-
-Our example training scheme is as follows:
-1. We train over 500 K iteration, each made of 128 samples of 8192 tokens, for a total of 524 B training tokens.
-2. We use the Adam optimizer with weight decay (Adamw), and gradient clipping.
-3. We warm up the learning rate for the first 1000 steps, then use cosine decay from 1e-4 to 3e-6.
-
-This translates into the following Fast-LLM configuration:
-```bash
-export TRAINING_ARGS="\
---batch_size=128 \
---sequence_length=8192 \
---train_iters=500000 \
---weight_decay=0.1 \
---adam_beta1=0.9 \
---adam_beta2=0.95 \
---clip_grad=1.0 \
---lr=0.0001 \
---lr_warmup_iters=1000 \
---lr_decay_style=cosine \
---lr_decay_iters=500000 \
---min_lr=0.000003 \
-"
-```
-
-# Performance parameters
-
-Our training setup is simple enough that the default distributed configuration
-(data parallel with [ZeRO stage 1](https://www.deepspeed.ai/tutorials/zero/))
-is sufficient for a near-optimal training throughput of around 9000 tokens/s/GPU on H100 GPUs (440 tflops/GPU).
-We only need to specify the training dtype and the number of data loader workers.
-```bash
-export PERFORMANCE_ARGS="\
---training_dtype=bf16 \
---num_workers=8 \
-"
-```
-
-Note that this configuration requires exactly 16 nodes.
-It may be adjusted run on fewer than 16 nodes,
-by using gradient accumulation to keep the micro-batch size constant and adding some memory optimizations.
-We suggest the following configuration for 4 to 64 GPUs (seet details the in next section):
-```bash
-export PERFORMANCE_ARGS_SMALL_CLUSTER="\
-$PERFORMANCE_ARGS \
---micro_batch_size=1 \
---zero_stage=2 \
-"
-```
-
-# (Optional) More on Mistral performance optimization
-
-The performance optimization of Mistral at the configuration level
-is mainly determined through the following guidelines:
-
-- **Use larger micro-batches**: The GPU runs more efficiently with larger kernels,
-so we want the micro-batches to be as large as allowed by memory and other constraints.
-Our configuration requires 36 GiB of activation memory,
-so a micro-batch or 8192 tokens per GPU is a reasonable choice.
-A value of 16384 tokens per GPU is technically feasible,
-but would require aggressive state memory optimizations and a higher batch size.
-2 - **Reduce model parallelism**: Model parallelism (tensor or pipeline) comes with a large overhead,
-so we should avoid or limit it whenever possible.
-For Mistral, no model parallelism is needed.
-3 - **Optimize the memory usage**: Additional memory optimizations are available to enable configurations that would
-otherwise not be possible. We already saw the most important one, the ZeRO stage (`--zero_stage` see note below).
-An additional one is the recomputation of the MLP activations `--mlp_recompute_level` ,
-which significantly lower the activation memory usage, for a small (`activation`) or moderate (`full`) overhead.
-Note that Fast-LLM does not implement activation recomputation for the entire transformer layer,
-as it comes with a large overhead (~33%) and it can be avoided in (almost) all practical scenario.
-
-
-??? note "More on ZeRO stages"
-
-    Fast-LLM provides a custom implementation of the training state partitioning
-    first described in the [ZeRO (Zero Redundancy Optimizer) paper](https://arxiv.org/abs/1910.02054).
-    The method comes in three "stages", which progressively reduce the memory footprint from the training state:
-
-    - **Stage 1**: Partition the optimizer state and its update across the data-parallel GPUs.
-      This stage reduces the state memory by around 3x (for mixed precision training with full-precision gradients),
-      while simultanuously speeding up training through a faster weight update.
-
-    - **Stage 2**: Extend partitioning to the (reduced) gradients.
-      This stage reduces the state memory by a further 3x,
-      but may come with a minor overhead (depending on the implementation),
-      and may require multiple reductions with gradient accumulation.
-
-    - **Stage 3**: Extend partitioning to the weights.
-      This stage drops the vast majority of the remaining state memory,
-      but requires extra network communication.
-
-    Fast-LLM implements all three of these stages, selected through the `--zero_stage` argument.
-    There is no option to disable ZeRO entirely, as it would be strictly worse in terms of performance.
-    In general, training configurations should use the lowest value allowed by other memory constraints.
-
-??? note "Recompute Level for MLPs"
-
-    The MLP is the largest contributor to a transformer's activation memory (with Flash Attention),
-    so recomputing its activations is a natural way to save memory.
-    Fast-LLM offers three MLP recomputaton modes, set throught the `--mlp_recompute_level` argument:
-
-    - **`none`** (default): All MLP activations are kept,
-    allowing for the highest throughput at the highest memory cost.
-
-    - **`activation`**: The MLP activation layer output (gelu, silu, etc.) is dropped and recomputed in the backward pass.
-    This saves on activation memory (~20% for Mistral) with minimal impact on throughput.
-
-    - **`full`**: Both the first dense layer and activation layer outputs are dropped and recomputed.
-    This saves more activation memory (~60% for Mistral), but has a noticeable impact on throughput .
-
-    For quantitative comparison, here are benchmarks for Mistral (using 4x A100 GPUs):
-
-    | Recompute Level | Act. Memory (MiB) | Tokens/s/GPU | Model TFLOP/s/GPU |
-    |-----------------|-------------------|--------------|---------------|
-    | `none`          | 36515             | 4234.09      | 202.88        |
-    | `activation`    | 29346             | 4218.63      | 202.14        |
-    | `full`          | 15010             | 3804.49      | 182.29        |
-
-
-# Monitoring and persistence parameters
-
-Finally, we set up experiment monitoring and persistence
-```bash
-export MONITORING_ARGS="\
---experiment_dir=$EXP_BASE_DIR \
---validation_iters=25 \
---validation_interval=1000 \
---max_checkpoints=5 \
---export_interval=25000 \
---log_interval=10 \
---log_offset=0 \
---checkpoint_interval=500 \
-"
-```
-This setup includes:
-- Creation of an experiment directory at `$EXP_BASE_DIR` to store checkpoints, logs, data cache and other artifacts.
-- Validation for 25 steps every 1000 steps
-- Logging of losses, metrics and other relevant quantities every 10 steps (from rank 0),
-  both to stdout and the log file.
-- Saving of a temporary checkpoint every 500 steps, and of a permanent checkpoint every 25000 steps.
-
-
-??? note "More on Fast-LLM checkpointing"
-
-    Fast-LLM provides two types of checkpoints:
-
-    - `checkpoint`: temporary checkpoint saved at `[--experiment_dir]/checkpoints/[iter]`,
-      to reload the experiment in case of a planned or unexpected shutdown.
-      Only the `--max_checkpoints` most recent ones are kept to limit disk usage.
-      Note that saving a checkpoint with Fast-LLM is relatively fast so can (and should) be done frequently.
-    - `export`: permanent checkpoint saved at `[--experiment_dir]/export/[iter]`.
-      This checkpoint type is typically intended for long-term storage, benchmarking, inference, etc.
-      It should be saved less often to limit disk usage.
-
-
-# (Optional) Set up wandb
-
-Fast-LLM also support monitoring through [Weights and Biases](https://wandb.ai/).
-This requires a valid API key,
-passed through an environment variable rather than an explicit argument for security reasons.
-It can be either contained in `$WANDB_API_KEY` or in a plain text file found at `$WANDB_API_KEY_PATH`.
-Then, we set the Wandb username, project and version (Wandb group).
-```bash
-export WANDB_ARGS="\
---wandb_entity_name=$WANDB_ENTITY_NAME \
---wandb_project_name=$PROJECT_NAME \
---wandb_group_name=$PROJECT_VERSION \
-"
-```
-The Wandb run will be set as the directory name of `$EXP_BASE_DIR`, or can be overriden through `--experiment_name`.
+    Heads up! This guide isn't ready yet. Check back soon.
diff --git a/docs/recipes/upcycle-llama-3b-to-moe.md b/docs/recipes/upcycle-llama-3b-to-moe.md
index e69de29bb..28d5e2648 100644
--- a/docs/recipes/upcycle-llama-3b-to-moe.md
+++ b/docs/recipes/upcycle-llama-3b-to-moe.md
@@ -0,0 +1,7 @@
+---
+title: Upcycling Llama 3B to MoE
+---
+
+!!! warning
+
+    This guide is under construction. Check back soon to see how to give your Llama 3B a new life as an MoE!
diff --git a/docs/reference/configuration.md b/docs/reference/configuration.md
index f4941c6c0..7d05293b5 100644
--- a/docs/reference/configuration.md
+++ b/docs/reference/configuration.md
@@ -1,3 +1,7 @@
-# Reference
+---
+title: Configuration Reference
+---
 
-Coming soon...
+!!! warning
+
+    Looking for the full config details? This reference is on the way. Stay tuned!
diff --git a/mkdocs.yaml b/mkdocs.yaml
index 93a9592a4..4a137fcf1 100644
--- a/mkdocs.yaml
+++ b/mkdocs.yaml
@@ -175,6 +175,7 @@ nav:
     - Configuration: reference/configuration.md
   - Developers:
     - Contributing: developers/contributing.md
-    - Best Practices: developers/best-practices.md
+    - Style Guide: developers/style-guide.md
+    - Development Practices: developers/dev-practices.md
   - About Us: about-us.md
   - Join Us: join-us.md

From 77c7416ba586ade7abea74d8b42a843917f0f172 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Thu, 14 Nov 2024 14:51:44 -0500
Subject: [PATCH 83/87] wip

---
 docs/developers/dev-practices.md | 2 +-
 docs/developers/style-guide.md   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/developers/dev-practices.md b/docs/developers/dev-practices.md
index 3a21845c8..da71eca62 100644
--- a/docs/developers/dev-practices.md
+++ b/docs/developers/dev-practices.md
@@ -4,7 +4,7 @@ title: Development Practices
 
 !!! warning
 
-    Work in progress! We are updating our development practices to reflect the latest changes in our codebase. Check back soon for the updated content.
+    Work in progress! Check back soon for the updated content.
 
 ## Recommended Development Setup
 
diff --git a/docs/developers/style-guide.md b/docs/developers/style-guide.md
index 959f0a2dd..1d0a45f20 100644
--- a/docs/developers/style-guide.md
+++ b/docs/developers/style-guide.md
@@ -4,4 +4,4 @@ title: Style Guide
 
 !!! warning
 
-    This section is work in progress. We are in the process of updating our style guide to reflect the latest changes in codebase. Please check back soon for the updated content.
+    This section is work in progress. Please check back soon for the updated content.

From e8e9aea39a2d73244bb17c40a10e44610c2a78c5 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Thu, 14 Nov 2024 19:07:05 -0500
Subject: [PATCH 84/87] wip

---
 docs/join-us.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/join-us.md b/docs/join-us.md
index 3b09c7787..31ff49abb 100644
--- a/docs/join-us.md
+++ b/docs/join-us.md
@@ -17,7 +17,7 @@ Want to keep up with the latest Fast-LLM updates and new opportunities to get in
 Fast-LLM thrives on collaboration, and we're excited to welcome new contributors! From fixing bugs to adding new features, every code contribution makes a difference. If you're just getting started, our [Good First Issues](https://github.com/ServiceNow/Fast-LLM/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22) on GitHub are labeled to help newcomers find approachable tasks. To set up your development environment and get oriented with Fast-LLM, check out our **Developer's Corner** for everything you need:
 
 -   [**Contributing**](developers/contributing.md) – for setup instructions and contributing guidelines
--   [**Best Practices**](developers/best-practices.md) – for tips on writing clean, maintainable code
+-   [**Best Practices**](developers/dev-practices.md) – for tips on writing clean, maintainable code
 
 Here's a quick overview of the process:
 

From e697eeb9216db067b5c04ce13979724ba659ab24 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Thu, 14 Nov 2024 19:23:46 -0500
Subject: [PATCH 85/87] wip

---
 docs/quick-start.md | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/docs/quick-start.md b/docs/quick-start.md
index fa4c66e46..a5eb7b248 100644
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -328,6 +328,7 @@ Create a configuration file for the dataset preparation. Copy the following cont
 
     dataset:
       path: openwebtext
+      trust_remote_code: true
 
     tokenizer:
       path: /mnt/inputs/SmolLM2-135M/tokenizer.json
@@ -346,6 +347,7 @@ Create a configuration file for the dataset preparation. Copy the following cont
 
     dataset:
       path: openwebtext
+      trust_remote_code: true
     
     tokenizer:
       path: /mnt/inputs/Llama-3.2-1B/tokenizer.json
@@ -362,13 +364,13 @@ Fast-LLM ships with a `prepare` command that'll download and preprocess the data
     ```bash
     docker run -it --rm ghcr.io/servicenow/fast-llm:latest \
         -v ~/inputs:/mnt/inputs \
-        fast-llm prepare --config /mnt/inputs/prepare-config.yaml
+        fast-llm prepare gpt_memmap --config /mnt/inputs/prepare-config.yaml
     ```
 
 === "Local Environment"
 
     ```bash
-    fast-llm prepare --config /mnt/inputs/prepare-config.yaml
+    fast-llm prepare gpt_memmap --config /mnt/inputs/prepare-config.yaml
     ```
 
 === "Slurm"
@@ -386,7 +388,7 @@ Fast-LLM ships with a `prepare` command that'll download and preprocess the data
         --container-image="ghcr.io/servicenow/fast-llm:latest" \
         --container-mounts="${HOME}/inputs:/mnt/inputs,${HOME}/results:/mnt/results" \
         --ntasks-per-node=$SLURM_NTASKS_PER_NODE \
-        bash -c "fast-llm prepare --config /mnt/inputs/prepare-config.yaml"
+        bash -c "fast-llm prepare gpt_memmap --config /mnt/inputs/prepare-config.yaml"
     EOF
     ```
 
@@ -411,7 +413,7 @@ Fast-LLM ships with a `prepare` command that'll download and preprocess the data
           containers:
             - name: fast-llm-prepare-container
               image: ghcr.io/servicenow/fast-llm:latest
-              command: ["fast-llm", "prepare"]
+              command: ["fast-llm", "prepare", "gpt_memmap"]
               args:
                 - "--config"
                 - "/mnt/inputs/prepare-config.yaml"

From 8ec8ade436a4685636175852c47a0fd22dcf39ec Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Thu, 14 Nov 2024 22:22:27 -0500
Subject: [PATCH 86/87] update configs

---
 docs/quick-start.md | 46 +++++++++++++++++++++++----------------------
 1 file changed, 24 insertions(+), 22 deletions(-)

diff --git a/docs/quick-start.md b/docs/quick-start.md
index a5eb7b248..644399b93 100644
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -461,7 +461,7 @@ Next, we'll create a configuration file for Fast-LLM. Save the following as `tra
         group_name: SmolLM2-135M
         entity_name: servicenow
     batch:
-      micro_batch_size: 20  # (4)!
+      micro_batch_size: 60  # (4)!
       sequence_length: 1024
       batch_size: 480  # (5)!
     data:
@@ -497,7 +497,7 @@ Next, we'll create a configuration file for Fast-LLM. Save the following as `tra
     1.  Total number of training tokens will be approximately 300B.
     2.  A Llama model will be saved in Hugging Face format to `~/results` directory every 20,000 iterations.
     3.  Entirely optional, but it's a good idea to track your training progress with Weights & Biases. Replace `servicenow` with your own W&B entity name. If you don't want to use W&B, just remove this section.
-    4.  Adjust the number of sequences per GPU based on GPU memory. For SmolLM2-135M and an A100-80GB, a `micro_batch_size` of 1 should work well.
+    4.  Adjust the number of sequences per GPU based on GPU memory. For SmolLM2-135M and an A100-80GB, a `micro_batch_size` of 60 should work well.
     5.  Must be divisible by the number of GPUs and the `micro_batch_size`. At 1024 tokens per sequence, 480 corresponds to about 500,000 tokens per batch.
     6.  Location of the dataset metadata file generated in Step 4.
     7.  99% train, 1% validation, 0% test. These settings need to be adjusted based on the size of your dataset. If you're using a smaller dataset, you need to increase the validation split.
@@ -528,10 +528,10 @@ Next, we'll create a configuration file for Fast-LLM. Save the following as `tra
         interval: 20_000
       wandb:  # (3)!
         project_name: fast-llm-quickstart
-        group_name: llama-3.2-1B
+        group_name: Llama-3.2-1B
         entity_name: servicenow
     batch:
-      micro_batch_size: 1  # (4)!
+      micro_batch_size: 20  # (4)!
       sequence_length: 1024
       batch_size: 480  # (5)!
     data:
@@ -546,7 +546,7 @@ Next, we'll create a configuration file for Fast-LLM. Save the following as `tra
         base: 6.0e-04
         minimum: 6.0e-05
         decay_style: cosine
-        decay_iterations: 600_000
+        decay_iterations: 100_000
         warmup_iterations: 2000
     pretrained:
       format: llama  # (10)!
@@ -556,10 +556,11 @@ Next, we'll create a configuration file for Fast-LLM. Save the following as `tra
       base_model:
         transformer:
           use_flash_attention: yes  # (12)!
+        cross_entropy_impl: fused  # (13)!
       multi_stage:
-        zero_stage: null  # (13)!
+        zero_stage: null  # (14)!
       distributed:
-        training_dtype: bf16  # (14)!
+        training_dtype: bf16  # (15)!
     run:
       experiment_dir: /mnt/results/Llama-3.2-1B
     ```
@@ -567,7 +568,7 @@ Next, we'll create a configuration file for Fast-LLM. Save the following as `tra
     1.  Total number of training tokens will be approximately 300B.
     2.  A Llama model will be saved in Hugging Face format to `~/results` directory every 20,000 iterations.
     3.  Entirely optional, but it's a good idea to track your training progress with Weights & Biases. Replace `servicenow` with your own W&B entity name. If you don't want to use W&B, just remove this section.
-    4.  Adjust the number of sequences per GPU based on GPU memory. For Llama-3.2-1B and an A100-80GB, a `micro_batch_size` of 1 should work well.
+    4.  Adjust the number of sequences per GPU based on GPU memory. For Llama-3.2-1B and an A100-80GB, a `micro_batch_size` of 20 should work well.
     5.  Must be divisible by the number of GPUs and the `micro_batch_size`. At 1024 tokens per sequence, 480 corresponds to about 500,000 tokens per batch.
     6.  Location of the dataset metadata file generated in Step 4.
     7.  99% train, 1% validation, 0% test. These settings need to be adjusted based on the size of your dataset. If you're using a smaller dataset, you need to increase the validation split.
@@ -576,8 +577,9 @@ Next, we'll create a configuration file for Fast-LLM. Save the following as `tra
     10.  Format of the pretrained model. Since it's a Llama model, we set this to `llama`.
     11.  We want to continue training Llama-3.2-1B from a checkpoint. If you're training from scratch, set this to `no`.
     12.  If you're using Ampere GPUs or higher, you can enable FlashAttention for faster training. Otherwise, set this to `no`. The default is `yes`.
-    13.  We're not using ZeRO for this tutorial, so we set `zero_stage` to `null`. You can set this to `1`, `2`, or `3` for ZeRO-1, ZeRO-2, or ZeRO-3, respectively.
-    14.  `bf16` (bfloat16, or Brain Floating Point 16) is supported on Ampere GPUs and higher. On Volta GPUs, you can use `fp16` (half-precision floating point) for training instead of `bf16`.
+    13.  Configure Fast-LLM to use the fused cross-entropy loss implementation rather than the default Triton implementation for Llama models. This avoids issues with block size limitations in our current Triton code, which can cause training failures.
+    14.  We're not using ZeRO for this tutorial, so we set `zero_stage` to `null`. You can set this to `1`, `2`, or `3` for ZeRO-1, ZeRO-2, or ZeRO-3, respectively.
+    15.  `bf16` (bfloat16, or Brain Floating Point 16) is supported on Ampere GPUs and higher. On Volta GPUs, you can use `fp16` (half-precision floating point) for training instead of `bf16`.
 
 ## 🔑 (Optional) Step 6: Add Your Weights & Biases API Key
 
@@ -785,10 +787,10 @@ You can expect to see the following performance metrics in Fast-LLM's output:
 
     | Performance Metric  | 8x V100-SXM2-32GB[^SmolLM2-V100] | 8x A100-SXM4-80GB[^SmolLM2-A100] | 8x H100-SXM5-80GB[^SmolLM2-H100] |
     |---------------------|---------------------------------:|---------------------------------:|---------------------------------:|
-    | tokens/s/GPU        | 18300                            |                                  |                                  |
-    | tflop/s (model)     | 16.7                             |                                  |                                  |
-    | tflop/s (hardware)  | 17.0                             |                                  |                                  |
-    | total training time | 23.3 days                        |                                  |                                  |
+    | tokens/s/GPU        | 18,300                           |                                  | 294,000                          |
+    | tflop/s (model)     | 16.7                             |                                  | 268                              |
+    | tflop/s (hardware)  | 17.0                             |                                  | 274                              |
+    | total training time | 23.3 days                        |                                  | 1.45 days                        |
 
     [^SmolLM2-V100]:
         `bf16` is not supported on V100 GPUs. Precision was set to `fp16`.
@@ -797,20 +799,20 @@ You can expect to see the following performance metrics in Fast-LLM's output:
     [^SmolLM2-A100]:
         Precision was set to `bf16`.
         FlashAttention was enabled.
-        Micro-batch size was set to ???.
+        Micro-batch size was set to 60.
     [^SmolLM2-H100]:
         Precision was set to `bf16`.
         FlashAttention was enabled.
-        Micro-batch size was set to ???.
+        Micro-batch size was set to 60.
 
 === "Llama-3.2-1B"
 
     | Performance Metric  | 8x V100-SXM2-32GB[^Llama-V100] | 8x A100-SXM4-80GB[^Llama-A100] | 8x H100-SXM5-80GB[^Llama-H100] |
     |---------------------|-------------------------------:|-------------------------------:|-------------------------------:|
-    | tokens/s/GPU        | 5680                           |                                |                                |
-    | tflop/s (model)     | 43.3                           |                                |                                |
-    | tflop/s (hardware)  | 43.4                           |                                |                                |
-    | total training time | 12.5                           |                                |                                |
+    | tokens/s/GPU        | 5,680                          |                                | 66,600                         |
+    | tflop/s (model)     | 43.3                           |                                | 508                            |
+    | tflop/s (hardware)  | 43.4                           |                                | 510                            |
+    | total training time | 12.5 days                      |                                | 1.07 days                      |
 
     [^Llama-V100]:
         `bf16` is not supported on V100 GPUs. Precision was set to `fp16`.
@@ -819,11 +821,11 @@ You can expect to see the following performance metrics in Fast-LLM's output:
     [^Llama-A100]:
         Precision was set to `bf16`.
         FlashAttention was enabled.
-        Micro-batch size was set to ???.
+        Micro-batch size was set to 20.
     [^Llama-H100]:
         Precision was set to `bf16`.
         FlashAttention was enabled.
-        Micro-batch size was set to ???.
+        Micro-batch size was set to 20.
 
 If you included the W&B section in your configuration, you can also track your training progress on the Weights & Biases dashboard as well. Follow the link in the console output to view your training run.
 

From 7d0cf492efa7ed952948cc0d9095bfec1df2f140 Mon Sep 17 00:00:00 2001
From: Torsten Scholak <torsten.scholak@googlemail.com>
Date: Thu, 14 Nov 2024 22:22:51 -0500
Subject: [PATCH 87/87] remove downloads correctly

---
 fast_llm/data/preparator/gpt_memmap/prepare.py | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fast_llm/data/preparator/gpt_memmap/prepare.py b/fast_llm/data/preparator/gpt_memmap/prepare.py
index c51bd4a70..fccb7945f 100644
--- a/fast_llm/data/preparator/gpt_memmap/prepare.py
+++ b/fast_llm/data/preparator/gpt_memmap/prepare.py
@@ -34,7 +34,6 @@ def _tokenize_batch(self, batch):
         }
 
     def _save_shard(self, args) -> dict:
-
         shard_idx, shard_dataset = args
         prefix = f"shard_{self._config.distributed.rank}_{shard_idx}"
         shard_output_path = self._config.output_path / prefix
@@ -51,7 +50,6 @@ def _save_shard(self, args) -> dict:
         return dataset_dict
 
     def run(self):
-
         # Set transformers logging verbosity
         transformers.logging.set_verbosity_error()
 
@@ -159,4 +157,4 @@ def run(self):
 
         # Clean up downloaded dataset
         if self._config.remove_downloads and self._config.distributed.rank == 0:
-            shutil.rmtree(download_path)
+            shutil.rmtree(download_path, ignore_errors=True)