From ec310ede85128c9fdb533cb54b05e710132c06ca Mon Sep 17 00:00:00 2001
From: Toolkit User <raymond.li@servicenow.com>
Date: Mon, 2 Dec 2024 21:32:20 +0000
Subject: [PATCH 01/10] initial recipe

---
 docs/recipes/train-llama-8b.md | 85 +++++++++++++++++++++++++++++++++-
 1 file changed, 83 insertions(+), 2 deletions(-)

diff --git a/docs/recipes/train-llama-8b.md b/docs/recipes/train-llama-8b.md
index d76f28223..c7fe25096 100644
--- a/docs/recipes/train-llama-8b.md
+++ b/docs/recipes/train-llama-8b.md
@@ -2,6 +2,87 @@
 title: Training Llama 3.1 8B
 ---
 
-!!! warning
+In this guide, we provide step-by-step instructions to do continued pretraining on The Stack with Llama 3.1-8B 🦙.
 
-    Heads up! This guide isn't ready yet. Check back soon.
+# Preliminary steps
+- [Quick Start](quick-start.md)
+- [Data preparation](data-preparation.md)
+
+# Download the Pretrained Model
+Let's download Llama-3.1-8B:
+```bash
+git lfs install
+git clone https://huggingface.co/meta-llama/Llama-3.1-8B ./fast-llm-tutorial/pretrained-model
+```
+
+# Training
+This is not much different from a pretraining config. We will:
+- specify the Llama3.1 checkpoint to load. Fast-LLm will automatically infer the corresponding model architecture.
+- adapt some of the training parameters for our needs.
+- and that's it!
+
+  ```yaml
+  training:
+    train_iters: 100_000
+    logs:
+      interval: 10
+    validation:
+      iterations: 25
+      interval: 1000
+    checkpoint:
+      interval: 1000
+      keep: 5
+    test_iters: 0
+    export:  # (1)!
+      format: llama
+      interval: 20_000
+  batch:
+    micro_batch_size: 2
+    sequence_length: 4096
+    batch_size: 256
+  data:
+    format: file
+    path: fast-llm-tutorial/dataset.json  # (2)!
+    split: [99, 1, 0]  
+  optimizer:  
+    weight_decay: 0.1
+    beta_1: 0.9
+    beta_2: 0.95
+    learning_rate:
+      base: 6.0e-04
+      minimum: 6.0e-05
+      decay_style: cosine
+      decay_iterations: 100_000
+      warmup_iterations: 2000
+  pretrained:  # (3)!
+    format: llama
+    path: fast-llm-tutorial/pretrained-model
+    model_weights: yes
+  model:
+    base_model:
+      transformer:
+        use_flash_attention: yes
+      cross_entropy_impl: fused
+    multi_stage:
+      zero_stage: 2
+    distributed:
+      training_dtype: bf16  
+  run:
+    experiment_dir: fast-llm-tutorial/Llama-3.1-8B-cpt
+    ```
+
+    1.  A Llama model will be saved in Hugging Face format to `~/results` directory every 20,000 iterations.
+    2.  Location of the dataset metadata file generated in Step 4.
+    3.  Config of the pretrained model. We load a `llama` model from the repository downloaded earlier.
+
+# Checkpoint usage
+Checkpoints will be saved regularly, and every 20k steps a checkpoint will be exported in the HF format.
+You can use it in `transformers` as you would use the pretrained Llama 3.1 model, except this one will be much stronger at {lang}!
+
+```python
+from transformers import pipeline, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("fast-llm-tutorial/pretrained-model")
+pipe = pipeline("text-generation", model="fast-llm-tutorial/Llama-3.1-8B-cpt/export/llama/20000/", tokenizer=tokenizer)
+
+```

From c6768e6133d78bc2ba1cb47f586ef91acff2abf3 Mon Sep 17 00:00:00 2001
From: Toolkit User <raymond.li@servicenow.com>
Date: Tue, 3 Dec 2024 16:33:58 +0000
Subject: [PATCH 02/10] more guidance

---
 docs/recipes/train-llama-8b.md | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/docs/recipes/train-llama-8b.md b/docs/recipes/train-llama-8b.md
index c7fe25096..7a49a3529 100644
--- a/docs/recipes/train-llama-8b.md
+++ b/docs/recipes/train-llama-8b.md
@@ -49,15 +49,15 @@ This is not much different from a pretraining config. We will:
     beta_1: 0.9
     beta_2: 0.95
     learning_rate:
-      base: 6.0e-04
-      minimum: 6.0e-05
+      base: 1.0e-04  # (3)!
+      minimum: 1.0e-05
       decay_style: cosine
       decay_iterations: 100_000
       warmup_iterations: 2000
-  pretrained:  # (3)!
+  pretrained:  # (4)!
     format: llama
     path: fast-llm-tutorial/pretrained-model
-    model_weights: yes
+    model_weights: yes  # (5)!
   model:
     base_model:
       transformer:
@@ -73,7 +73,9 @@ This is not much different from a pretraining config. We will:
 
     1.  A Llama model will be saved in Hugging Face format to `~/results` directory every 20,000 iterations.
     2.  Location of the dataset metadata file generated in Step 4.
-    3.  Config of the pretrained model. We load a `llama` model from the repository downloaded earlier.
+    3.  The learning-rate can be used to trade-off between learning and forgetting. A higher learning-rate will learn quickly on our new dataset but will cause forgetting. A lower learning-rate will instead retain more of the pretrained model's knowledge, but will slow down adapting to the new domain.
+    4.  Config of the pretrained model. We load a `llama` model from the repository downloaded earlier.
+    5.  This tells Fast-LLM to load the weights of the pretrained model. If we wanted to use Llama-3.1's configuration, but train from scratch, we could use the same config but set this to `no`.
 
 # Checkpoint usage
 Checkpoints will be saved regularly, and every 20k steps a checkpoint will be exported in the HF format.

From 963e2b2a0bb1ac414f3dc4e6848c3c7a025570dd Mon Sep 17 00:00:00 2001
From: Toolkit User <raymond.li@servicenow.com>
Date: Tue, 3 Dec 2024 21:32:57 +0000
Subject: [PATCH 03/10] swap

---
 docs/recipes/continue-training-llama-8b.md | 87 +++++++++++++++++++++-
 docs/recipes/train-llama-8b.md             | 85 +--------------------
 2 files changed, 87 insertions(+), 85 deletions(-)

diff --git a/docs/recipes/continue-training-llama-8b.md b/docs/recipes/continue-training-llama-8b.md
index 159be53e0..3e31c04f8 100644
--- a/docs/recipes/continue-training-llama-8b.md
+++ b/docs/recipes/continue-training-llama-8b.md
@@ -2,6 +2,89 @@
 title: Continual Pretraining of Llama 3.1 8B
 ---
 
-!!! warning
 
-    This recipe’s still in the oven. Check back soon for the full details!
+In this guide, we provide step-by-step instructions to do continued pretraining on The Stack with Llama 3.1-8B 🦙.
+
+# Preliminary steps
+- [Quick Start](quick-start.md)
+- [Data preparation](data-preparation.md)
+
+# Download the Pretrained Model
+Let's download Llama-3.1-8B:
+```bash
+git lfs install
+git clone https://huggingface.co/meta-llama/Llama-3.1-8B ./fast-llm-tutorial/pretrained-model
+```
+
+# Training
+This is not much different from a pretraining config. We will:
+- specify the Llama3.1 checkpoint to load. Fast-LLm will automatically infer the corresponding model architecture.
+- adapt some of the training parameters for our needs.
+- and that's it!
+
+  ```yaml
+  training:
+    train_iters: 100_000
+    logs:
+      interval: 10
+    validation:
+      iterations: 25
+      interval: 1000
+    checkpoint:
+      interval: 1000
+      keep: 5
+    test_iters: 0
+    export:  # (1)!
+      format: llama
+      interval: 20_000
+  batch:
+    micro_batch_size: 2
+    sequence_length: 4096
+    batch_size: 256
+  data:
+    format: file
+    path: fast-llm-tutorial/dataset.json  # (2)!
+    split: [99, 1, 0]  
+  optimizer:  
+    weight_decay: 0.1
+    beta_1: 0.9
+    beta_2: 0.95
+    learning_rate:
+      base: 1.0e-04  # (3)!
+      minimum: 1.0e-05
+      decay_style: cosine
+      decay_iterations: 100_000
+      warmup_iterations: 2000
+  pretrained:  # (4)!
+    format: llama
+    path: fast-llm-tutorial/pretrained-model
+    model_weights: yes  # (5)!
+  model:
+    base_model:
+      transformer:
+        use_flash_attention: yes
+      cross_entropy_impl: fused
+    multi_stage:
+      zero_stage: 2
+    distributed:
+      training_dtype: bf16  
+  run:
+    experiment_dir: fast-llm-tutorial/Llama-3.1-8B-cpt
+    ```
+
+    1.  A Llama model will be saved in Hugging Face format to `~/results` directory every 20,000 iterations.
+    2.  Location of the dataset metadata file generated in Step 4.
+    3.  The learning-rate can be used to trade-off between learning and forgetting. A higher learning-rate will learn quickly on our new dataset but will cause forgetting. A lower learning-rate will instead retain more of the pretrained model's knowledge, but will slow down adapting to the new domain.
+    4.  Config of the pretrained model. We load a `llama` model from the repository downloaded earlier.
+    5.  This tells Fast-LLM to load the weights of the pretrained model. If we wanted to use Llama-3.1's configuration, but train from scratch, we could use the same config but set this to `no`.
+
+# Checkpoint usage
+Checkpoints will be saved regularly, and every 20k steps a checkpoint will be exported in the HF format.
+You can use it in `transformers` as you would use the pretrained Llama 3.1 model, except this one will be much stronger at {lang}!
+
+```python
+from transformers import pipeline, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("fast-llm-tutorial/pretrained-model")
+pipe = pipeline("text-generation", model="fast-llm-tutorial/Llama-3.1-8B-cpt/export/llama/20000/", tokenizer=tokenizer)
+```
diff --git a/docs/recipes/train-llama-8b.md b/docs/recipes/train-llama-8b.md
index 7a49a3529..8bc182758 100644
--- a/docs/recipes/train-llama-8b.md
+++ b/docs/recipes/train-llama-8b.md
@@ -2,89 +2,8 @@
 title: Training Llama 3.1 8B
 ---
 
-In this guide, we provide step-by-step instructions to do continued pretraining on The Stack with Llama 3.1-8B 🦙.
 
-# Preliminary steps
-- [Quick Start](quick-start.md)
-- [Data preparation](data-preparation.md)
+!!! warning
 
-# Download the Pretrained Model
-Let's download Llama-3.1-8B:
-```bash
-git lfs install
-git clone https://huggingface.co/meta-llama/Llama-3.1-8B ./fast-llm-tutorial/pretrained-model
-```
+    Coming soon!
 
-# Training
-This is not much different from a pretraining config. We will:
-- specify the Llama3.1 checkpoint to load. Fast-LLm will automatically infer the corresponding model architecture.
-- adapt some of the training parameters for our needs.
-- and that's it!
-
-  ```yaml
-  training:
-    train_iters: 100_000
-    logs:
-      interval: 10
-    validation:
-      iterations: 25
-      interval: 1000
-    checkpoint:
-      interval: 1000
-      keep: 5
-    test_iters: 0
-    export:  # (1)!
-      format: llama
-      interval: 20_000
-  batch:
-    micro_batch_size: 2
-    sequence_length: 4096
-    batch_size: 256
-  data:
-    format: file
-    path: fast-llm-tutorial/dataset.json  # (2)!
-    split: [99, 1, 0]  
-  optimizer:  
-    weight_decay: 0.1
-    beta_1: 0.9
-    beta_2: 0.95
-    learning_rate:
-      base: 1.0e-04  # (3)!
-      minimum: 1.0e-05
-      decay_style: cosine
-      decay_iterations: 100_000
-      warmup_iterations: 2000
-  pretrained:  # (4)!
-    format: llama
-    path: fast-llm-tutorial/pretrained-model
-    model_weights: yes  # (5)!
-  model:
-    base_model:
-      transformer:
-        use_flash_attention: yes
-      cross_entropy_impl: fused
-    multi_stage:
-      zero_stage: 2
-    distributed:
-      training_dtype: bf16  
-  run:
-    experiment_dir: fast-llm-tutorial/Llama-3.1-8B-cpt
-    ```
-
-    1.  A Llama model will be saved in Hugging Face format to `~/results` directory every 20,000 iterations.
-    2.  Location of the dataset metadata file generated in Step 4.
-    3.  The learning-rate can be used to trade-off between learning and forgetting. A higher learning-rate will learn quickly on our new dataset but will cause forgetting. A lower learning-rate will instead retain more of the pretrained model's knowledge, but will slow down adapting to the new domain.
-    4.  Config of the pretrained model. We load a `llama` model from the repository downloaded earlier.
-    5.  This tells Fast-LLM to load the weights of the pretrained model. If we wanted to use Llama-3.1's configuration, but train from scratch, we could use the same config but set this to `no`.
-
-# Checkpoint usage
-Checkpoints will be saved regularly, and every 20k steps a checkpoint will be exported in the HF format.
-You can use it in `transformers` as you would use the pretrained Llama 3.1 model, except this one will be much stronger at {lang}!
-
-```python
-from transformers import pipeline, AutoTokenizer
-
-tokenizer = AutoTokenizer.from_pretrained("fast-llm-tutorial/pretrained-model")
-pipe = pipeline("text-generation", model="fast-llm-tutorial/Llama-3.1-8B-cpt/export/llama/20000/", tokenizer=tokenizer)
-
-```

From e5e886ebf585494852763b13a359cabde589b333 Mon Sep 17 00:00:00 2001
From: Toolkit User <raymond.li@servicenow.com>
Date: Tue, 3 Dec 2024 22:20:23 +0000
Subject: [PATCH 04/10] add training from scratch

---
 docs/recipes/train-llama-8b.md | 106 ++++++++++++++++++++++++++++++++-
 1 file changed, 104 insertions(+), 2 deletions(-)

diff --git a/docs/recipes/train-llama-8b.md b/docs/recipes/train-llama-8b.md
index 8bc182758..60022e1dc 100644
--- a/docs/recipes/train-llama-8b.md
+++ b/docs/recipes/train-llama-8b.md
@@ -2,8 +2,110 @@
 title: Training Llama 3.1 8B
 ---
 
+Follow this guide to train a Llama-3.1 like model from scratch!
 
-!!! warning
 
-    Coming soon!
+# Preliminary steps
+- [Quick Start](quick-start.md)
+- [Data preparation](data-preparation.md)
+
+
+# Training configuration
+In this guide, we show you how to configure a model architecture and train a model from scratch.
+Let's start from the following training configuration:
+
+    ```yaml
+    training:
+      train_iters: 100_000
+      logs:
+        interval: 10
+      validation:
+        iterations: 25
+        interval: 1000
+      checkpoint:
+        interval: 1000
+        keep: 5
+      test_iters: 0
+      export:
+        format: llama
+        interval: 20_000
+    batch:
+      micro_batch_size: 4
+      sequence_length: 4096
+      batch_size: 480
+    data:
+      format: file
+      path: fast-llm-tutorial/dataset/fast_llm_dataset.json
+      split: [99, 1, 0]
+    optimizer:
+      weight_decay: 0.1
+      beta_1: 0.9
+      beta_2: 0.95
+      learning_rate:
+        base: 6.0e-04
+        minimum: 6.0e-05
+        decay_style: cosine
+        decay_iterations: 100_000
+        warmup_iterations: 2000
+    model:
+      base_model:
+        cross_entropy_impl: fused
+      multi_stage:
+        zero_stage: 2
+      distributed:
+        training_dtype: bf16
+    run:
+      experiment_dir: fast-llm-tutorial/experiment
+    ```
+This configuration will not work because it misses important arguments to define model architecture.
+There are 2 ways of instantiating our Llama-3.1-8B model. We could use a pretrained model config, or define the model architecture ourselves.
+
+=== "Pretrained configuration"
+    This step is similar to what is done in the [Quick Start guide](quick-start.md).
+    First download the model configuration:
+    ```bash
+    git lfs install
+    GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/meta-llama/Llama-3.1-8B ./fast-llm-tutorial/pretrained-model
+    ```
+    By specifying a pretrained model from the HuggingFace hub, Fast-LLM automatically converts the config to load the model.
+    **Only the configuration is loaded, not the weights**, because of `model_weights: no`.
+
+    ```yaml
+    pretrained:
+      format: llama  
+      path: fast-llm-tutorial/pretrained_model
+      model_weights: no 
+    ```
+
+=== "From-scratch configuration"
+      In this step, we specify the model architecture as follows:
+      
+      ```yaml
+      model:
+        base_model:
+          tie_word_embeddings: false
+          transformer:
+            activation_type: silu
+            add_linear_biases: false
+            ffn_hidden_size: 14336
+            gated: true
+            head_groups: 8
+            hidden_size: 4096  # (1)!
+            kv_channels: 128
+            normalization:
+              type: rms_norm
+            num_attention_heads: 32
+            num_layers: 32
+            rotary:
+              scaling_type: llama3
+            rotary_embedding_scale: -13.122363377404328  # (2)!
+            use_rotary_embeddings: true
+          use_position_embeddings: false
+          vocab_size: 128256
+      ```
+
+      1.  Hidden-size/num-layers will be used to provide good defaults for weight initialization std.
+      2.  -ln(500_000)
+
+      Configuring the model this way is a bit more verbose than using the pretrained configuration, but gives you an idea of how you configure a Llama-3.1-8B-like model with Fast-LLM.
 

From 303fcb55bfc46590cf32fc56f81904f95cf0f465 Mon Sep 17 00:00:00 2001
From: Toolkit User <raymond.li@servicenow.com>
Date: Tue, 3 Dec 2024 22:22:04 +0000
Subject: [PATCH 05/10] reorder

---
 docs/recipes/train-llama-8b.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/recipes/train-llama-8b.md b/docs/recipes/train-llama-8b.md
index 60022e1dc..77015b945 100644
--- a/docs/recipes/train-llama-8b.md
+++ b/docs/recipes/train-llama-8b.md
@@ -84,6 +84,8 @@ There are 2 ways of instantiating our Llama-3.1-8B model. We could use a pretrai
       model:
         base_model:
           tie_word_embeddings: false
+          use_position_embeddings: false
+          vocab_size: 128256
           transformer:
             activation_type: silu
             add_linear_biases: false
@@ -100,8 +102,6 @@ There are 2 ways of instantiating our Llama-3.1-8B model. We could use a pretrai
               scaling_type: llama3
             rotary_embedding_scale: -13.122363377404328  # (2)!
             use_rotary_embeddings: true
-          use_position_embeddings: false
-          vocab_size: 128256
       ```
 
       1.  Hidden-size/num-layers will be used to provide good defaults for weight initialization std.

From ae3de628a16a622107ccb5a8320899f64858320e Mon Sep 17 00:00:00 2001
From: Toolkit User <raymond.li@servicenow.com>
Date: Tue, 3 Dec 2024 22:34:01 +0000
Subject: [PATCH 06/10] adjust

---
 docs/recipes/train-llama-8b.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/recipes/train-llama-8b.md b/docs/recipes/train-llama-8b.md
index 77015b945..ec697a9d7 100644
--- a/docs/recipes/train-llama-8b.md
+++ b/docs/recipes/train-llama-8b.md
@@ -30,9 +30,9 @@ Let's start from the following training configuration:
         format: llama
         interval: 20_000
     batch:
-      micro_batch_size: 4
+      micro_batch_size: 2
       sequence_length: 4096
-      batch_size: 480
+      batch_size: 256
     data:
       format: file
       path: fast-llm-tutorial/dataset/fast_llm_dataset.json
@@ -107,5 +107,5 @@ There are 2 ways of instantiating our Llama-3.1-8B model. We could use a pretrai
       1.  Hidden-size/num-layers will be used to provide good defaults for weight initialization std.
       2.  -ln(500_000)
 
-      Configuring the model this way is a bit more verbose than using the pretrained configuration, but gives you an idea of how you configure a Llama-3.1-8B-like model with Fast-LLM.
+      Configuring the model this way is a bit more verbose than using the pretrained configuration, but gives an idea of how to configure a Llama-3.1-8B-like model with Fast-LLM.
 

From f19ff8bcd1247eb2730d4f3f62aa4173c807a84c Mon Sep 17 00:00:00 2001
From: Toolkit User <raymond.li@servicenow.com>
Date: Wed, 11 Dec 2024 00:57:23 +0000
Subject: [PATCH 07/10] adjust

---
 docs/recipes/train-llama-8b.md | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/docs/recipes/train-llama-8b.md b/docs/recipes/train-llama-8b.md
index ec697a9d7..c33db394c 100644
--- a/docs/recipes/train-llama-8b.md
+++ b/docs/recipes/train-llama-8b.md
@@ -99,13 +99,11 @@ There are 2 ways of instantiating our Llama-3.1-8B model. We could use a pretrai
             num_attention_heads: 32
             num_layers: 32
             rotary:
-              scaling_type: llama3
-            rotary_embedding_scale: -13.122363377404328  # (2)!
-            use_rotary_embeddings: true
+              type: llama3
+              theta: 500_000
       ```
 
       1.  Hidden-size/num-layers will be used to provide good defaults for weight initialization std.
-      2.  -ln(500_000)
 
       Configuring the model this way is a bit more verbose than using the pretrained configuration, but gives an idea of how to configure a Llama-3.1-8B-like model with Fast-LLM.
 

From 1ba9982803a74507acceee2c1f5ea7e8f1e29d8f Mon Sep 17 00:00:00 2001
From: Denis Kocetkov <denis.kocetkov@servicenow.com>
Date: Tue, 4 Mar 2025 16:33:16 +0200
Subject: [PATCH 08/10] qwen recipes

---
 docs/recipes/continue-training-qwen-7b.md |  90 ++++++++++++++++++
 docs/recipes/train-qwen-7b.md             | 109 ++++++++++++++++++++++
 2 files changed, 199 insertions(+)
 create mode 100644 docs/recipes/continue-training-qwen-7b.md
 create mode 100644 docs/recipes/train-qwen-7b.md

diff --git a/docs/recipes/continue-training-qwen-7b.md b/docs/recipes/continue-training-qwen-7b.md
new file mode 100644
index 000000000..7cc91ceaf
--- /dev/null
+++ b/docs/recipes/continue-training-qwen-7b.md
@@ -0,0 +1,90 @@
+---
+title: Continual Pretraining of Qwen 2.5 7B
+---
+
+
+In this guide, we provide step-by-step instructions to do continued pretraining on The Stack with Qwen 2.5 7B.
+
+# Preliminary steps
+- [Quick Start](quick-start.md)
+- [Data preparation](data-preparation.md)
+
+# Download the Pretrained Model
+Let's download Qwen 2.5 7B:
+```bash
+git lfs install
+git clone https://huggingface.co/Qwen/Qwen2.5-7B ./fast-llm-tutorial/pretrained-model
+```
+
+# Training
+This is not much different from a pretraining config. We will:
+- specify the Qwen 2.5 checkpoint to load and `qwen2` checkpoint format. Fast-LLm will automatically infer the corresponding model architecture.
+- adapt some of the training parameters for our needs.
+- and that's it!
+
+  ```yaml
+  training:
+    train_iters: 100_000
+    logs:
+      interval: 10
+    validation:
+      iterations: 25
+      interval: 1000
+    checkpoint:
+      interval: 1000
+      keep: 5
+    test_iters: 0
+    export:  # (1)!
+      format: qwen2
+      interval: 20_000
+  batch:
+    micro_batch_size: 1
+    sequence_length: 8192
+    batch_size: 256
+  data:
+    format: file
+    path: fast-llm-tutorial/dataset.json  # (2)!
+    split: [99, 1, 0]  
+  optimizer:  
+    weight_decay: 0.1
+    beta_1: 0.9
+    beta_2: 0.95
+    learning_rate:
+      base: 1.0e-04  # (3)!
+      minimum: 1.0e-05
+      decay_style: cosine
+      decay_iterations: 100_000
+      warmup_iterations: 2000
+  pretrained:  # (4)!
+    format: qwen2
+    path: fast-llm-tutorial/pretrained-model
+    model_weights: yes  # (5)!
+  model:
+    base_model:
+      transformer:
+        use_flash_attention: yes
+      cross_entropy_impl: fused
+    multi_stage:
+      zero_stage: 2
+    distributed:
+      training_dtype: bf16  
+  run:
+    experiment_dir: fast-llm-tutorial/qwen-2.5-7B-cpt
+    ```
+
+    1.  A Qwen model will be saved in Hugging Face format to `~/results` directory every 20,000 iterations.
+    2.  Location of the dataset metadata file generated in Step 4.
+    3.  The learning-rate can be used to trade-off between learning and forgetting. A higher learning-rate will learn quickly on our new dataset but will cause forgetting. A lower learning-rate will instead retain more of the pretrained model's knowledge, but will slow down adapting to the new domain.
+    4.  Config of the pretrained model. We load a `qwen2` model from the repository downloaded earlier.
+    5.  This tells Fast-LLM to load the weights of the pretrained model. If we wanted to use Qwen-2.5's configuration, but train from scratch, we could use the same config but set this to `no`.
+
+# Checkpoint usage
+Checkpoints will be saved regularly, and every 20k steps a checkpoint will be exported in the HF format.
+You can use it in `transformers` as you would use the pretrained Qwen 2.5 model, except this one will be much stronger at {lang}!
+
+```python
+from transformers import pipeline, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("fast-llm-tutorial/pretrained-model")
+pipe = pipeline("text-generation", model="fast-llm-tutorial/qwen-2.5-7B-cpt/export/qwen2/20000/", tokenizer=tokenizer)
+```
diff --git a/docs/recipes/train-qwen-7b.md b/docs/recipes/train-qwen-7b.md
new file mode 100644
index 000000000..cacaab52a
--- /dev/null
+++ b/docs/recipes/train-qwen-7b.md
@@ -0,0 +1,109 @@
+---
+title: Training Qwen 2.5 7B
+---
+
+Follow this guide to train a Qwen-2.5 like model from scratch!
+
+
+# Preliminary steps
+- [Quick Start](quick-start.md)
+- [Data preparation](data-preparation.md)
+
+
+# Training configuration
+In this guide, we show you how to configure a model architecture and train a model from scratch.
+Let's start from the following training configuration:
+
+    ```yaml
+    training:
+      train_iters: 100_000
+      logs:
+        interval: 10
+      validation:
+        iterations: 25
+        interval: 1000
+      checkpoint:
+        interval: 1000
+        keep: 5
+      test_iters: 0
+      export:
+        format: qwen2
+        interval: 20_000
+    batch:
+      micro_batch_size: 1
+      sequence_length: 8192
+      batch_size: 256
+    data:
+      format: file
+      path: fast-llm-tutorial/dataset/fast_llm_dataset.json
+      split: [99, 1, 0]
+    optimizer:
+      weight_decay: 0.1
+      beta_1: 0.9
+      beta_2: 0.95
+      learning_rate:
+        base: 6.0e-04
+        minimum: 6.0e-05
+        decay_style: cosine
+        decay_iterations: 100_000
+        warmup_iterations: 2000
+    model:
+      base_model:
+        cross_entropy_impl: fused
+      multi_stage:
+        zero_stage: 2
+      distributed:
+        training_dtype: bf16
+    run:
+      experiment_dir: fast-llm-tutorial/experiment
+    ```
+This configuration will not work because it misses important arguments to define model architecture.
+There are 2 ways of instantiating our Qwen-2.5-7B model. We could use a pretrained model config, or define the model architecture ourselves.
+
+=== "Pretrained configuration"
+    This step is similar to what is done in the [Quick Start guide](quick-start.md).
+    First download the model configuration:
+    ```bash
+    git lfs install
+    GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Qwen/Qwen2.5-7B ./fast-llm-tutorial/pretrained-model
+    ```
+    By specifying a pretrained model from the HuggingFace hub, Fast-LLM automatically converts the config to load the model.
+    **Only the configuration is loaded, not the weights**, because of `model_weights: no`.
+
+    ```yaml
+    pretrained:
+      format: qwen2  
+      path: fast-llm-tutorial/pretrained_model
+      model_weights: no 
+    ```
+
+=== "From-scratch configuration"
+      In this step, we specify the model architecture as follows:
+      
+      ```yaml
+      model:
+        base_model:
+          tie_word_embeddings: false
+          use_position_embeddings: false
+          vocab_size: 152064
+          transformer:
+            activation_type: silu
+            add_linear_biases: only_attn_qkv
+            ffn_hidden_size: 18944
+            gated: true
+            head_groups: 4
+            hidden_size: 3584  # (1)!
+            normalization:
+              type: rms_norm
+              epsilon: 1e-06
+            num_attention_heads: 28
+            num_layers: 28
+            rotary:
+              type: default
+              theta: 1_000_000
+      ```
+
+      1.  Hidden-size/num-layers will be used to provide good defaults for weight initialization std.
+
+      Configuring the model this way is a bit more verbose than using the pretrained configuration, but gives an idea of how to configure a Qwen-2.5-7B-like model with Fast-LLM.
+

From 8110495233d85ed5ef052875bd41a773d4508e74 Mon Sep 17 00:00:00 2001
From: Denis Kocetkov <denis.kocetkov@servicenow.com>
Date: Wed, 5 Mar 2025 16:27:43 +0200
Subject: [PATCH 09/10] common train md

---
 docs/recipes/train-llama-8b.md | 109 -------------------
 docs/recipes/train-qwen-7b.md  | 109 -------------------
 docs/recipes/train.md          | 189 +++++++++++++++++++++++++++++++++
 3 files changed, 189 insertions(+), 218 deletions(-)
 delete mode 100644 docs/recipes/train-llama-8b.md
 delete mode 100644 docs/recipes/train-qwen-7b.md
 create mode 100644 docs/recipes/train.md

diff --git a/docs/recipes/train-llama-8b.md b/docs/recipes/train-llama-8b.md
deleted file mode 100644
index c33db394c..000000000
--- a/docs/recipes/train-llama-8b.md
+++ /dev/null
@@ -1,109 +0,0 @@
----
-title: Training Llama 3.1 8B
----
-
-Follow this guide to train a Llama-3.1 like model from scratch!
-
-
-# Preliminary steps
-- [Quick Start](quick-start.md)
-- [Data preparation](data-preparation.md)
-
-
-# Training configuration
-In this guide, we show you how to configure a model architecture and train a model from scratch.
-Let's start from the following training configuration:
-
-    ```yaml
-    training:
-      train_iters: 100_000
-      logs:
-        interval: 10
-      validation:
-        iterations: 25
-        interval: 1000
-      checkpoint:
-        interval: 1000
-        keep: 5
-      test_iters: 0
-      export:
-        format: llama
-        interval: 20_000
-    batch:
-      micro_batch_size: 2
-      sequence_length: 4096
-      batch_size: 256
-    data:
-      format: file
-      path: fast-llm-tutorial/dataset/fast_llm_dataset.json
-      split: [99, 1, 0]
-    optimizer:
-      weight_decay: 0.1
-      beta_1: 0.9
-      beta_2: 0.95
-      learning_rate:
-        base: 6.0e-04
-        minimum: 6.0e-05
-        decay_style: cosine
-        decay_iterations: 100_000
-        warmup_iterations: 2000
-    model:
-      base_model:
-        cross_entropy_impl: fused
-      multi_stage:
-        zero_stage: 2
-      distributed:
-        training_dtype: bf16
-    run:
-      experiment_dir: fast-llm-tutorial/experiment
-    ```
-This configuration will not work because it misses important arguments to define model architecture.
-There are 2 ways of instantiating our Llama-3.1-8B model. We could use a pretrained model config, or define the model architecture ourselves.
-
-=== "Pretrained configuration"
-    This step is similar to what is done in the [Quick Start guide](quick-start.md).
-    First download the model configuration:
-    ```bash
-    git lfs install
-    GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/meta-llama/Llama-3.1-8B ./fast-llm-tutorial/pretrained-model
-    ```
-    By specifying a pretrained model from the HuggingFace hub, Fast-LLM automatically converts the config to load the model.
-    **Only the configuration is loaded, not the weights**, because of `model_weights: no`.
-
-    ```yaml
-    pretrained:
-      format: llama  
-      path: fast-llm-tutorial/pretrained_model
-      model_weights: no 
-    ```
-
-=== "From-scratch configuration"
-      In this step, we specify the model architecture as follows:
-      
-      ```yaml
-      model:
-        base_model:
-          tie_word_embeddings: false
-          use_position_embeddings: false
-          vocab_size: 128256
-          transformer:
-            activation_type: silu
-            add_linear_biases: false
-            ffn_hidden_size: 14336
-            gated: true
-            head_groups: 8
-            hidden_size: 4096  # (1)!
-            kv_channels: 128
-            normalization:
-              type: rms_norm
-            num_attention_heads: 32
-            num_layers: 32
-            rotary:
-              type: llama3
-              theta: 500_000
-      ```
-
-      1.  Hidden-size/num-layers will be used to provide good defaults for weight initialization std.
-
-      Configuring the model this way is a bit more verbose than using the pretrained configuration, but gives an idea of how to configure a Llama-3.1-8B-like model with Fast-LLM.
-
diff --git a/docs/recipes/train-qwen-7b.md b/docs/recipes/train-qwen-7b.md
deleted file mode 100644
index cacaab52a..000000000
--- a/docs/recipes/train-qwen-7b.md
+++ /dev/null
@@ -1,109 +0,0 @@
----
-title: Training Qwen 2.5 7B
----
-
-Follow this guide to train a Qwen-2.5 like model from scratch!
-
-
-# Preliminary steps
-- [Quick Start](quick-start.md)
-- [Data preparation](data-preparation.md)
-
-
-# Training configuration
-In this guide, we show you how to configure a model architecture and train a model from scratch.
-Let's start from the following training configuration:
-
-    ```yaml
-    training:
-      train_iters: 100_000
-      logs:
-        interval: 10
-      validation:
-        iterations: 25
-        interval: 1000
-      checkpoint:
-        interval: 1000
-        keep: 5
-      test_iters: 0
-      export:
-        format: qwen2
-        interval: 20_000
-    batch:
-      micro_batch_size: 1
-      sequence_length: 8192
-      batch_size: 256
-    data:
-      format: file
-      path: fast-llm-tutorial/dataset/fast_llm_dataset.json
-      split: [99, 1, 0]
-    optimizer:
-      weight_decay: 0.1
-      beta_1: 0.9
-      beta_2: 0.95
-      learning_rate:
-        base: 6.0e-04
-        minimum: 6.0e-05
-        decay_style: cosine
-        decay_iterations: 100_000
-        warmup_iterations: 2000
-    model:
-      base_model:
-        cross_entropy_impl: fused
-      multi_stage:
-        zero_stage: 2
-      distributed:
-        training_dtype: bf16
-    run:
-      experiment_dir: fast-llm-tutorial/experiment
-    ```
-This configuration will not work because it misses important arguments to define model architecture.
-There are 2 ways of instantiating our Qwen-2.5-7B model. We could use a pretrained model config, or define the model architecture ourselves.
-
-=== "Pretrained configuration"
-    This step is similar to what is done in the [Quick Start guide](quick-start.md).
-    First download the model configuration:
-    ```bash
-    git lfs install
-    GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Qwen/Qwen2.5-7B ./fast-llm-tutorial/pretrained-model
-    ```
-    By specifying a pretrained model from the HuggingFace hub, Fast-LLM automatically converts the config to load the model.
-    **Only the configuration is loaded, not the weights**, because of `model_weights: no`.
-
-    ```yaml
-    pretrained:
-      format: qwen2  
-      path: fast-llm-tutorial/pretrained_model
-      model_weights: no 
-    ```
-
-=== "From-scratch configuration"
-      In this step, we specify the model architecture as follows:
-      
-      ```yaml
-      model:
-        base_model:
-          tie_word_embeddings: false
-          use_position_embeddings: false
-          vocab_size: 152064
-          transformer:
-            activation_type: silu
-            add_linear_biases: only_attn_qkv
-            ffn_hidden_size: 18944
-            gated: true
-            head_groups: 4
-            hidden_size: 3584  # (1)!
-            normalization:
-              type: rms_norm
-              epsilon: 1e-06
-            num_attention_heads: 28
-            num_layers: 28
-            rotary:
-              type: default
-              theta: 1_000_000
-      ```
-
-      1.  Hidden-size/num-layers will be used to provide good defaults for weight initialization std.
-
-      Configuring the model this way is a bit more verbose than using the pretrained configuration, but gives an idea of how to configure a Qwen-2.5-7B-like model with Fast-LLM.
-
diff --git a/docs/recipes/train.md b/docs/recipes/train.md
new file mode 100644
index 000000000..c731be4cd
--- /dev/null
+++ b/docs/recipes/train.md
@@ -0,0 +1,189 @@
+---
+title: Training Llama 3.1 8B
+---
+
+Follow this guide to train a Llama-3.1 or Qwen 2.5 7B like model from scratch!
+
+
+# Preliminary steps
+- [Quick Start](quick-start.md)
+- [Data preparation](data-preparation.md)
+
+
+# Training configuration
+In this guide, we show you how to configure a model architecture and train a model from scratch.
+Let's start from the following training configuration:
+=== "Llama 3.1 8B"
+    ```yaml
+    training:
+      train_iters: 100_000
+      logs:
+        interval: 10
+      validation:
+        iterations: 25
+        interval: 1000
+      checkpoint:
+        interval: 1000
+        keep: 5
+      test_iters: 0
+      export:
+        format: llama
+        interval: 20_000
+    batch:
+      micro_batch_size: 2
+      sequence_length: 4096
+      batch_size: 256
+    data:
+      format: file
+      path: fast-llm-tutorial/dataset/fast_llm_dataset.json
+      split: [99, 1, 0]
+    optimizer:
+      weight_decay: 0.1
+      beta_1: 0.9
+      beta_2: 0.95
+      learning_rate:
+        base: 6.0e-04
+        minimum: 6.0e-05
+        decay_style: cosine
+        decay_iterations: 100_000
+        warmup_iterations: 2000
+    model:
+      base_model:
+        cross_entropy_impl: fused
+      multi_stage:
+        zero_stage: 2
+      distributed:
+        training_dtype: bf16
+    run:
+      experiment_dir: fast-llm-tutorial/experiment
+    ```
+=== "Qwen 2.5 7B"
+    ```yaml
+    training:
+      train_iters: 100_000
+      logs:
+        interval: 10
+      validation:
+        iterations: 25
+        interval: 1000
+      checkpoint:
+        interval: 1000
+        keep: 5
+      test_iters: 0
+      export:
+        format: qwen2
+        interval: 20_000
+    batch:
+      micro_batch_size: 1
+      sequence_length: 8192
+      batch_size: 256
+    data:
+      format: file
+      path: fast-llm-tutorial/dataset/fast_llm_dataset.json
+      split: [99, 1, 0]
+    optimizer:
+      weight_decay: 0.1
+      beta_1: 0.9
+      beta_2: 0.95
+      learning_rate:
+        base: 6.0e-04
+        minimum: 6.0e-05
+        decay_style: cosine
+        decay_iterations: 100_000
+        warmup_iterations: 2000
+    model:
+      base_model:
+        cross_entropy_impl: fused
+      multi_stage:
+        zero_stage: 2
+      distributed:
+        training_dtype: bf16
+    run:
+      experiment_dir: fast-llm-tutorial/experiment
+    ```
+
+This configuration will not work because it misses important arguments to define model architecture.
+There are 2 ways of instantiating our a model.
+
+We could use a pretrained model config. This step is similar to what is done in the [Quick Start guide](quick-start.md).
+First download the model configuration:
+=== "Llama 3.1 8B"
+    ```bash
+    git lfs install
+    GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/meta-llama/Llama-3.1-8B ./fast-llm-tutorial/pretrained-model
+    ```
+=== "Qwen 2.5 7B"
+    ```bash
+    git lfs install
+    GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Qwen/Qwen2.5-7B ./fast-llm-tutorial/pretrained-model
+    ```
+
+By specifying a pretrained model from the HuggingFace hub, Fast-LLM automatically converts the config to load the model.
+    **Only the configuration is loaded, not the weights**, because of `model_weights: no`.
+=== "Llama 3.1 8B"
+    ```yaml
+    pretrained:
+      format: llama  
+      path: fast-llm-tutorial/pretrained_model
+      model_weights: no 
+    ```
+=== "Qwen 2.5 7B"
+    ```yaml
+    pretrained:
+      format: qwen2  
+      path: fast-llm-tutorial/pretrained_model
+      model_weights: no 
+    ```
+
+Alternatively, we define the model architecture ourselves as follows:
+=== "Llama 3.1 8B"
+      ```yaml
+      model:
+        base_model:
+          tie_word_embeddings: false
+          use_position_embeddings: false
+          vocab_size: 128256
+          transformer:
+            activation_type: silu
+            add_linear_biases: false
+            ffn_hidden_size: 14336
+            gated: true
+            head_groups: 8
+            hidden_size: 4096  # (1)!
+            kv_channels: 128
+            normalization:
+              type: rms_norm
+            num_attention_heads: 32
+            num_layers: 32
+            rotary:
+              type: llama3
+              theta: 500_000
+      ```
+=== "Qwen 2.5 7B"
+      ```yaml
+      model:
+        base_model:
+          tie_word_embeddings: false
+          use_position_embeddings: false
+          vocab_size: 152064
+          transformer:
+            activation_type: silu
+            add_linear_biases: only_attn_qkv
+            ffn_hidden_size: 18944
+            gated: true
+            head_groups: 4
+            hidden_size: 3584  # (1)!
+            normalization:
+              type: rms_norm
+              epsilon: 1e-06
+            num_attention_heads: 28
+            num_layers: 28
+            rotary:
+              type: default
+              theta: 1_000_000
+      ```
+
+1.  Hidden-size/num-layers will be used to provide good defaults for weight initialization std.
+
+Configuring the model this way is a bit more verbose than using the pretrained configuration, but gives an idea of how to configure a the model with Fast-LLM.
+

From 970db40f4e2856c99f087d56971b062c3e174510 Mon Sep 17 00:00:00 2001
From: Denis Kocetkov <denis.kocetkov@servicenow.com>
Date: Wed, 5 Mar 2025 16:56:59 +0200
Subject: [PATCH 10/10] common continues training md, text and path fixes

---
 docs/help.md                               |   2 +-
 docs/index.md                              |   2 +-
 docs/recipes/continue-training-llama-8b.md |  90 ------------
 docs/recipes/continue-training-qwen-7b.md  |  90 ------------
 docs/recipes/continue-training.md          | 153 +++++++++++++++++++++
 docs/recipes/train.md                      |   4 +-
 mkdocs.yaml                                |   4 +-
 7 files changed, 159 insertions(+), 186 deletions(-)
 delete mode 100644 docs/recipes/continue-training-llama-8b.md
 delete mode 100644 docs/recipes/continue-training-qwen-7b.md
 create mode 100644 docs/recipes/continue-training.md

diff --git a/docs/help.md b/docs/help.md
index e41354603..ed59dffa7 100644
--- a/docs/help.md
+++ b/docs/help.md
@@ -47,7 +47,7 @@ We've got some excellent tutorials to help you get the most out of Fast-LLM:
 
 -   [**Quick-Start Guide**](quick-start.md): Perfect for launching Fast-LLM on a single GPU machine. We walk you through running your first training job (either locally or on a cluster), and handling common issues.
 
--   [**Cookbook**](recipes/train-llama-8b.md): Ready to go big? These recipes cover real-world scenarios like training big models from scratch, continuing training from a checkpoint, and more. This is where Fast-LLM really shows its power.
+-   [**Cookbook**](recipes/train.md): Ready to go big? These recipes cover real-world scenarios like training big models from scratch, continuing training from a checkpoint, and more. This is where Fast-LLM really shows its power.
 
 ---
 
diff --git a/docs/index.md b/docs/index.md
index e4dcafde9..80277ffd2 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -6,7 +6,7 @@ hide:
 
 Introducing **Fast-LLM**, the cutting-edge open-source library built for training large language models (LLMs) with **unmatched speed, scalability, and cost-efficiency**. Developed by [ServiceNow Research](https://www.servicenow.com/research/)'s Foundation Models Lab, Fast-LLM is engineered to meet the rigorous demands of professional AI researchers, AI/ML engineers, academic and industrial research institutions, and enterprise product development teams pushing the limits of generative AI. **Achieve groundbreaking research and high-stakes production goals faster with Fast-LLM.**
 
-[Start your journey with Fast-LLM](quick-start.md) and explore the future of LLM training. Dive into [real-world use cases](recipes/train-llama-8b.md) to see how Fast-LLM can elevate your training workflows.
+[Start your journey with Fast-LLM](quick-start.md) and explore the future of LLM training. Dive into [real-world use cases](recipes/train.md) to see how Fast-LLM can elevate your training workflows.
 
 ## Why Fast-LLM?
 
diff --git a/docs/recipes/continue-training-llama-8b.md b/docs/recipes/continue-training-llama-8b.md
deleted file mode 100644
index 3e31c04f8..000000000
--- a/docs/recipes/continue-training-llama-8b.md
+++ /dev/null
@@ -1,90 +0,0 @@
----
-title: Continual Pretraining of Llama 3.1 8B
----
-
-
-In this guide, we provide step-by-step instructions to do continued pretraining on The Stack with Llama 3.1-8B 🦙.
-
-# Preliminary steps
-- [Quick Start](quick-start.md)
-- [Data preparation](data-preparation.md)
-
-# Download the Pretrained Model
-Let's download Llama-3.1-8B:
-```bash
-git lfs install
-git clone https://huggingface.co/meta-llama/Llama-3.1-8B ./fast-llm-tutorial/pretrained-model
-```
-
-# Training
-This is not much different from a pretraining config. We will:
-- specify the Llama3.1 checkpoint to load. Fast-LLm will automatically infer the corresponding model architecture.
-- adapt some of the training parameters for our needs.
-- and that's it!
-
-  ```yaml
-  training:
-    train_iters: 100_000
-    logs:
-      interval: 10
-    validation:
-      iterations: 25
-      interval: 1000
-    checkpoint:
-      interval: 1000
-      keep: 5
-    test_iters: 0
-    export:  # (1)!
-      format: llama
-      interval: 20_000
-  batch:
-    micro_batch_size: 2
-    sequence_length: 4096
-    batch_size: 256
-  data:
-    format: file
-    path: fast-llm-tutorial/dataset.json  # (2)!
-    split: [99, 1, 0]  
-  optimizer:  
-    weight_decay: 0.1
-    beta_1: 0.9
-    beta_2: 0.95
-    learning_rate:
-      base: 1.0e-04  # (3)!
-      minimum: 1.0e-05
-      decay_style: cosine
-      decay_iterations: 100_000
-      warmup_iterations: 2000
-  pretrained:  # (4)!
-    format: llama
-    path: fast-llm-tutorial/pretrained-model
-    model_weights: yes  # (5)!
-  model:
-    base_model:
-      transformer:
-        use_flash_attention: yes
-      cross_entropy_impl: fused
-    multi_stage:
-      zero_stage: 2
-    distributed:
-      training_dtype: bf16  
-  run:
-    experiment_dir: fast-llm-tutorial/Llama-3.1-8B-cpt
-    ```
-
-    1.  A Llama model will be saved in Hugging Face format to `~/results` directory every 20,000 iterations.
-    2.  Location of the dataset metadata file generated in Step 4.
-    3.  The learning-rate can be used to trade-off between learning and forgetting. A higher learning-rate will learn quickly on our new dataset but will cause forgetting. A lower learning-rate will instead retain more of the pretrained model's knowledge, but will slow down adapting to the new domain.
-    4.  Config of the pretrained model. We load a `llama` model from the repository downloaded earlier.
-    5.  This tells Fast-LLM to load the weights of the pretrained model. If we wanted to use Llama-3.1's configuration, but train from scratch, we could use the same config but set this to `no`.
-
-# Checkpoint usage
-Checkpoints will be saved regularly, and every 20k steps a checkpoint will be exported in the HF format.
-You can use it in `transformers` as you would use the pretrained Llama 3.1 model, except this one will be much stronger at {lang}!
-
-```python
-from transformers import pipeline, AutoTokenizer
-
-tokenizer = AutoTokenizer.from_pretrained("fast-llm-tutorial/pretrained-model")
-pipe = pipeline("text-generation", model="fast-llm-tutorial/Llama-3.1-8B-cpt/export/llama/20000/", tokenizer=tokenizer)
-```
diff --git a/docs/recipes/continue-training-qwen-7b.md b/docs/recipes/continue-training-qwen-7b.md
deleted file mode 100644
index 7cc91ceaf..000000000
--- a/docs/recipes/continue-training-qwen-7b.md
+++ /dev/null
@@ -1,90 +0,0 @@
----
-title: Continual Pretraining of Qwen 2.5 7B
----
-
-
-In this guide, we provide step-by-step instructions to do continued pretraining on The Stack with Qwen 2.5 7B.
-
-# Preliminary steps
-- [Quick Start](quick-start.md)
-- [Data preparation](data-preparation.md)
-
-# Download the Pretrained Model
-Let's download Qwen 2.5 7B:
-```bash
-git lfs install
-git clone https://huggingface.co/Qwen/Qwen2.5-7B ./fast-llm-tutorial/pretrained-model
-```
-
-# Training
-This is not much different from a pretraining config. We will:
-- specify the Qwen 2.5 checkpoint to load and `qwen2` checkpoint format. Fast-LLm will automatically infer the corresponding model architecture.
-- adapt some of the training parameters for our needs.
-- and that's it!
-
-  ```yaml
-  training:
-    train_iters: 100_000
-    logs:
-      interval: 10
-    validation:
-      iterations: 25
-      interval: 1000
-    checkpoint:
-      interval: 1000
-      keep: 5
-    test_iters: 0
-    export:  # (1)!
-      format: qwen2
-      interval: 20_000
-  batch:
-    micro_batch_size: 1
-    sequence_length: 8192
-    batch_size: 256
-  data:
-    format: file
-    path: fast-llm-tutorial/dataset.json  # (2)!
-    split: [99, 1, 0]  
-  optimizer:  
-    weight_decay: 0.1
-    beta_1: 0.9
-    beta_2: 0.95
-    learning_rate:
-      base: 1.0e-04  # (3)!
-      minimum: 1.0e-05
-      decay_style: cosine
-      decay_iterations: 100_000
-      warmup_iterations: 2000
-  pretrained:  # (4)!
-    format: qwen2
-    path: fast-llm-tutorial/pretrained-model
-    model_weights: yes  # (5)!
-  model:
-    base_model:
-      transformer:
-        use_flash_attention: yes
-      cross_entropy_impl: fused
-    multi_stage:
-      zero_stage: 2
-    distributed:
-      training_dtype: bf16  
-  run:
-    experiment_dir: fast-llm-tutorial/qwen-2.5-7B-cpt
-    ```
-
-    1.  A Qwen model will be saved in Hugging Face format to `~/results` directory every 20,000 iterations.
-    2.  Location of the dataset metadata file generated in Step 4.
-    3.  The learning-rate can be used to trade-off between learning and forgetting. A higher learning-rate will learn quickly on our new dataset but will cause forgetting. A lower learning-rate will instead retain more of the pretrained model's knowledge, but will slow down adapting to the new domain.
-    4.  Config of the pretrained model. We load a `qwen2` model from the repository downloaded earlier.
-    5.  This tells Fast-LLM to load the weights of the pretrained model. If we wanted to use Qwen-2.5's configuration, but train from scratch, we could use the same config but set this to `no`.
-
-# Checkpoint usage
-Checkpoints will be saved regularly, and every 20k steps a checkpoint will be exported in the HF format.
-You can use it in `transformers` as you would use the pretrained Qwen 2.5 model, except this one will be much stronger at {lang}!
-
-```python
-from transformers import pipeline, AutoTokenizer
-
-tokenizer = AutoTokenizer.from_pretrained("fast-llm-tutorial/pretrained-model")
-pipe = pipeline("text-generation", model="fast-llm-tutorial/qwen-2.5-7B-cpt/export/qwen2/20000/", tokenizer=tokenizer)
-```
diff --git a/docs/recipes/continue-training.md b/docs/recipes/continue-training.md
new file mode 100644
index 000000000..8ea36ebdc
--- /dev/null
+++ b/docs/recipes/continue-training.md
@@ -0,0 +1,153 @@
+---
+title: Continual Pretraining of Llama 3.1 8B or Qwen 2.5 7B
+---
+
+
+In this guide, we provide step-by-step instructions to do continued pretraining on The Stack with Llama 3.1 8B  or Qwen 2.5 7B models.
+
+# Preliminary steps
+- [Quick Start](../quick-start.md)
+- [Data preparation](data-preparation.md)
+
+# Download the Pretrained Model
+Let's download the model first:
+=== "Llama 3.1 8B"
+    ```bash
+    git lfs install
+    git clone https://huggingface.co/meta-llama/Llama-3.1-8B ./fast-llm-tutorial/pretrained-model
+    ```
+=== "Qwen 2.5 7B"
+    ```bash
+    git lfs install
+    git clone https://huggingface.co/Qwen/Qwen2.5-7B ./fast-llm-tutorial/pretrained-model
+    ```
+
+# Training
+This is not much different from a pretraining config. We will:
+- specify the the model checkpoint to load and its format. Fast-LLM will automatically infer the corresponding model architecture.
+- adapt some of the training parameters for our needs.
+- and that's it!
+=== "Llama 3.1 8B"
+    ```yaml
+    training:
+      train_iters: 100_000
+      logs:
+        interval: 10
+      validation:
+        iterations: 25
+        interval: 1000
+      checkpoint:
+        interval: 1000
+        keep: 5
+      test_iters: 0
+      export:  # (1)!
+        format: llama
+        interval: 20_000
+    batch:
+      micro_batch_size: 2
+      sequence_length: 4096
+      batch_size: 256
+    data:
+      format: file
+      path: fast-llm-tutorial/dataset.json  # (2)!
+      split: [99, 1, 0]  
+    optimizer:  
+      weight_decay: 0.1
+      beta_1: 0.9
+      beta_2: 0.95
+      learning_rate:
+        base: 1.0e-04  # (3)!
+        minimum: 1.0e-05
+        decay_style: cosine
+        decay_iterations: 100_000
+        warmup_iterations: 2000
+    pretrained:  # (4)!
+      format: llama
+      path: fast-llm-tutorial/pretrained-model
+      model_weights: yes  # (5)!
+    model:
+      base_model:
+        transformer:
+          use_flash_attention: yes
+        cross_entropy_impl: fused
+      multi_stage:
+        zero_stage: 2
+      distributed:
+        training_dtype: bf16  
+    run:
+      experiment_dir: fast-llm-tutorial/Llama-3.1-8B-cpt
+    ```
+=== "Qwen 2.5 7B"
+    ```yaml
+    training:
+      train_iters: 100_000
+      logs:
+        interval: 10
+      validation:
+        iterations: 25
+        interval: 1000
+      checkpoint:
+        interval: 1000
+        keep: 5
+      test_iters: 0
+      export:  # (1)!
+        format: qwen2
+        interval: 20_000
+    batch:
+      micro_batch_size: 1
+      sequence_length: 8192
+      batch_size: 256
+    data:
+      format: file
+      path: fast-llm-tutorial/dataset.json  # (2)!
+      split: [99, 1, 0]  
+    optimizer:  
+      weight_decay: 0.1
+      beta_1: 0.9
+      beta_2: 0.95
+      learning_rate:
+        base: 1.0e-04  # (3)!
+        minimum: 1.0e-05
+        decay_style: cosine
+        decay_iterations: 100_000
+        warmup_iterations: 2000
+    pretrained:  # (4)!
+      format: qwen2
+      path: fast-llm-tutorial/pretrained-model
+      model_weights: yes  # (5)!
+    model:
+      base_model:
+        transformer:
+          use_flash_attention: yes
+        cross_entropy_impl: fused
+      multi_stage:
+        zero_stage: 2
+      distributed:
+        training_dtype: bf16  
+    run:
+      experiment_dir: fast-llm-tutorial/qwen-2.5-7B-cpt
+    ```
+
+1.  A the model will be saved in Hugging Face format to `~/results` directory every 20,000 iterations.
+2.  Location of the dataset metadata file generated in Step 4.
+3.  The learning-rate can be used to trade-off between learning and forgetting. A higher learning-rate will learn quickly on our new dataset but will cause forgetting. A lower learning-rate will instead retain more of the pretrained model's knowledge, but will slow down adapting to the new domain.
+4.  Config of the pretrained model. We load the model downloaded from the repository earlier.
+5.  This tells Fast-LLM to load the weights of the pretrained model. If we wanted to use the model's configuration, but train from scratch, we could use the same config but set this to `no`.
+
+# Checkpoint usage
+Checkpoints will be saved regularly, and every 20k steps a checkpoint will be exported in the HF format.
+You can use it in `transformers` as you would use the pretrained  model, except this one should be stronger on programming languages!
+=== "Llama 3.1 8B"
+    ```python
+    from transformers import pipeline, AutoTokenizer
+
+    tokenizer = AutoTokenizer.from_pretrained("fast-llm-tutorial/pretrained-model")
+    pipe = pipeline("text-generation", model="fast-llm-tutorial/Llama-3.1-8B-cpt/export/llama/20000/", tokenizer=tokenizer)
+    ```
+=== "Qwen 2.5 7B"
+    ```python
+    from transformers import pipeline, AutoTokenizer
+
+    tokenizer = AutoTokenizer.from_pretrained("fast-llm-tutorial/pretrained-model")
+    pipe = pipeline("text-generation", model="fast-llm-tutorial/qwen-2.5-7B-cpt/export/qwen2/20000/", tokenizer=tokenizer)
+    ```
\ No newline at end of file
diff --git a/docs/recipes/train.md b/docs/recipes/train.md
index c731be4cd..4b59ab6ca 100644
--- a/docs/recipes/train.md
+++ b/docs/recipes/train.md
@@ -6,7 +6,7 @@ Follow this guide to train a Llama-3.1 or Qwen 2.5 7B like model from scratch!
 
 
 # Preliminary steps
-- [Quick Start](quick-start.md)
+- [Quick Start](../quick-start.md)
 - [Data preparation](data-preparation.md)
 
 
@@ -105,7 +105,7 @@ Let's start from the following training configuration:
 This configuration will not work because it misses important arguments to define model architecture.
 There are 2 ways of instantiating our a model.
 
-We could use a pretrained model config. This step is similar to what is done in the [Quick Start guide](quick-start.md).
+We could use a pretrained model config. This step is similar to what is done in the [Quick Start guide](../quick-start.md).
 First download the model configuration:
 === "Llama 3.1 8B"
     ```bash
diff --git a/mkdocs.yaml b/mkdocs.yaml
index 47ac8cd65..3979acda0 100644
--- a/mkdocs.yaml
+++ b/mkdocs.yaml
@@ -169,8 +169,8 @@ nav:
   - Recipes:
     - Prepare a dataset: recipes/data-preparation.md
     - Configure a dataset: recipes/data-configuration.md
-    - Train Llama 8B from scratch: recipes/train-llama-8b.md
-    - Continue training Llama 8B: recipes/continue-training-llama-8b.md
+    - Train a model from scratch: recipes/train.md
+    - Continue training a model: recipes/continue-training.md
     - Upcycle Llama 3B to MoE: recipes/upcycle-llama-3b-to-moe.md
   - Reference:
     - User Guide: