ServiceNow · tscholak · Mar 5, 2025 · Dec 2, 2024 · Dec 3, 2024 · Dec 3, 2024
diff --git a/docs/help.md b/docs/help.md
@@ -47,7 +47,7 @@ We've got some excellent tutorials to help you get the most out of Fast-LLM:
 
 -   [**Quick-Start Guide**](quick-start.md): Perfect for launching Fast-LLM on a single GPU machine. We walk you through running your first training job (either locally or on a cluster), and handling common issues.
 
--   [**Cookbook**](recipes/train-llama-8b.md): Ready to go big? These recipes cover real-world scenarios like training big models from scratch, continuing training from a checkpoint, and more. This is where Fast-LLM really shows its power.
+-   [**Cookbook**](recipes/train.md): Ready to go big? These recipes cover real-world scenarios like training big models from scratch, continuing training from a checkpoint, and more. This is where Fast-LLM really shows its power.
 
 ---
 

diff --git a/docs/index.md b/docs/index.md
@@ -6,7 +6,7 @@ hide:
 
 Introducing **Fast-LLM**, the cutting-edge open-source library built for training large language models (LLMs) with **unmatched speed, scalability, and cost-efficiency**. Developed by [ServiceNow Research](https://www.servicenow.com/research/)'s Foundation Models Lab, Fast-LLM is engineered to meet the rigorous demands of professional AI researchers, AI/ML engineers, academic and industrial research institutions, and enterprise product development teams pushing the limits of generative AI. **Achieve groundbreaking research and high-stakes production goals faster with Fast-LLM.**
 
-[Start your journey with Fast-LLM](quick-start.md) and explore the future of LLM training. Dive into [real-world use cases](recipes/train-llama-8b.md) to see how Fast-LLM can elevate your training workflows.
+[Start your journey with Fast-LLM](quick-start.md) and explore the future of LLM training. Dive into [real-world use cases](recipes/train.md) to see how Fast-LLM can elevate your training workflows.
 
 ## Why Fast-LLM?
 

diff --git a/docs/recipes/continue-training-llama-8b.md b/docs/recipes/continue-training-llama-8b.md
diff --git a/docs/recipes/continue-training.md b/docs/recipes/continue-training.md
@@ -0,0 +1,153 @@
+---
+title: Continual Pretraining of Llama 3.1 8B or Qwen 2.5 7B
+---
+
+
+In this guide, we provide step-by-step instructions to do continued pretraining on The Stack with Llama 3.1 8B  or Qwen 2.5 7B models.
+
+# Preliminary steps
+- [Quick Start](../quick-start.md)
+- [Data preparation](data-preparation.md)
+
+# Download the Pretrained Model
+Let's download the model first:
+=== "Llama 3.1 8B"
+    ```bash
+    git lfs install
+    git clone https://huggingface.co/meta-llama/Llama-3.1-8B ./fast-llm-tutorial/pretrained-model
+    ```
+=== "Qwen 2.5 7B"
+    ```bash
+    git lfs install
+    git clone https://huggingface.co/Qwen/Qwen2.5-7B ./fast-llm-tutorial/pretrained-model
+    ```
+
+# Training
+This is not much different from a pretraining config. We will:
+- specify the the model checkpoint to load and its format. Fast-LLM will automatically infer the corresponding model architecture.
+- adapt some of the training parameters for our needs.
+- and that's it!
+=== "Llama 3.1 8B"
+    ```yaml
+    training:
+      train_iters: 100_000
+      logs:
+        interval: 10
+      validation:
+        iterations: 25
+        interval: 1000
+      checkpoint:
+        interval: 1000
+        keep: 5
+      test_iters: 0
+      export:  # (1)!
+        format: llama
+        interval: 20_000
+    batch:
+      micro_batch_size: 2
+      sequence_length: 4096
+      batch_size: 256
+    data:
+      format: file
+      path: fast-llm-tutorial/dataset.json  # (2)!
+      split: [99, 1, 0]  
+    optimizer:  
+      weight_decay: 0.1
+      beta_1: 0.9
+      beta_2: 0.95
+      learning_rate:
+        base: 1.0e-04  # (3)!
+        minimum: 1.0e-05
+        decay_style: cosine
+        decay_iterations: 100_000
+        warmup_iterations: 2000
+    pretrained:  # (4)!
+      format: llama
+      path: fast-llm-tutorial/pretrained-model
+      model_weights: yes  # (5)!
+    model:
+      base_model:
+        transformer:
+          use_flash_attention: yes
+        cross_entropy_impl: fused
+      multi_stage:
+        zero_stage: 2
+      distributed:
+        training_dtype: bf16  
+    run:
+      experiment_dir: fast-llm-tutorial/Llama-3.1-8B-cpt
+    ```
+=== "Qwen 2.5 7B"
+    ```yaml
+    training:
+      train_iters: 100_000
+      logs:
+        interval: 10
+      validation:
+        iterations: 25
+        interval: 1000
+      checkpoint:
+        interval: 1000
+        keep: 5
+      test_iters: 0
+      export:  # (1)!
+        format: qwen2
+        interval: 20_000
+    batch:
+      micro_batch_size: 1
+      sequence_length: 8192
+      batch_size: 256
+    data:
+      format: file
+      path: fast-llm-tutorial/dataset.json  # (2)!
+      split: [99, 1, 0]  
+    optimizer:  
+      weight_decay: 0.1
+      beta_1: 0.9
+      beta_2: 0.95
+      learning_rate:
+        base: 1.0e-04  # (3)!
+        minimum: 1.0e-05
+        decay_style: cosine
+        decay_iterations: 100_000
+        warmup_iterations: 2000
+    pretrained:  # (4)!
+      format: qwen2
+      path: fast-llm-tutorial/pretrained-model
+      model_weights: yes  # (5)!
+    model:
+      base_model:
+        transformer:
+          use_flash_attention: yes
+        cross_entropy_impl: fused
+      multi_stage:
+        zero_stage: 2
+      distributed:
+        training_dtype: bf16  
+    run:
+      experiment_dir: fast-llm-tutorial/qwen-2.5-7B-cpt
+    ```
+
+1.  A the model will be saved in Hugging Face format to `~/results` directory every 20,000 iterations.
+2.  Location of the dataset metadata file generated in Step 4.
+3.  The learning-rate can be used to trade-off between learning and forgetting. A higher learning-rate will learn quickly on our new dataset but will cause forgetting. A lower learning-rate will instead retain more of the pretrained model's knowledge, but will slow down adapting to the new domain.
+4.  Config of the pretrained model. We load the model downloaded from the repository earlier.
+5.  This tells Fast-LLM to load the weights of the pretrained model. If we wanted to use the model's configuration, but train from scratch, we could use the same config but set this to `no`.
+
+# Checkpoint usage
+Checkpoints will be saved regularly, and every 20k steps a checkpoint will be exported in the HF format.
+You can use it in `transformers` as you would use the pretrained  model, except this one should be stronger on programming languages!
+=== "Llama 3.1 8B"
+    ```python
+    from transformers import pipeline, AutoTokenizer
+
+    tokenizer = AutoTokenizer.from_pretrained("fast-llm-tutorial/pretrained-model")
+    pipe = pipeline("text-generation", model="fast-llm-tutorial/Llama-3.1-8B-cpt/export/llama/20000/", tokenizer=tokenizer)
+    ```
+=== "Qwen 2.5 7B"
+    ```python
+    from transformers import pipeline, AutoTokenizer
+
+    tokenizer = AutoTokenizer.from_pretrained("fast-llm-tutorial/pretrained-model")
+    pipe = pipeline("text-generation", model="fast-llm-tutorial/qwen-2.5-7B-cpt/export/qwen2/20000/", tokenizer=tokenizer)
+    ```
diff --git a/docs/recipes/train-llama-8b.md b/docs/recipes/train-llama-8b.md
diff --git a/docs/recipes/train.md b/docs/recipes/train.md
@@ -0,0 +1,189 @@
+---
+title: Training Llama 3.1 8B
+---
+
+Follow this guide to train a Llama-3.1 or Qwen 2.5 7B like model from scratch!
+
+
+# Preliminary steps
+- [Quick Start](../quick-start.md)
+- [Data preparation](data-preparation.md)
+
+
+# Training configuration
+In this guide, we show you how to configure a model architecture and train a model from scratch.
+Let's start from the following training configuration:
+=== "Llama 3.1 8B"
+    ```yaml
+    training:
+      train_iters: 100_000
+      logs:
+        interval: 10
+      validation:
+        iterations: 25
+        interval: 1000
+      checkpoint:
+        interval: 1000
+        keep: 5
+      test_iters: 0
+      export:
+        format: llama
+        interval: 20_000
+    batch:
+      micro_batch_size: 2
+      sequence_length: 4096
+      batch_size: 256
+    data:
+      format: file
+      path: fast-llm-tutorial/dataset/fast_llm_dataset.json
+      split: [99, 1, 0]
+    optimizer:
+      weight_decay: 0.1
+      beta_1: 0.9
+      beta_2: 0.95
+      learning_rate:
+        base: 6.0e-04
+        minimum: 6.0e-05
+        decay_style: cosine
+        decay_iterations: 100_000
+        warmup_iterations: 2000
+    model:
+      base_model:
+        cross_entropy_impl: fused
+      multi_stage:
+        zero_stage: 2
+      distributed:
+        training_dtype: bf16
+    run:
+      experiment_dir: fast-llm-tutorial/experiment
+    ```
+=== "Qwen 2.5 7B"
+    ```yaml
+    training:
+      train_iters: 100_000
+      logs:
+        interval: 10
+      validation:
+        iterations: 25
+        interval: 1000
+      checkpoint:
+        interval: 1000
+        keep: 5
+      test_iters: 0
+      export:
+        format: qwen2
+        interval: 20_000
+    batch:
+      micro_batch_size: 1
+      sequence_length: 8192
+      batch_size: 256
+    data:
+      format: file
+      path: fast-llm-tutorial/dataset/fast_llm_dataset.json
+      split: [99, 1, 0]
+    optimizer:
+      weight_decay: 0.1
+      beta_1: 0.9
+      beta_2: 0.95
+      learning_rate:
+        base: 6.0e-04
+        minimum: 6.0e-05
+        decay_style: cosine
+        decay_iterations: 100_000
+        warmup_iterations: 2000
+    model:
+      base_model:
+        cross_entropy_impl: fused
+      multi_stage:
+        zero_stage: 2
+      distributed:
+        training_dtype: bf16
+    run:
+      experiment_dir: fast-llm-tutorial/experiment
+    ```
+
+This configuration will not work because it misses important arguments to define model architecture.
+There are 2 ways of instantiating our a model.
+
+We could use a pretrained model config. This step is similar to what is done in the [Quick Start guide](../quick-start.md).
+First download the model configuration:
+=== "Llama 3.1 8B"
+    ```bash
+    git lfs install
+    GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/meta-llama/Llama-3.1-8B ./fast-llm-tutorial/pretrained-model
+    ```
+=== "Qwen 2.5 7B"
+    ```bash
+    git lfs install
+    GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Qwen/Qwen2.5-7B ./fast-llm-tutorial/pretrained-model
+    ```
+
+By specifying a pretrained model from the HuggingFace hub, Fast-LLM automatically converts the config to load the model.
+    **Only the configuration is loaded, not the weights**, because of `model_weights: no`.
+=== "Llama 3.1 8B"
+    ```yaml
+    pretrained:
+      format: llama  
+      path: fast-llm-tutorial/pretrained_model
+      model_weights: no 
+    ```
+=== "Qwen 2.5 7B"
+    ```yaml
+    pretrained:
+      format: qwen2  
+      path: fast-llm-tutorial/pretrained_model
+      model_weights: no 
+    ```
+
+Alternatively, we define the model architecture ourselves as follows:
+=== "Llama 3.1 8B"
+      ```yaml
+      model:
+        base_model:
+          tie_word_embeddings: false
+          use_position_embeddings: false
+          vocab_size: 128256
+          transformer:
+            activation_type: silu
+            add_linear_biases: false
+            ffn_hidden_size: 14336
+            gated: true
+            head_groups: 8
+            hidden_size: 4096  # (1)!
+            kv_channels: 128
+            normalization:
+              type: rms_norm
+            num_attention_heads: 32
+            num_layers: 32
+            rotary:
+              type: llama3
+              theta: 500_000
+      ```
+=== "Qwen 2.5 7B"
+      ```yaml
+      model:
+        base_model:
+          tie_word_embeddings: false
+          use_position_embeddings: false
+          vocab_size: 152064
+          transformer:
+            activation_type: silu
+            add_linear_biases: only_attn_qkv
+            ffn_hidden_size: 18944
+            gated: true
+            head_groups: 4
+            hidden_size: 3584  # (1)!
+            normalization:
+              type: rms_norm
+              epsilon: 1e-06
+            num_attention_heads: 28
+            num_layers: 28
+            rotary:
+              type: default
+              theta: 1_000_000
+      ```
+
+1.  Hidden-size/num-layers will be used to provide good defaults for weight initialization std.
+
+Configuring the model this way is a bit more verbose than using the pretrained configuration, but gives an idea of how to configure a the model with Fast-LLM.
+
diff --git a/mkdocs.yaml b/mkdocs.yaml
@@ -169,8 +169,8 @@ nav:
   - Recipes:
     - Prepare a dataset: recipes/data-preparation.md
     - Configure a dataset: recipes/data-configuration.md
-    - Train Llama 8B from scratch: recipes/train-llama-8b.md
-    - Continue training Llama 8B: recipes/continue-training-llama-8b.md
+    - Train a model from scratch: recipes/train.md
+    - Continue training a model: recipes/continue-training.md
     - Upcycle Llama 3B to MoE: recipes/upcycle-llama-3b-to-moe.md
   - Reference:
     - User Guide:
-Original file line number
+Diff line change
@@ Expand Up @@
     -   [**Quick-Start Guide**](quick-start.md): Perfect for launching Fast-LLM on a single GPU machine. We walk you through running your first training job (either locally or on a cluster), and handling common issues.
-    -   [**Cookbook**](recipes/train-llama-8b.md): Ready to go big? These recipes cover real-world scenarios like training big models from scratch, continuing training from a checkpoint, and more. This is where Fast-LLM really shows its power.
+    -   [**Cookbook**](recipes/train.md): Ready to go big? These recipes cover real-world scenarios like training big models from scratch, continuing training from a checkpoint, and more. This is where Fast-LLM really shows its power.
     ---
@@ Expand Down @@