Skip to content

huyiwen/awesome-llm-pretraining

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

Awesome LLM Pre-training

Pre-training is the first and most crucial training stage in the development of large language models. As the open-source community continues to improve in areas such as model architecture, training strategies, open-source datasets, and data methods, we consistently monitor the resources available for large model pre-training to give back to the developers of large language models in the open-source community.

Compared to a comprehensive review, our coverage will be limited to common resources and cutting-edge attempts related to pre-training, enabling users to quickly get started with large language model pre-training. Meanwhile, we welcome updates from the open-source community to jointly promote the development of large models.

Table of Contents

Technical Reports

Behind technical reports often lie hundreds or thousands of computing resources. Therefore, it is highly recommended to read some open-source technical reports.

Dense Models

  1. The Llama 3 Herd of Models. [paper]
  2. Qwen2.5 Technical Report. [paper]
  3. Gemma 3 Technical Report. [paper]
  4. Nemotron-4 340B Technical Report. [paper]
  5. Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs. [paper]
  6. Baichuan 2: Open Large-scale Language Models. [paper]

MoE Models

  1. DeepSeek-V3 Technical Report. [paper]
  2. DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. [paper]
  3. Mixtral of Experts. [paper]
  4. Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models. [paper]
  5. Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs. [paper]
  6. OLMoE: Open Mixture-of-Experts Language Models. [paper]
  7. Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent. [paper]

Models with Open-Source Datasets

  1. YuLan-Mini: An Open Data-efficient Language Model. [paper]
  2. MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series. [paper]
  3. LLM360: Towards Fully Transparent Open-Source LLMs. [paper]
  4. Nemotron-4 15B Technical Report. [paper]

Training/Data Strategies

  1. Phi-4 Technical Report. [paper]
  2. OLMo: Accelerating the Science of Language Models. [paper]
  3. 2 OLMo 2 Furious. [paper]
  4. Yi: Open Foundation Models by 01.AI. [paper]
  5. MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies. [paper]

Hybrid/Linear Models

  1. Falcon Mamba: The First Competitive Attention-free 7B Language Model. [paper]
  2. MiniMax-01: Scaling Foundation Models with Lightning Attention. [paper]
  3. Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models. [paper]
All Technical Reports

LLaMA Series

  1. LLaMA: Open and Efficient Foundation Language Models. [paper]
  2. Llama 2: Open Foundation and Fine-Tuned Chat Models. [paper]
  3. The Llama 3 Herd of Models. [paper]

Qwen Series

  1. Qwen Technical Report. [paper]
  2. Qwen2 Technical Report. [paper]
  3. Qwen2.5 Technical Report. [paper]

DeepSeek Series

  1. DeepSeek LLM: Scaling Open-Source Language Models with Longtermism. [paper]
  2. DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. [paper]
  3. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. [paper]
  4. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. [paper]
  5. DeepSeek-V3 Technical Report. [paper]

Gemma Series

  1. Gemma: Open Models Based on Gemini Research and Technology. [paper]
  2. Gemma 2: Improving Open Language Models at a Practical Size. [paper]
  3. Gemma 3 Technical Report. [paper]

Gemini Series

  1. Gemini: A Family of Highly Capable Multimodal Models. [paper]
  2. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. [paper]

Mistral Series

  1. Mistral 7B. [paper]
  2. Mixtral of Experts. [paper]

Phi Series

  1. Textbooks Are All You Need. [paper]
  2. Textbooks Are All You Need II: phi-1.5 technical report. [paper]
  3. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. [paper]
  4. Phi-4 Technical Report. [paper]

GLM Series

  1. GLM: General Language Model Pretraining with Autoregressive Blank Infilling. [paper]
  2. GLM-130B: An Open Bilingual Pre-trained Model. [paper]
  3. ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools. [paper]

Baichuan Series

  1. Baichuan 2: Open Large-scale Language Models. [paper]
  2. Baichuan-M1: Pushing the Medical Capability of Large Language Models. [paper]

Falcon Series

  1. The Falcon Series of Open Language Models. [paper]
  2. Falcon2-11B Technical Report. [paper]
  3. Falcon Mamba: The First Competitive Attention-free 7B Language Model. [paper]

InternLM Series

  1. InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities. [paper]
  2. InternLM2 Technical Report. [paper]

MiniCPM

  1. MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies. [paper]

Yi Series

  1. Yi: Open Foundation Models by 01.AI. [paper]
  2. Yi-Lightning Technical Report. [paper]

Minimax Series

  1. MiniMax-01: Scaling Foundation Models with Lightning Attention. [paper]

Reka Series

  1. Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models. [paper]

Skywork Series

  1. Skywork: A More Open Bilingual Foundation Model. [paper]
  2. Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models. [paper]

Hunyuan Series

  1. Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent. [paper]

Nemotron Series

  1. Nemotron-4 15B Technical Report. [paper]
  2. Nemotron-4 340B Technical Report. [paper]
  3. Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models. [paper]

Ling Series

  1. Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs. [paper]

OLMo Series

  1. OLMo: Accelerating the Science of Language Models. [paper]
  2. 2 OLMo 2 Furious. [paper]
  3. OLMoE: Open Mixture-of-Experts Language Models. [paper]

Yulan Series

  1. YuLan: An Open-source Large Language Model. [paper]
  2. YuLan-Mini: An Open Data-efficient Language Model. [paper]

MAP-Neo Series

  1. MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series. [paper]

LLM360 Project

  1. LLM360: Towards Fully Transparent Open-Source LLMs. [paper]

Training Strategies

Training Frameworks

The most commonly used training framework is Megatron-LM, which provides a good out-of-the-box and efficient benchmark.

  1. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    The most commonly used pre-training framework

  2. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

    Zero-redundancy data parallelism

  3. Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts.

    MoE computation-communication overlapping

  4. DeepEP: an efficient expert-parallel communication library

    Expert parallel acceleration

  5. DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

    Accelerating FP8 matrix multiplication using the asynchronous features of Hopper

  6. Liger Kernel: Efficient Triton Kernels for LLM Training

    Triton acceleration operator library

Training Strategies

  1. Predictable Scale: Part I -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining

    Regarding the Scaling Law of hyperparameters

  2. The Ultra-Scale Playbook: Training LLMs on GPU Clusters

    Visualizing the memory usage of parallel strategies

  3. A Spectral Condition for Feature Learning

    An advanced version of MuP

  4. Muon is Scalable for LLM Training

    An efficient optimizer

  5. COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training

    Training with optimizer states and activations also in FP8

  6. Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models

    Regarding the Scaling Law of MoE

Interpretability

We incompletely list some interpretability works that are inspiring for pre-training.

  1. On the Biology of a Large Language Model
  2. Physics of Language Models
  3. In-context Learning and Induction Heads
  4. Rethinking Reflection in Pre-Training

Model Architecture Improvement

We incompletely list some recent improvements to model architectures.

  1. Gated Delta Networks: Improving Mamba2 with Delta Rule
  2. RWKV-7 "Goose" with Expressive Dynamic State Evolution
  3. Mixture of Hidden-Dimensions Transformer
  4. Titans: Learning to Memorize at Test Time
  5. Ultra-Sparse Memory Network
  6. Large Language Diffusion Models
  7. Better & Faster Large Language Models via Multi-token Prediction
  8. Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing
  9. Stick-breaking Attention
  10. Forgetting Transformer: Softmax Attention with a Forget Gate
  11. Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
  12. MoBA: Mixture of Block Attention for Long-Context LLMs
  13. KV Shifting Attention Enhances Language Modeling
  14. Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models
  15. Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts
  16. ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs
  17. μnit Scaling: Simple and Scalable FP8 LLM Training

Learning Rate Annealing

Learning rate annealing is often combined with data quality screening.

  1. MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
  2. Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations
  3. Scaling Law with Learning Rate Annealing

Open-Source Datasets

We mainly discuss existing open-source datasets from four aspects: web pages, mathematics, code, and general (books, encyclopedias, instructions, long contexts, etc.).

Web Pages

Web page data will form the core corpus in pre-training.

  1. DCLM. [paper] [resource]

    An open-source web page dataset, a 3.8T dataset obtained after screening by Fasttext, etc.

  2. FineWeb-Edu

    A corpus for educational quality scoring, screened and scored from FineWeb, which has certain effects on knowledge-intensive questions.

  3. Nemotron-CC-HQ. [paper] [resource]

    NVIDIA's CC corpus.

  4. Chinese-FineWeb-Edu. [resource]

    An open-source Chinese educational quality scoring corpus by OpenCSG, screened and scored from Map-CC, SkyPile, WuDao, Wanjuan, etc.

  5. FineWeb2: A sparkling update with 1000s of languages

    A multilingual dataset.

Mathematics

  1. MegaMath: Pushing the Limits of Open Math Corpora

    The largest open-source high-quality mathematical CC corpus.

  2. JiuZhang3.0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models

    Synthetic mathematical instruction data.

  3. mlfoundations-dev/stackoverflow_math

    Math-related questions.

  4. DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

    A high-difficulty mathematical dataset.

  5. YuLan-Mini: An Open Data-efficient Language Model

    Collecting open-source Lean theorem proving datasets.

Code

  1. OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models

    Cleaned from The-Stack-V2.

  2. SmolLM-corpus. [resource]

    Python educational quality scoring.

  3. The-Stack-V2

    The largest-scale uncleaned code data.

  4. YuLan-Mini: An Open Data-efficient Language Model

    Cleaning Jupyter-Notebook and Python data with educational quality.

  5. HuggingFaceTB/issues-kaggle-notebooks

    GitHub Issues and Kaggle Notebooks data.

  6. mlfoundations-dev/stackoverflow

    A programming Q&A forum.

  7. Magicoder: Empowering Code Generation with OSS-Instruct

    Training with synthetic instruction data generated from open-source code.

General (Books, Encyclopedias, Instructions, Long Contexts, etc.)

  1. YuLan: An Open-source Large Language Model

    Enhancing long-tail knowledge and cleaning various general data sources.

  2. MinerU: An Open-Source Solution for Precise Document Content Extraction

    Converting PDF to Markdown with strong compatibility.

  3. The Pile: An 800GB Dataset of Diverse Text for Language Modeling.

    arXiv, conversations, DM Math, etc.

  4. Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research.

    Encyclopedias, books, papers, Reddit, etc.

  5. WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models

    Laws, exams, news, patents, encyclopedias, etc.

  6. MAmmoTH2: Scaling Instructions from the Web

    Q&A for web pages.

  7. togethercomputer/Long-Data-Collections

    Books, papers, web pages, and instructions filtered from datasets such as RedPajama, Pile, and P3.

  8. Longattn: Selecting long-context training data via token-level attention

    Q&A for long-range dependencies.

Data Methods

Tokenizers

  1. SuperBPE: Space Travel for Language Models

    A training method for multi-word tokenizers.

  2. Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

    Predicting the size of the vocabulary.

  3. Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs

    Comparing the tokenization methods of numbers.

Data Ratio and Curriculum

  1. Nemotron-4 15B Technical Report

    Divided into 8T pre-training and continued pre-training with a smaller data scale.

  2. YuLan-Mini: An Open Data-efficient Language Model

    Using educational scores for curriculum data.

  3. DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

    Optimizing the mixing ratio of pre-training data.

  4. Efficient Online Data Mixing For Language Model Pre-Training

    Online data mixing.

  5. Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance

    Laws of data mixing.

  6. Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

    Cracking the data ratio of commercial models such as GPT through the merging rules of BPE tokenizers.

  7. CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training

    A clustering-based iterative data mixture bootstrapping framework.

  8. Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens

    Building an index for large-scale pre-training datasets to check data quality.

Data Synthesis

  1. Knowledge-Instruct: Effective Continual Pre-training from Limited Data using Instructions

    Generating information-intensive synthetic instruction data and learning knowledge from a limited corpus.

  2. LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs

    Constructing long-text Creative Writing.

  3. Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use

    Multi-step reasoning data synthesis, decomposing complex tasks into sub-trajectories and optimizing data generation in combination with reinforcement learning.

  4. WildChat: 1M ChatGPT Interaction Logs in the Wild

    An open-source dataset of real user conversations.

  5. Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

    Alignment data synthesis.

  6. Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems

    Imitation learning based on long thinking chain synthetic data.

About

Awesome LLM pre-training resources, including data, frameworks, and methods.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors