Awesome LLM Pre-training

Pre-training is the first and most crucial training stage in the development of large language models. As the open-source community continues to improve in areas such as model architecture, training strategies, open-source datasets, and data methods, we consistently monitor the resources available for large model pre-training to give back to the developers of large language models in the open-source community.

Compared to a comprehensive review, our coverage will be limited to common resources and cutting-edge attempts related to pre-training, enabling users to quickly get started with large language model pre-training. Meanwhile, we welcome updates from the open-source community to jointly promote the development of large models.

Technical Reports

Behind technical reports often lie hundreds or thousands of computing resources. Therefore, it is highly recommended to read some open-source technical reports.

Dense Models

The Llama 3 Herd of Models. [paper]
Qwen2.5 Technical Report. [paper]
Gemma 3 Technical Report. [paper]
Nemotron-4 340B Technical Report. [paper]
Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs. [paper]
Baichuan 2: Open Large-scale Language Models. [paper]

MoE Models

DeepSeek-V3 Technical Report. [paper]
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. [paper]
Mixtral of Experts. [paper]
Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models. [paper]
Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs. [paper]
OLMoE: Open Mixture-of-Experts Language Models. [paper]
Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent. [paper]

Models with Open-Source Datasets

YuLan-Mini: An Open Data-efficient Language Model. [paper]
MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series. [paper]
LLM360: Towards Fully Transparent Open-Source LLMs. [paper]
Nemotron-4 15B Technical Report. [paper]

Training/Data Strategies

Phi-4 Technical Report. [paper]
OLMo: Accelerating the Science of Language Models. [paper]
2 OLMo 2 Furious. [paper]
Yi: Open Foundation Models by 01.AI. [paper]
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies. [paper]

Hybrid/Linear Models

Falcon Mamba: The First Competitive Attention-free 7B Language Model. [paper]
MiniMax-01: Scaling Foundation Models with Lightning Attention. [paper]
Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models. [paper]

All Technical Reports

LLaMA Series

LLaMA: Open and Efficient Foundation Language Models. [paper]
Llama 2: Open Foundation and Fine-Tuned Chat Models. [paper]
The Llama 3 Herd of Models. [paper]

Qwen Series

Qwen Technical Report. [paper]
Qwen2 Technical Report. [paper]
Qwen2.5 Technical Report. [paper]

DeepSeek Series

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism. [paper]
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. [paper]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. [paper]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. [paper]
DeepSeek-V3 Technical Report. [paper]

Gemma Series

Gemma: Open Models Based on Gemini Research and Technology. [paper]
Gemma 2: Improving Open Language Models at a Practical Size. [paper]
Gemma 3 Technical Report. [paper]

Gemini Series

Gemini: A Family of Highly Capable Multimodal Models. [paper]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. [paper]

Mistral Series

Mistral 7B. [paper]
Mixtral of Experts. [paper]

Phi Series

Textbooks Are All You Need. [paper]
Textbooks Are All You Need II: phi-1.5 technical report. [paper]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. [paper]
Phi-4 Technical Report. [paper]

GLM Series

GLM: General Language Model Pretraining with Autoregressive Blank Infilling. [paper]
GLM-130B: An Open Bilingual Pre-trained Model. [paper]
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools. [paper]

Baichuan Series

Baichuan 2: Open Large-scale Language Models. [paper]
Baichuan-M1: Pushing the Medical Capability of Large Language Models. [paper]

Falcon Series

The Falcon Series of Open Language Models. [paper]
Falcon2-11B Technical Report. [paper]
Falcon Mamba: The First Competitive Attention-free 7B Language Model. [paper]

InternLM Series

InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities. [paper]
InternLM2 Technical Report. [paper]

MiniCPM

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies. [paper]

Yi Series

Yi: Open Foundation Models by 01.AI. [paper]
Yi-Lightning Technical Report. [paper]

Minimax Series

MiniMax-01: Scaling Foundation Models with Lightning Attention. [paper]

Reka Series

Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models. [paper]

Skywork Series

Skywork: A More Open Bilingual Foundation Model. [paper]
Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models. [paper]

Hunyuan Series

Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent. [paper]

Nemotron Series

Nemotron-4 15B Technical Report. [paper]
Nemotron-4 340B Technical Report. [paper]
Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models. [paper]

Ling Series

Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs. [paper]

OLMo Series

OLMo: Accelerating the Science of Language Models. [paper]
2 OLMo 2 Furious. [paper]
OLMoE: Open Mixture-of-Experts Language Models. [paper]

Yulan Series

YuLan: An Open-source Large Language Model. [paper]
YuLan-Mini: An Open Data-efficient Language Model. [paper]

MAP-Neo Series

MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series. [paper]

LLM360 Project

LLM360: Towards Fully Transparent Open-Source LLMs. [paper]

Training Strategies

Training Frameworks

The most commonly used training framework is Megatron-LM, which provides a good out-of-the-box and efficient benchmark.

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

The most commonly used pre-training framework
Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

Zero-redundancy data parallelism
Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts.

MoE computation-communication overlapping
DeepEP: an efficient expert-parallel communication library

Expert parallel acceleration
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Accelerating FP8 matrix multiplication using the asynchronous features of Hopper
Liger Kernel: Efficient Triton Kernels for LLM Training

Triton acceleration operator library

Training Strategies

Predictable Scale: Part I -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining

Regarding the Scaling Law of hyperparameters
The Ultra-Scale Playbook: Training LLMs on GPU Clusters

Visualizing the memory usage of parallel strategies
A Spectral Condition for Feature Learning

An advanced version of MuP
Muon is Scalable for LLM Training

An efficient optimizer
COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training

Training with optimizer states and activations also in FP8
Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models

Regarding the Scaling Law of MoE

Interpretability

We incompletely list some interpretability works that are inspiring for pre-training.

On the Biology of a Large Language Model
Physics of Language Models
In-context Learning and Induction Heads
Rethinking Reflection in Pre-Training

Model Architecture Improvement

We incompletely list some recent improvements to model architectures.

Gated Delta Networks: Improving Mamba2 with Delta Rule
RWKV-7 "Goose" with Expressive Dynamic State Evolution
Mixture of Hidden-Dimensions Transformer
Titans: Learning to Memorize at Test Time
Ultra-Sparse Memory Network
Large Language Diffusion Models
Better & Faster Large Language Models via Multi-token Prediction
Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing
Stick-breaking Attention
Forgetting Transformer: Softmax Attention with a Forget Gate
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
MoBA: Mixture of Block Attention for Long-Context LLMs
KV Shifting Attention Enhances Language Modeling
Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models
Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts
ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs
μnit Scaling: Simple and Scalable FP8 LLM Training

Learning Rate Annealing

Learning rate annealing is often combined with data quality screening.

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations
Scaling Law with Learning Rate Annealing

Open-Source Datasets

We mainly discuss existing open-source datasets from four aspects: web pages, mathematics, code, and general (books, encyclopedias, instructions, long contexts, etc.).

Web Pages

Web page data will form the core corpus in pre-training.

DCLM. [paper] [resource]

An open-source web page dataset, a 3.8T dataset obtained after screening by Fasttext, etc.
FineWeb-Edu

A corpus for educational quality scoring, screened and scored from FineWeb, which has certain effects on knowledge-intensive questions.
Nemotron-CC-HQ. [paper] [resource]

NVIDIA's CC corpus.
Chinese-FineWeb-Edu. [resource]

An open-source Chinese educational quality scoring corpus by OpenCSG, screened and scored from Map-CC, SkyPile, WuDao, Wanjuan, etc.
FineWeb2: A sparkling update with 1000s of languages

A multilingual dataset.

Mathematics

MegaMath: Pushing the Limits of Open Math Corpora

The largest open-source high-quality mathematical CC corpus.
JiuZhang3.0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models

Synthetic mathematical instruction data.
mlfoundations-dev/stackoverflow_math

Math-related questions.
DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

A high-difficulty mathematical dataset.
YuLan-Mini: An Open Data-efficient Language Model

Collecting open-source Lean theorem proving datasets.

Code

OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models

Cleaned from The-Stack-V2.
SmolLM-corpus. [resource]

Python educational quality scoring.
The-Stack-V2

The largest-scale uncleaned code data.
YuLan-Mini: An Open Data-efficient Language Model

Cleaning Jupyter-Notebook and Python data with educational quality.
HuggingFaceTB/issues-kaggle-notebooks

GitHub Issues and Kaggle Notebooks data.
mlfoundations-dev/stackoverflow

A programming Q&A forum.
Magicoder: Empowering Code Generation with OSS-Instruct

Training with synthetic instruction data generated from open-source code.

General (Books, Encyclopedias, Instructions, Long Contexts, etc.)

YuLan: An Open-source Large Language Model

Enhancing long-tail knowledge and cleaning various general data sources.
MinerU: An Open-Source Solution for Precise Document Content Extraction

Converting PDF to Markdown with strong compatibility.
The Pile: An 800GB Dataset of Diverse Text for Language Modeling.

arXiv, conversations, DM Math, etc.
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research.

Encyclopedias, books, papers, Reddit, etc.
WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models

Laws, exams, news, patents, encyclopedias, etc.
MAmmoTH2: Scaling Instructions from the Web

Q&A for web pages.
togethercomputer/Long-Data-Collections

Books, papers, web pages, and instructions filtered from datasets such as RedPajama, Pile, and P3.
Longattn: Selecting long-context training data via token-level attention

Q&A for long-range dependencies.

Data Methods

Tokenizers

SuperBPE: Space Travel for Language Models

A training method for multi-word tokenizers.
Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

Predicting the size of the vocabulary.
Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs

Comparing the tokenization methods of numbers.

Data Ratio and Curriculum

Nemotron-4 15B Technical Report

Divided into 8T pre-training and continued pre-training with a smaller data scale.
YuLan-Mini: An Open Data-efficient Language Model

Using educational scores for curriculum data.
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

Optimizing the mixing ratio of pre-training data.
Efficient Online Data Mixing For Language Model Pre-Training

Online data mixing.
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance

Laws of data mixing.
Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

Cracking the data ratio of commercial models such as GPT through the merging rules of BPE tokenizers.
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training

A clustering-based iterative data mixture bootstrapping framework.
Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens

Building an index for large-scale pre-training datasets to check data quality.

Data Synthesis

Knowledge-Instruct: Effective Continual Pre-training from Limited Data using Instructions

Generating information-intensive synthetic instruction data and learning knowledge from a limited corpus.
LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs

Constructing long-text Creative Writing.
Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use

Multi-step reasoning data synthesis, decomposing complex tasks into sub-trajectories and optimizing data generation in combination with reinforcement learning.
WildChat: 1M ChatGPT Interaction Logs in the Wild

An open-source dataset of real user conversations.
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

Alignment data synthesis.
Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems

Imitation learning based on long thinking chain synthetic data.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
CITATION.cff		CITATION.cff
README.md		README.md
README_ZH.md		README_ZH.md

Folders and files

Latest commit

History

Repository files navigation

Awesome LLM Pre-training

Table of Contents

Technical Reports

Dense Models

MoE Models

Models with Open-Source Datasets

Training/Data Strategies

Hybrid/Linear Models

LLaMA Series

Qwen Series

DeepSeek Series

Gemma Series

Gemini Series

Mistral Series

Phi Series

GLM Series

Baichuan Series

Falcon Series

InternLM Series

MiniCPM

Yi Series

Minimax Series

Reka Series

Skywork Series

Hunyuan Series

Nemotron Series

Ling Series

OLMo Series

Yulan Series

MAP-Neo Series

LLM360 Project

Training Strategies

Training Frameworks

Training Strategies

Interpretability

Model Architecture Improvement

Learning Rate Annealing

Open-Source Datasets

Web Pages

Mathematics

Code

General (Books, Encyclopedias, Instructions, Long Contexts, etc.)

Data Methods

Tokenizers

Data Ratio and Curriculum

Data Synthesis

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages