Pre-training is the first and most crucial training stage in the development of large language models. As the open-source community continues to improve in areas such as model architecture, training strategies, open-source datasets, and data methods, we consistently monitor the resources available for large model pre-training to give back to the developers of large language models in the open-source community.
Compared to a comprehensive review, our coverage will be limited to common resources and cutting-edge attempts related to pre-training, enabling users to quickly get started with large language model pre-training. Meanwhile, we welcome updates from the open-source community to jointly promote the development of large models.
Behind technical reports often lie hundreds or thousands of computing resources. Therefore, it is highly recommended to read some open-source technical reports.
- The Llama 3 Herd of Models. [paper]
- Qwen2.5 Technical Report. [paper]
- Gemma 3 Technical Report. [paper]
- Nemotron-4 340B Technical Report. [paper]
- Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs. [paper]
- Baichuan 2: Open Large-scale Language Models. [paper]
- DeepSeek-V3 Technical Report. [paper]
- DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. [paper]
- Mixtral of Experts. [paper]
- Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models. [paper]
- Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs. [paper]
- OLMoE: Open Mixture-of-Experts Language Models. [paper]
- Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent. [paper]
- YuLan-Mini: An Open Data-efficient Language Model. [paper]
- MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series. [paper]
- LLM360: Towards Fully Transparent Open-Source LLMs. [paper]
- Nemotron-4 15B Technical Report. [paper]
- Phi-4 Technical Report. [paper]
- OLMo: Accelerating the Science of Language Models. [paper]
- 2 OLMo 2 Furious. [paper]
- Yi: Open Foundation Models by 01.AI. [paper]
- MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies. [paper]
- Falcon Mamba: The First Competitive Attention-free 7B Language Model. [paper]
- MiniMax-01: Scaling Foundation Models with Lightning Attention. [paper]
- Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models. [paper]
All Technical Reports
- LLaMA: Open and Efficient Foundation Language Models. [paper]
- Llama 2: Open Foundation and Fine-Tuned Chat Models. [paper]
- The Llama 3 Herd of Models. [paper]
- DeepSeek LLM: Scaling Open-Source Language Models with Longtermism. [paper]
- DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. [paper]
- DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. [paper]
- DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. [paper]
- DeepSeek-V3 Technical Report. [paper]
- Gemma: Open Models Based on Gemini Research and Technology. [paper]
- Gemma 2: Improving Open Language Models at a Practical Size. [paper]
- Gemma 3 Technical Report. [paper]
- Gemini: A Family of Highly Capable Multimodal Models. [paper]
- Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. [paper]
- Textbooks Are All You Need. [paper]
- Textbooks Are All You Need II: phi-1.5 technical report. [paper]
- Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. [paper]
- Phi-4 Technical Report. [paper]
- GLM: General Language Model Pretraining with Autoregressive Blank Infilling. [paper]
- GLM-130B: An Open Bilingual Pre-trained Model. [paper]
- ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools. [paper]
- Baichuan 2: Open Large-scale Language Models. [paper]
- Baichuan-M1: Pushing the Medical Capability of Large Language Models. [paper]
- The Falcon Series of Open Language Models. [paper]
- Falcon2-11B Technical Report. [paper]
- Falcon Mamba: The First Competitive Attention-free 7B Language Model. [paper]
- InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities. [paper]
- InternLM2 Technical Report. [paper]
- MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies. [paper]
- MiniMax-01: Scaling Foundation Models with Lightning Attention. [paper]
- Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models. [paper]
- Skywork: A More Open Bilingual Foundation Model. [paper]
- Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models. [paper]
- Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent. [paper]
- Nemotron-4 15B Technical Report. [paper]
- Nemotron-4 340B Technical Report. [paper]
- Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models. [paper]
- Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs. [paper]
- OLMo: Accelerating the Science of Language Models. [paper]
- 2 OLMo 2 Furious. [paper]
- OLMoE: Open Mixture-of-Experts Language Models. [paper]
- YuLan: An Open-source Large Language Model. [paper]
- YuLan-Mini: An Open Data-efficient Language Model. [paper]
- MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series. [paper]
- LLM360: Towards Fully Transparent Open-Source LLMs. [paper]
The most commonly used training framework is Megatron-LM, which provides a good out-of-the-box and efficient benchmark.
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
The most commonly used pre-training framework
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters
Zero-redundancy data parallelism
- Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts.
MoE computation-communication overlapping
- DeepEP: an efficient expert-parallel communication library
Expert parallel acceleration
- DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
Accelerating FP8 matrix multiplication using the asynchronous features of Hopper
- Liger Kernel: Efficient Triton Kernels for LLM Training
Triton acceleration operator library
- Predictable Scale: Part I -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining
Regarding the Scaling Law of hyperparameters
- The Ultra-Scale Playbook: Training LLMs on GPU Clusters
Visualizing the memory usage of parallel strategies
- A Spectral Condition for Feature Learning
An advanced version of MuP
- Muon is Scalable for LLM Training
An efficient optimizer
- COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training
Training with optimizer states and activations also in FP8
- Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models
Regarding the Scaling Law of MoE
We incompletely list some interpretability works that are inspiring for pre-training.
- On the Biology of a Large Language Model
- Physics of Language Models
- In-context Learning and Induction Heads
- Rethinking Reflection in Pre-Training
We incompletely list some recent improvements to model architectures.
- Gated Delta Networks: Improving Mamba2 with Delta Rule
- RWKV-7 "Goose" with Expressive Dynamic State Evolution
- Mixture of Hidden-Dimensions Transformer
- Titans: Learning to Memorize at Test Time
- Ultra-Sparse Memory Network
- Large Language Diffusion Models
- Better & Faster Large Language Models via Multi-token Prediction
- Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing
- Stick-breaking Attention
- Forgetting Transformer: Softmax Attention with a Forget Gate
- Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
- MoBA: Mixture of Block Attention for Long-Context LLMs
- KV Shifting Attention Enhances Language Modeling
- Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models
- Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts
- ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs
- μnit Scaling: Simple and Scalable FP8 LLM Training
Learning rate annealing is often combined with data quality screening.
- MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
- Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations
- Scaling Law with Learning Rate Annealing
We mainly discuss existing open-source datasets from four aspects: web pages, mathematics, code, and general (books, encyclopedias, instructions, long contexts, etc.).
Web page data will form the core corpus in pre-training.
- DCLM. [paper] [resource]
An open-source web page dataset, a 3.8T dataset obtained after screening by Fasttext, etc.
- FineWeb-Edu
A corpus for educational quality scoring, screened and scored from FineWeb, which has certain effects on knowledge-intensive questions.
- Nemotron-CC-HQ. [paper] [resource]
NVIDIA's CC corpus.
- Chinese-FineWeb-Edu. [resource]
An open-source Chinese educational quality scoring corpus by OpenCSG, screened and scored from Map-CC, SkyPile, WuDao, Wanjuan, etc.
- FineWeb2: A sparkling update with 1000s of languages
A multilingual dataset.
- MegaMath: Pushing the Limits of Open Math Corpora
The largest open-source high-quality mathematical CC corpus.
- JiuZhang3.0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models
Synthetic mathematical instruction data.
- mlfoundations-dev/stackoverflow_math
Math-related questions.
- DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning
A high-difficulty mathematical dataset.
- YuLan-Mini: An Open Data-efficient Language Model
Collecting open-source Lean theorem proving datasets.
- OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models
Cleaned from The-Stack-V2.
- SmolLM-corpus. [resource]
Python educational quality scoring.
- The-Stack-V2
The largest-scale uncleaned code data.
- YuLan-Mini: An Open Data-efficient Language Model
Cleaning Jupyter-Notebook and Python data with educational quality.
- HuggingFaceTB/issues-kaggle-notebooks
GitHub Issues and Kaggle Notebooks data.
- mlfoundations-dev/stackoverflow
A programming Q&A forum.
- Magicoder: Empowering Code Generation with OSS-Instruct
Training with synthetic instruction data generated from open-source code.
- YuLan: An Open-source Large Language Model
Enhancing long-tail knowledge and cleaning various general data sources.
- MinerU: An Open-Source Solution for Precise Document Content Extraction
Converting PDF to Markdown with strong compatibility.
- The Pile: An 800GB Dataset of Diverse Text for Language Modeling.
arXiv, conversations, DM Math, etc.
- Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research.
Encyclopedias, books, papers, Reddit, etc.
- WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models
Laws, exams, news, patents, encyclopedias, etc.
- MAmmoTH2: Scaling Instructions from the Web
Q&A for web pages.
- togethercomputer/Long-Data-Collections
Books, papers, web pages, and instructions filtered from datasets such as RedPajama, Pile, and P3.
- Longattn: Selecting long-context training data via token-level attention
Q&A for long-range dependencies.
- SuperBPE: Space Travel for Language Models
A training method for multi-word tokenizers.
- Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies
Predicting the size of the vocabulary.
- Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs
Comparing the tokenization methods of numbers.
- Nemotron-4 15B Technical Report
Divided into 8T pre-training and continued pre-training with a smaller data scale.
- YuLan-Mini: An Open Data-efficient Language Model
Using educational scores for curriculum data.
- DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining
Optimizing the mixing ratio of pre-training data.
- Efficient Online Data Mixing For Language Model Pre-Training
Online data mixing.
- Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance
Laws of data mixing.
- Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?
Cracking the data ratio of commercial models such as GPT through the merging rules of BPE tokenizers.
- CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training
A clustering-based iterative data mixture bootstrapping framework.
- Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens
Building an index for large-scale pre-training datasets to check data quality.
- Knowledge-Instruct: Effective Continual Pre-training from Limited Data using Instructions
Generating information-intensive synthetic instruction data and learning knowledge from a limited corpus.
- LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs
Constructing long-text Creative Writing.
- Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use
Multi-step reasoning data synthesis, decomposing complex tasks into sub-trajectories and optimizing data generation in combination with reinforcement learning.
- WildChat: 1M ChatGPT Interaction Logs in the Wild
An open-source dataset of real user conversations.
- Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing
Alignment data synthesis.
- Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems
Imitation learning based on long thinking chain synthetic data.