Folders and files Name Name Last commit message
Last commit date
parent directory
View all files
1. Test-Time Scaling Method: Parallel Sampling
2. Test-Time Scaling Method: Tree Search
3. Test-Time Scaling Method: Multi-turn Correction
4. Test-Time Scaling Method: (long) Chain-of-Thought
5. Scaling Reinforcement Learning for Long CoT 🔥
6. Supervised Learning for Long CoT 🔥
7. Self-improvement with Test-Time Scaling
8. Ensemble of Test-Time Scaling Method
9. Inference Time Scaling Laws
10. Improving Scaling Efficiency
11. Latent Thoughts
1. Test-Time Scaling Method: Parallel Sampling
1.1 Application I: (Mathematical) Reasoning
Training Verifiers to Solve Math Word Problems [Paper]
Self-Consistency Improves Chain of Thought Reasoning in Language Models [Paper]
Let's Verify Step by Step [Paper]
Improving Large Language Model Fine-tuning for Solving Math Problems [Paper]
Common 7B Language Models Already Possess Strong Math Capabilities [Paper]
Getting 50% (SoTA) on ARC-AGI with GPT-4o [Paper]
Natural Language to Code Translation with Execution [Paper]
Competition-Level Code Generation with AlphaCode [Paper]
CodeT: Code Generation with Generated Tests [Paper]
AlphaCode 2 Technical Report [Paper]
Training Software Engineering Agents and Verifiers with SWE-Gym [Paper]
S*: Test Time Scaling for Code Generation [Paper]
1.3 Application III: Multimodal
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics [Paper]
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step [Paper]
1.4 Application IV: Safety
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models [Paper]
Leveraging Reasoning with Guidelines to Elicit and Utilize Knowledge for Enhancing Safety Alignment [Paper]
Chain-of-Retrieval Augmented Generation [Paper]
1.6 Application VI: Evaluation
Crowd Comparative Reasoning: Unlocking Comprehensive Evaluations for LLM-as-a-Judge [Paper]
Inference-Time Scaling for Generalist Reward Modeling [Paper]
2. Test-Time Scaling Method: Tree Search
2.1 Application I: (Mathematical) Reasoning
Generative Language Modeling for Automated Theorem Proving [Paper]
HyperTree Proof Search for Neural Theorem Proving [Paper]
Large Language Model Guided Tree-of-Thought [Paper]
Tree of Thoughts: Deliberate Problem Solving with Large Language Models [Paper]
Alphazero-like Tree-Search Can Guide Large Language Model Decoding and Training [Paper]
Hypothesis Search: Inductive Reasoning with Language Models [Paper]
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models [Paper]
ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search [Paper]
Reasoning with Language Model is Planning with World Model [Paper]
Self-Evaluation Guided Beam Search for Reasoning [Paper]
AlphaMath Almost Zero: Process Supervision without Process [Paper]
Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [Paper]
MindStar: Enhancing Math Reasoning in Pre-Trained LLMs at Inference Time [Paper]
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing [Paper]
Plan of Thoughts: Heuristic-Guided Problem Solving with Large Language Models [Paper]
ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [Paper]
Step-Level Value Preference Optimization for Mathematical Reasoning [Paper]
Accessing GPT-4 Level Mathematical Olympiad Solutions via Monte Carlo Tree Self-Refine with LLaMa-3 8B [Paper]
Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs [Paper]
Q*: Improving Multi-Step Reasoning for LLMs with Deliberative Planning [Paper]
DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search [Paper]
Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers [Paper]
Planning with MCTS: Enhancing Problem-Solving in Large Language Models [Paper]
CPL: Critical Plan Step Learning Boosts LLM Generalization in Reasoning Tasks [Paper]
Interpretable Contrastive Monte Carlo Tree Search Reasoning [Paper]
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking [Paper]
BFS-Prover: Scalable Best-First Tree Search for LLM-based Automatic Theorem Proving [Paper]
Planning with Large Language Models for Code Generation [Paper]
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models [Paper]
Cruxeval: A benchmark for code reasoning, understanding and execution [Paper]
Planning In Natural Language Improves LLM Search For Code Generation [Paper]
RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation [Paper]
SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation [Paper]
O1-Coder: An O1 Replication for Coding [Paper]
2.3 Application III: Multimodal
Llava-CoT: Let vision language models reason step-by-step [Paper]
Scaling inference-time search with vision value model for improved visual comprehension [Paper]
Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search [Paper]
Llamav-o1: Rethinking step-by-step visual reasoning in llms [Paper]
Video-T1: Test-Time Scaling for Video Generation [Paper]
2.4 Application IV: Agent
Tree of Thoughts: Deliberate Problem Solving with Large Language Models [Paper]
Don't Generate, Discriminate: A Proposal for Grounding Language Models to Real-World Environments [Paper]
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models [Paper]
ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search [Paper]
Tree Search for Language Model Agents [Paper]
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents [Paper]
2.5 Application V: Safety
C-MCTS: Safe Planning with Monte Carlo Tree Search [Paper]
Don't Throw Away Your Value Model! Generating More Preferable Text with Value-Guided Monte-Carlo Tree Search Decoding [Paper]
ARGS: Alignment as Reward-Guided Search [Paper]
Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking [Paper]
Almost Surely Safe Alignment of Large Language Models at Inference-Time [Paper]
STAIR: Improving Safety Alignment with Introspective Reasoning [Paper]
AirRAG: Activating Intrinsic Reasoning for Retrieval Augmented Generation using Tree-based Search [Paper]
Chain-of-Retrieval Augmented Generation [Paper]
2.7 Application VII: Evaluation
MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation [Paper]
2.8 Special Topic: Process Reward Model 🔥
Solving math word problems with process- and outcome-based feedback [Paper]
Math-Shepherd: Verify and Reinforce LLMs Step-by-Step without Human Annotations [Paper]
GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements [Paper]
Multi-step Problem Solving Through a Verifier: An Empirical Analysis on Model-induced Process Supervision [Paper]
Evaluating Mathematical Reasoning Beyond Accuracy [Paper]
AutoPSV: Automated Process-Supervised Verifier [Paper]
Improve Mathematical Reasoning in Language Models by Automated Process Supervision [Paper]
Token-Supervised Value Models for Enhancing Mathematical Reasoning Capabilities of Large Language Models [Paper]
Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning [Paper]
Free Process Rewards without Process Labels [Paper]
3. Test-Time Scaling Method: Multi-turn Correction
3.1 Application I: (Mathematical) Reasoning
Generating Sequences by Learning to Self-Correct [Paper]
Baldur: Whole-Proof Generation and Repair with Large Language Models [Paper]
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing [Paper]
Improving Factuality and Reasoning in Language Models through Multiagent Debate [Paper]
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate [Paper]
Verify-and-Edit: A Knowledge-Enhanced Chain-of-Thought Framework [Paper]
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-Based Self-Verification [Paper]
Language Models can Solve Computer Tasks [Paper]
Self-Refine: Iterative Refinement with Self-Feedback [Paper]
Reflexion: language agents with verbal reinforcement learning [Paper]
REFINER: Reasoning Feedback on Intermediate Representations [Paper]
Debating with More Persuasive LLMs Leads to More Truthful Answers [Paper]
Recursive Introspection: Teaching Language Model Agents How to Self-Improve [Paper]
Training Language Models to Self-Correct via Reinforcement Learning [Paper]
Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [Paper]
Teaching Large Language Models to Self-Debug [Paper]
Is Self-Repair a Silver Bullet for Code Generation? [Paper]
Reflexion: language agents with verbal reinforcement learning [Paper]
Phenomenal Yet Puzzling: Testing Inductive Reasoning Capabilities of Language Models with Hypothesis Refinement [Paper]
Self-taught optimizer (stop): Recursively self-improving code generation [Paper]
LLM Critics Help Catch LLM Bugs [Paper]
3.3 Application III: Multimodal
Vision-Language Models Can Self-Improve Reasoning via Reflection [Paper]
Insight-v: Exploring long-chain visual reasoning with multimodal large language models [Paper]
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step [Paper]
MINT: Multi-modal Chain of Thought in Unified Generative Models for Enhanced Image Generation [Paper]
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing [Paper]
3.4 Application IV: Agent
Language Models can Solve Computer Tasks [Paper]
Reflexion: language agents with verbal reinforcement learning [Paper]
Autonomous Evaluation and Refinement of Digital Agents [Paper]
3.5 Application V: Embodied AI
Inner Monologue: Embodied Reasoning through Planning with Language Models [Paper]
REFLECT: Summarizing Robot Experiences for Failure Explanation and Correction [Paper]
Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners [Paper]
3.6 Application VI: Safety
Generating Sequences by Learning to Self-Correct [Paper]
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing [Paper]
Improving Factuality and Reasoning in Language Models through Multiagent Debate [Paper]
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming [Paper]
Combating Adversarial Attacks with Multi-Agent Debate [Paper]
Debategpt: Fine-tuning large language models with multi-agent debate supervision [Paper]
3.7 Application VII: Evaluation
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate [Paper]
Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate [Paper]
3.8 Special Topic: Critical Perspective 🔥
Large Language Models Cannot Self-Correct Reasoning Yet [Paper]
Can Large Language Models Really Improve by Self-Critiquing Their Own Plans? [Paper]
GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems [Paper]
LLMs cannot find reasoning errors, but can correct them given the error location [Paper]
Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement [Paper]
When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs [Paper]
4. Test-Time Scaling Method: (long) Chain-of-Thought
4.1 Application I: (Mathematical) Reasoning
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models [Paper]
OpenAI O1 System Card [Blog]
QwQ: Reflect Deeply on the Boundaries of the Unknown [Blog]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [Paper]
Kimi k1.5: Scaling reinforcement learning with llms [Paper]
Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [Paper]
There May Not be Aha Moment in R1-Zero-like Training — A Pilot Study [Blog]
7B Model and 8K Examples: Emerging Reasoning with Reinforcement Learning is Both Effective and Efficient [Blog]
s1: Simple test-time scaling [Paper]
LIMO: Less is More for Reasoning [Paper]
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning [Paper]
Demystifying Long Chain-of-Thought Reasoning in LLMs [Paper]
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution [Paper]
LIMR: Less is More for RL Scaling [Paper]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale [Paper]
An Empirical Study on Eliciting and Improving R1-like Reasoning Models [Paper]
Open-Reasoner-Zero: An Open Source Approach to Scaling Reinforcement Learning on the Base Model [Github]
DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL [Blog]
Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs [Paper]
Understanding R1-Zero-Like Training: A Critical Perspective [Paper]
Can Better Cold-Start Strategies Improve RL Training for LLMs? [Blog]
What's Behind PPO's Collapse in Long-CoT? Value Optimization Holds the Secret [Paper]
Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't [Paper]
GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning [Paper]
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks [Paper]
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining [Paper]
Rethinking Reflection in Pre-Training [Paper]
OpenAI O1 System Card [Paper]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [Paper]
Kimi k1.5: Scaling reinforcement learning with llms [Paper]
s1: Simple test-time scaling [Paper]
Competitive Programming with Large Reasoning Models [Paper]
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution [Paper]
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [Paper]
ToRL: Scaling Tool-Integrated RL [Paper]
OpenCodeReasoning: Advancing Data Distillation for Competitive Coding [Paper]
DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level [Paper]
Seed-Thinking-v1.5: Advancing Superb Reasoning Models with Reinforcement Learning [Paper]
4.3 Application III: Multimodal
QVQ: To See the World with Wisdom [Paper]
Llava-onevision: Easy visual task transfer [Paper]
Qwen2-VL: Enhancing vision-language model's perception of the world at any resolution [Paper]
Reducing hallucinations in vision-language models via latent space steering [Paper]
Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale [Paper]
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM [Paper]
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought [Paper]
Kimi k1.5: Scaling reinforcement learning with llms [Paper]
Qwen2.5-VL Technical Report [Paper]
EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework [Github]
LMM-R1 [Github]
VLM-R1: A stable and generalizable R1-style Large Vision-Language Model [Github]
R1-V: Reinforcing Super Generalization Ability in Vision-Language Models with Less Than $3 [Github]
open-r1-multimodal: A fork to add multimodal model training to open-r1 [Github]
R1-Multimodal-Journey: A journey to real multimodal R1 [Github]
Open-R1-Video [Github]
R1-Onevision: Open-Source Multimodal Large Language Model with Reasoning Ability [Notion]
Video-R1: Reinforcing Video Reasoning in MLLMs [Paper]
R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model [Paper]
MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning [Paper]
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement [Paper]
Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme [Paper]
Improved Visual-Spatial Reasoning via R1-Zero-Like Training [Paper]
Kimi-VL [Github]
Introducing OpenAI o3 and o4-mini [Blog]
4.4 Application IV: Agent
ReAct: Synergizing Reasoning and Acting in Language Models [Paper]
Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku [Blog]
PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World [Paper]
UI-TARS: Pioneering Automated GUI Interaction with Native Agents [Paper]
Introducing deep research [Blog]
Computer-Using Agent [Blog]
The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks [Paper]
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution [Paper]
DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments [Paper]
4.5 Application V: Embodied AI
Agent Planning with World Knowledge Model [Paper]
Robotic control via embodied chain-of-thought reasoning [Paper]
Improving Vision-Language-Action Models via Chain-of-Affordance [Paper]
SpatialCoT: Advancing Spatial Reasoning through Coordinate Alignment and Chain-of-Thought for Embodied Task Planning [Paper]
Action-Free Reasoning for Policy Generalization [Paper]
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning [Paper]
Gemini Robotics: Bringing AI into the Physical World [Paper]
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [Paper]
4.6 Application VI: Safety
Chain-of-Verification Reduces Hallucination in Large Language Models [Paper]
Mixture of insighTful Experts (MoTE): The Synergy of Thought Chains and Expert Mixtures in Self-Alignment [Paper]
Deliberative Alignment: Reasoning Enables Safer Language Models [Paper]
SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities [Paper]
Inference Scaling for Long-Context Retrieval Augmented Generation [Paper]
Plan*RAG: Efficient Test-Time Planning for Retrieval Augmented Generation [Paper]
Auto-RAG: Autonomous Retrieval-Augmented Generation for Large Language Models [Paper]
Search-o1: Agentic Search-Enhanced Large Reasoning Models [Paper]
AirRAG: Activating Intrinsic Reasoning for Retrieval Augmented Generation using Tree-based Search [Paper]
Chain-of-Retrieval Augmented Generation [Paper]
DeepRAG: Thinking to Retrieval Step by Step for Large Language Models [Paper]
DeepRetrieval: Hacking Real Search Engines and Retrievers
with Large Language Models via Reinforcement Learning [Paper]
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning [Paper]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning [Paper]
ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning [Paper]
4.8 Application VIII: Evaluation
FactScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation [Paper]
FacTool: Factuality Detection in Generative AI -- A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios [Paper]
Knowledge-Centric Hallucination Detection [Paper]
RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation [Paper]
Agent-as-a-Judge: Evaluate Agents with Agents [Paper]
Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge [Paper]
Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators [Paper]
4.9 Special Topic: Representational Complexity of Transformers with CoT 🔥
The Expressive Power of Transformers with Chain of Thought [Paper]
On the Representational Capacity of Neural Language Models with Chain-of-Thought Reasoning [Paper]
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought[Paper]
5. Scaling Reinforcement Learning for Long CoT 🔥
5.1 Application: Math & Code
OpenAI O1 System Card [Blog]
QwQ: Reflect Deeply on the Boundaries of the Unknown [Blog]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [Paper]
Kimi k1.5: Scaling reinforcement learning with llms [Paper]
Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [Paper]
There May Not be Aha Moment in R1-Zero-like Training — A Pilot Study [Blog]
7B Model and 8K Examples: Emerging Reasoning with Reinforcement Learning is Both Effective and Efficient [Blog]
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning [Paper]
Demystifying Long Chain-of-Thought Reasoning in LLMs [Paper]
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution [Paper]
LIMR: Less is More for RL Scaling [Paper]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale [Paper]
An Empirical Study on Eliciting and Improving R1-like Reasoning Models [Paper]
Open-Reasoner-Zero: An Open Source Approach to Scaling Reinforcement Learning on the Base Model [Github]
DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL [Blog]
Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs [Paper]
Understanding R1-Zero-Like Training: A Critical Perspective [Paper]
Can Better Cold-Start Strategies Improve RL Training for LLMs? [Blog]
What's Behind PPO's Collapse in Long-CoT? Value Optimization Holds the Secret [Paper]
Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't [Paper]
GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning [Paper]
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks [Paper]
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining [Paper]
Rethinking Reflection in Pre-Training [Paper]
DeepRetrieval: Hacking Real Search Engines and Retrievers
with Large Language Models via Reinforcement Learning [Paper]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning [Paper]
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning [Paper]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning [Paper]
ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning [Paper]
DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments [Paper]
5.3 Application III: Multimodal
QVQ: To See the World with Wisdom [Blog]
Llava-onevision: Easy visual task transfer [Paper]
Qwen2-VL: Enhancing vision-language model's perception of the world at any resolution [Paper]
Reducing hallucinations in vision-language models via latent space steering [Paper]
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought [Paper]
Kimi k1.5: Scaling reinforcement learning with llms [Paper]
Qwen2.5-VL Technical Report [Paper]
EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework [Github]
LMM-R1 [Github]
VLM-R1: A stable and generalizable R1-style Large Vision-Language Model [Github]
R1-V: Reinforcing Super Generalization Ability in Vision-Language Models with Less Than $3 [Github]
open-r1-multimodal: A fork to add multimodal model training to open-r1 [Github]
R1-Multimodal-Journey: A journey to real multimodal R1 [Github]
Open-R1-Video [Github]
Video-R1: Reinforcing Video Reasoning in MLLMs [Paper]
R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model [Paper]
MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning [Paper]
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement [Paper]
Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme [Paper]
Improved Visual-Spatial Reasoning via R1-Zero-Like Training [Paper]
Kimi-VL [Blog]
Introducing OpenAI o3 and o4-mini [Blog]
5.4 Papers Sorted by RL Components
Proximal Policy Optimization Algorithms [Paper]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models [Paper]
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models [Paper]
Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning [Paper]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale [Paper]
Understanding R1-Zero-Like Training: A Critical Perspective [Paper]
What's Behind PPO's Collapse in Long-CoT? Value Optimization Holds the Secret [Paper]
GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning [Paper]
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks [Paper]
Self-Explore: Enhancing Mathematical Reasoning in Language Models with Fine-grained Rewards [Paper]
Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms [Paper]
On Designing Effective RL Reward at Training Time for LLM Reasoning [Paper]
VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment [Paper]
Does RLHF Scale? Exploring the Impacts From Data, Model, and Method [Paper]
Process Reinforcement through Implicit Rewards [Paper]
Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs [Paper]
Understanding R1-Zero-Like Training: A Critical Perspective [Paper]
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining [Paper]
Rethinking Reflection in Pre-Training [Paper]
Kimi k1.5: Scaling reinforcement learning with llms [Paper]
LIMR: Less is More for RL Scaling [Paper]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale [Paper]
5.4.5 Multi-stage Training
Kimi k1.5: Scaling reinforcement learning with llms [Paper]
Demystifying Long Chain-of-Thought Reasoning in LLMs [Paper]
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning [Paper]
DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL [Blog]
Can Better Cold-Start Strategies Improve RL Training for LLMs? [Blog]
Position: Benchmarking is Limited in Reinforcement Learning Research [Paper]
A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility [Paper]
Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data [Paper]
RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold [Paper]
OpenRLHF
Verl
NeMo-Aligner
Deepspeed-chat
6. Supervised Learning for Long CoT 🔥
Work
Application
Type
Source
Quantity
Modality
Link
O1 Journey–Part 1
Math
Synthesize
GPT-4o
0.3K
Text
GitHub HuggingFace
Marco-o1
Reasoning
Synthesize
Qwen2-7B-Instruct
10K
Text
GitHub
STILL-2
Math, Code, Science, Puzzle
Distillation
DeepSeek-R1-Lite-Preview, QwQ-32B-preview
5K
Text
GitHub HuggingFace
RedStar-math
Math
Distillation
QwQ-32B-preview
4K
Text
HuggingFace
RedStar-code
Code
Distillation
QwQ-32B-preview
16K
Text
HuggingFace
RedStar-multimodal
Math
Distillation
QwQ-32B-preview
12K
Vision + Text
HuggingFace
S1K
Math, Science, Code
Distillation
Gemini Flash Thinking
1K
Text
GitHub HuggingFace
S1K-1.1
Math, Science, Code
Distillation
DeepSeek R1
1K
Text
GitHub HuggingFace
LIMO
Math
Distillation
DeepSeek R1, DeepSeekR1-Distill-Qwen-32B
0.8K
Text
GitHub HuggingFace
OpenThoughts-114k
Math, Code, Science, Puzzle
Distillation
DeepSeek R1
114K
Text
GitHub HuggingFace
OpenR1-Math-220k
Math
Distillation
DeepSeek R1
220K
Text
GitHub HuggingFace
OpenThoughts2-1M
Math, Code, Science, Puzzle
Distillation
DeepSeek R1
1M
Text
GitHub HuggingFace
CodeForces-CoTs
Code
Distillation
DeepSeek R1
47K
Text
GitHub HuggingFace
Sky-T1-17k
Math, Code, Science, Puzzle
Distillation
QwQ-32B-Preview
17K
Text
GitHub HuggingFace
S²R
Math
Synthesize
Qwen2.5-Math-7B
3K
Text
GitHub HuggingFace
R1-Onevision
Science, Math, General
Distillation
DeepSeek R1
155K
Vision + Text
GitHub HuggingFace
OpenO1-SFT
Math, Code
Synthesize
-
77K
Text
GitHub HuggingFace
Medical-o1
Medical
Distillation
Deepseek R1
25K
Text
GitHub HuggingFace
O1 Journey–Part 3
Medical
Distillation
o1-preview
0.5K
Text
GitHub HuggingFace
SCP-116K
Math, Science
Distillation
Deepseek R1
116K
Text
GitHub HuggingFace
open-r1-multimodal
Math
Distillation
GPT-4o
8K
Vision + Text
GitHub HuggingFace
Vision-R1-cold
Science, Math, General
Distillation
Deepseek R1
200K
Vision + Text
GitHub HuggingFace
MMMU-Reasoning-Distill-Validation
Science, Math, General
Distillation
Deepseek R1
0.8K
Vision + Text
ModelScope
Clevr-CoGenT
Vision Counting
Distillation
Deepseek R1
37.8K
Vision + Text
GitHub HuggingFace
VL-Thinking
Science, Math, General
Distillation
Deepseek R1
158K
Vision + Text
GitHub HuggingFace
Video-R1
Video
Distillation
Qwen2.5-VL-72B
158K
Vision + Text
GitHub HuggingFace
Embodied-Reasoner
Embodied AI
Synthesize
GPT-4o
9K
Vision + Text
GitHub HuggingFace
OpenCodeReasoning
Code
Distillation
DeepSeek R1
736K
Text
HuggingFace
SafeChain
Safety
Distillation
Deepseek R1
40K
Text
GitHub HuggingFace
KodCode
Code
Distillation
DeepSeek R1
2.8K
Text
GitHub HuggingFace
Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping [Paper]
Stream of Search (SoS): Learning to Search in Language [Paper]
Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems [Paper]
Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems [Paper]
s1: Simple test-time scaling [Paper]
RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems? [Paper]
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training [Paper]
LIMO: Less is More for Reasoning [Paper]
Small Models Struggle to Learn from Strong Reasoners [Paper]
LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! [Paper]
7. Self-improvement with Test-Time Scaling
STaR: Bootstrapping Reasoning With Reasoning [Paper]
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment [Paper]
Language Models Can Teach Themselves to Program Better [Paper]
Scaling relationship on learning mathematical reasoning with large language models [Paper]
Reinforced self-training (rest) for language modeling [Paper]
Large Language Models Can Self-Improve [Paper]
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models [Paper]
Self-Rewarding Language Models [Paper]
V-STaR: Training Verifiers for Self-Taught Reasoners [Paper]
Iterative Reasoning Preference Optimization [Paper]
Progress or Regress? Self-Improvement Reversal in Post-training [Paper]
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement [Paper]
Process-based Self-Rewarding Language Models [Paper]
Alphazero-like Tree-Search Can Guide Large Language Model Decoding and Training [Paper]
AlphaMath Almost Zero: Process Supervision without Process [Paper]
Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [Paper]
ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [Paper]
Step-Level Value Preference Optimization for Mathematical Reasoning [Paper]
Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs [Paper]
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking [Paper]
7.3 Multi-turn Correction
Multi-Turn Code Generation Through Single-Step Rewards [Paper]
S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [Paper]
Self-rewarding correction for mathematical reasoning [Paper]
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [Paper]
8. Ensemble of Test-Time Scaling Method
Accessing GPT-4 Level Mathematical Olympiad Solutions via Monte Carlo Tree Self-Refine with LLaMa-3 8B [Paper]
Scaling test-time compute with open models [Blog]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters [Paper]
RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation [Paper]
TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling [Paper]
LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning [Paper]
MC-NEST -- Enhancing Mathematical Reasoning in Large Language Models with a Monte Carlo Nash Equilibrium Self-Refine Tree [Paper]
SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models [Paper]
Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning [Paper]
RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques [Paper]
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling [Paper]
Scaling Test-Time Compute Without Verification or RL is Suboptimal [Paper]
S*: Test Time Scaling for Code Generation [Paper]
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [Paper]
9. Inference Time Scaling Laws
The Impact of Reasoning Step Length on Large Language Models [Paper]
More Agents Is All You Need [Paper]
Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems [Paper]
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling [Paper]
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models [Paper]
Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling [Paper]
Inference Scaling fLaws: The Limits of LLM Resampling with Imperfect Verifiers [Paper]
s1: Simple test-time scaling [Paper]
Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [Paper]
Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? [Paper]
10. Improving Scaling Efficiency
Let's Sample Step by Step: Adaptive-Consistency for Efficient Reasoning and Coding with LLMs [Paper]
Universal Self-Consistency for Large Language Model Generation [Paper]
Escape Sky-High Cost: Early-Stopping Self-Consistency for Multi-Step Reasoning [Paper]
Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision [Paper]
Dynamic Self-Consistency: Leveraging Reasoning Paths for Efficient LLM Sampling [Paper]
Make Every Penny Count: Difficulty-Adaptive Self-Consistency for Cost-Efficient Reasoning [Paper]
Fast Best-of-N Decoding via Speculative Rejection [Paper]
Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models [Paper]
Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers [Paper]
Scalable Best-of-N Selection for Large Language Models via Self-Certainty [Paper]
Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding [Paper]
When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning [Paper]
LiteSearch: Efficacious Tree Search for LLM [Paper]
ETS: Efficient Tree Search for Inference-Time Scaling [Paper]
Dynamic Parallel Tree Search for Efficient LLM Reasoning [Paper]
Don't Get Lost in the Trees: Streamlining LLM Reasoning by Overcoming Tree Search Exploration Pitfalls [Paper]
10.3 Multi-turn Correction
Can Large Language Models Be an Alternative to Human Evaluations? [Paper]
G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment [Paper]
A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation [Paper]
Confidence Matters: Revisiting Intrinsic Self-Correction Capabilities of Large Language Models [Paper]
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation [Paper]
Recursive Introspection: Teaching Language Model Agents How to Self-Improve [Paper]
Training Language Models to Self-Correct via Reinforcement Learning [Paper]
Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [Paper]
Implicit Chain of Thought Reasoning via Knowledge Distillation [Paper]
Anchor-based Large Language Models [Paper]
From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step [Paper]
Break the Chain: Large Language Models Can be Shortcut Reasoners [Paper]
Distilling System 2 into System 1 [Paper]
System-1.x: Learning to Balance Fast and Slow Planning with Language Models [Paper]
Concise Thoughts: Impact of Output Length on LLM Reasoning and Cost [Paper]
To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning [Paper]
Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces [Paper]
Markov Chain of Thought for Efficient Mathematical Reasoning [Paper]
Can Language Models Learn to Skip Steps? [Paper]
Training Large Language Models to Reason in a Continuous Latent Space [Paper]
Compressed Chain of Thought: Efficient Reasoning Through Dense Representations [Paper]
C3oT: Generating Shorter Chain-of-Thought without Compromising Effectiveness [Paper]
Token-Budget-Aware LLM Reasoning [Paper]
Efficiently Serving LLM Reasoning Programs with Certaindex [Paper]
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs [Paper]
Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models [Paper]
O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning [Paper]
Think Smarter not Harder: Adaptive Reasoning with Inference Aware Optimization [Paper]
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs [Paper]
Training Language Models to Reason Efficiently [Paper]
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach [Paper]
CoT-Valve: Length-Compressible Chain-of-Thought Tuning [Paper]
TokenSkip: Controllable Chain-of-Thought Compression in LLMs [Paper]
SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs [Paper]
Stepwise Perplexity-Guided Refinement for Efficient Chain-of-Thought Reasoning in Large Language Models [Paper]
The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer [Paper]
LightThinker: Thinking Step-by-Step Compression [Paper]
Chain of Draft: Thinking Faster by Writing Less [Paper]
Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning [Paper]
Self-Training Elicits Concise Reasoning in Large Language Models [Paper]
CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation [Paper]
Efficient Test-Time Scaling via Self-Calibration [Paper]
How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach [Paper]
DAST: Difficulty-Adaptive Slow-Thinking for Large Reasoning Models [Paper]
L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning [Paper]
Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching [Paper]
InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models [Paper]
Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning [Paper]
Unlocking Efficient Long-to-Short LLM Reasoning with Model Merging [Paper]
Think Less, Achieve More: Cut Reasoning Costs by 50% Without Sacrificing Accuracy [Blog]
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking [Paper]
Lean-STaR: Learning to Interleave Thinking and Proving [Paper]
RATIONALYST: Pre-training Process-Supervision for Improving Reasoning [Paper]
Reasoning to Learn from Latent Thoughts [Paper]
You can’t perform that action at this time.