Name	Name	Last commit message	Last commit date
parent directory ..
readme.md	readme.md

Well-Organized Papers

1. Test-Time Scaling Method: Parallel Sampling

1.1 Application I: (Mathematical) Reasoning
1.2 Application II: Code
1.3 Application III: Multimodal
1.4 Application IV: Safety
1.5 Application V: RAG
1.6 Application VI: Evaluation

2. Test-Time Scaling Method: Tree Search

2.1 Application I: (Mathematical) Reasoning
2.2 Application II: Code
2.3 Application III: Multimodal
2.4 Application IV: Agent
2.5 Application V: Safety
2.6 Application VI: RAG
2.7 Application VII: Evaluation
2.8 Special Topic: Process Reward Model 🔥

3. Test-Time Scaling Method: Multi-turn Correction

3.1 Application I: (Mathematical) Reasoning
3.2 Application II: Code
3.3 Application III: Multimodal
3.4 Application IV: Agent
3.5 Application V: Embodied AI
3.6 Application VI: Safety
3.7 Application VII: Evaluation
3.8 Special Topic: Critical Perspective 🔥

4. Test-Time Scaling Method: (long) Chain-of-Thought

4.1 Application I: (Mathematical) Reasoning
4.2 Application II: Code
4.3 Application III: Multimodal
4.4 Application IV: Agent
4.5 Application V: Embodied AI
4.6 Application VI: Safety
4.7 Application VII: RAG
4.8 Application VIII: Evaluation
4.9 Special Topic: Representational Complexity of Transformers with CoT 🔥

5. Scaling Reinforcement Learning for Long CoT 🔥

5.1 Application: Math & Code
5.2 Application: Search
5.3 Application III: Multimodal
5.4 Papers Sorted by RL Components
5.5 Infra

6. Supervised Learning for Long CoT 🔥

6.1 long CoT Resource
6.2 Analysis

7. Self-improvement with Test-Time Scaling

7.1 Parallel Sampling
7.2 Tree Search
7.3 Multi-turn Correction
7.4 Long CoT

8. Ensemble of Test-Time Scaling Method

9. Inference Time Scaling Laws

10. Improving Scaling Efficiency

10.1 Parallel Sampling
10.2 Tree Search
10.3 Multi-turn Correction
10.4 Long CoT

11. Latent Thoughts

1. Test-Time Scaling Method: Parallel Sampling

1.1 Application I: (Mathematical) Reasoning

Training Verifiers to Solve Math Word Problems [Paper]
Self-Consistency Improves Chain of Thought Reasoning in Language Models [Paper]
Let's Verify Step by Step [Paper]
Improving Large Language Model Fine-tuning for Solving Math Problems [Paper]
Common 7B Language Models Already Possess Strong Math Capabilities [Paper]
Getting 50% (SoTA) on ARC-AGI with GPT-4o [Paper]

1.2 Application II: Code

Natural Language to Code Translation with Execution [Paper]
Competition-Level Code Generation with AlphaCode [Paper]
CodeT: Code Generation with Generated Tests [Paper]
AlphaCode 2 Technical Report [Paper]
Training Software Engineering Agents and Verifiers with SWE-Gym [Paper]
S*: Test Time Scaling for Code Generation [Paper]

1.3 Application III: Multimodal

URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics [Paper]
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step [Paper]

1.4 Application IV: Safety

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models [Paper]
Leveraging Reasoning with Guidelines to Elicit and Utilize Knowledge for Enhancing Safety Alignment [Paper]

1.5 Application V: RAG

Chain-of-Retrieval Augmented Generation [Paper]

1.6 Application VI: Evaluation

Crowd Comparative Reasoning: Unlocking Comprehensive Evaluations for LLM-as-a-Judge [Paper]
Inference-Time Scaling for Generalist Reward Modeling [Paper]

2. Test-Time Scaling Method: Tree Search

2.1 Application I: (Mathematical) Reasoning

Generative Language Modeling for Automated Theorem Proving [Paper]
HyperTree Proof Search for Neural Theorem Proving [Paper]
Large Language Model Guided Tree-of-Thought [Paper]
Tree of Thoughts: Deliberate Problem Solving with Large Language Models [Paper]
Alphazero-like Tree-Search Can Guide Large Language Model Decoding and Training [Paper]
Hypothesis Search: Inductive Reasoning with Language Models [Paper]
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models [Paper]
ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search [Paper]
Reasoning with Language Model is Planning with World Model [Paper]
Self-Evaluation Guided Beam Search for Reasoning [Paper]
AlphaMath Almost Zero: Process Supervision without Process [Paper]
Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [Paper]
MindStar: Enhancing Math Reasoning in Pre-Trained LLMs at Inference Time [Paper]
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing [Paper]
Plan of Thoughts: Heuristic-Guided Problem Solving with Large Language Models [Paper]
ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [Paper]
Step-Level Value Preference Optimization for Mathematical Reasoning [Paper]
Accessing GPT-4 Level Mathematical Olympiad Solutions via Monte Carlo Tree Self-Refine with LLaMa-3 8B [Paper]
Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs [Paper]
Q*: Improving Multi-Step Reasoning for LLMs with Deliberative Planning [Paper]
DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search [Paper]
Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers [Paper]
Planning with MCTS: Enhancing Problem-Solving in Large Language Models [Paper]
CPL: Critical Plan Step Learning Boosts LLM Generalization in Reasoning Tasks [Paper]
Interpretable Contrastive Monte Carlo Tree Search Reasoning [Paper]
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking [Paper]
BFS-Prover: Scalable Best-First Tree Search for LLM-based Automatic Theorem Proving [Paper]

2.2 Application II: Code

Planning with Large Language Models for Code Generation [Paper]
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models [Paper]
Cruxeval: A benchmark for code reasoning, understanding and execution [Paper]
Planning In Natural Language Improves LLM Search For Code Generation [Paper]
RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation [Paper]
SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation [Paper]
O1-Coder: An O1 Replication for Coding [Paper]

2.3 Application III: Multimodal

Llava-CoT: Let vision language models reason step-by-step [Paper]
Scaling inference-time search with vision value model for improved visual comprehension [Paper]
Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search [Paper]
Llamav-o1: Rethinking step-by-step visual reasoning in llms [Paper]
Video-T1: Test-Time Scaling for Video Generation [Paper]

2.4 Application IV: Agent

Tree of Thoughts: Deliberate Problem Solving with Large Language Models [Paper]
Don't Generate, Discriminate: A Proposal for Grounding Language Models to Real-World Environments [Paper]
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models [Paper]
ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search [Paper]
Tree Search for Language Model Agents [Paper]
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents [Paper]

2.5 Application V: Safety

C-MCTS: Safe Planning with Monte Carlo Tree Search [Paper]
Don't Throw Away Your Value Model! Generating More Preferable Text with Value-Guided Monte-Carlo Tree Search Decoding [Paper]
ARGS: Alignment as Reward-Guided Search [Paper]
Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking [Paper]
Almost Surely Safe Alignment of Large Language Models at Inference-Time [Paper]
STAIR: Improving Safety Alignment with Introspective Reasoning [Paper]

2.6 Application VI: RAG

AirRAG: Activating Intrinsic Reasoning for Retrieval Augmented Generation using Tree-based Search [Paper]
Chain-of-Retrieval Augmented Generation [Paper]

2.7 Application VII: Evaluation

MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation [Paper]

2.8 Special Topic: Process Reward Model 🔥

Solving math word problems with process- and outcome-based feedback [Paper]
Math-Shepherd: Verify and Reinforce LLMs Step-by-Step without Human Annotations [Paper]
GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements [Paper]
Multi-step Problem Solving Through a Verifier: An Empirical Analysis on Model-induced Process Supervision [Paper]
Evaluating Mathematical Reasoning Beyond Accuracy [Paper]
AutoPSV: Automated Process-Supervised Verifier [Paper]
Improve Mathematical Reasoning in Language Models by Automated Process Supervision [Paper]
Token-Supervised Value Models for Enhancing Mathematical Reasoning Capabilities of Large Language Models [Paper]
Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning [Paper]
Free Process Rewards without Process Labels [Paper]

3. Test-Time Scaling Method: Multi-turn Correction

3.1 Application I: (Mathematical) Reasoning

Generating Sequences by Learning to Self-Correct [Paper]
Baldur: Whole-Proof Generation and Repair with Large Language Models [Paper]
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing [Paper]
Improving Factuality and Reasoning in Language Models through Multiagent Debate [Paper]
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate [Paper]
Verify-and-Edit: A Knowledge-Enhanced Chain-of-Thought Framework [Paper]
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-Based Self-Verification [Paper]
Language Models can Solve Computer Tasks [Paper]
Self-Refine: Iterative Refinement with Self-Feedback [Paper]
Reflexion: language agents with verbal reinforcement learning [Paper]
REFINER: Reasoning Feedback on Intermediate Representations [Paper]
Debating with More Persuasive LLMs Leads to More Truthful Answers [Paper]
Recursive Introspection: Teaching Language Model Agents How to Self-Improve [Paper]
Training Language Models to Self-Correct via Reinforcement Learning [Paper]
Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [Paper]

3.2 Application II: Code

Teaching Large Language Models to Self-Debug [Paper]
Is Self-Repair a Silver Bullet for Code Generation? [Paper]
Reflexion: language agents with verbal reinforcement learning [Paper]
Phenomenal Yet Puzzling: Testing Inductive Reasoning Capabilities of Language Models with Hypothesis Refinement [Paper]
Self-taught optimizer (stop): Recursively self-improving code generation [Paper]
LLM Critics Help Catch LLM Bugs [Paper]

3.3 Application III: Multimodal

Vision-Language Models Can Self-Improve Reasoning via Reflection [Paper]
Insight-v: Exploring long-chain visual reasoning with multimodal large language models [Paper]
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step [Paper]
MINT: Multi-modal Chain of Thought in Unified Generative Models for Enhanced Image Generation [Paper]
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing [Paper]

3.4 Application IV: Agent

Language Models can Solve Computer Tasks [Paper]
Reflexion: language agents with verbal reinforcement learning [Paper]
Autonomous Evaluation and Refinement of Digital Agents [Paper]

3.5 Application V: Embodied AI

Inner Monologue: Embodied Reasoning through Planning with Language Models [Paper]
REFLECT: Summarizing Robot Experiences for Failure Explanation and Correction [Paper]
Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners [Paper]

3.6 Application VI: Safety

Generating Sequences by Learning to Self-Correct [Paper]
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing [Paper]
Improving Factuality and Reasoning in Language Models through Multiagent Debate [Paper]
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming [Paper]
Combating Adversarial Attacks with Multi-Agent Debate [Paper]
Debategpt: Fine-tuning large language models with multi-agent debate supervision [Paper]

3.7 Application VII: Evaluation

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate [Paper]
Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate [Paper]

3.8 Special Topic: Critical Perspective 🔥

Large Language Models Cannot Self-Correct Reasoning Yet [Paper]
Can Large Language Models Really Improve by Self-Critiquing Their Own Plans? [Paper]
GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems [Paper]
LLMs cannot find reasoning errors, but can correct them given the error location [Paper]
Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement [Paper]
When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs [Paper]

4. Test-Time Scaling Method: (long) Chain-of-Thought

4.1 Application I: (Mathematical) Reasoning

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models [Paper]
OpenAI O1 System Card [Blog]
QwQ: Reflect Deeply on the Boundaries of the Unknown [Blog]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [Paper]
Kimi k1.5: Scaling reinforcement learning with llms [Paper]
Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [Paper]
There May Not be Aha Moment in R1-Zero-like Training — A Pilot Study [Blog]
7B Model and 8K Examples: Emerging Reasoning with Reinforcement Learning is Both Effective and Efficient [Blog]
s1: Simple test-time scaling [Paper]
LIMO: Less is More for Reasoning [Paper]
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning [Paper]
Demystifying Long Chain-of-Thought Reasoning in LLMs [Paper]
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution [Paper]
LIMR: Less is More for RL Scaling [Paper]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale [Paper]
An Empirical Study on Eliciting and Improving R1-like Reasoning Models [Paper]
Open-Reasoner-Zero: An Open Source Approach to Scaling Reinforcement Learning on the Base Model [Github]
DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL [Blog]
Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs [Paper]
Understanding R1-Zero-Like Training: A Critical Perspective [Paper]
Can Better Cold-Start Strategies Improve RL Training for LLMs? [Blog]
What's Behind PPO's Collapse in Long-CoT? Value Optimization Holds the Secret [Paper]
Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't [Paper]
GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning [Paper]
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks [Paper]
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining [Paper]
Rethinking Reflection in Pre-Training [Paper]

4.2 Application II: Code

OpenAI O1 System Card [Paper]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [Paper]
Kimi k1.5: Scaling reinforcement learning with llms [Paper]
s1: Simple test-time scaling [Paper]
Competitive Programming with Large Reasoning Models [Paper]
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution [Paper]
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [Paper]
ToRL: Scaling Tool-Integrated RL [Paper]
OpenCodeReasoning: Advancing Data Distillation for Competitive Coding [Paper]
DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level [Paper]
Seed-Thinking-v1.5: Advancing Superb Reasoning Models with Reinforcement Learning [Paper]

4.3 Application III: Multimodal

QVQ: To See the World with Wisdom [Paper]
Llava-onevision: Easy visual task transfer [Paper]
Qwen2-VL: Enhancing vision-language model's perception of the world at any resolution [Paper]
Reducing hallucinations in vision-language models via latent space steering [Paper]
Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale [Paper]
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM [Paper]
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought [Paper]
Kimi k1.5: Scaling reinforcement learning with llms [Paper]
Qwen2.5-VL Technical Report [Paper]
EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework [Github]
LMM-R1 [Github]
VLM-R1: A stable and generalizable R1-style Large Vision-Language Model [Github]
R1-V: Reinforcing Super Generalization Ability in Vision-Language Models with Less Than $3 [Github]
open-r1-multimodal: A fork to add multimodal model training to open-r1 [Github]
R1-Multimodal-Journey: A journey to real multimodal R1 [Github]
Open-R1-Video [Github]
R1-Onevision: Open-Source Multimodal Large Language Model with Reasoning Ability [Notion]
Video-R1: Reinforcing Video Reasoning in MLLMs [Paper]
R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model [Paper]
MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning [Paper]
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement [Paper]
Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme [Paper]
Improved Visual-Spatial Reasoning via R1-Zero-Like Training [Paper]
Kimi-VL [Github]
Introducing OpenAI o3 and o4-mini [Blog]

4.4 Application IV: Agent

ReAct: Synergizing Reasoning and Acting in Language Models [Paper]
Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku [Blog]
PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World [Paper]
UI-TARS: Pioneering Automated GUI Interaction with Native Agents [Paper]
Introducing deep research [Blog]
Computer-Using Agent [Blog]
The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks [Paper]
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution [Paper]
DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments [Paper]

4.5 Application V: Embodied AI

Agent Planning with World Knowledge Model [Paper]
Robotic control via embodied chain-of-thought reasoning [Paper]
Improving Vision-Language-Action Models via Chain-of-Affordance [Paper]
SpatialCoT: Advancing Spatial Reasoning through Coordinate Alignment and Chain-of-Thought for Embodied Task Planning [Paper]
Action-Free Reasoning for Policy Generalization [Paper]
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning [Paper]
Gemini Robotics: Bringing AI into the Physical World [Paper]
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [Paper]

4.6 Application VI: Safety

Chain-of-Verification Reduces Hallucination in Large Language Models [Paper]
Mixture of insighTful Experts (MoTE): The Synergy of Thought Chains and Expert Mixtures in Self-Alignment [Paper]
Deliberative Alignment: Reasoning Enables Safer Language Models [Paper]
SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities [Paper]

4.7 Application VII: RAG

Inference Scaling for Long-Context Retrieval Augmented Generation [Paper]
Plan*RAG: Efficient Test-Time Planning for Retrieval Augmented Generation [Paper]
Auto-RAG: Autonomous Retrieval-Augmented Generation for Large Language Models [Paper]
Search-o1: Agentic Search-Enhanced Large Reasoning Models [Paper]
AirRAG: Activating Intrinsic Reasoning for Retrieval Augmented Generation using Tree-based Search [Paper]
Chain-of-Retrieval Augmented Generation [Paper]
DeepRAG: Thinking to Retrieval Step by Step for Large Language Models [Paper]
DeepRetrieval: Hacking Real Search Engines and Retrievers with Large Language Models via Reinforcement Learning [Paper]
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning [Paper]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning [Paper]
ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning [Paper]

4.8 Application VIII: Evaluation

FactScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation [Paper]
FacTool: Factuality Detection in Generative AI -- A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios [Paper]
Knowledge-Centric Hallucination Detection [Paper]
RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation [Paper]
Agent-as-a-Judge: Evaluate Agents with Agents [Paper]
Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge [Paper]
Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators [Paper]

4.9 Special Topic: Representational Complexity of Transformers with CoT 🔥

The Expressive Power of Transformers with Chain of Thought [Paper]
On the Representational Capacity of Neural Language Models with Chain-of-Thought Reasoning [Paper]
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought[Paper]

5. Scaling Reinforcement Learning for Long CoT 🔥

5.1 Application: Math & Code

OpenAI O1 System Card [Blog]
QwQ: Reflect Deeply on the Boundaries of the Unknown [Blog]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [Paper]
Kimi k1.5: Scaling reinforcement learning with llms [Paper]
Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [Paper]
There May Not be Aha Moment in R1-Zero-like Training — A Pilot Study [Blog]
7B Model and 8K Examples: Emerging Reasoning with Reinforcement Learning is Both Effective and Efficient [Blog]
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning [Paper]
Demystifying Long Chain-of-Thought Reasoning in LLMs [Paper]
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution [Paper]
LIMR: Less is More for RL Scaling [Paper]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale [Paper]
An Empirical Study on Eliciting and Improving R1-like Reasoning Models [Paper]
Open-Reasoner-Zero: An Open Source Approach to Scaling Reinforcement Learning on the Base Model [Github]
DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL [Blog]
Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs [Paper]
Understanding R1-Zero-Like Training: A Critical Perspective [Paper]
Can Better Cold-Start Strategies Improve RL Training for LLMs? [Blog]
What's Behind PPO's Collapse in Long-CoT? Value Optimization Holds the Secret [Paper]
Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't [Paper]
GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning [Paper]
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks [Paper]
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining [Paper]
Rethinking Reflection in Pre-Training [Paper]

5.2 Application: Search

DeepRetrieval: Hacking Real Search Engines and Retrievers with Large Language Models via Reinforcement Learning [Paper]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning [Paper]
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning [Paper]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning [Paper]
ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning [Paper]
DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments [Paper]

5.3 Application III: Multimodal

QVQ: To See the World with Wisdom [Blog]
Llava-onevision: Easy visual task transfer [Paper]
Qwen2-VL: Enhancing vision-language model's perception of the world at any resolution [Paper]
Reducing hallucinations in vision-language models via latent space steering [Paper]
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought [Paper]
Kimi k1.5: Scaling reinforcement learning with llms [Paper]
Qwen2.5-VL Technical Report [Paper]
EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework [Github]
LMM-R1 [Github]
VLM-R1: A stable and generalizable R1-style Large Vision-Language Model [Github]
R1-V: Reinforcing Super Generalization Ability in Vision-Language Models with Less Than $3 [Github]
open-r1-multimodal: A fork to add multimodal model training to open-r1 [Github]
R1-Multimodal-Journey: A journey to real multimodal R1 [Github]
Open-R1-Video [Github]
Video-R1: Reinforcing Video Reasoning in MLLMs [Paper]
R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model [Paper]
MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning [Paper]
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement [Paper]
Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme [Paper]
Improved Visual-Spatial Reasoning via R1-Zero-Like Training [Paper]
Kimi-VL [Blog]
Introducing OpenAI o3 and o4-mini [Blog]

5.4 Papers Sorted by RL Components

5.4.1 Training Algorithm

Proximal Policy Optimization Algorithms [Paper]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models [Paper]
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models [Paper]
Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning [Paper]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale [Paper]
Understanding R1-Zero-Like Training: A Critical Perspective [Paper]
What's Behind PPO's Collapse in Long-CoT? Value Optimization Holds the Secret [Paper]
GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning [Paper]
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks [Paper]

5.4.2 Reward Model

Self-Explore: Enhancing Mathematical Reasoning in Language Models with Fine-grained Rewards [Paper]
Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms [Paper]
On Designing Effective RL Reward at Training Time for LLM Reasoning [Paper]
VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment [Paper]
Does RLHF Scale? Exploring the Impacts From Data, Model, and Method [Paper]
Process Reinforcement through Implicit Rewards [Paper]

5.4.3 Base Model

Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs [Paper]
Understanding R1-Zero-Like Training: A Critical Perspective [Paper]
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining [Paper]
Rethinking Reflection in Pre-Training [Paper]

5.4.4 Training Data

Kimi k1.5: Scaling reinforcement learning with llms [Paper]
LIMR: Less is More for RL Scaling [Paper]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale [Paper]

5.4.5 Multi-stage Training

Kimi k1.5: Scaling reinforcement learning with llms [Paper]
Demystifying Long Chain-of-Thought Reasoning in LLMs [Paper]
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning [Paper]
DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL [Blog]
Can Better Cold-Start Strategies Improve RL Training for LLMs? [Blog]

5.4.6 Evaluation

Position: Benchmarking is Limited in Reinforcement Learning Research [Paper]
A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility [Paper]

5.4.7 Analysis

Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data [Paper]
RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold [Paper]

5.5 Infra

6. Supervised Learning for Long CoT 🔥

6.1 Long CoT Resource

Work	Application	Type	Source	Quantity	Modality	Link
O1 Journey–Part 1	Math	Synthesize	GPT-4o	0.3K	Text	GitHub HuggingFace
Marco-o1	Reasoning	Synthesize	Qwen2-7B-Instruct	10K	Text	GitHub
STILL-2	Math, Code, Science, Puzzle	Distillation	DeepSeek-R1-Lite-Preview, QwQ-32B-preview	5K	Text	GitHub HuggingFace
RedStar-math	Math	Distillation	QwQ-32B-preview	4K	Text	HuggingFace
RedStar-code	Code	Distillation	QwQ-32B-preview	16K	Text	HuggingFace
RedStar-multimodal	Math	Distillation	QwQ-32B-preview	12K	Vision + Text	HuggingFace
S1K	Math, Science, Code	Distillation	Gemini Flash Thinking	1K	Text	GitHub HuggingFace
S1K-1.1	Math, Science, Code	Distillation	DeepSeek R1	1K	Text	GitHub HuggingFace
LIMO	Math	Distillation	DeepSeek R1, DeepSeekR1-Distill-Qwen-32B	0.8K	Text	GitHub HuggingFace
OpenThoughts-114k	Math, Code, Science, Puzzle	Distillation	DeepSeek R1	114K	Text	GitHub HuggingFace
OpenR1-Math-220k	Math	Distillation	DeepSeek R1	220K	Text	GitHub HuggingFace
OpenThoughts2-1M	Math, Code, Science, Puzzle	Distillation	DeepSeek R1	1M	Text	GitHub HuggingFace
CodeForces-CoTs	Code	Distillation	DeepSeek R1	47K	Text	GitHub HuggingFace
Sky-T1-17k	Math, Code, Science, Puzzle	Distillation	QwQ-32B-Preview	17K	Text	GitHub HuggingFace
S²R	Math	Synthesize	Qwen2.5-Math-7B	3K	Text	GitHub HuggingFace
R1-Onevision	Science, Math, General	Distillation	DeepSeek R1	155K	Vision + Text	GitHub HuggingFace
OpenO1-SFT	Math, Code	Synthesize	-	77K	Text	GitHub HuggingFace
Medical-o1	Medical	Distillation	Deepseek R1	25K	Text	GitHub HuggingFace
O1 Journey–Part 3	Medical	Distillation	o1-preview	0.5K	Text	GitHub HuggingFace
SCP-116K	Math, Science	Distillation	Deepseek R1	116K	Text	GitHub HuggingFace
open-r1-multimodal	Math	Distillation	GPT-4o	8K	Vision + Text	GitHub HuggingFace
Vision-R1-cold	Science, Math, General	Distillation	Deepseek R1	200K	Vision + Text	GitHub HuggingFace
MMMU-Reasoning-Distill-Validation	Science, Math, General	Distillation	Deepseek R1	0.8K	Vision + Text	ModelScope
Clevr-CoGenT	Vision Counting	Distillation	Deepseek R1	37.8K	Vision + Text	GitHub HuggingFace
VL-Thinking	Science, Math, General	Distillation	Deepseek R1	158K	Vision + Text	GitHub HuggingFace
Video-R1	Video	Distillation	Qwen2.5-VL-72B	158K	Vision + Text	GitHub HuggingFace
Embodied-Reasoner	Embodied AI	Synthesize	GPT-4o	9K	Vision + Text	GitHub HuggingFace
OpenCodeReasoning	Code	Distillation	DeepSeek R1	736K	Text	HuggingFace
SafeChain	Safety	Distillation	Deepseek R1	40K	Text	GitHub HuggingFace
KodCode	Code	Distillation	DeepSeek R1	2.8K	Text	GitHub HuggingFace

6.2 Analysis

Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping [Paper]
Stream of Search (SoS): Learning to Search in Language [Paper]
Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems [Paper]
Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems [Paper]
s1: Simple test-time scaling [Paper]
RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems? [Paper]
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training [Paper]
LIMO: Less is More for Reasoning [Paper]
Small Models Struggle to Learn from Strong Reasoners [Paper]
LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! [Paper]

7. Self-improvement with Test-Time Scaling

7.1 Parallel Sampling

STaR: Bootstrapping Reasoning With Reasoning [Paper]
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment [Paper]
Language Models Can Teach Themselves to Program Better [Paper]
Scaling relationship on learning mathematical reasoning with large language models [Paper]
Reinforced self-training (rest) for language modeling [Paper]
Large Language Models Can Self-Improve [Paper]
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models [Paper]
Self-Rewarding Language Models [Paper]
V-STaR: Training Verifiers for Self-Taught Reasoners [Paper]
Iterative Reasoning Preference Optimization [Paper]
Progress or Regress? Self-Improvement Reversal in Post-training [Paper]
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement [Paper]
Process-based Self-Rewarding Language Models [Paper]

7.2 Tree Search

Alphazero-like Tree-Search Can Guide Large Language Model Decoding and Training [Paper]
AlphaMath Almost Zero: Process Supervision without Process [Paper]
Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [Paper]
ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [Paper]
Step-Level Value Preference Optimization for Mathematical Reasoning [Paper]
Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs [Paper]
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking [Paper]

7.3 Multi-turn Correction

Multi-Turn Code Generation Through Single-Step Rewards [Paper]
S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [Paper]
Self-rewarding correction for mathematical reasoning [Paper]

7.4 Long CoT

Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [Paper]

8. Ensemble of Test-Time Scaling Method

Accessing GPT-4 Level Mathematical Olympiad Solutions via Monte Carlo Tree Self-Refine with LLaMa-3 8B [Paper]
Scaling test-time compute with open models [Blog]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters [Paper]
RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation [Paper]
TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling [Paper]
LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning [Paper]
MC-NEST -- Enhancing Mathematical Reasoning in Large Language Models with a Monte Carlo Nash Equilibrium Self-Refine Tree [Paper]
SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models [Paper]
Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning [Paper]
RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques [Paper]
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling [Paper]
Scaling Test-Time Compute Without Verification or RL is Suboptimal [Paper]
S*: Test Time Scaling for Code Generation [Paper]
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [Paper]

9. Inference Time Scaling Laws

The Impact of Reasoning Step Length on Large Language Models [Paper]
More Agents Is All You Need [Paper]
Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems [Paper]
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling [Paper]
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models [Paper]
Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling [Paper]
Inference Scaling fLaws: The Limits of LLM Resampling with Imperfect Verifiers [Paper]
s1: Simple test-time scaling [Paper]
Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [Paper]
Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? [Paper]

10. Improving Scaling Efficiency

10.1 Parallel Sampling

Let's Sample Step by Step: Adaptive-Consistency for Efficient Reasoning and Coding with LLMs [Paper]
Universal Self-Consistency for Large Language Model Generation [Paper]
Escape Sky-High Cost: Early-Stopping Self-Consistency for Multi-Step Reasoning [Paper]
Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision [Paper]
Dynamic Self-Consistency: Leveraging Reasoning Paths for Efficient LLM Sampling [Paper]
Make Every Penny Count: Difficulty-Adaptive Self-Consistency for Cost-Efficient Reasoning [Paper]
Fast Best-of-N Decoding via Speculative Rejection [Paper]
Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models [Paper]
Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers [Paper]
Scalable Best-of-N Selection for Large Language Models via Self-Certainty [Paper]
Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding [Paper]
When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning [Paper]

10.2 Tree Search

LiteSearch: Efficacious Tree Search for LLM [Paper]
ETS: Efficient Tree Search for Inference-Time Scaling [Paper]
Dynamic Parallel Tree Search for Efficient LLM Reasoning [Paper]
Don't Get Lost in the Trees: Streamlining LLM Reasoning by Overcoming Tree Search Exploration Pitfalls [Paper]

10.3 Multi-turn Correction

Can Large Language Models Be an Alternative to Human Evaluations? [Paper]
G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment [Paper]
A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation [Paper]
Confidence Matters: Revisiting Intrinsic Self-Correction Capabilities of Large Language Models [Paper]
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation [Paper]
Recursive Introspection: Teaching Language Model Agents How to Self-Improve [Paper]
Training Language Models to Self-Correct via Reinforcement Learning [Paper]
Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [Paper]

10.4 Long CoT

Implicit Chain of Thought Reasoning via Knowledge Distillation [Paper]
Anchor-based Large Language Models [Paper]
From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step [Paper]
Break the Chain: Large Language Models Can be Shortcut Reasoners [Paper]
Distilling System 2 into System 1 [Paper]
System-1.x: Learning to Balance Fast and Slow Planning with Language Models [Paper]
Concise Thoughts: Impact of Output Length on LLM Reasoning and Cost [Paper]
To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning [Paper]
Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces [Paper]
Markov Chain of Thought for Efficient Mathematical Reasoning [Paper]
Can Language Models Learn to Skip Steps? [Paper]
Training Large Language Models to Reason in a Continuous Latent Space [Paper]
Compressed Chain of Thought: Efficient Reasoning Through Dense Representations [Paper]
C3oT: Generating Shorter Chain-of-Thought without Compromising Effectiveness [Paper]
Token-Budget-Aware LLM Reasoning [Paper]
Efficiently Serving LLM Reasoning Programs with Certaindex [Paper]
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs [Paper]
Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models [Paper]
O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning [Paper]
Think Smarter not Harder: Adaptive Reasoning with Inference Aware Optimization [Paper]
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs [Paper]
Training Language Models to Reason Efficiently [Paper]
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach [Paper]
CoT-Valve: Length-Compressible Chain-of-Thought Tuning [Paper]
TokenSkip: Controllable Chain-of-Thought Compression in LLMs [Paper]
SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs [Paper]
Stepwise Perplexity-Guided Refinement for Efficient Chain-of-Thought Reasoning in Large Language Models [Paper]
The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer [Paper]
LightThinker: Thinking Step-by-Step Compression [Paper]
Chain of Draft: Thinking Faster by Writing Less [Paper]
Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning [Paper]
Self-Training Elicits Concise Reasoning in Large Language Models [Paper]
CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation [Paper]
Efficient Test-Time Scaling via Self-Calibration [Paper]
How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach [Paper]
DAST: Difficulty-Adaptive Slow-Thinking for Large Reasoning Models [Paper]
L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning [Paper]
Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching [Paper]
InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models [Paper]
Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning [Paper]
Unlocking Efficient Long-to-Short LLM Reasoning with Model Merging [Paper]
Think Less, Achieve More: Cut Reasoning Costs by 50% Without Sacrificing Accuracy [Blog]

11. Latent Thoughts

Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking [Paper]
Lean-STaR: Learning to Interleave Thinking and Proving [Paper]
RATIONALYST: Pre-training Process-Supervision for Improving Reasoning [Paper]
Reasoning to Learn from Latent Thoughts [Paper]

FilesExpand file tree

papers

Directory actions

More options

Directory actions

More options

Latest commit

History

papers

Folders and files

parent directory

readme.md

Well-Organized Papers

1. Test-Time Scaling Method: Parallel Sampling

1.1 Application I: (Mathematical) Reasoning

1.2 Application II: Code

1.3 Application III: Multimodal

1.4 Application IV: Safety

1.5 Application V: RAG

1.6 Application VI: Evaluation

2. Test-Time Scaling Method: Tree Search

2.1 Application I: (Mathematical) Reasoning

2.2 Application II: Code

2.3 Application III: Multimodal

2.4 Application IV: Agent

2.5 Application V: Safety

2.6 Application VI: RAG

2.7 Application VII: Evaluation

2.8 Special Topic: Process Reward Model 🔥

3. Test-Time Scaling Method: Multi-turn Correction

3.1 Application I: (Mathematical) Reasoning

3.2 Application II: Code

3.3 Application III: Multimodal

3.4 Application IV: Agent

3.5 Application V: Embodied AI

3.6 Application VI: Safety

3.7 Application VII: Evaluation

3.8 Special Topic: Critical Perspective 🔥

4. Test-Time Scaling Method: (long) Chain-of-Thought

4.1 Application I: (Mathematical) Reasoning

4.2 Application II: Code

4.3 Application III: Multimodal

4.4 Application IV: Agent

4.5 Application V: Embodied AI

4.6 Application VI: Safety

4.7 Application VII: RAG

4.8 Application VIII: Evaluation

4.9 Special Topic: Representational Complexity of Transformers with CoT 🔥

5. Scaling Reinforcement Learning for Long CoT 🔥

5.1 Application: Math & Code

5.2 Application: Search

5.3 Application III: Multimodal

5.4 Papers Sorted by RL Components

5.4.1 Training Algorithm

5.4.2 Reward Model

5.4.3 Base Model

5.4.4 Training Data

5.4.5 Multi-stage Training

5.4.6 Evaluation

5.4.7 Analysis

5.5 Infra

6. Supervised Learning for Long CoT 🔥

6.1 Long CoT Resource

6.2 Analysis

7. Self-improvement with Test-Time Scaling

7.1 Parallel Sampling

7.2 Tree Search

7.3 Multi-turn Correction

7.4 Long CoT

8. Ensemble of Test-Time Scaling Method

9. Inference Time Scaling Laws

10. Improving Scaling Efficiency

10.1 Parallel Sampling

10.2 Tree Search

10.3 Multi-turn Correction

10.4 Long CoT

11. Latent Thoughts