Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

readme.md

Well-Organized Papers

1. Test-Time Scaling Method: Parallel Sampling
2. Test-Time Scaling Method: Tree Search
3. Test-Time Scaling Method: Multi-turn Correction
4. Test-Time Scaling Method: (long) Chain-of-Thought
5. Scaling Reinforcement Learning for Long CoT 🔥
6. Supervised Learning for Long CoT 🔥
7. Self-improvement with Test-Time Scaling
8. Ensemble of Test-Time Scaling Method
9. Inference Time Scaling Laws
10. Improving Scaling Efficiency
11. Latent Thoughts

1. Test-Time Scaling Method: Parallel Sampling

1.1 Application I: (Mathematical) Reasoning

  • Training Verifiers to Solve Math Word Problems [Paper]
  • Self-Consistency Improves Chain of Thought Reasoning in Language Models [Paper]
  • Let's Verify Step by Step [Paper]
  • Improving Large Language Model Fine-tuning for Solving Math Problems [Paper]
  • Common 7B Language Models Already Possess Strong Math Capabilities [Paper]
  • Getting 50% (SoTA) on ARC-AGI with GPT-4o [Paper]

1.2 Application II: Code

  • Natural Language to Code Translation with Execution [Paper]
  • Competition-Level Code Generation with AlphaCode [Paper]
  • CodeT: Code Generation with Generated Tests [Paper]
  • AlphaCode 2 Technical Report [Paper]
  • Training Software Engineering Agents and Verifiers with SWE-Gym [Paper]
  • S*: Test Time Scaling for Code Generation [Paper]

1.3 Application III: Multimodal

  • URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics [Paper]
  • Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step [Paper]

1.4 Application IV: Safety

  • SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models [Paper]
  • Leveraging Reasoning with Guidelines to Elicit and Utilize Knowledge for Enhancing Safety Alignment [Paper]

1.5 Application V: RAG

  • Chain-of-Retrieval Augmented Generation [Paper]

1.6 Application VI: Evaluation

  • Crowd Comparative Reasoning: Unlocking Comprehensive Evaluations for LLM-as-a-Judge [Paper]
  • Inference-Time Scaling for Generalist Reward Modeling [Paper]

2. Test-Time Scaling Method: Tree Search

2.1 Application I: (Mathematical) Reasoning

  • Generative Language Modeling for Automated Theorem Proving [Paper]
  • HyperTree Proof Search for Neural Theorem Proving [Paper]
  • Large Language Model Guided Tree-of-Thought [Paper]
  • Tree of Thoughts: Deliberate Problem Solving with Large Language Models [Paper]
  • Alphazero-like Tree-Search Can Guide Large Language Model Decoding and Training [Paper]
  • Hypothesis Search: Inductive Reasoning with Language Models [Paper]
  • Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models [Paper]
  • ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search [Paper]
  • Reasoning with Language Model is Planning with World Model [Paper]
  • Self-Evaluation Guided Beam Search for Reasoning [Paper]
  • AlphaMath Almost Zero: Process Supervision without Process [Paper]
  • Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [Paper]
  • MindStar: Enhancing Math Reasoning in Pre-Trained LLMs at Inference Time [Paper]
  • Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing [Paper]
  • Plan of Thoughts: Heuristic-Guided Problem Solving with Large Language Models [Paper]
  • ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [Paper]
  • Step-Level Value Preference Optimization for Mathematical Reasoning [Paper]
  • Accessing GPT-4 Level Mathematical Olympiad Solutions via Monte Carlo Tree Self-Refine with LLaMa-3 8B [Paper]
  • Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs [Paper]
  • Q*: Improving Multi-Step Reasoning for LLMs with Deliberative Planning [Paper]
  • DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search [Paper]
  • Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers [Paper]
  • Planning with MCTS: Enhancing Problem-Solving in Large Language Models [Paper]
  • CPL: Critical Plan Step Learning Boosts LLM Generalization in Reasoning Tasks [Paper]
  • Interpretable Contrastive Monte Carlo Tree Search Reasoning [Paper]
  • rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking [Paper]
  • BFS-Prover: Scalable Best-First Tree Search for LLM-based Automatic Theorem Proving [Paper]

2.2 Application II: Code

  • Planning with Large Language Models for Code Generation [Paper]
  • Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models [Paper]
  • Cruxeval: A benchmark for code reasoning, understanding and execution [Paper]
  • Planning In Natural Language Improves LLM Search For Code Generation [Paper]
  • RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation [Paper]
  • SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation [Paper]
  • O1-Coder: An O1 Replication for Coding [Paper]

2.3 Application III: Multimodal

  • Llava-CoT: Let vision language models reason step-by-step [Paper]
  • Scaling inference-time search with vision value model for improved visual comprehension [Paper]
  • Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search [Paper]
  • Llamav-o1: Rethinking step-by-step visual reasoning in llms [Paper]
  • Video-T1: Test-Time Scaling for Video Generation [Paper]

2.4 Application IV: Agent

  • Tree of Thoughts: Deliberate Problem Solving with Large Language Models [Paper]
  • Don't Generate, Discriminate: A Proposal for Grounding Language Models to Real-World Environments [Paper]
  • Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models [Paper]
  • ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search [Paper]
  • Tree Search for Language Model Agents [Paper]
  • Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents [Paper]

2.5 Application V: Safety

  • C-MCTS: Safe Planning with Monte Carlo Tree Search [Paper]
  • Don't Throw Away Your Value Model! Generating More Preferable Text with Value-Guided Monte-Carlo Tree Search Decoding [Paper]
  • ARGS: Alignment as Reward-Guided Search [Paper]
  • Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking [Paper]
  • Almost Surely Safe Alignment of Large Language Models at Inference-Time [Paper]
  • STAIR: Improving Safety Alignment with Introspective Reasoning [Paper]

2.6 Application VI: RAG

  • AirRAG: Activating Intrinsic Reasoning for Retrieval Augmented Generation using Tree-based Search [Paper]
  • Chain-of-Retrieval Augmented Generation [Paper]

2.7 Application VII: Evaluation

  • MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation [Paper]

2.8 Special Topic: Process Reward Model 🔥

  • Solving math word problems with process- and outcome-based feedback [Paper]
  • Math-Shepherd: Verify and Reinforce LLMs Step-by-Step without Human Annotations [Paper]
  • GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements [Paper]
  • Multi-step Problem Solving Through a Verifier: An Empirical Analysis on Model-induced Process Supervision [Paper]
  • Evaluating Mathematical Reasoning Beyond Accuracy [Paper]
  • AutoPSV: Automated Process-Supervised Verifier [Paper]
  • Improve Mathematical Reasoning in Language Models by Automated Process Supervision [Paper]
  • Token-Supervised Value Models for Enhancing Mathematical Reasoning Capabilities of Large Language Models [Paper]
  • Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning [Paper]
  • Free Process Rewards without Process Labels [Paper]

3. Test-Time Scaling Method: Multi-turn Correction

3.1 Application I: (Mathematical) Reasoning

  • Generating Sequences by Learning to Self-Correct [Paper]
  • Baldur: Whole-Proof Generation and Repair with Large Language Models [Paper]
  • CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing [Paper]
  • Improving Factuality and Reasoning in Language Models through Multiagent Debate [Paper]
  • Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate [Paper]
  • Verify-and-Edit: A Knowledge-Enhanced Chain-of-Thought Framework [Paper]
  • Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-Based Self-Verification [Paper]
  • Language Models can Solve Computer Tasks [Paper]
  • Self-Refine: Iterative Refinement with Self-Feedback [Paper]
  • Reflexion: language agents with verbal reinforcement learning [Paper]
  • REFINER: Reasoning Feedback on Intermediate Representations [Paper]
  • Debating with More Persuasive LLMs Leads to More Truthful Answers [Paper]
  • Recursive Introspection: Teaching Language Model Agents How to Self-Improve [Paper]
  • Training Language Models to Self-Correct via Reinforcement Learning [Paper]
  • Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [Paper]

3.2 Application II: Code

  • Teaching Large Language Models to Self-Debug [Paper]
  • Is Self-Repair a Silver Bullet for Code Generation? [Paper]
  • Reflexion: language agents with verbal reinforcement learning [Paper]
  • Phenomenal Yet Puzzling: Testing Inductive Reasoning Capabilities of Language Models with Hypothesis Refinement [Paper]
  • Self-taught optimizer (stop): Recursively self-improving code generation [Paper]
  • LLM Critics Help Catch LLM Bugs [Paper]

3.3 Application III: Multimodal

  • Vision-Language Models Can Self-Improve Reasoning via Reflection [Paper]
  • Insight-v: Exploring long-chain visual reasoning with multimodal large language models [Paper]
  • Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step [Paper]
  • MINT: Multi-modal Chain of Thought in Unified Generative Models for Enhanced Image Generation [Paper]
  • GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing [Paper]

3.4 Application IV: Agent

  • Language Models can Solve Computer Tasks [Paper]
  • Reflexion: language agents with verbal reinforcement learning [Paper]
  • Autonomous Evaluation and Refinement of Digital Agents [Paper]

3.5 Application V: Embodied AI

  • Inner Monologue: Embodied Reasoning through Planning with Language Models [Paper]
  • REFLECT: Summarizing Robot Experiences for Failure Explanation and Correction [Paper]
  • Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners [Paper]

3.6 Application VI: Safety

  • Generating Sequences by Learning to Self-Correct [Paper]
  • CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing [Paper]
  • Improving Factuality and Reasoning in Language Models through Multiagent Debate [Paper]
  • MART: Improving LLM Safety with Multi-round Automatic Red-Teaming [Paper]
  • Combating Adversarial Attacks with Multi-Agent Debate [Paper]
  • Debategpt: Fine-tuning large language models with multi-agent debate supervision [Paper]

3.7 Application VII: Evaluation

  • ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate [Paper]
  • Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate [Paper]

3.8 Special Topic: Critical Perspective 🔥

  • Large Language Models Cannot Self-Correct Reasoning Yet [Paper]
  • Can Large Language Models Really Improve by Self-Critiquing Their Own Plans? [Paper]
  • GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems [Paper]
  • LLMs cannot find reasoning errors, but can correct them given the error location [Paper]
  • Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement [Paper]
  • When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs [Paper]

4. Test-Time Scaling Method: (long) Chain-of-Thought

4.1 Application I: (Mathematical) Reasoning

  • Chain-of-Thought Prompting Elicits Reasoning in Large Language Models [Paper]
  • OpenAI O1 System Card [Blog]
  • QwQ: Reflect Deeply on the Boundaries of the Unknown [Blog]
  • DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [Paper]
  • Kimi k1.5: Scaling reinforcement learning with llms [Paper]
  • Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [Paper]
  • There May Not be Aha Moment in R1-Zero-like Training — A Pilot Study [Blog]
  • 7B Model and 8K Examples: Emerging Reasoning with Reinforcement Learning is Both Effective and Efficient [Blog]
  • s1: Simple test-time scaling [Paper]
  • LIMO: Less is More for Reasoning [Paper]
  • Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning [Paper]
  • Demystifying Long Chain-of-Thought Reasoning in LLMs [Paper]
  • SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution [Paper]
  • LIMR: Less is More for RL Scaling [Paper]
  • DAPO: An Open-Source LLM Reinforcement Learning System at Scale [Paper]
  • An Empirical Study on Eliciting and Improving R1-like Reasoning Models [Paper]
  • Open-Reasoner-Zero: An Open Source Approach to Scaling Reinforcement Learning on the Base Model [Github]
  • DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL [Blog]
  • Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs [Paper]
  • Understanding R1-Zero-Like Training: A Critical Perspective [Paper]
  • Can Better Cold-Start Strategies Improve RL Training for LLMs? [Blog]
  • What's Behind PPO's Collapse in Long-CoT? Value Optimization Holds the Secret [Paper]
  • Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't [Paper]
  • GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning [Paper]
  • VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks [Paper]
  • Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining [Paper]
  • Rethinking Reflection in Pre-Training [Paper]

4.2 Application II: Code

  • OpenAI O1 System Card [Paper]
  • DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [Paper]
  • Kimi k1.5: Scaling reinforcement learning with llms [Paper]
  • s1: Simple test-time scaling [Paper]
  • Competitive Programming with Large Reasoning Models [Paper]
  • SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution [Paper]
  • Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [Paper]
  • ToRL: Scaling Tool-Integrated RL [Paper]
  • OpenCodeReasoning: Advancing Data Distillation for Competitive Coding [Paper]
  • DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level [Paper]
  • Seed-Thinking-v1.5: Advancing Superb Reasoning Models with Reinforcement Learning [Paper]

4.3 Application III: Multimodal

  • QVQ: To See the World with Wisdom [Paper]
  • Llava-onevision: Easy visual task transfer [Paper]
  • Qwen2-VL: Enhancing vision-language model's perception of the world at any resolution [Paper]
  • Reducing hallucinations in vision-language models via latent space steering [Paper]
  • Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale [Paper]
  • Virgo: A Preliminary Exploration on Reproducing o1-like MLLM [Paper]
  • Imagine while Reasoning in Space: Multimodal Visualization-of-Thought [Paper]
  • Kimi k1.5: Scaling reinforcement learning with llms [Paper]
  • Qwen2.5-VL Technical Report [Paper]
  • EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework [Github]
  • LMM-R1 [Github]
  • VLM-R1: A stable and generalizable R1-style Large Vision-Language Model [Github]
  • R1-V: Reinforcing Super Generalization Ability in Vision-Language Models with Less Than $3 [Github]
  • open-r1-multimodal: A fork to add multimodal model training to open-r1 [Github]
  • R1-Multimodal-Journey: A journey to real multimodal R1 [Github]
  • Open-R1-Video [Github]
  • R1-Onevision: Open-Source Multimodal Large Language Model with Reasoning Ability [Notion]
  • Video-R1: Reinforcing Video Reasoning in MLLMs [Paper]
  • R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model [Paper]
  • MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning [Paper]
  • Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement [Paper]
  • Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme [Paper]
  • Improved Visual-Spatial Reasoning via R1-Zero-Like Training [Paper]
  • Kimi-VL [Github]
  • Introducing OpenAI o3 and o4-mini [Blog]

4.4 Application IV: Agent

  • ReAct: Synergizing Reasoning and Acting in Language Models [Paper]
  • Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku [Blog]
  • PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World [Paper]
  • UI-TARS: Pioneering Automated GUI Interaction with Native Agents [Paper]
  • Introducing deep research [Blog]
  • Computer-Using Agent [Blog]
  • The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks [Paper]
  • SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution [Paper]
  • DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments [Paper]

4.5 Application V: Embodied AI

  • Agent Planning with World Knowledge Model [Paper]
  • Robotic control via embodied chain-of-thought reasoning [Paper]
  • Improving Vision-Language-Action Models via Chain-of-Affordance [Paper]
  • SpatialCoT: Advancing Spatial Reasoning through Coordinate Alignment and Chain-of-Thought for Embodied Task Planning [Paper]
  • Action-Free Reasoning for Policy Generalization [Paper]
  • Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning [Paper]
  • Gemini Robotics: Bringing AI into the Physical World [Paper]
  • CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [Paper]

4.6 Application VI: Safety

  • Chain-of-Verification Reduces Hallucination in Large Language Models [Paper]
  • Mixture of insighTful Experts (MoTE): The Synergy of Thought Chains and Expert Mixtures in Self-Alignment [Paper]
  • Deliberative Alignment: Reasoning Enables Safer Language Models [Paper]
  • SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities [Paper]

4.7 Application VII: RAG

  • Inference Scaling for Long-Context Retrieval Augmented Generation [Paper]
  • Plan*RAG: Efficient Test-Time Planning for Retrieval Augmented Generation [Paper]
  • Auto-RAG: Autonomous Retrieval-Augmented Generation for Large Language Models [Paper]
  • Search-o1: Agentic Search-Enhanced Large Reasoning Models [Paper]
  • AirRAG: Activating Intrinsic Reasoning for Retrieval Augmented Generation using Tree-based Search [Paper]
  • Chain-of-Retrieval Augmented Generation [Paper]
  • DeepRAG: Thinking to Retrieval Step by Step for Large Language Models [Paper]
  • DeepRetrieval: Hacking Real Search Engines and Retrievers with Large Language Models via Reinforcement Learning [Paper]
  • R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning [Paper]
  • Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning [Paper]
  • ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning [Paper]

4.8 Application VIII: Evaluation

  • FactScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation [Paper]
  • FacTool: Factuality Detection in Generative AI -- A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios [Paper]
  • Knowledge-Centric Hallucination Detection [Paper]
  • RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation [Paper]
  • Agent-as-a-Judge: Evaluate Agents with Agents [Paper]
  • Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge [Paper]
  • Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators [Paper]

4.9 Special Topic: Representational Complexity of Transformers with CoT 🔥

  • The Expressive Power of Transformers with Chain of Thought [Paper]
  • On the Representational Capacity of Neural Language Models with Chain-of-Thought Reasoning [Paper]
  • Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought[Paper]

5. Scaling Reinforcement Learning for Long CoT 🔥

5.1 Application: Math & Code

  • OpenAI O1 System Card [Blog]
  • QwQ: Reflect Deeply on the Boundaries of the Unknown [Blog]
  • DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [Paper]
  • Kimi k1.5: Scaling reinforcement learning with llms [Paper]
  • Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [Paper]
  • There May Not be Aha Moment in R1-Zero-like Training — A Pilot Study [Blog]
  • 7B Model and 8K Examples: Emerging Reasoning with Reinforcement Learning is Both Effective and Efficient [Blog]
  • Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning [Paper]
  • Demystifying Long Chain-of-Thought Reasoning in LLMs [Paper]
  • SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution [Paper]
  • LIMR: Less is More for RL Scaling [Paper]
  • DAPO: An Open-Source LLM Reinforcement Learning System at Scale [Paper]
  • An Empirical Study on Eliciting and Improving R1-like Reasoning Models [Paper]
  • Open-Reasoner-Zero: An Open Source Approach to Scaling Reinforcement Learning on the Base Model [Github]
  • DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL [Blog]
  • Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs [Paper]
  • Understanding R1-Zero-Like Training: A Critical Perspective [Paper]
  • Can Better Cold-Start Strategies Improve RL Training for LLMs? [Blog]
  • What's Behind PPO's Collapse in Long-CoT? Value Optimization Holds the Secret [Paper]
  • Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't [Paper]
  • GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning [Paper]
  • VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks [Paper]
  • Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining [Paper]
  • Rethinking Reflection in Pre-Training [Paper]

5.2 Application: Search

  • DeepRetrieval: Hacking Real Search Engines and Retrievers with Large Language Models via Reinforcement Learning [Paper]
  • Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning [Paper]
  • R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning [Paper]
  • Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning [Paper]
  • ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning [Paper]
  • DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments [Paper]

5.3 Application III: Multimodal

  • QVQ: To See the World with Wisdom [Blog]
  • Llava-onevision: Easy visual task transfer [Paper]
  • Qwen2-VL: Enhancing vision-language model's perception of the world at any resolution [Paper]
  • Reducing hallucinations in vision-language models via latent space steering [Paper]
  • Imagine while Reasoning in Space: Multimodal Visualization-of-Thought [Paper]
  • Kimi k1.5: Scaling reinforcement learning with llms [Paper]
  • Qwen2.5-VL Technical Report [Paper]
  • EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework [Github]
  • LMM-R1 [Github]
  • VLM-R1: A stable and generalizable R1-style Large Vision-Language Model [Github]
  • R1-V: Reinforcing Super Generalization Ability in Vision-Language Models with Less Than $3 [Github]
  • open-r1-multimodal: A fork to add multimodal model training to open-r1 [Github]
  • R1-Multimodal-Journey: A journey to real multimodal R1 [Github]
  • Open-R1-Video [Github]
  • Video-R1: Reinforcing Video Reasoning in MLLMs [Paper]
  • R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model [Paper]
  • MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning [Paper]
  • Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement [Paper]
  • Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme [Paper]
  • Improved Visual-Spatial Reasoning via R1-Zero-Like Training [Paper]
  • Kimi-VL [Blog]
  • Introducing OpenAI o3 and o4-mini [Blog]

5.4 Papers Sorted by RL Components

5.4.1 Training Algorithm

  • Proximal Policy Optimization Algorithms [Paper]
  • DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models [Paper]
  • REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models [Paper]
  • Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning [Paper]
  • DAPO: An Open-Source LLM Reinforcement Learning System at Scale [Paper]
  • Understanding R1-Zero-Like Training: A Critical Perspective [Paper]
  • What's Behind PPO's Collapse in Long-CoT? Value Optimization Holds the Secret [Paper]
  • GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning [Paper]
  • VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks [Paper]

5.4.2 Reward Model

  • Self-Explore: Enhancing Mathematical Reasoning in Language Models with Fine-grained Rewards [Paper]
  • Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms [Paper]
  • On Designing Effective RL Reward at Training Time for LLM Reasoning [Paper]
  • VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment [Paper]
  • Does RLHF Scale? Exploring the Impacts From Data, Model, and Method [Paper]
  • Process Reinforcement through Implicit Rewards [Paper]

5.4.3 Base Model

  • Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs [Paper]
  • Understanding R1-Zero-Like Training: A Critical Perspective [Paper]
  • Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining [Paper]
  • Rethinking Reflection in Pre-Training [Paper]

5.4.4 Training Data

  • Kimi k1.5: Scaling reinforcement learning with llms [Paper]
  • LIMR: Less is More for RL Scaling [Paper]
  • DAPO: An Open-Source LLM Reinforcement Learning System at Scale [Paper]

5.4.5 Multi-stage Training

  • Kimi k1.5: Scaling reinforcement learning with llms [Paper]
  • Demystifying Long Chain-of-Thought Reasoning in LLMs [Paper]
  • Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning [Paper]
  • DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL [Blog]
  • Can Better Cold-Start Strategies Improve RL Training for LLMs? [Blog]

5.4.6 Evaluation

  • Position: Benchmarking is Limited in Reinforcement Learning Research [Paper]
  • A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility [Paper]

5.4.7 Analysis

  • Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data [Paper]
  • RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold [Paper]

5.5 Infra

OpenRLHF GitHub stars

Verl GitHub stars

NeMo-Aligner GitHub stars

Deepspeed-chat GitHub stars

6. Supervised Learning for Long CoT 🔥

6.1 Long CoT Resource

Work Application Type Source Quantity Modality Link
O1 Journey–Part 1 Math Synthesize GPT-4o 0.3K Text GitHub HuggingFace
Marco-o1 Reasoning Synthesize Qwen2-7B-Instruct 10K Text GitHub
STILL-2 Math, Code, Science, Puzzle Distillation DeepSeek-R1-Lite-Preview, QwQ-32B-preview 5K Text GitHub HuggingFace
RedStar-math Math Distillation QwQ-32B-preview 4K Text HuggingFace
RedStar-code Code Distillation QwQ-32B-preview 16K Text HuggingFace
RedStar-multimodal Math Distillation QwQ-32B-preview 12K Vision + Text HuggingFace
S1K Math, Science, Code Distillation Gemini Flash Thinking 1K Text GitHub HuggingFace
S1K-1.1 Math, Science, Code Distillation DeepSeek R1 1K Text GitHub HuggingFace
LIMO Math Distillation DeepSeek R1, DeepSeekR1-Distill-Qwen-32B 0.8K Text GitHub HuggingFace
OpenThoughts-114k Math, Code, Science, Puzzle Distillation DeepSeek R1 114K Text GitHub HuggingFace
OpenR1-Math-220k Math Distillation DeepSeek R1 220K Text GitHub HuggingFace
OpenThoughts2-1M Math, Code, Science, Puzzle Distillation DeepSeek R1 1M Text GitHub HuggingFace
CodeForces-CoTs Code Distillation DeepSeek R1 47K Text GitHub HuggingFace
Sky-T1-17k Math, Code, Science, Puzzle Distillation QwQ-32B-Preview 17K Text GitHub HuggingFace
S²R Math Synthesize Qwen2.5-Math-7B 3K Text GitHub HuggingFace
R1-Onevision Science, Math, General Distillation DeepSeek R1 155K Vision + Text GitHub HuggingFace
OpenO1-SFT Math, Code Synthesize - 77K Text GitHub HuggingFace
Medical-o1 Medical Distillation Deepseek R1 25K Text GitHub HuggingFace
O1 Journey–Part 3 Medical Distillation o1-preview 0.5K Text GitHub HuggingFace
SCP-116K Math, Science Distillation Deepseek R1 116K Text GitHub HuggingFace
open-r1-multimodal Math Distillation GPT-4o 8K Vision + Text GitHub HuggingFace
Vision-R1-cold Science, Math, General Distillation Deepseek R1 200K Vision + Text GitHub HuggingFace
MMMU-Reasoning-Distill-Validation Science, Math, General Distillation Deepseek R1 0.8K Vision + Text ModelScope
Clevr-CoGenT Vision Counting Distillation Deepseek R1 37.8K Vision + Text GitHub HuggingFace
VL-Thinking Science, Math, General Distillation Deepseek R1 158K Vision + Text GitHub HuggingFace
Video-R1 Video Distillation Qwen2.5-VL-72B 158K Vision + Text GitHub HuggingFace
Embodied-Reasoner Embodied AI Synthesize GPT-4o 9K Vision + Text GitHub HuggingFace
OpenCodeReasoning Code Distillation DeepSeek R1 736K Text HuggingFace
SafeChain Safety Distillation Deepseek R1 40K Text GitHub HuggingFace
KodCode Code Distillation DeepSeek R1 2.8K Text GitHub HuggingFace

6.2 Analysis

  • Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping [Paper]
  • Stream of Search (SoS): Learning to Search in Language [Paper]
  • Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems [Paper]
  • Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems [Paper]
  • s1: Simple test-time scaling [Paper]
  • RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems? [Paper]
  • SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training [Paper]
  • LIMO: Less is More for Reasoning [Paper]
  • Small Models Struggle to Learn from Strong Reasoners [Paper]
  • LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! [Paper]

7. Self-improvement with Test-Time Scaling

7.1 Parallel Sampling

  • STaR: Bootstrapping Reasoning With Reasoning [Paper]
  • RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment [Paper]
  • Language Models Can Teach Themselves to Program Better [Paper]
  • Scaling relationship on learning mathematical reasoning with large language models [Paper]
  • Reinforced self-training (rest) for language modeling [Paper]
  • Large Language Models Can Self-Improve [Paper]
  • Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models [Paper]
  • Self-Rewarding Language Models [Paper]
  • V-STaR: Training Verifiers for Self-Taught Reasoners [Paper]
  • Iterative Reasoning Preference Optimization [Paper]
  • Progress or Regress? Self-Improvement Reversal in Post-training [Paper]
  • Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement [Paper]
  • Process-based Self-Rewarding Language Models [Paper]

7.2 Tree Search

  • Alphazero-like Tree-Search Can Guide Large Language Model Decoding and Training [Paper]
  • AlphaMath Almost Zero: Process Supervision without Process [Paper]
  • Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [Paper]
  • ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [Paper]
  • Step-Level Value Preference Optimization for Mathematical Reasoning [Paper]
  • Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs [Paper]
  • rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking [Paper]

7.3 Multi-turn Correction

  • Multi-Turn Code Generation Through Single-Step Rewards [Paper]
  • S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [Paper]
  • Self-rewarding correction for mathematical reasoning [Paper]

7.4 Long CoT

  • Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [Paper]

8. Ensemble of Test-Time Scaling Method

  • Accessing GPT-4 Level Mathematical Olympiad Solutions via Monte Carlo Tree Self-Refine with LLaMa-3 8B [Paper]
  • Scaling test-time compute with open models [Blog]
  • Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters [Paper]
  • RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation [Paper]
  • TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling [Paper]
  • LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning [Paper]
  • MC-NEST -- Enhancing Mathematical Reasoning in Large Language Models with a Monte Carlo Nash Equilibrium Self-Refine Tree [Paper]
  • SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models [Paper]
  • Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning [Paper]
  • RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques [Paper]
  • Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling [Paper]
  • Scaling Test-Time Compute Without Verification or RL is Suboptimal [Paper]
  • S*: Test Time Scaling for Code Generation [Paper]
  • Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [Paper]

9. Inference Time Scaling Laws

  • The Impact of Reasoning Step Length on Large Language Models [Paper]
  • More Agents Is All You Need [Paper]
  • Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems [Paper]
  • Large Language Monkeys: Scaling Inference Compute with Repeated Sampling [Paper]
  • Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models [Paper]
  • Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling [Paper]
  • Inference Scaling fLaws: The Limits of LLM Resampling with Imperfect Verifiers [Paper]
  • s1: Simple test-time scaling [Paper]
  • Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [Paper]
  • Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? [Paper]

10. Improving Scaling Efficiency

10.1 Parallel Sampling

  • Let's Sample Step by Step: Adaptive-Consistency for Efficient Reasoning and Coding with LLMs [Paper]
  • Universal Self-Consistency for Large Language Model Generation [Paper]
  • Escape Sky-High Cost: Early-Stopping Self-Consistency for Multi-Step Reasoning [Paper]
  • Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision [Paper]
  • Dynamic Self-Consistency: Leveraging Reasoning Paths for Efficient LLM Sampling [Paper]
  • Make Every Penny Count: Difficulty-Adaptive Self-Consistency for Cost-Efficient Reasoning [Paper]
  • Fast Best-of-N Decoding via Speculative Rejection [Paper]
  • Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models [Paper]
  • Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers [Paper]
  • Scalable Best-of-N Selection for Large Language Models via Self-Certainty [Paper]
  • Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding [Paper]
  • When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning [Paper]

10.2 Tree Search

  • LiteSearch: Efficacious Tree Search for LLM [Paper]
  • ETS: Efficient Tree Search for Inference-Time Scaling [Paper]
  • Dynamic Parallel Tree Search for Efficient LLM Reasoning [Paper]
  • Don't Get Lost in the Trees: Streamlining LLM Reasoning by Overcoming Tree Search Exploration Pitfalls [Paper]

10.3 Multi-turn Correction

  • Can Large Language Models Be an Alternative to Human Evaluations? [Paper]
  • G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment [Paper]
  • A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation [Paper]
  • Confidence Matters: Revisiting Intrinsic Self-Correction Capabilities of Large Language Models [Paper]
  • Branch-Solve-Merge Improves Large Language Model Evaluation and Generation [Paper]
  • Recursive Introspection: Teaching Language Model Agents How to Self-Improve [Paper]
  • Training Language Models to Self-Correct via Reinforcement Learning [Paper]
  • Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [Paper]

10.4 Long CoT

  • Implicit Chain of Thought Reasoning via Knowledge Distillation [Paper]
  • Anchor-based Large Language Models [Paper]
  • From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step [Paper]
  • Break the Chain: Large Language Models Can be Shortcut Reasoners [Paper]
  • Distilling System 2 into System 1 [Paper]
  • System-1.x: Learning to Balance Fast and Slow Planning with Language Models [Paper]
  • Concise Thoughts: Impact of Output Length on LLM Reasoning and Cost [Paper]
  • To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning [Paper]
  • Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces [Paper]
  • Markov Chain of Thought for Efficient Mathematical Reasoning [Paper]
  • Can Language Models Learn to Skip Steps? [Paper]
  • Training Large Language Models to Reason in a Continuous Latent Space [Paper]
  • Compressed Chain of Thought: Efficient Reasoning Through Dense Representations [Paper]
  • C3oT: Generating Shorter Chain-of-Thought without Compromising Effectiveness [Paper]
  • Token-Budget-Aware LLM Reasoning [Paper]
  • Efficiently Serving LLM Reasoning Programs with Certaindex [Paper]
  • Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs [Paper]
  • Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models [Paper]
  • O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning [Paper]
  • Think Smarter not Harder: Adaptive Reasoning with Inference Aware Optimization [Paper]
  • Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs [Paper]
  • Training Language Models to Reason Efficiently [Paper]
  • Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach [Paper]
  • CoT-Valve: Length-Compressible Chain-of-Thought Tuning [Paper]
  • TokenSkip: Controllable Chain-of-Thought Compression in LLMs [Paper]
  • SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs [Paper]
  • Stepwise Perplexity-Guided Refinement for Efficient Chain-of-Thought Reasoning in Large Language Models [Paper]
  • The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer [Paper]
  • LightThinker: Thinking Step-by-Step Compression [Paper]
  • Chain of Draft: Thinking Faster by Writing Less [Paper]
  • Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning [Paper]
  • Self-Training Elicits Concise Reasoning in Large Language Models [Paper]
  • CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation [Paper]
  • Efficient Test-Time Scaling via Self-Calibration [Paper]
  • How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach [Paper]
  • DAST: Difficulty-Adaptive Slow-Thinking for Large Reasoning Models [Paper]
  • L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning [Paper]
  • Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching [Paper]
  • InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models [Paper]
  • Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning [Paper]
  • Unlocking Efficient Long-to-Short LLM Reasoning with Model Merging [Paper]
  • Think Less, Achieve More: Cut Reasoning Costs by 50% Without Sacrificing Accuracy [Blog]

11. Latent Thoughts

  • Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking [Paper]
  • Lean-STaR: Learning to Interleave Thinking and Proving [Paper]
  • RATIONALYST: Pre-training Process-Supervision for Improving Reasoning [Paper]
  • Reasoning to Learn from Latent Thoughts [Paper]