Skip to content

amazon-science/Self-Aligned-Reward-Towards_Effective_and_Efficient_Reasoners

Self-aligned Reward: towards Effective and Efficient Reasoners

Peixuan Han, Adit Krishnan, Gerald Friedland, Jiaxuan You, Chris Kong

About SAR

Self-aligned reward (SAR) is a generic, universally applicable self-guided signal that complements verifiable rewards to enhance both reasoning accuracy and efficiency in RL. By utilizing the perplexity signals within the model, SAR encourages more compact, efficient reasoning paths while maintaining strong reasoning capacity.

Specifically, SAR compares the perpexity of a model rollout given and not given the question as context. As a result, answers that are closely tailored to the question will receive a higher SAR score. Please refer to the paper for additional details.

Repo Usage

Our repository is based on verl 0.3.1.dev0.

Setup

Run each step specified in preconfig.sh to prepare the environment. Please note a compatible environment with conda installed and available is required.

Data Processing

preconfig.sh also contains the commands to prepare math datasets used in our paper.

examples/data_preprocess contains scripts for preparing datasets. One can customize their own datasets based on the code.

Training

We provide two example scripts in scripts/ppo.sh and scripts/grpo.sh. Key hyperparameters are reward_types and reward_factors. We denote self-aligned reward as "ppl_qa" in the codebase.

One can read verl/trainer/config/ppo_trainer.yaml to learn details for all hyperparameters.

Self-aligned reward can be seamlessly adapted to different RL algorithms.

Evaluation

scripts/batched_validate.sh and scripts/auto_validate.sh are scripts for inference.

From the figure below, we can find that self-aligned reward leads to notable gains on both accuracy and efficiency.

Self-aligned reward achieves notable gains on both accuracy and efficiency.

Cite this paper

If you find this repo or the paper useful, please cite:

@article{han2025self,
  title={Self-Aligned Reward: Towards Effective and Efficient Reasoners},
  author={Han, Peixuan and Krishnan, Adit and Friedland, Gerald and You, Jiaxuan and Kong, Chris},
  journal={arXiv preprint arXiv:2509.05489},
  year={2025}
}

Reach out to Peixuan Han for any questions.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages