GitHub - amazon-science/Self-Aligned-Reward-Towards_Effective_and_Efficient_Reasoners

Self-aligned Reward: towards Effective and Efficient Reasoners

Peixuan Han, Adit Krishnan, Gerald Friedland, Jiaxuan You, Chris Kong

About SAR

Self-aligned reward (SAR) is a generic, universally applicable self-guided signal that complements verifiable rewards to enhance both reasoning accuracy and efficiency in RL. By utilizing the perplexity signals within the model, SAR encourages more compact, efficient reasoning paths while maintaining strong reasoning capacity.

Specifically, SAR compares the perpexity of a model rollout given and not given the question as context. As a result, answers that are closely tailored to the question will receive a higher SAR score. Please refer to the paper for additional details.

Repo Usage

Our repository is based on verl 0.3.1.dev0.

Setup

Run each step specified in preconfig.sh to prepare the environment. Please note a compatible environment with conda installed and available is required.

Data Processing

preconfig.sh also contains the commands to prepare math datasets used in our paper.

examples/data_preprocess contains scripts for preparing datasets. One can customize their own datasets based on the code.

Training

We provide two example scripts in scripts/ppo.sh and scripts/grpo.sh. Key hyperparameters are reward_types and reward_factors. We denote self-aligned reward as "ppl_qa" in the codebase.

One can read verl/trainer/config/ppo_trainer.yaml to learn details for all hyperparameters.

Self-aligned reward can be seamlessly adapted to different RL algorithms.

Evaluation

scripts/batched_validate.sh and scripts/auto_validate.sh are scripts for inference.

From the figure below, we can find that self-aligned reward leads to notable gains on both accuracy and efficiency.

Cite this paper

If you find this repo or the paper useful, please cite:

@article{han2025self,
  title={Self-Aligned Reward: Towards Effective and Efficient Reasoners},
  author={Han, Peixuan and Krishnan, Adit and Friedland, Gerald and You, Jiaxuan and Kong, Chris},
  journal={arXiv preprint arXiv:2509.05489},
  year={2025}
}

Reach out to Peixuan Han for any questions.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
examples		examples
figures		figures
scripts		scripts
verl		verl
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
preconfig.sh		preconfig.sh
requirements.txt		requirements.txt
setup.py		setup.py
third_party_notice.txt		third_party_notice.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Self-aligned Reward: towards Effective and Efficient Reasoners

Peixuan Han, Adit Krishnan, Gerald Friedland, Jiaxuan You, Chris Kong

About SAR

Repo Usage

Setup

Data Processing

Training

Evaluation

Cite this paper

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Self-aligned Reward: towards Effective and Efficient Reasoners

Peixuan Han, Adit Krishnan, Gerald Friedland, Jiaxuan You, Chris Kong

About SAR

Repo Usage

Setup

Data Processing

Training

Evaluation

Cite this paper

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 1

Languages

Packages