Self-aligned reward (SAR) is a generic, universally applicable self-guided signal that complements verifiable rewards to enhance both reasoning accuracy and efficiency in RL. By utilizing the perplexity signals within the model, SAR encourages more compact, efficient reasoning paths while maintaining strong reasoning capacity.
Specifically, SAR compares the perpexity of a model rollout given and not given the question as context. As a result, answers that are closely tailored to the question will receive a higher SAR score. Please refer to the paper for additional details.
Our repository is based on verl 0.3.1.dev0.
Run each step specified in preconfig.sh to prepare the environment. Please note a compatible environment with conda installed and available is required.
preconfig.sh also contains the commands to prepare math datasets used in our paper.
examples/data_preprocess contains scripts for preparing datasets. One can customize their own datasets based on the code.
We provide two example scripts in scripts/ppo.sh and scripts/grpo.sh. Key hyperparameters are reward_types and reward_factors. We denote self-aligned reward as "ppl_qa" in the codebase.
One can read verl/trainer/config/ppo_trainer.yaml to learn details for all hyperparameters.
Self-aligned reward can be seamlessly adapted to different RL algorithms.
scripts/batched_validate.sh and scripts/auto_validate.sh are scripts for inference.
From the figure below, we can find that self-aligned reward leads to notable gains on both accuracy and efficiency.
If you find this repo or the paper useful, please cite:
@article{han2025self,
title={Self-Aligned Reward: Towards Effective and Efficient Reasoners},
author={Han, Peixuan and Krishnan, Adit and Friedland, Gerald and You, Jiaxuan and Kong, Chris},
journal={arXiv preprint arXiv:2509.05489},
year={2025}
}
Reach out to Peixuan Han for any questions.

