AnnoDPO

Implementation of AnnoDPO: Protein Functional Annotation Learning with Direct Preference Optimization by Zixuan Jiang and Renjing Xu. Please feel free to reach out to us at zjiang597@connect.hkust-gz.edu.cn with any questions.

Introduction

Deciphering protein function remains a fundamental challenge in protein representation learning. The task presents significant difficulties for protein language models due to the sheer volume of functional annotation categories and the highly imbalanced distribution of annotated instances across biological ontologies. Inspired by the remarkable success of reinforcement learning from human feedback (RLHF) in large language model alignment, we propose AnnoDPO, a novel multi-modal framework for protein function prediction that leverages Direct Preference Optimization (DPO) to enhance annotation learning.

Environment

To set up the environment and run our code, you could use the commands below in the terminal:

First clone this repo

git clone https://github.com/AzusaXuan/AnnoDPO.git

Then,

cd AnnoDPO

Use following commands to set up the environment

conda create -n annodpo_env python=3.9
conda activate annodpo_env
pip3 install -r requirements.txt # for reference

Model weights

Model Version	Parameter (trainable)	Checkpoints
SFT	52M	ckpt, lora
DPO	390M	ckpt, lora

You could select a model version and run the chosen task using following command:

torchrun --nproc_per_node=<number_of_gpus> main.py --mode <task_type: e.g. finetune, caption_rlhf, eval> --checkpoint <path/to/ckpt> --actual_epoch <epoch_num> --version <model_version: e.g. proclean_itc_alm>

Evaluation

Downstream task1: Gene Ontology (GO) prediction (7533-category)

torchrun --nproc_per_node=<number_of_gpus> main.py --mode eval

Downstream task2: GO prediction w/ category (MF, BP, CC)

torchrun --nproc_per_node=<number_of_gpus> main.py --mode eval_go_classification

Downstream task2: GO prediction w/ frequency (low, medium, high)

torchrun --nproc_per_node=<number_of_gpus> main.py --mode eval_freq_+{high/medium/low}+{optional for dpo model:_rlhf}

Workflow

Step1: SFT

torchrun --nproc_per_node=<number_of_gpus> main.py --mode finetune --actual_epoch <round_num>

Step2: DPO

torchrun --nproc_per_node=<number_of_gpus> main.py --mode caption_rlhf --actual_epoch <round_num>

We recommend you to run the DPO process using run_dpo.sh

Datasets

All the datasets are stored at here.

License

The code and model weights are released under MIT license. See the LICENSE for details.

Citation

@article{jiang2025annodpo,
  title={AnnoDPO: Protein Functional Annotation Learning with Direct Preference Optimization},
  author={Jiang, Zixuan and Xu, Renjing},
  journal={arXiv preprint arXiv:2506.07035},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
evaluation		evaluation
models		models
LICENSE		LICENSE
README.md		README.md
dataloader.py		dataloader.py
main.py		main.py
method.png		method.png
requirements.txt		requirements.txt
run_dpo.sh		run_dpo.sh
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AnnoDPO

Introduction

Environment

Model weights

You could select a model version and run the chosen task using following command:

Evaluation

Downstream task1: Gene Ontology (GO) prediction (7533-category)

Downstream task2: GO prediction w/ category (MF, BP, CC)

Downstream task2: GO prediction w/ frequency (low, medium, high)

Workflow

Step1: SFT

Step2: DPO

Datasets

License

Citation

About

Uh oh!

Releases

Packages

Languages

License

AzusaXuan/AnnoDPO

Folders and files

Latest commit

History

Repository files navigation

AnnoDPO

Introduction

Environment

Model weights

You could select a model version and run the chosen task using following command:

Evaluation

Downstream task1: Gene Ontology (GO) prediction (7533-category)

Downstream task2: GO prediction w/ category (MF, BP, CC)

Downstream task2: GO prediction w/ frequency (low, medium, high)

Workflow

Step1: SFT

Step2: DPO

Datasets

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages