This folder contains the two-stage post-training pipeline used to build Video-R2:
- Stage 1: Timestamp-aware SFT starting from Qwen2.5-VL-Instruct
- Stage 2: GRPO guided by Temporal Alignement Reward (TAR)
The main entrypoints are:
src/train/train_sft.pysrc/train/train_grpo.py
Sample multi-GPU launch scripts are provided in scripts/:
scripts/train_sft.shscripts/train_grpo.sh
conda create -n video-r2 python=3.12 -y
conda activate video-r2
pip install -U pip
# We use torch v2.7.0, torchvision v0.22.0 and transformers v2.51.1 in the development of Video-R2
# Please see requirements.txt and environment.yml for all requirements
pip install -r requirements.txt
# Further, we recommend installing flash-attn v2.7.4.post1 or v2.8.3 for trainingWe release the datasets used in Video-R2 development on Hugging Face:
Download:
hf download MBZUAI/Video-R2-Dataset --repo-type datasetArrange (or symlink) into a local layout that matches the paths you will pass to the scripts:
data/
video-r2-sft-dataset.json
video-r2-grpo-dataset.json
videos/
<video files referenced in the JSON, extract all the .tar files>
subtitles/ # optional
<subtitle files matched by video stem, extract subtitles.tar>
Important paths used by the scripts:
DATA_PATH: one of the JSON files aboveDATA_FOLDER: folder containing videos (passed to the code as--image_folder)SUBTITLES_FOLDER: optional folder containing.srtfiles
We start from the instruction-tuned Qwen2.5-VL model:
Qwen/Qwen2.5-VL-7B-Instruct
- Edit
scripts/train_sft.sh:
MODEL_NAME="Qwen/Qwen2.5-VL-7B-Instruct"DATA_PATH(SFT JSON)DATA_FOLDER(videos folder)SUBTITLES_FOLDER(optional)OUTPUT_DIR
- Run:
cd train
bash scripts/train_sft.shNotes:
- The script uses LoRA by default.
- Video sampling is controlled by
--fpsandFPS_MAX_FRAMES. - The code supports overlaying timestamps and subtitles on frames via env vars.
Merge the LoRA checkpoints after SFT:
python src/merge_lora_weights.py \
--base_model <base_model_id_or_path> \
--lora_model <lora_checkpoint_path> \
--output_dir <merged_output_dir>We perform GRPO starting from the SFT checkpoint. GRPO is configured with a mixture of reward functions, including Temporal Alignment Reward (TAR).
- Edit
scripts/train_grpo.sh:
MODEL_NAMEshould point to the SFT merged checkpointsDATA_PATH(GRPO JSON)DATA_FOLDER(videos folder)SUBTITLES_FOLDER(optional)OUTPUT_DIR
- Serve a judge / parsing model
TAR uses an LLM judge served behind an OpenAI-compatible API. An example vLLM launcher is provided:
cd train
bash serve_llm/serve_qwen3.shYou can test the server:
python serve_llm/test_vllm_client.py- Run GRPO:
cd train
bash scripts/train_grpo.shKey knobs:
--num_generations: rollouts per prompt--beta: KL coefficient--max_prompt_length,--max_completion_length: token budgets- TAR parameters:
--buffer_seconds,--similarity_threshold
If you find Video-R2 helpful, please cite:
@article{maaz2025video-r2,
title={Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models},
author={Maaz, Muhammad and Rasheed, Hanoona and Khan, Fahad Shahbaz and Khan, Salman},
journal={arXiv preprint arXiv:2511.23478},
year={2025}
}

