Skip to content

zihuixue/ProgCaptioner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Progress-Aware Video Frame Captioning

Progress-Aware Video Frame Captioning
Zihui Xue, Joungbin An, Xitong Yang, Kristen Grauman
CVPR, 2025
project page | arxiv | bibtex

This codebase builds upon the excellent LLaVA-NeXT repository. We deeply appreciate their efforts in open-sourcing!

Overview

We propose progress-aware video frame captioning (bottom), which aims to generate a sequence of captions that capture the temporal dynamics within a video.

Getting Started

Environment Setup

conda create -n progcap python=3.10 -y
conda activate progcap
cd LLaVA-NeXT
pip install -e ".[train]"
pip install flash-attn==2.6.3

Data and Model Checkpoint

  1. We provide pre-processed frame sequences for the benchmarks used in our paper:
  • HowToChange: data/data_files/input/htc_valhs_seq.json
  • COIN: data/data_files/input/coin_valhs_seq.json
  • These frames are extracted at 1fps, filtered, and manually verified for assessing fine-grained action progress. Download the frames here, and place them under the data/ folder. Ensure the paths match those expected by the JSON files or update the JSONs accordingly.
  1. To run inference on your own video data:
  • Run python prepare_data.py --video_file VIDEO_FILE (adjust arguments as needed) to extract frames from video and create a JSON file like data/data_files/input/one_example.json.
  1. ProgressCaptioner model checkpoint is available here.

Inference

Run the inference script using the prepared data file and the downloaded model checkpoint:

python infer.py --data_file DATA_FILE --model_path ProgressCaptionerCheckpoint
  • Input: DATA_FILE should be in data/data_files/input/ directory, examples: data/data_files/input/one_example.json, data/data_files/input/htc_valhs_seq.json.

  • Output: captions are printed and a response file will be saved in data/data_files/output directory.

We provide post_process.py to visualize the generated captions alongside the frames in an easy-to-browse HTML format.

  • The generated HTML files will be saved in data/viz_html/ directory.

  • Start a simple web server from the repository's root directory: python -m http.server 8000 (or any available port), then navigate to http://localhost:8000/data/viz_html/ in your browser.

Training

ProgressCaptioner training data is available here. Videos are sourced from COIN and HowToChange datasets. For HowToChange, you may retrieve videos directly from YouTube using their unique IDs. The start time and duration can be inferred from the filename (look for st and dur).

Next, to prepare the data, extract video frames at 1 FPS (the extract_frames function in prepare_data.py may be helpful).

We follow the LLaVA-NeXT training pipeline for SFT and DPO with the LLAVA-OV-7B model. Modify the training scripts (fineune_ov.sh, dpo_ov7b.sh) as needed to launch ProgressCaptioner training.

Keyframe Selection

ProgressCaptioner can also be used to identify informative keyframes within a video sequence.

🚧 Code and Instructions Coming Soon

Citation

If you find our work inspiring or use our codebase in your research, please consider giving a star ⭐ and a citation.

@article{xue2024progress,
  title={Progress-Aware Video Frame Captioning},
  author={Xue, Zihui and An, Joungbin and Yang, Xitong and Grauman, Kristen},
  journal={arXiv preprint arXiv:2412.02071},
  year={2024}
}

About

Code release for the paper "Progress-Aware Video Frame Captioning" (CVPR 2025)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors