Progress-Aware Video Frame Captioning

Progress-Aware Video Frame Captioning
Zihui Xue, Joungbin An, Xitong Yang, Kristen Grauman
CVPR, 2025
project page | arxiv | bibtex

This codebase builds upon the excellent LLaVA-NeXT repository. We deeply appreciate their efforts in open-sourcing!

Overview

We propose progress-aware video frame captioning (bottom), which aims to generate a sequence of captions that capture the temporal dynamics within a video.

Getting Started

Environment Setup

conda create -n progcap python=3.10 -y
conda activate progcap
cd LLaVA-NeXT
pip install -e ".[train]"
pip install flash-attn==2.6.3

Data and Model Checkpoint

We provide pre-processed frame sequences for the benchmarks used in our paper:

HowToChange: data/data_files/input/htc_valhs_seq.json
COIN: data/data_files/input/coin_valhs_seq.json
These frames are extracted at 1fps, filtered, and manually verified for assessing fine-grained action progress. Download the frames here, and place them under the data/ folder. Ensure the paths match those expected by the JSON files or update the JSONs accordingly.

To run inference on your own video data:

Run python prepare_data.py --video_file VIDEO_FILE (adjust arguments as needed) to extract frames from video and create a JSON file like data/data_files/input/one_example.json.

ProgressCaptioner model checkpoint is available here.

Inference

Run the inference script using the prepared data file and the downloaded model checkpoint:

python infer.py --data_file DATA_FILE --model_path ProgressCaptionerCheckpoint

Input: DATA_FILE should be in data/data_files/input/ directory, examples: data/data_files/input/one_example.json, data/data_files/input/htc_valhs_seq.json.
Output: captions are printed and a response file will be saved in data/data_files/output directory.

We provide post_process.py to visualize the generated captions alongside the frames in an easy-to-browse HTML format.

The generated HTML files will be saved in data/viz_html/ directory.
Start a simple web server from the repository's root directory: python -m http.server 8000 (or any available port), then navigate to http://localhost:8000/data/viz_html/ in your browser.

Training

ProgressCaptioner training data is available here. Videos are sourced from COIN and HowToChange datasets. For HowToChange, you may retrieve videos directly from YouTube using their unique IDs. The start time and duration can be inferred from the filename (look for st and dur).

Next, to prepare the data, extract video frames at 1 FPS (the extract_frames function in prepare_data.py may be helpful).

We follow the LLaVA-NeXT training pipeline for SFT and DPO with the LLAVA-OV-7B model. Modify the training scripts (fineune_ov.sh, dpo_ov7b.sh) as needed to launch ProgressCaptioner training.

Keyframe Selection

ProgressCaptioner can also be used to identify informative keyframes within a video sequence.

🚧 Code and Instructions Coming Soon

Citation

If you find our work inspiring or use our codebase in your research, please consider giving a star ⭐ and a citation.

@article{xue2024progress,
  title={Progress-Aware Video Frame Captioning},
  author={Xue, Zihui and An, Joungbin and Yang, Xitong and Grauman, Kristen},
  journal={arXiv preprint arXiv:2412.02071},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Progress-Aware Video Frame Captioning

Overview

Getting Started

Environment Setup

Data and Model Checkpoint

Inference

Training

Keyframe Selection

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LLaVA-NeXT		LLaVA-NeXT
data		data
images		images
README.md		README.md
infer.py		infer.py
post_process.py		post_process.py
prepare_data.py		prepare_data.py

Folders and files

Latest commit

History

Repository files navigation

Progress-Aware Video Frame Captioning

Overview

Getting Started

Environment Setup

Data and Model Checkpoint

Inference

Training

Keyframe Selection

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages