Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos

🏡 Project Page | 📄 Paper | 🤗 Dataset | 🤗 Checkpoints

Abstract

Problem Scenario

This paper considers the problem of Multi-Hop Video Question Answering (MH-VidQA) in long-form egocentric videos. This task not only requires to answer visual questions, but also to localize multiple relevant time intervals within the video as visual evidences.

Baseline Method

We develop an automated pipeline to mine multi-hop question-answering pairs with associated temporal evidence, enabling to construct a large-scale dataset for instruction-tuning. We then propose a novel architecture, termed as GeLM, to leverage the world knowledge reasoning capabilities of multi-modal large language models (LLMs), while incorporating a grounding module to retrieve temporal evidence in the video with flexible grounding tokens.

📂 Directory Structure

MultiHop-EgoQA/            
├── baseline/                       # Our Baseline Method
│   ├── checkpoints/                # Checkpoints of LLMs
│   │   ├──vicuna-v1-3-7b/
│   ├── datasets/                   # Save path of datasets
│   │   ├── multihop_qa/        
│   │   │   ├── features/     
│   │   │   ├── train_annotations.json
│   │   ├── activitynet-captions/     
│   │   │   ├── intern_feature/
│   │   │   ├── val_1.json
│   │   ├── temporal_reasoning/     
│   ├── gelm/                       # Implementation of the GeLM model
│   ├── llava/                      # LLaVa code base
│   ├── scripts/                    # Scripts for evaluating the baseline method
│   │   ├── eval_multihop_qa.sh     # Evaluate GeLM on MultiHop-EgoQA
│   │   └── eval_rtl.sh             # Evaluate GeLM on ActivityNet-RTL
│   └── pyproject.toml              # Configuration file
|
├── benchmark/                      # Benchmarking tools and metrics
│   ├── metrics/                    # Metrics calculation
│   └── zero-shot-inference/        # Zero-shot inference codes

Datasets

See Dataset Preparation.

Baseline Method

Training setup: Ubuntu 18.04, CUDA 12.1, 4x Nvidia H800 (80GB)

Training

Installing the environment.

cd baseline

conda create -n gelm python=3.10 -y
conda activate gelm

pip install --upgrade pip  # enable PEP 660 support
pip install -e .

pip install ninja
pip install flash-attn --no-build-isolation

Downloading LLM checkpoints and saving under checkpoints.

git clone https://huggingface.co/lmsys/vicuna-13b-v1.3

Training.

# Training on MultiHop-EgoQA
bash scripts/finetune_multihop_qa.sh

# Training on ActivityNet-RTL
bash scripts/finetune_rtl.sh

# Training on both MultiHop-EgoQA and ActivityNet-RTL
bash scripts/finetune_mixed.sh

Checkpoints

We provide the checkpoints of GeLM-7B trained on MultiHop-EgoQA and ActivityNet-RTL on Hugging face, respectively.

Evaluation

Evaluation on MultiHop-EgoQA

cd benchmark/metrics
bash evaluate.sh

Evaluation on ActivityNet-RTL

cd baseline
bash eval_rtl.sh

Citation

If you find this paper or repo helpful, you can use the following format to cite:

@inproceedings{chen2025grounded,
  title={Grounded multi-hop videoqa in long-form egocentric videos},
  author={Chen, Qirui and Di, Shangzhe and Xie, Weidi},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={39},
  number={2},
  pages={2159--2167},
  year={2025}
}

🫡 Acknowledgements

Our baseline method implementation is adapted from the LITA.
The implementation of the zero-shot evaluation code references the official repositories of TimeChat and VTimeLLM, as well as the Hugging Face documentation of InternVL2, LLaVa-NeXT-Video, LLaVa-v1.6, Meta-Llama-3.1, and the documentation of OpenAI GPT-4o for video understanding.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
assets		assets
baseline		baseline
benchmark		benchmark
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos

Abstract

Problem Scenario

Baseline Method

📂 Directory Structure

Datasets

Baseline Method

Training

Checkpoints

Evaluation

Citation

🫡 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos

Abstract

Problem Scenario

Baseline Method

📂 Directory Structure

Datasets

Baseline Method

Training

Checkpoints

Evaluation

Citation

🫡 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages