MLDLProject: From Query-Based Segment Retrieval to Answer Generation in Egocentric Videos

This repository contains the full implementation for "From Query-Based Segment Retrieval to Answer Generation in Egocentric Videos". The codebase includes model training notebooks, architecture experiments, video segment processing scripts, answer generation pipelines, and evaluation metrics, corresponding to sections of the final project report.

📁 Repository Structure

3_3_Temporal_Localization_Models/:
Temporal localization with VSLBase and VSLNet using EgoVLP and Omnivore features. Includes Colab-compatible notebooks for Section 3.3.
3_4_Architectural_Variants_VSLNet/:
Contains experiments with GLoVE and BERT Non-Shared Encoder variants of VSLNet. Also runnable on free-tier Colab. Linked to Section 3.4.
3_5_Temporal_Localization_To_Answer_Generation/:
Local-only scripts for downloading, trimming, and annotating clips using Selenium and ffmpeg. Corresponds to Section 3.5.
3_6_Answer_Generation_VideoQA_Models/:
Answer generation with LLaVA-NeXT, CogVLM2, and InternVideo2.5. Requires high-end GPU (e.g., A100). Covered in Section 3.6.
4_5_NLP_Metrics.ipynb:
Notebook for computing BLEU and ROUGE metrics on generated answers. Linked to Section 4.5.
Report/:
Contains LaTeX source files, assets, and compiled report PDF.

⚙️ Environment Setup

General Dependencies

All the dependencies will be solved during each notebook's run.

⚙️ Section-Specific Dependencies

`3_5_Temporal_Localization_To_Answer_Generation/`

pip install selenium
sudo apt install ffmpeg # if on Linux. On Windows download the installer from official website.
# ChromeDriver should match your Chrome version

`3_6_Answer_Generation_VideoQA_Models/`

⚠️ Requires: A100 GPU or Colab Pro+ instance
⏳ Estimated Required Colab Units: ~100 Google Colab compute units

💻 Hardware Requirements

Section	Platform	Notes
`3_3_Temporal_Localization_Models`	Google Colab	Free-tier compatible
`3_4_Architectural_Variants_VSLNet`	Google Colab	Free-tier compatible
`3_5_Temporal_Localization_To_Answer_Generation`	Local machine	Requires ffmpeg, Selenium, and manual labeling
`3_6_Answer_Generation_VideoQA_Models`	A100 / Colab Pro	Very high memory/compute requirements
`4_5_NLP_Metrics.ipynb`	Any	Lightweight metric evaluation

🚀 How to Run

`3_3_Temporal_Localization_Models/`

Open notebooks in Google Colab
Select between EgoVLP or Omnivore variants
Execute all cells
Outputs will be saved in the runs/ directory

`3_4_Architectural_Variants_VSLNet/`

Navigate to either GLoVE/ or NonSharedEncoder/ subdirectories
Run the notebooks in Google Colab
Uses default hyperparameters
Choose between EgoVLP or Omnivore features as input

`3_5_Temporal_Localization_To_Answer_Generation/`

Execute Python scripts locally (not compatible with Colab)
Video clips will be downloaded using Selenium from Ego4D Visualizer
Requirements:
- Install ffmpeg and add it to system PATH
- Install selenium Python package
- Manually watch and label [clip_uid]_[query_idx].mp4 files inside the clips/ directory

`3_6_Answer_Generation_VideoQA_Models/`

Use notebooks located in the NoteBook/ directory
Place prepared input files in input/ directory
Generated outputs will be saved in output/
Supported models:
- LLaVA-NeXT
- CogVLM2
- InternVideo2.5
⚠️ This section is highly resource-intensive and requires an A100 GPU or equivalent (e.g., Colab Pro with high memory runtime)

`4_5_NLP_Metrics.ipynb`

Run this notebook after generating answer predictions from VLMs
Calculates evaluation metrics:
- BLEU-1, BLEU-2, BLEU-4
- ROUGE-1, ROUGE-L
Outputs average scores across all evaluated queries

📄 Report

The Report/ folder includes:

Full LaTeX source code
Tables, images, and bibliography
Code section references aligned with the report structure

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
3_3_Temporal_Localization_Models		3_3_Temporal_Localization_Models
3_4_Architectural_Variants_VSLNet		3_4_Architectural_Variants_VSLNet
3_5_Temporal_Localization_To_Answer_Generation		3_5_Temporal_Localization_To_Answer_Generation
3_6_ Answer_Generation_VideoQA_Models		3_6_ Answer_Generation_VideoQA_Models
Report		Report
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLDLProject: From Query-Based Segment Retrieval to Answer Generation in Egocentric Videos

📁 Repository Structure

⚙️ Environment Setup

General Dependencies

⚙️ Section-Specific Dependencies

`3_5_Temporal_Localization_To_Answer_Generation/`

`3_6_Answer_Generation_VideoQA_Models/`

💻 Hardware Requirements

🚀 How to Run

`3_3_Temporal_Localization_Models/`

`3_4_Architectural_Variants_VSLNet/`

`3_5_Temporal_Localization_To_Answer_Generation/`

`3_6_Answer_Generation_VideoQA_Models/`

`4_5_NLP_Metrics.ipynb`

📄 Report

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MLDLProject: From Query-Based Segment Retrieval to Answer Generation in Egocentric Videos

📁 Repository Structure

⚙️ Environment Setup

General Dependencies

⚙️ Section-Specific Dependencies

3_5_Temporal_Localization_To_Answer_Generation/

3_6_Answer_Generation_VideoQA_Models/

💻 Hardware Requirements

🚀 How to Run

3_3_Temporal_Localization_Models/

3_4_Architectural_Variants_VSLNet/

3_5_Temporal_Localization_To_Answer_Generation/

3_6_Answer_Generation_VideoQA_Models/

4_5_NLP_Metrics.ipynb

📄 Report

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`3_5_Temporal_Localization_To_Answer_Generation/`

`3_6_Answer_Generation_VideoQA_Models/`

`3_3_Temporal_Localization_Models/`

`3_4_Architectural_Variants_VSLNet/`

`3_5_Temporal_Localization_To_Answer_Generation/`

`3_6_Answer_Generation_VideoQA_Models/`

`4_5_NLP_Metrics.ipynb`

Packages