Skip to content
/ Aha- Public

(NeurIPS 2025) Aha! – Predicting What Matters Next: Online Highlight Detection Without Looking Ahead

License

Notifications You must be signed in to change notification settings

aiden200/Aha-

Repository files navigation

Official implementation of paper: Aha! – Predicting What Matters Next: Online Highlight Detection Without Looking Ahead (in Proceedings of NeurIPS 2025)

Python PyTorch License

Contents:

Introduction

Unlike traditional models that analyze every frame or respond at fixed intervals, Aha! dynamically decides when to pause, reason, and act, capturing the essence of meaningful moments.

Built by fine-tuning Qwen-7B with a multimodal, importance-aware objective and incorporating uncertainty-based decision-making, Aha! can:

  • 🎯 Detect when enough context has accumulated to make informed predictions

  • 📊 Rank and extract key segments using task-aware importance scores

  • 📝 Generate structured highlight reels using Savitzky-Golay smoothing and time-domain peak detection to identify key moments.

This approach enables more efficient video understanding, making Aha! applicable to autonomous agents, surveillance, video summarization, and decision-support systems. This is an example of our model running live on the NASA Stream of Astronaut Jonny Kim Soyuz MS-27 Docking (55 minutes of video):

Aha! demo gif

Installation

  1. Create conda environment and use pip to install some packages
cd aha
conda create -n aha python=3.10
conda activate aha
pip install --upgrade pip
pip install -r requirements.txt
  1. Install llava. If you run into any issues check the official repository download instructions.
cd LLaVA_NeXT
pip install -e ".[train]"
cd ..
  1. Install torch compiled with cuda. Install them together using the instructions provided by pytorch.org.

  2. Install flash-attention following the instructions in https://github.com/Dao-AILab/flash-attention. If you have difficulties installing it, add --attn_implementation sdpa in every command to use the sdpa implementation of transformer attention for train or inference.

MAX_JOBS=4 pip install flash-attn --no-build-isolation --no-cache-dir 
  1. Optional: you can download the weights of the model from our google drive
Common Problems

Note 1: If you get a bitsandbytes error, try running:

pip uninstall bitsandbytes
pip install bitsandbytes

Note 2: If you get a Undefined symbol cpython-310-x86_64-linux-gnu.so: undefined symbol: error, try running:

pip uninstall flash-attn
pip install flash-attn --no-build-isolation --no-cache-dir

Note 3: If you get some kind of c10 deprecation error, your pytorch version might be too high. The authors used the version:

  • Python==3.10
  • torch == 2.5.1
  • torchvision==0.20.1
  • cuda12.4
pip3 install torch==2.5.1 torchvision==0.20.1 torchaudio --index-url https://download.pytorch.org/whl/cu124

Note 4: If you want to use the CPU adam optimizer with deepspeed, you need to install it with the correct flags:

DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 \
pip install deepspeed \
  --global-option="build_ext" \
  --global-option="-j8"

Required Specs

This model trained 1 epoch off of 6xA6000 GPUs, over 24 hours. You need at least 48GB worth of VRAM on each GPU to tune it.

Inference requires at least 24GB VRAM. Tested on a single RTX 4090 GPU.

Download pretrained Model

  • Download checkpoint weights from our google drive
  • Unzip the weights into aha_weights
    unzip aha_weights.zip -d aha_weights

Preparing the metadata

  • Download the metadata for our dataset. You can download them from our google drive.
  • Unzip the weights into datasets
    unzip datasets.zip -d datasets

This should give you a structure like this.

├── datasets
│   ├── charades
│   │   └── annotations
│   │       └── test-random_prompt.json
│   ├── coin
│   │   └── annotations
│   │       └── train-0.25_0.5_earlier-120s_240s.json
│   ├── download_tools
│   │   ├── coin_download.py
│   │   ├── coin_files.json
│   │   ├── hisum_download.py
│   │   ├── mr_hisum_crawler.py
│   │   ├── mr_hisum_metadata.csv
│   │   └── vocabulary.csv
│   ├── hisum
│   │   └── annotations
│   │       ├── mr_hisum_metadata.csv
│   │       └── split.json
│   ├── HIHD
│   │   └── annotations
│   │       ├── HIHD_metadata.csv
│   │       ├── youtube_links.txt
│   ├── shot2story
│   │   └── annotations
│   │       ├── dvc_train-human_anno-0.25_0.5_earlier.json
│   │       ├── magqa_test.json
│   │       └── magqa_train-0.25_0.5-earlier.json
│   ├── tvsum
│   └── youcook2
│       └── annotations
│           └── val-random_prompt.json
├── assets
├── configs
├── data
├── demo
├── instructions
├── LICENSE
├── LLaVA_NeXT
├── models
├── README.md
├── requirements.txt
├── scripts
├── test
├── train.py
└── Utils

Evaluation

  • Tvsum data preparation:

    • Follow the instructions from the official tvsum repository to download the videos then move it to the datasets folder as datasets/tvsum
    • Run scripts/inference/tvsum.sh.
    • To evaluate tvsum with the quality dropout, run scripts/inference/tvsum_degraded.sh
  • Mr.Hisum data preparation

    • Prepare the mr_hisum.h5 file following the instructions of the official repository.
    • Place the mr_hisum.h5 file in the datasets/hisum/annotations folder.
    • Download the validation youtube videos and place them in the datasets/hisum/videos folder.
    • Run scripts/inference/hisum.sh
  • Charades data preparation

    • Prepare the Charades videos following the official instructions. Place them in datasets/charades/videos folder.
    • Run scripts/inference/charades.sh
  • YouCook2 data preparation

    • Prepare the youcook2 videos following the official instructions. Place them in datasets/youcook2/videos folder.
    • Run scripts/inference/youcook2.sh
  • Shot2Story data preparation

    • Prepare the shot2Story videos following the official instructions. Place them in datasets/shot2story/videos folder.
    • Go to scripts/inference/magqa.sh and update the GROQ_API_KEY (if using online inference for llama-3.3 70B) and OPENAI_API_KEY (required).
    • Note: You need at least 140GB of VRAM locally to run a quantized version of a 70B llama model.
    • Run scripts/inference/magqa.sh

Training

Data preparation

  • HIHD data preparation
  • Shot2Story data preparation
    • Prepare the shot2Story videos following the official instructions.
    • Place them in datasets/shot2story/videos folder.
  • COIN data preparation
    • Prepare the COIN videos following the official instructions.
    • Place the videos in the datasets/coin/videos folder.

Since some of these datasets (especially HIHD) are very big, you can always specify the video path in the configs/datasets/aha_config.json file where the datasets exist on your local machine.

note: I've left a script to help the download processes at datasets/download_tools

When running training code for the first time, the dataset code will traverse all videos of the training dataset and measure the frame rate, duration, number of frames, and the corruption status of the videos. It will store this information in datasets/${dataset_name}/videos_metadata.json. This can take some time, since some of these datasets are very large.

  • Following MMDuet's labeling process, download paraphrase-en.gz (59MB) which is used for dense video captioning evaluation. Put this file at test/dvc/metrics/data/paraphrase-en.gz

Run the training script

Log into wandb in order to monitor your progress

wandb login [YOUR_API_KEY]

Start the training process

bash ./scripts/train.sh

Distributed Training

This model is very big, trained on 6xA6000 GPUs, and you will probably need to utilize distributed training. I've included instructions on how to train on the cloud, using Paperspace.

Fine-tuning

If you wish to further tune the trained weights, download the pretrained model from our google drive and put the files under folder aha_weights

In the scripts/train.sh file, add this line: --lora_pretrained aha_weights \

Usage Guidelines

This model should not be deployed in contexts that may infringe on personal privacy or be used to reinforce harmful societal biases. If used in sensitive domains (e.g., surveillance, defense), additional safeguards such as privacy-preserving filters, access controls, and domain-specific audits are strongly recommended. By using this model, you agree to adhere to responsible AI deployment practices.

Acknowledgements

This work was conducted as part of the author's AEOP Fellowship, with compute resources and mentorship provided by the Army Research Laboratory, West Coast (ARL-W).

The codebase uses components in:

We thank the original authors for their contributions.

License

This project is licensed under the Apache License 2.0.

Citation

If you find this work useful in your research, please consider citing:

@inproceedings{chang2025aha,
  title     = {Aha! - Predicting What Matters Next: Online Highlight Detection Without Looking Ahead},
  author    = {Chang, Aiden and De Melo, Celso and Lukin, Stephanie},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  note      = {NeurIPS},
  url       = {[https://neurips.cc/Conferences/2025](https://neurips.cc/virtual/2025/poster/119707)}
}

About

(NeurIPS 2025) Aha! – Predicting What Matters Next: Online Highlight Detection Without Looking Ahead

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published