Exploring Scalability of Self-Training for Open-Vocabulary Temporal Action Localization (STOV-TAL)

The code is currently not cleaned. We will clean it in the future. If you need any clarification, please contact js-hyun@yonsei.ac.kr.

Installation

We use packages from ActionFormer and ViFi-CLIP. Our experiments were conducted with cuda==11.3, torch==1.11.0, torchvision==0.12.0, and numpy==1.24.4. Alternatively, you can use the following Docker image:

docker pull jshyunaa/vificlip_tal:v2

Dataset

Below are Google Drive links to the datasets, including features and annotations used in our experiments. For each dataset, we provide CLIP-B (clip) and ViFi-CLIP-B (ep10) features.

[FineAction] [THUMOS14] [ANET13]

Due to the large volume, we do not release other features: ViCLIP-B/L features and ViFi-CLIP-B features for untrimmed YouTube videos used for scaling up. If you need these features, please contact us.

Directory Structure Preparation

After downloading the datasets, please organize them and the checkpoints according to the following project structure:

{project_root}/
│
├── data/
│   ├── anet13/
│   │   ├── annotations/
│   │   ├── vificlip_feats/
│   │   │   └── F16_w16F_s4F_ep10_trainval/
│   │   └── vinfo.csv
│   ├── thumos14/
│   │   ├── annotations/
│   │   ├── vificlip_feats/
│   │   └── vinfo.csv
│   ├── fineaction/
│   │   ├── annotations/
│   │   ├── vificlip_feats/
│   │   └── vinfo.csv
│   └── joint/
│       ├── annotations/
│       ├── vificlip_feats/
│       └── vinfo.csv
│
├── checkpoints/ (Stores pretrained backbones)
│   ├── vifi_clip/
│   │   └── vifi_clip_10_epochs_k400_full_finetuned.pth
│   └── viclip/
│       ├── ViCLIP-B_InternVid-FLT-10M.pth
│       └── ViCLIP-L_InternVid-FLT-10M.pth
├── ckpt/ (Stores training outputs)
│   ├── {config of training data}/
│   └── TH_agn_K400/
│       └── {config of model}/
│           └── thumos14_vifi_prop_K400_0/
│               ├── epoch_xxx.pth.tar
│               └── logs/
│                   └── ...
...

How to Extract Video Features

Please refer to this repository for feature extraction. Note: Before extracting features, resize the shortest side of the videos to 256 pixels and set the frame rate to 30 FPS. https://github.com/HYUNJS/vifi-clip-tal

LMM-based OV-TAL

Please refer to this repository for OV-TAL with Gemini. https://github.com/HYUNJS/LMM-TAL

Citation

@inproceedings{hyun2025exploring,
  title={Exploring Scalability of Self-Training for Open-Vocabulary Temporal Action Localization},
  author={Hyun, Jeongseok and Han, Su Ho and Kang, Hyolim and Lee, Joon-Young and Kim, Seon Joo},
  booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)},
  year={2025},
  pages={9388-9397}
}

Acknowledgement

Our code is based on ActionFormer, ViFi-CLIP, InternVideo. We sincerely thank the authors for releasing their code.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
configs		configs
filter_pl		filter_pl
gemini_output		gemini_output
libs		libs
run_script		run_script
scripts_eval		scripts_eval
scripts_inf		scripts_inf
scripts_train		scripts_train
tools		tools
visualization		visualization
.gitignore		.gitignore
FAQ.md		FAQ.md
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md
collect_metrics.py		collect_metrics.py
eda_data_length.py		eda_data_length.py
eval.py		eval.py
eval_proposal.py		eval_proposal.py
eval_script.py		eval_script.py
eval_script_old.py		eval_script_old.py
filter_results_with_gemini_results.py		filter_results_with_gemini_results.py
inference.py		inference.py
inference_opental.py		inference_opental.py
merge_metric.py		merge_metric.py
teaser.jpg		teaser.jpg
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploring Scalability of Self-Training for Open-Vocabulary Temporal Action Localization (STOV-TAL)

Installation

Dataset

Directory Structure Preparation

How to Extract Video Features

LMM-based OV-TAL

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Exploring Scalability of Self-Training for Open-Vocabulary Temporal Action Localization (STOV-TAL)

Installation

Dataset

Directory Structure Preparation

How to Extract Video Features

LMM-based OV-TAL

Citation

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages