Video is a hard modality to work with. You're dealing with more data, temporal complexity, and annotation workflows that don't scale. This workshop tackles a practical question: given a large video dataset, how do you understand what's in it without manually watching thousands of clips?
We work with a subset of Action100M preview — 1,144 YouTube videos, each clipped to 90 seconds, annotated with a hierarchical Tree-of-Captions structure produced by a fully automated AI pipeline. Every label in this dataset was written by a model. None of it was seen by a human annotator.
As AI-generated datasets become the norm, the skill of interrogating machine-generated annotations is increasingly important. This workshop shows you how to do that systematically.
| Question | Tools | |
|---|---|---|
| 1. What We Were Given | What does this dataset claim to contain? | FiftyOne App |
| 2. Three Lenses | What does the raw data actually look like? | Qwen3-VL-Embedding, Molmo2, Sentence Transformers |
| 3. The Second Opinion | Does a second model agree with the first? | Qwen3-VL |
| 4. Measuring Agreement | How much do they agree, per sample? | Text Evaluation Plugin |
By the end, you'll have a confidence map of the dataset's annotations and a reusable workflow for understanding any video dataset with AI-generated labels.
Install all dependencies with:
pip install -r requirements.txtRunning on Google Colab? Uninstall
torchcodecafter installing requirements — it conflicts with Colab's video decoding stack:pip uninstall torchcodec -y
flash-attn is not included in requirements.txt because it requires a compatible CUDA environment and can take a while to build. It is not required, but will significantly speed up inference with the transformer-based models in this workshop. If your environment supports it:
pip install flash-attn --no-build-isolationDownload the base Action100M subset from the Voxel51 Hugging Face org:
from fiftyone.utils.huggingface import load_from_hub
dataset = load_from_hub(
"Voxel51/action100m_tiny_subset",
dataset_name="action100m",
overwrite=True,
persistent=True,
)If your compute is limited, the fully enriched dataset (with all embeddings, model outputs, and evaluation scores from the notebook already computed) is available here:
from fiftyone.utils.huggingface import load_from_hub
dataset = load_from_hub(
"harpreetsahota/fo_video_workshop_enriched",
dataset_name="action100m_enriched",
overwrite=True,
persistent=True,
)The enriched dataset includes Qwen3-VL-Embedding vectors. To use the natural language search feature in the FiftyOne App, you need to register and download the model locally so FiftyOne can use it for query encoding:
import fiftyone.zoo as foz
foz.register_zoo_model_source(
"https://github.com/harpreetsahota204/qwen3vl_embeddings",
overwrite=True
)
foz.download_zoo_model(
"https://github.com/harpreetsahota204/qwen3vl_embeddings",
model_name="Qwen/Qwen3-VL-Embedding-2B",
)This workshop uses three FiftyOne plugins. Install them before running the notebook.
Used in Section 4 to compute per-sample agreement scores between model outputs:
fiftyone plugins download https://github.com/harpreetsahota204/text_evaluation_metrics --overwriteSome captions in this dataset are long. This plugin renders any StringField in a formatted panel inside the FiftyOne App, making them much easier to read:
fiftyone plugins download https://github.com/harpreetsahota204/caption_viewer --overwriteAn experimental panel used to demonstrate a few additional workflows in the notebook:
fiftyone plugins download https://github.com/harpreetsahota204/FiftyComfy --overwrite@article{chen2026action100m,
title={Action100M: A Large-scale Video Action Dataset},
author={Chen, Delong and Kasarla, Tejaswi and Bang, Yejin and Shukor, Mustafa and Chung, Willy and Yu, Jade and Bolourchi, Allen and Moutakanni, Théo and Fung, Pascale},
journal={arXiv preprint arXiv:2601.10592},
year={2026}
}@article{qwen3vlembedding,
title={Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking},
author={Li, Mingxin and Zhang, Yanzhao and Long, Dingkun and Chen, Keqin and Song, Sibo and Bai, Shuai and Yang, Zhibo and Xie, Pengjun and Yang, An and Liu, Dayiheng and Zhou, Jingren and Lin, Junyang},
journal={arXiv},
year={2026}
}@article{Qwen3-VL,
title={Qwen3-VL Technical Report},
author={Shuai Bai and Yuxuan Cai and Ruizhe Chen and Keqin Chen and Xionghui Chen and Zesen Cheng and Lianghao Deng and Wei Ding and Chang Gao and Chunjiang Ge and Wenbin Ge and Zhifang Guo and Qidong Huang and Jie Huang and Fei Huang and Binyuan Hui and Shutong Jiang and Zhaohai Li and Mingsheng Li and Mei Li and Kaixin Li and Zicheng Lin and Junyang Lin and Xuejing Liu and Jiawei Liu and Chenglong Liu and Yang Liu and Dayiheng Liu and Shixuan Liu and Dunjie Lu and Ruilin Luo and Chenxu Lv and Rui Men and Lingchen Meng and Xuancheng Ren and Xingzhang Ren and Sibo Song and Yuchong Sun and Jun Tang and Jianhong Tu and Jianqiang Wan and Peng Wang and Pengfei Wang and Qiuyue Wang and Yuxuan Wang and Tianbao Xie and Yiheng Xu and Haiyang Xu and Jin Xu and Zhibo Yang and Mingkun Yang and Jianxin Yang and An Yang and Bowen Yu and Fei Zhang and Hang Zhang and Xi Zhang and Bo Zheng and Humen Zhong and Jingren Zhou and Fan Zhou and Jing Zhou and Yuanzhi Zhu and Ke Zhu},
journal={arXiv preprint arXiv:2511.21631},
year={2025}
}@misc{clark2026molmo2,
title={Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding},
author={Christopher Clark and Jieyu Zhang and Zixian Ma and Jae Sung Park and Mohammadreza Salehi and Rohun Tripathi and Sangho Lee and Zhongzheng Ren and Chris Dongjoo Kim and Yinuo Yang and Vincent Shao and Yue Yang and Weikai Huang and Ziqi Gao and Taira Anderson and Jianrui Zhang and Jitesh Jain and George Stoica and Winson Han and Ali Farhadi and Ranjay Krishna},
year={2026},
eprint={2601.10611},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.10611},
}@misc{akram2026jinaembeddingsv5texttasktargetedembeddingdistillation,
title={jina-embeddings-v5-text: Task-Targeted Embedding Distillation},
author={Mohammad Kalim Akram and Saba Sturua and Nastia Havriushenko and Quentin Herreros and Michael Günther and Maximilian Werk and Han Xiao},
year={2026},
eprint={2602.15547},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.15547},
}The code and workshop content in this repository are licensed under Apache 2.0.
Models and datasets referenced in this workshop are subject to their own respective licenses — please consult each project directly before use.
Found a bug or have a question? Open an issue.
Contributions are not being accepted.
