GitHub - CoactiveAI/mediaperf: MediaPerf is a series of benchmarks to evaluate the video understanding performance of multimodal foundation models, based on the real data and tasks that technical leaders and practitioners within the media industry are building and deploying in production.

A production-ready framework to evaluate the video understanding performance of multimodal foundation models, based on the real data and tasks that technical leaders and practitioners within the media industry are building and deploying in production.

Key Features

17 vision-language models: Benchmarking across AWS Bedrock (Nova, Pegasus, NVIDIA), Google Vertex AI (Gemini), OpenAI (GPT), Anthropic (Claude), and self-hosted OpenAI-compatible models (Qwen, NVIDIA vLLM). See Model Reference Guide for complete list and details.
4 task types:
- Standard tagging
- Tagging and refinement workload
- Summarization
- Summary Evaluation
Config Validation: Pydantic-based validation catches errors before expensive operations
Multi-Cloud Support: AWS Bedrock, OpenAI, Google Vertex AI, Anthropic (Claude), Self-Hosted OpenAI-compatible models
Config-Driven: Zero-code model swapping and experimentation
Smart Caching: Frame reuse across runs with S3/GCS/Local storage backends
Comprehensive Tracking: Token usage, API costs, timing metrics
LLM-as-a-Judge: Automated summary quality evaluation
Extensible: Plugin architecture for adding models, tasks, and components

Tasks

Video-level tagging
Video-level summarization
Video-level tagging and refinement workload

Measurements

Performance
- Video-level tagging: precision, recall, F1
- Video-level summarization: Rubric-based score (using LLM-as-judge evaluation)
- Video-level tagging and refinement workload: N/A
Cost
Latency/throughput

Data

As our core dataset, we use the "Automatic Understanding of Image and Video Advertisements"[1] video data and annotations. As a brief description:

Video data: 2,003 ads videos ranging in length from 30s to 2m 30s for a total duration of 1,749 minutes.
Annotations: 68 video-level tags focused on topics and sentiment.

We augment this dataset with additional summaries for the same video data from human annotators. Briefly:

Video data: Same as above
Augmented annotations:
- Video-level summaries focused on long-form editorial descriptions, including storyline, intent, message, tone and target audience.
- 100 video-level tags, including genre, format, subject, mood and themes.

Notes/caveats

A number of tags from the original list were omitted from analysis due to limited coverage or an incosistent application (e.g. funny, effective, exciting).
The dataset YouTube video IDs are available at data/inputs/youtube_video_ids.txt.
Videos should be named vid_<youtube_id>.mp4 (e.g., vid_8iXdsvgpwc8.mp4) when stored in S3, GCS, or locally.
Video-level summaries are available at data/inputs/summarization_ground_truth.jsonl.

Technologies Used

Python 3.12
UV for dependency/environment management
AWS Bedrock (Nova Pro v1.0, Nova Lite v2.0, Pegasus 1.2, NVIDIA Nemotron Nano 12B v2 VL)
OpenAI API (GPT 5.1, GPT 5.4, GPT 5 Mini, GPT 5 Nano)
Google Vertex AI (Gemini 2.5 Pro, Gemini 3.0 Pro, Gemini 3.1 Pro, Gemini 3.1 Flash-Lite)
Anthropic API (Claude Opus 4.6, Claude Sonnet 4.6, Claude Haiku 4.5)
Self-Hosted OpenAI-compatible models (Qwen3-VL-30B-A3B-Instruct-FP8, NVIDIA Nemotron 3 Nano Omni via vLLM)
Pydantic v2 for config validation
python-dotenv for environment loading
OpenCV for video processing
Pytest for testing

Architecture

The project uses a plugin architecture with Registry + Factory + Builder patterns for zero-code extensibility. Components are config-driven and dynamically loaded at runtime.

Key design patterns:

Registry Pattern: Dynamic component registration and discovery
Factory Pattern: Component construction from YAML configuration
Strategy Pattern: Interchangeable storage backends (S3/GCS/Local)
Validation Pattern: Pydantic-based config validation at load time

All models inherit from base classes providing consistent prompt management, response parsing, and token tracking.

Config Validation: All configuration files are validated using Pydantic v2 schemas before execution. This catches structural errors, missing files, invalid registry keys, and environment variable issues before any expensive operations (video downloads, API calls) occur.

Project Structure

benchmarks-project/
├── src/benchmarks/           # Main package
│   ├── models/              # Taggers, Summarizers, Judges
│   │   └── prompts/         # Model-specific prompts organized by task
│   ├── tasks/               # Task orchestration
│   ├── metrics/             # Evaluation logic
│   ├── preprocessing/       # Video → frames transformation
│   ├── storage/             # S3/GCS/Local backends
│   ├── schemas/             # Pydantic validation schemas
│   │   ├── api_schemas.py   # API response schemas
│   │   └── config_schemas/  # Config validation (modular)
│   │       ├── __init__.py      # Main validate_config() entry
│   │       ├── base.py          # Base configs (logging, tracking, pricing)
│   │       ├── video.py         # Video source/selection validation
│   │       ├── storage.py       # Storage backend validation
│   │       ├── preprocessor.py  # Preprocessor + cache validation
│   │       ├── components.py    # Component validation (taggers, etc.)
│   │       ├── paths.py         # Path configs (task-specific)
│   │       ├── tasks.py         # Task-specific settings
│   │       └── task_configs.py  # Full task configurations
│   ├── utils/               # AWS, GCP, timing, config validation
│   ├── datasets/            # Ground truth loaders
│   ├── writers/             # Output formatting
│   ├── registry.py          # Component registries
│   ├── builders.py          # Factory functions
│   └── enums.py             # Type-safe enums
├── configs/                 # Task configurations
│   ├── standard_tagging/
│   ├── workload/
│   ├── summarization/
│   ├── summary_evaluation/
│   └── preprocessing/       # Preprocessing & MediaConvert configs
├── tests/                   # Test suite
│   ├── test_config_validation.py  # Integration test for all configs
│   ├── test_*.py            # Unit tests
│   └── fixtures/            # Test data and helpers
├── scripts/                 # Utility scripts
├── data/                    # Datasets and results
├── docs/                    # Documentation
└── pyproject.toml           # Project configuration

Setup

Install UV

curl -LsSf https://astral.sh/uv/install.sh | sh

Clone the repository

git clone <repository-url>
cd benchmarks-project

Install dependencies
```
uv sync --all-groups
```
Install pre-commit hooks
```
uv run pre-commit install
```

Configure environment variables

cp .env.example .env

Edit the .env file and add your credentials:

# OpenAI
OPENAI_API_KEY=your_openai_key_here

# AWS (or use ~/.aws/credentials)
AWS_ACCESS_KEY_ID=your_aws_key_id
AWS_SECRET_ACCESS_KEY=your_aws_secret_key

# GCP (or use service account JSON)
GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json

# Anthropic
ANTHROPIC_API_KEY=your_anthropic_key_here

How to Run

main.py is the entrypoint. Specify the config you want to run the pipeline for as an argument.

uv run main.py path/to/config.yaml

Example:

uv run main.py configs/standard_tagging/testing/config_openai_test.yaml

Config files allow to specify all parameters needed to set up and run a pipeline: task, model, videos, preprocessor config if needed, etc.

That's it!

Quickly testing the pipeline end-to-end

To test the pipeline, use configs from the testing/ subdirectories. These are configured to run on a small number of videos with debug logging enabled:

# Example: Test standard tagging
uv run python main.py configs/standard_tagging/testing/config_gpt_test.yaml

# Example: Test summarization
uv run python main.py configs/summarization/testing/config_gpt_test.yaml

Running Tasks

The framework supports four task types, all executed through main.py:

# Standard Tagging - Multi-label classification with metrics
uv run python main.py configs/standard_tagging/config_bedrock.yaml

# Workload Benchmark - Timing and cost analysis with iterative prompts
uv run python main.py configs/workload/config_workload_gpt.yaml

# Summarization - Generate video summaries
uv run python main.py configs/summarization/config_gpt.yaml

# Summary Evaluation - LLM-as-a-judge quality assessment
uv run python main.py configs/summary_evaluation/config_openai_judge_gpt.yaml

Note: All configs are validated at load time using Pydantic schemas. Validation errors are reported with clear messages before any execution begins.

Utility Scripts

Collect video metadata

Extract duration, resolution, codec, and file size from videos in S3 using ffprobe. Used for duration-based video selection (e.g., selecting videos to reach a target number of minutes):

uv run python scripts/collect_video_metadata.py \
  --bucket benchmarks-project-dev \
  --prefix youtube_ads_dataset_h264/ \
  --output data/inputs/youtube_ads_video_metadata.json

Frame sampling

uv run python scripts/preprocess_frames_only.py \
  --config configs/preprocessing/config_preprocess_duration_based.yaml

Convert videos using AWS MediaConvert

Always converts to H.264 codec
Optionally downscales resolution (set max_height in config)
For 720p downscaling (NVIDIA models): use config_mediaconvert_720p.yaml
For codec conversion only: use config_mediaconvert_duration_based.yaml

uv run python scripts/convert_videos_mediaconvert.py \
  --config configs/preprocessing/config_mediaconvert_720p.yaml

Testing

Run tests using pytest:

# Run all tests
uv run pytest

# Run only fast tests (<5 seconds)
uv run pytest -m fast

# Run specific test file
uv run pytest tests/test_response_parsing.py

The test suite includes test cases covering:

Config validation: Integration test validates all configs
Model API response parsing: JSON parsing, tag filtering, confidence handling
Multi-label classification metrics: Precision, recall, F1 calculations
LLM-as-a-judge evaluation: Sum/average aggregation, criterion statistics
Cost and timing calculations: Per-video, per-iteration aggregation
Label loading and validation: JSONL/JSON formats, threshold validation

Code Formatting

Format code using Ruff:

# Format all Python files
uv run ruff format

Pre-commit hooks automatically run Ruff on staged files before each commit.

Additional Documentation

Additional information can be found in the docs/ directory:

Model Reference Guide - Model-specific requirements, limitations, and best practices
Configuration Guide - Complete walkthrough of config file structure and options
Cost Calculation Guide - Tracking and calculating costs for AWS, GCP, and model inference
Contributing Guide - How to contribute new models, bug fixes, and other improvements

Known Limitations

MediaPerf delivers a production-ready benchmark for real media tasks today. The following are areas where future iterations can extend its coverage and value further. Contributions are welcome — whether that's code to this repo, licensed data for benchmarking, or joining our working group.

No model size segmentation. Results compare models of different sizes on the same leaderboard without size-class groupings. Future iterations will introduce further models and parameter-based tiers for fairer comparison.
Generative VLMs only. The benchmark currently evaluates vision-language models (e.g., Gemini, GPT, Qwen) on generative tasks (tagging, summarization). Encoder-based models used for embedding and search/retrieval workflows are not yet covered but are planned for future iterations.
Short-form video only. The dataset includes short-form video only. Longer-form content (episodes, films, sports, news) is on the roadmap but not yet included. Additional licensed data is needed to expand coverage.
Pipeline steps not decoupled. The pipeline does not explicitly isolate certain steps in pre-processing (e.g., frame sampling, resolution scaling), making it harder to attribute performance differences. Future iterations could introduce measurement at pipeline stages to enable comparative analysis of key optimizations.
Limited output standardization. The benchmark does not enforce standardized output formats (e.g., timecodes, structured metadata) required for downstream media workflows.
Hardware- and platform-dependent results. Cost and latency numbers are tied to specific cloud providers and instance types (e.g., GCP vs. AWS). Current results represent our best attempt to provide practical comparative measurements despite differences across environments.
Pipelines reflect typical engineering effort. Inference pipelines were built using publicly available documentation and best practices such that they are representative of what a typical engineering team could stand up in a reasonable timeframe (not provider-specific optimizations inaccessible to most teams). Future iterations may include a provider-optimized task track, contingent on involvement from model providers and platforms.

License

The source code in this repository is licensed under the Apache License 2.0. See LICENSE for details.

Our human-annotated summaries and tags are licensed under the Creative Commons Attribution 4.0 International License (CC-BY 4.0). See LICENSE-DATA for details.

Acknowledgements

Juan Aguilar (juan-co@coactive.ai) — Design & Implementation
Seby Jacob (seby@coactive.ai) — Technical & Research Advisory
Ali Harakeh (ali@coactive.ai) — Technical & Research Advisory

References

[1] Zaeem Hussain, Mingda Zhang, Xiaozhong Zhang, Keren Ye, Christopher Thomas, Zuha Agha, Nathan Ong, Adriana Kovashka. "Automatic Understanding of Image and Video Advertisements." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1705-1715. Link

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Key Features

Tasks

Measurements

Data

Notes/caveats

Technologies Used

Architecture

Project Structure

Setup

How to Run

Quickly testing the pipeline end-to-end

Running Tasks

Utility Scripts

Collect video metadata

Frame sampling

Convert videos using AWS MediaConvert

Testing

Code Formatting

Additional Documentation

Known Limitations

License

Acknowledgements

References

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github		.github
configs		configs
data		data
docs		docs
scripts		scripts
src/benchmarks		src/benchmarks
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE-DATA		LICENSE-DATA
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Key Features

Tasks

Measurements

Data

Notes/caveats

Technologies Used

Architecture

Project Structure

Setup

How to Run

Quickly testing the pipeline end-to-end

Running Tasks

Utility Scripts

Collect video metadata

Frame sampling

Convert videos using AWS MediaConvert

Testing

Code Formatting

Additional Documentation

Known Limitations

License

Acknowledgements

References

About

Resources

License

Licenses found

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages