[Benchmark] Add DA-2K, ERQA, and RefSpatialBench support by Qi-Zhangyang · Pull Request #1461 · open-compass/VLMEvalKit

Qi-Zhangyang · 2026-02-28T09:34:54Z

Description

This PR adds support for three new benchmarks to VLMEvalKit:

1. DA-2K

Paper: DA-2K: A Challenging Benchmark for Relative Depth Estimation (NeurIPS 2024)
Dataset: https://huggingface.co/datasets/depth-anything/DA-2K
Features:
- Relative depth estimation in VQA format
- 1K images with 2K annotation pairs across 8 scene categories
- Ground truth: which of two annotated points is closer to the camera
Eval Metric: Accuracy

2. ERQA (Embodied Reasoning QA)

Paper: Gemini Robotics: Bringing AI into the Physical World (2025)
Dataset: https://huggingface.co/datasets/RunsenXu/ERQA
Features:
- VQA format with <image> placeholders
- Supports multi-image questions
- 400 samples covering spatial reasoning and world knowledge
- 8 categories: Trajectory Reasoning, Action Reasoning, Pointing, State Estimation, Spatial Reasoning, Multi-view Reasoning, Task Reasoning, Other
Eval Metric: Accuracy (overall + per-category)

3. RefSpatialBench

Paper: RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics (NeurIPS 2025)
Dataset: https://huggingface.co/datasets/BAAI/RefSpatial-Bench
Features:
- Point coordinate prediction tasks
- Mask-based evaluation (Success Rate)
- 3 splits: Location (100), Placement (100), Unseen (77)
- Supports multiple output formats: JSON [(x, y)], Qwen3-VL [{"point_2d": [x, y]}], Gemini [{"point": [y, x]}], Molmo XML <points x1="50" y1="50"/>
Eval Metric: Success Rate (predicted point falls within ground-truth mask)

Benchmark Results (Qwen3-VL-235B)

Benchmark Results

Benchmark	Ours (Qwen3-VL-235B-Instruct)	Reference
DA-2K	68.78	70.3 (Claude Opus 4.5, Seed 2.0 Report)
ERQA	48.5	43.0 (Qwen3-VL-235B-Instruct, Qwen3-VL Report)
RefSpatialBench	66.0	65.5 (Qwen3-VL-235B-Instruct, Qwen3-VL Report)

Note: RefSpatialBench results vary depending on the distance tolerance threshold used for evaluation. Our implementation supports both strict mask containment and centroid-distance-based evaluation. The table below shows the sensitivity to this threshold, where ≤ 0.07 aligns with the Qwen3-VL report.

Threshold Score

Strict mask containment 50.5

≤ 0.05 (5% of image width) 59.0

≤ 0.07 (7% of image width) 66.0

≤ 0.10 (10% of image width) 74.0

Usage

# DA-2K
python run.py --data DA-2K --model gpt-4o

# ERQA
python run.py --data ERQA --model gpt-4o

# RefSpatialBench
python run.py --data RefSpatial-Location --model gpt-4o
python run.py --data RefSpatial-Placement --model gpt-4o
python run.py --data RefSpatial-Unseen --model gpt-4o

Add support for DA-2K (NeurIPS 2024), a challenging benchmark for relative depth estimation. Features: - Auto-download from HuggingFace (DepthAnything/DA-2K) - 1K images, 2K annotation pairs - 8 scene categories - VQA format for depth estimation tasks Usage: python run.py --data DA-2K --model YOUR_MODEL

Add support for two new benchmarks: - ERQA: Embodied Reasoning QA Evaluation Dataset (Gemini Robotics) - RefSpatialBench: Multi-step Spatial Referring with Reasoning (NeurIPS 2025) Combined with existing DA-2K support (commit 3130988), this branch now supports: - DA-2K: Dense Annotation to 2K - ERQA: 400 samples of embodied reasoning - RefSpatialBench: 277 samples of spatial referring Note: This PR should be merged together with DA-2K support

DA-2K: - Fix HuggingFace repo ID (DepthAnything -> depth-anything) - Switch from snapshot_download to hf_hub_download + ZIP extraction - Rewrite load_data to read TSV directly with absolute path handling - Filter inaccessible images (macOS Unicode NFD/NFC normalization issue) - Simplify build_prompt to use file path directly with point1/point2 suffix - Fix evaluate to use regex matching for point1/point2 extraction RefSpatialBench: - Update prompt template to Qwen3-VL native point_2d format - Always rebuild question from current template (ignore stale TSV column) - Add failure_reason tracking in evaluate for better error analysis - Add point_2d format parsing support in _parse_prediction Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

FangXinyu-0913 · 2026-03-03T08:42:42Z

Hi @Qi-Zhangyang! Thanks for your contribution. Please help fix lint first.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Qi-Zhangyang · 2026-03-03T16:25:32Z

Finished

moonshot and others added 3 commits February 28, 2026 17:30

[CI] Fix flake8 lint errors: trailing whitespace, long lines, f-string

81ee2df

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge branch 'main' into add-da2k

b773a7e

FangXinyu-0913 merged commit 8a904f5 into open-compass:main Mar 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Benchmark] Add DA-2K, ERQA, and RefSpatialBench support#1461

[Benchmark] Add DA-2K, ERQA, and RefSpatialBench support#1461
FangXinyu-0913 merged 5 commits intoopen-compass:mainfrom
Qi-Zhangyang:add-da2k

Qi-Zhangyang commented Feb 28, 2026 •

edited

Loading

Uh oh!

FangXinyu-0913 commented Mar 3, 2026

Uh oh!

Qi-Zhangyang commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Threshold	Score
Strict mask containment	50.5
≤ 0.05 (5% of image width)	59.0
≤ 0.07 (7% of image width)	66.0
≤ 0.10 (10% of image width)	74.0

Conversation

Qi-Zhangyang commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

1. DA-2K

2. ERQA (Embodied Reasoning QA)

3. RefSpatialBench

Benchmark Results (Qwen3-VL-235B)

Benchmark Results

Usage

Uh oh!

FangXinyu-0913 commented Mar 3, 2026

Uh oh!

Qi-Zhangyang commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Qi-Zhangyang commented Feb 28, 2026 •

edited

Loading