Skip to content

[Benchmark] Add DA-2K, ERQA, and RefSpatialBench support#1461

Merged
FangXinyu-0913 merged 5 commits intoopen-compass:mainfrom
Qi-Zhangyang:add-da2k
Mar 4, 2026
Merged

[Benchmark] Add DA-2K, ERQA, and RefSpatialBench support#1461
FangXinyu-0913 merged 5 commits intoopen-compass:mainfrom
Qi-Zhangyang:add-da2k

Conversation

@Qi-Zhangyang
Copy link
Contributor

@Qi-Zhangyang Qi-Zhangyang commented Feb 28, 2026

Description

This PR adds support for three new benchmarks to VLMEvalKit:

1. DA-2K

  • Paper: DA-2K: A Challenging Benchmark for Relative Depth Estimation (NeurIPS 2024)
  • Dataset: https://huggingface.co/datasets/depth-anything/DA-2K
  • Features:
    • Relative depth estimation in VQA format
    • 1K images with 2K annotation pairs across 8 scene categories
    • Ground truth: which of two annotated points is closer to the camera
  • Eval Metric: Accuracy

2. ERQA (Embodied Reasoning QA)

  • Paper: Gemini Robotics: Bringing AI into the Physical World (2025)
  • Dataset: https://huggingface.co/datasets/RunsenXu/ERQA
  • Features:
    • VQA format with <image> placeholders
    • Supports multi-image questions
    • 400 samples covering spatial reasoning and world knowledge
    • 8 categories: Trajectory Reasoning, Action Reasoning, Pointing, State Estimation, Spatial Reasoning, Multi-view Reasoning, Task Reasoning, Other
  • Eval Metric: Accuracy (overall + per-category)

3. RefSpatialBench

  • Paper: RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics (NeurIPS 2025)
  • Dataset: https://huggingface.co/datasets/BAAI/RefSpatial-Bench
  • Features:
    • Point coordinate prediction tasks
    • Mask-based evaluation (Success Rate)
    • 3 splits: Location (100), Placement (100), Unseen (77)
    • Supports multiple output formats: JSON [(x, y)], Qwen3-VL [{"point_2d": [x, y]}], Gemini [{"point": [y, x]}], Molmo XML <points x1="50" y1="50"/>
  • Eval Metric: Success Rate (predicted point falls within ground-truth mask)

Benchmark Results (Qwen3-VL-235B)

Benchmark Results

Benchmark Ours (Qwen3-VL-235B-Instruct) Reference
DA-2K 68.78 70.3 (Claude Opus 4.5, Seed 2.0 Report)
ERQA 48.5 43.0 (Qwen3-VL-235B-Instruct, Qwen3-VL Report)
RefSpatialBench 66.0 65.5 (Qwen3-VL-235B-Instruct, Qwen3-VL Report)

Note: RefSpatialBench results vary depending on the distance tolerance threshold used for evaluation. Our implementation supports both strict mask containment and centroid-distance-based evaluation. The table below shows the sensitivity to this threshold, where ≤ 0.07 aligns with the Qwen3-VL report.

Threshold Score
Strict mask containment 50.5
≤ 0.05 (5% of image width) 59.0
≤ 0.07 (7% of image width) 66.0
≤ 0.10 (10% of image width) 74.0

Usage

# DA-2K
python run.py --data DA-2K --model gpt-4o

# ERQA
python run.py --data ERQA --model gpt-4o

# RefSpatialBench
python run.py --data RefSpatial-Location --model gpt-4o
python run.py --data RefSpatial-Placement --model gpt-4o
python run.py --data RefSpatial-Unseen --model gpt-4o

moonshot and others added 3 commits February 28, 2026 17:30
Add support for DA-2K (NeurIPS 2024), a challenging benchmark for relative depth estimation.

Features:
- Auto-download from HuggingFace (DepthAnything/DA-2K)
- 1K images, 2K annotation pairs
- 8 scene categories
- VQA format for depth estimation tasks

Usage:
  python run.py --data DA-2K --model YOUR_MODEL
Add support for two new benchmarks:
- ERQA: Embodied Reasoning QA Evaluation Dataset (Gemini Robotics)
- RefSpatialBench: Multi-step Spatial Referring with Reasoning (NeurIPS 2025)

Combined with existing DA-2K support (commit 3130988), this branch now supports:
- DA-2K: Dense Annotation to 2K
- ERQA: 400 samples of embodied reasoning
- RefSpatialBench: 277 samples of spatial referring

Note: This PR should be merged together with DA-2K support
DA-2K:
- Fix HuggingFace repo ID (DepthAnything -> depth-anything)
- Switch from snapshot_download to hf_hub_download + ZIP extraction
- Rewrite load_data to read TSV directly with absolute path handling
- Filter inaccessible images (macOS Unicode NFD/NFC normalization issue)
- Simplify build_prompt to use file path directly with point1/point2 suffix
- Fix evaluate to use regex matching for point1/point2 extraction

RefSpatialBench:
- Update prompt template to Qwen3-VL native point_2d format
- Always rebuild question from current template (ignore stale TSV column)
- Add failure_reason tracking in evaluate for better error analysis
- Add point_2d format parsing support in _parse_prediction

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@FangXinyu-0913
Copy link
Collaborator

Hi @Qi-Zhangyang! Thanks for your contribution. Please help fix lint first.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Qi-Zhangyang
Copy link
Contributor Author

Finished

@FangXinyu-0913 FangXinyu-0913 merged commit 8a904f5 into open-compass:main Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants