Skip to content

Open-Bee/VLMEvalKit

Repository files navigation

Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs

ICLR 2026

🌐 Homepage | 🤗 Models and Datasets | 📖 ArXiv

This repository contains the evaluation code for the paper "Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs". The code is based on VLMEvalKit.


📄 Abstract

Fully open multimodal large language models (MLLMs) currently lag behind proprietary counterparts, primarily due to a significant gap in data quality for supervised fine-tuning (SFT). Existing open-source datasets are often plagued by widespread noise and a critical deficit in complex reasoning data, such as Chain-of-Thought (CoT), which hinders the development of advanced model capabilities.

Addressing these challenges, our work makes three primary contributions. First, we introduce Honey-Data-15M, a new SFT dataset comprising approximately 15 million QA pairs, processed through multiple cleaning techniques and enhanced with a novel dual-level (short and long) CoT enrichment strategy. Second, we introduce HoneyPipe, the data curation pipeline, and its underlying framework DataStudio, providing the community with a transparent and adaptable methodology for data curation that moves beyond static dataset releases. Finally, to validate our dataset and pipeline, we train Bee-8B, an 8B model on Honey-Data-15M.

Experiments show that Bee-8B establishes a new state-of-the-art (SOTA) for fully open MLLMs, achieving performance that is competitive with, and in some cases surpasses, recent semi-open models such as InternVL3.5-8B. Our work delivers to the community a suite of foundational resources, including: the Honey-Data-15M corpus; the full-stack suite comprising HoneyPipe and DataStudio; training recipes; an evaluation harness; and the model weights. This effort demonstrates that a principled focus on data quality is a key pathway to developing fully open MLLMs that are highly competitive with their semi-open counterparts.

🐝 Experimental Results

logo

Evaluation of Bee-8B against other MLLMs. We distinguish between fully open (*) and semi-open (†) models. The top and second-best scores for each benchmark are highlighted.
  1. New State-of-the-Art: Bee-8B establishes a new performance standard for fully open MLLMs, proving highly competitive with recent semi-open models across a wide array of benchmarks.
  2. Excellence in Complex Reasoning: Thanks to the CoT-enriched Honey-Data-15M, Bee-8B shows its most significant advancements in complex math and reasoning. It achieves top scores on challenging benchmarks like MathVerse, LogicVista, and DynaMath.
  3. Superior Document and Chart Understanding: The model demonstrates powerful capabilities in analyzing structured visual data, securing the top rank on the CharXiv benchmark for both descriptive and reasoning questions.

⚙️ Installation

You need to prepare the following three environments:

  1. An environment for running VLMEvalKit

    Please follow the VLMEvalKit Official QuickStart.

  2. An environment for deploying Bee-8B with vLLM

    conda create -n [CONDA_ENV_FOR_TESTED_MODEL] python=3.12
    conda activate [CONDA_ENV_FOR_TESTED_MODEL]
    pip install vllm>=0.11.1
  3. An environment for deploying the judge model (Qwen3-32B) with SGLang

    conda create -n [CONDA_ENV_FOR_JUDGE_MODEL] python=3.12
    conda activate [CONDA_ENV_FOR_JUDGE_MODEL]
    pip install "sglang[all]>=0.4.6.post1"

    For more details about deploying Qwen3 using SGLang, please refer to the Qwen official documentation.

    Note: If you are using other judge model APIs, you may skip this step and refer to the VLMEvalKit Official QuickStart.

You need to fill in the names of three Conda environments in the config file (by default, batch_eval_config/example.sh): CONDA_ENV_FOR_VLMEVALKIT, CONDA_ENV_FOR_TESTED_MODEL, and CONDA_ENV_FOR_JUDGE_MODEL. For more details, see this section.

🧠 Inference & Evaluation

TL;DR

Run the command below to launch evaluations for Bee-8B-RL and Bee-8B-SFT across all benchmarks:

bash run_evaluation.sh

Customize Your Evaluation Using a Config File

You can customize the evaluation by providing a configuration file (default: batch_eval_config/example.sh):

bash run_evaluation.sh /path/to/your/config_file.sh

The config file defines the following key variables:

  1. MODELS: An array of model names to evaluate. Each name must match a registered entry in vlmeval/config.py. This array must align one-to-one with MODEL_PATHS and TIMESTAMPS.

  2. MODEL_PATHS: An array of model deployment paths (either a Hugging Face repo like Open-Bee/Bee-8B-RL or a local path). These paths are used to deploy models via vLLM.

  3. TIMESTAMPS: An array used to determine the output directory for each model’s results. Set this to resume from a previous run.

  4. DATASETS: An array of benchmarks to run. For the full list of VLMEvalKit-supported benchmarks and their dataset names, see VLMEvalKit Features.

  5. CONDA_ENV_FOR_VLMEVALKIT: The Conda environment name used to run VLMEvalKit.

  6. CONDA_ENV_FOR_TESTED_MODEL: The Conda environment name used to deploy Bee-8B with vLLM.

  7. CONDA_ENV_FOR_JUDGE_MODEL: The Conda environment name used to deploy the judge model (Qwen3-32B) with SGLang.

📝 Notes on Changes to Evaluation Settings

Based on VLMEvalKit, we adjusted some evaluation details for several benchmarks as described below:

  1. MathVerse

    1. Improved judge prompting: We added a few more few-shot examples for the judge model and refined the instruction to improve judge accuracy.
    2. More robust answer extraction: During the judge stage, when extracting the final answer from a model response, we now try MathRuler (a rule-based extractor) first. If it fails, we fall back to the original LLM-based extraction. This improves extraction reliability.
    3. More accurate correctness checking with fewer LLM calls: Previously, correctness checking was: (1) character-level exact match → (2) LLM-based judgment. We now insert MathRuler between (1) and (2). This reduces judge-LLM usage and improves correctness decisions.
  2. CountBenchQA, ChartQA, DocVQA, InfoVQA

    We switched these benchmarks to LLM-based judging. (In VLMEvalKit, these were originally judged using rule-based methods.)

📚 Citation

BibTeX:

@article{zhang2025bee,
  title={Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs},
  author={Zhang, Yi and Ni, Bolin and Chen, Xin-Sheng and Zhang, Heng-Rui and Rao, Yongming and Peng, Houwen and Lu, Qinglin and Hu, Han and Guo, Meng-Hao and Hu, Shi-Min},
  journal={arXiv preprint arXiv:2510.13795},
  year={2025}
}

About

Evaluation code for Bee. Based on https://github.com/open-compass/VLMEvalKit.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages