ICLR 2026
🌐 Homepage | 🤗 Models and Datasets | 📖 ArXiv
This repository contains the evaluation code for the paper "Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs". The code is based on VLMEvalKit.
Fully open multimodal large language models (MLLMs) currently lag behind proprietary counterparts, primarily due to a significant gap in data quality for supervised fine-tuning (SFT). Existing open-source datasets are often plagued by widespread noise and a critical deficit in complex reasoning data, such as Chain-of-Thought (CoT), which hinders the development of advanced model capabilities.
Addressing these challenges, our work makes three primary contributions. First, we introduce Honey-Data-15M, a new SFT dataset comprising approximately 15 million QA pairs, processed through multiple cleaning techniques and enhanced with a novel dual-level (short and long) CoT enrichment strategy. Second, we introduce HoneyPipe, the data curation pipeline, and its underlying framework DataStudio, providing the community with a transparent and adaptable methodology for data curation that moves beyond static dataset releases. Finally, to validate our dataset and pipeline, we train Bee-8B, an 8B model on Honey-Data-15M.
Experiments show that Bee-8B establishes a new state-of-the-art (SOTA) for fully open MLLMs, achieving performance that is competitive with, and in some cases surpasses, recent semi-open models such as InternVL3.5-8B. Our work delivers to the community a suite of foundational resources, including: the Honey-Data-15M corpus; the full-stack suite comprising HoneyPipe and DataStudio; training recipes; an evaluation harness; and the model weights. This effort demonstrates that a principled focus on data quality is a key pathway to developing fully open MLLMs that are highly competitive with their semi-open counterparts.
Evaluation of Bee-8B against other MLLMs. We distinguish between fully open (*) and semi-open (†) models. The top and second-best scores for each benchmark are highlighted.- New State-of-the-Art: Bee-8B establishes a new performance standard for fully open MLLMs, proving highly competitive with recent semi-open models across a wide array of benchmarks.
- Excellence in Complex Reasoning: Thanks to the CoT-enriched Honey-Data-15M, Bee-8B shows its most significant advancements in complex math and reasoning. It achieves top scores on challenging benchmarks like MathVerse, LogicVista, and DynaMath.
- Superior Document and Chart Understanding: The model demonstrates powerful capabilities in analyzing structured visual data, securing the top rank on the CharXiv benchmark for both descriptive and reasoning questions.
You need to prepare the following three environments:
-
An environment for running VLMEvalKit
Please follow the VLMEvalKit Official QuickStart.
-
An environment for deploying Bee-8B with vLLM
conda create -n [CONDA_ENV_FOR_TESTED_MODEL] python=3.12 conda activate [CONDA_ENV_FOR_TESTED_MODEL] pip install vllm>=0.11.1 -
An environment for deploying the judge model (Qwen3-32B) with SGLang
conda create -n [CONDA_ENV_FOR_JUDGE_MODEL] python=3.12 conda activate [CONDA_ENV_FOR_JUDGE_MODEL] pip install "sglang[all]>=0.4.6.post1"For more details about deploying Qwen3 using SGLang, please refer to the Qwen official documentation.
Note: If you are using other judge model APIs, you may skip this step and refer to the VLMEvalKit Official QuickStart.
You need to fill in the names of three Conda environments in the config file (by default, batch_eval_config/example.sh): CONDA_ENV_FOR_VLMEVALKIT, CONDA_ENV_FOR_TESTED_MODEL, and CONDA_ENV_FOR_JUDGE_MODEL. For more details, see this section.
Run the command below to launch evaluations for Bee-8B-RL and Bee-8B-SFT across all benchmarks:
bash run_evaluation.shYou can customize the evaluation by providing a configuration file (default: batch_eval_config/example.sh):
bash run_evaluation.sh /path/to/your/config_file.shThe config file defines the following key variables:
-
MODELS: An array of model names to evaluate. Each name must match a registered entry invlmeval/config.py. This array must align one-to-one withMODEL_PATHSandTIMESTAMPS. -
MODEL_PATHS: An array of model deployment paths (either a Hugging Face repo likeOpen-Bee/Bee-8B-RLor a local path). These paths are used to deploy models via vLLM. -
TIMESTAMPS: An array used to determine the output directory for each model’s results. Set this to resume from a previous run. -
DATASETS: An array of benchmarks to run. For the full list of VLMEvalKit-supported benchmarks and their dataset names, see VLMEvalKit Features. -
CONDA_ENV_FOR_VLMEVALKIT: The Conda environment name used to run VLMEvalKit. -
CONDA_ENV_FOR_TESTED_MODEL: The Conda environment name used to deploy Bee-8B with vLLM. -
CONDA_ENV_FOR_JUDGE_MODEL: The Conda environment name used to deploy the judge model (Qwen3-32B) with SGLang.
Based on VLMEvalKit, we adjusted some evaluation details for several benchmarks as described below:
-
MathVerse
- Improved judge prompting: We added a few more few-shot examples for the judge model and refined the instruction to improve judge accuracy.
- More robust answer extraction: During the judge stage, when extracting the final answer from a model response, we now try MathRuler (a rule-based extractor) first. If it fails, we fall back to the original LLM-based extraction. This improves extraction reliability.
- More accurate correctness checking with fewer LLM calls: Previously, correctness checking was: (1) character-level exact match → (2) LLM-based judgment. We now insert MathRuler between (1) and (2). This reduces judge-LLM usage and improves correctness decisions.
-
CountBenchQA, ChartQA, DocVQA, InfoVQA
We switched these benchmarks to LLM-based judging. (In VLMEvalKit, these were originally judged using rule-based methods.)
BibTeX:
@article{zhang2025bee,
title={Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs},
author={Zhang, Yi and Ni, Bolin and Chen, Xin-Sheng and Zhang, Heng-Rui and Rao, Yongming and Peng, Houwen and Lu, Qinglin and Hu, Han and Guo, Meng-Hao and Hu, Shi-Min},
journal={arXiv preprint arXiv:2510.13795},
year={2025}
}