What constitutes a good science benchmark?
Junying Wang*,
Zicheng Zhang*#,
Yijin Guo*,
Farong Wen*,
Ye Shen,
Yingji Liang,
Shanghai Artificial Intelligence Laboratory
*Equal contribution. #Corresponding author.
Paper | Github | Team Work | Huggingface
- [2026/01/16]🔥 Our EESE-V3 is updated online EESE-Dataset .
- [2025/10/15]🔥 Our EESE-V2 is updated online EESE-Dataset .
- [2025/7/30] 🔥 Evaluation is avaiable on OpenCompass.
- [2025/7/24] 🔥 Our quick start method is submmitted online EESE-Quick-Start.
- [2025/7/23] 🔥 Our EESE-V1 is updated online EESE-Dataset .
- [2025/7/22] 🔥 Our paper is submmitted online EESE-Paper .
- A large-scale, high-quality science benchmark pool: We construct EESE-Pool, a 100K+ science question-answer pair pool across 5 disciplines and 500+ subfields, with diverse formats and rigorous quality control. We design three-stage Data Engine (Transcription, Expansion, and Categorization) and Data Refinement (a Parallel Three-Branch Refinement Framework) to ensure range, reach, and rigor.
- A dynamic, leakage-resilient evaluation set: We propose EESE, a 500-instance subset periodically updated (regular resampling 500 instances from the EESE-Pool), maintaining representativeness while reducing leakage risk and evaluation overhead.
- Comprehensive evaluation of LLMs: We evaluate 32 leading models (open- and closed-source) on EESE-Pool and EESE, revealing significant performance gaps across disciplines, the effectiveness of refinement in improving quality, and the trade-offs between inference cost and science ability. The findings offer insights for future science benchmarks.
| Model | Org. | Params | Open. | Overall |
|---|---|---|---|---|
| Doubao-1-5-Pro-32K | ByteDance | N/A | ❌ | 0.3606 |
| GPT-5 | OpenAI | N/A | ❌ | 0.3520 |
| Kimi-K2-0905-preview | Moonshot | N/A | ❌ | 0.3424 |
| Gemini-2.5-flash | N/A | ❌ | 0.3250 | |
| Qwen3-235B-A22B-Instruct | Alibaba | 235B | ❌ | 0.3076 |
| GLM-4.5v | Zhipu AI | N/A | ❌ | 0.3012 |
| Deepseek-R1 | Deepseek | 671B | ✅ | 0.2932 |
| Deepseek-V3 | Deepseek | N/A | ❌ | 0.2756 |
| O3 | OpenAI | N/A | ❌ | 0.2594 |
| Gemini-2.5-pro | N/A | ❌ | 0.2424 | |
| Kimi-K2-0711 | Moonshot | 1.01T | ✅ | 0.2230 |
| Grok-4 | xAI | N/A | ❌ | 0.1920 |
| Qwen2.5-72B-Instruct | Alibaba | 72B | ❌ | 0.1906 |
| Claude-3-7-sonnet | Anthropic | N/A | ❌ | 0.1452 |
| Model | Org. | Params | Open. | Overall |
|---|---|---|---|---|
| Claude-3-7-sonnet | Anthropic | N/A | ❌ | 0.1452 |
| Gemini-2.5-pro | N/A | ❌ | 0.2424 | |
| GPT-5 | OpenAI | N/A | ❌ | 0.2620 |
| Kimi-K2-0711 | Moonshot | 1.01T | ✅ | 0.2230 |
| O3 | OpenAI | N/A | ❌ | 0.2594 |
| Deepseek-R1 | Deepseek | 671B | ✅ | 0.1916 |
| Grok-4 | xAI | N/A | ❌ | 0.1920 |
Note: GPT-5 and Kimi-K2-0711 are models released after the V1 leaderboard.
| Model | Org. | Params | Open. | Overall |
|---|---|---|---|---|
| Claude-3-7-sonnet | Anthropic | N/A | ❌ | 0.2648 |
| Gemini-2.5-pro | N/A | ❌ | 0.3813 | |
| O3 | OpenAI | N/A | ❌ | 0.4025 |
| Deepseek-R1 | Deepseek | 671B | ✅ | 0.3251 |
| Grok-4 | xAI | N/A | ❌ | 0.3442 |
pip install -r requirements.txtEdit config.py and replace the API keys:
LLM_CONFIG = {
"base_url": "https://api.openai.com/v1",
"api_key": "your-actual-api-key-here", # Replace with your API key
"model": "model-name",
"temperature": 0.0,
"max_retries": 3
}
JUDGE_LLM_CONFIG = {
"base_url": "https://api.openai.com/v1",
"api_key": "your-actual-api-key-here", # Replace with your API key
"model": "model-name",
"temperature": 0.0,
"max_retries": 3
}Ensure your esee.jsonl file is in the project directory. You can download from Huggingface.
python main.py├── code # Code folder
├── main.py # Main evaluation script
├── inference.py # Core inference functions
├── config.py # Configuration settings
├── call.py # LLM API calling functions
├── llm_information.py # LLM client setup
├── utils.py # Utility functions
├── requirements.txt # Python dependencies
├── EESE.jsonl # Input data file
├── log/ # Log files directory
└── results/ # Output results directory
After running the evaluation, you'll get:
- Log File (
log/evaluation.log): Detailed processing logs - Detailed Results (
results/detailed_results.json): Complete evaluation data - Summary Results (
results/summary_results.json): Performance summary by discipline
- Closed-ended questions: 0 or 10 points (correct/incorrect)
- Open-ended questions: 0-10 points (integer scores)
- Scores are automatically generated by the judging LLM
Please contact any of the first authors of this paper for queries.
- Zicheng Zhang,
zhangzicheng@pjlab.org.cn, @zzc-1998 - Junying Wang,
wangjunying@pjlab.org.cn, @junyingwang959
If you find our work interesting, please feel free to cite our paper:
@misc{wang2025everevolvingscienceexam,
title={The Ever-Evolving Science Exam},
author={Junying Wang and Zicheng Zhang and Yijin Guo and Farong Wen and Ye Shen and Yingji Liang and Yalun Wu and Wenzhe Li and Chunyi Li and Zijian Chen and Qi Jia and Guangtao Zhai},
year={2025},
eprint={2507.16514},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.16514},
}

