The Ever-Evolving Science Exam

What constitutes a good science benchmark?

Junying Wang^*, Zicheng Zhang^*^#, Yijin Guo^*, Farong Wen^*, Ye Shen, Yingji Liang,

Yalun Wu, Wenzhe Li, Chunyi Li, Zijian Chen, Qi Jia, Guangtao Zhai^#,

Shanghai Artificial Intelligence Laboratory

^*Equal contribution. ^#Corresponding author.

Paper | Github | Team Work | Huggingface

Release

[2026/01/16]🔥 Our EESE-V3 is updated online EESE-Dataset .
[2025/10/15]🔥 Our EESE-V2 is updated online EESE-Dataset .
[2025/7/30] 🔥 Evaluation is avaiable on OpenCompass.
[2025/7/24] 🔥 Our quick start method is submmitted online EESE-Quick-Start.
[2025/7/23] 🔥 Our EESE-V1 is updated online EESE-Dataset .
[2025/7/22] 🔥 Our paper is submmitted online EESE-Paper .

key Contribution

A large-scale, high-quality science benchmark pool: We construct EESE-Pool, a 100K+ science question-answer pair pool across 5 disciplines and 500+ subfields, with diverse formats and rigorous quality control. We design three-stage Data Engine (Transcription, Expansion, and Categorization) and Data Refinement (a Parallel Three-Branch Refinement Framework) to ensure range, reach, and rigor.
A dynamic, leakage-resilient evaluation set: We propose EESE, a 500-instance subset periodically updated (regular resampling 500 instances from the EESE-Pool), maintaining representativeness while reducing leakage risk and evaluation overhead.
Comprehensive evaluation of LLMs: We evaluate 32 leading models (open- and closed-source) on EESE-Pool and EESE, revealing significant performance gaps across disciplines, the effectiveness of refinement in improving quality, and the trade-offs between inference cost and science ability. The findings offer insights for future science benchmarks.

V3-Version-2026-01-16

Model	Org.	Params	Open.	Overall
Doubao-1-5-Pro-32K	ByteDance	N/A	❌	0.3606
GPT-5	OpenAI	N/A	❌	0.3520
Kimi-K2-0905-preview	Moonshot	N/A	❌	0.3424
Gemini-2.5-flash	Google	N/A	❌	0.3250
Qwen3-235B-A22B-Instruct	Alibaba	235B	❌	0.3076
GLM-4.5v	Zhipu AI	N/A	❌	0.3012
Deepseek-R1	Deepseek	671B	✅	0.2932
Deepseek-V3	Deepseek	N/A	❌	0.2756
O3	OpenAI	N/A	❌	0.2594
Gemini-2.5-pro	Google	N/A	❌	0.2424
Kimi-K2-0711	Moonshot	1.01T	✅	0.2230
Grok-4	xAI	N/A	❌	0.1920
Qwen2.5-72B-Instruct	Alibaba	72B	❌	0.1906
Claude-3-7-sonnet	Anthropic	N/A	❌	0.1452

V2-Version-2025-10-15

Model	Org.	Params	Open.	Overall
Claude-3-7-sonnet	Anthropic	N/A	❌	0.1452
Gemini-2.5-pro	Google	N/A	❌	0.2424
GPT-5	OpenAI	N/A	❌	0.2620
Kimi-K2-0711	Moonshot	1.01T	✅	0.2230
O3	OpenAI	N/A	❌	0.2594
Deepseek-R1	Deepseek	671B	✅	0.1916
Grok-4	xAI	N/A	❌	0.1920

Note: GPT-5 and Kimi-K2-0711 are models released after the V1 leaderboard.

V1-Version-2025-07-30

Model	Org.	Params	Open.	Overall
Claude-3-7-sonnet	Anthropic	N/A	❌	0.2648
Gemini-2.5-pro	Google	N/A	❌	0.3813
O3	OpenAI	N/A	❌	0.4025
Deepseek-R1	Deepseek	671B	✅	0.3251
Grok-4	xAI	N/A	❌	0.3442

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Configure API Keys

Edit config.py and replace the API keys:

LLM_CONFIG = {
    "base_url": "https://api.openai.com/v1",
    "api_key": "your-actual-api-key-here",  # Replace with your API key
    "model": "model-name",
    "temperature": 0.0,
    "max_retries": 3
}

JUDGE_LLM_CONFIG = {
    "base_url": "https://api.openai.com/v1", 
    "api_key": "your-actual-api-key-here",  # Replace with your API key
    "model": "model-name",
    "temperature": 0.0,
    "max_retries": 3
}

3. Prepare Data

Ensure your esee.jsonl file is in the project directory. You can download from Huggingface.

4. Run Evaluation

python main.py

File Structure

├── code                     # Code folder
    ├── main.py              # Main evaluation script
    ├── inference.py         # Core inference functions
    ├── config.py            # Configuration settings
    ├── call.py              # LLM API calling functions
    ├── llm_information.py   # LLM client setup
    ├── utils.py             # Utility functions
    ├── requirements.txt     # Python dependencies
    ├── EESE.jsonl           # Input data file
    ├── log/                 # Log files directory
    └── results/             # Output results directory

Output Files

After running the evaluation, you'll get:

Log File (log/evaluation.log): Detailed processing logs
Detailed Results (results/detailed_results.json): Complete evaluation data
Summary Results (results/summary_results.json): Performance summary by discipline

Scoring System

Closed-ended questions: 0 or 10 points (correct/incorrect)
Open-ended questions: 0-10 points (integer scores)
Scores are automatically generated by the judging LLM

Contact

Please contact any of the first authors of this paper for queries.

Zicheng Zhang, zhangzicheng@pjlab.org.cn, @zzc-1998
Junying Wang, wangjunying@pjlab.org.cn, @junyingwang959

Citation

If you find our work interesting, please feel free to cite our paper:

@misc{wang2025everevolvingscienceexam,
      title={The Ever-Evolving Science Exam}, 
      author={Junying Wang and Zicheng Zhang and Yijin Guo and Farong Wen and Ye Shen and Yingji Liang and Yalun Wu and Wenzhe Li and Chunyi Li and Zijian Chen and Qi Jia and Guangtao Zhai},
      year={2025},
      eprint={2507.16514},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.16514}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
code		code
img		img
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Ever-Evolving Science Exam

Release

key Contribution

V3-Version-2026-01-16

V2-Version-2025-10-15

V1-Version-2025-07-30

Quick Start

1. Install Dependencies

2. Configure API Keys

3. Prepare Data

4. Run Evaluation

File Structure

Output Files

Scoring System

Contact

Citation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

aiben-ch/EESE

Folders and files

Latest commit

History

Repository files navigation

The Ever-Evolving Science Exam

Release

key Contribution

V3-Version-2026-01-16

V2-Version-2025-10-15

V1-Version-2025-07-30

Quick Start

1. Install Dependencies

2. Configure API Keys

3. Prepare Data

4. Run Evaluation

File Structure

Output Files

Scoring System

Contact

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages