- Overview
- Project Structure
- Installation
- Download Datasets
- Quick Start
- Usage
- Advanced Usage
- Evaluation
specverdict/
├── main.py # Main entry point for all pipeline stages
├── draft.py # Draft stage
├── verdict.py # Verdict stage
├── consensus_scoring.py # Consensus-based expert ranking
├── prompts.py # Dataset-specific prompts
├── model.py # Model wrappers
├── utils/ # Post-processing utilities
├── eval/ # Evaluation framework
├── layout_annotation/ # Optional: OCR-based image annotation for information-intensive benchmarks
└── requirements.txt
git clone https://github.com/Tinaliu0123/speculative-verdict.git
cd specverdict
conda create -n specverdict python=3.10
conda activate specverdict
export OPENAI_API_KEY="your-key"
pip install -r requirements.txt
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whlNote: Use transformers>=4.57.1 for GLM-4.1V-Thinking. Use transformers==4.53.0 for LLaVA-OneVision.
- InfographicVQA: InfographicVQA Dataset
- ChartMuseum: 🤗ChartMuseum Dataset
- ChartQAPro: 🤗ChartQAPro Dataset
- HR-Bench: 🤗HR-Bench Dataset
Note: The following datasets require light preprocessing of the question text to satisfy task-specific formatting.
- ChartQAPro: Follow the official prompt in its paper and prepend the required paragraph metadata when needed.
- HR-Bench 4K: For multiple-choice questions, append explicit A/B/C/D options immediately after the question.
# 1. Initial inference with 5 candidate models (QA mode)
python main.py \
--mode inference \
--inference_mode qa \
--models path/to/model1 path/to/model2 path/to/model3 path/to/model4 path/to/model5\
--dataset infovqa \
--in_json data/infovqa/test.jsonl \
--out_json results/qa_inference.json
# 2. Compute consensus scores
python main.py \
--mode prefill_cross \
--models path/to/model1 path/to/model2 path/to/model3 path/to/model4 path/to/model5\
--in_json results/qa_inference.json \
--out_json results/prefill.json
# 3. Select top-3 draft experts
python consensus_scoring.py \
--input results/prefill.json \
--output results/consensus.json \
--top_k 3
# 4. Draft experts generate detailed reasoning (reason mode)
python main.py \
--mode inference_from_topk \
--inference_mode reason \
--model_mapping '{"model1":"path/to/model1", "model2":"path/to/model2", ...}' \
--consensus_file results/consensus.json \
--in_json data/infovqa/test.jsonl \
--out_json results/draft_reason.json
# 5. Final verdict synthesis
python main.py \
--mode verdict \
--verdict_backend gpt4o \
--in_json results/draft_reason.json \
--out_json results/verdict.json \
--dataset infovqa \
--annotated_folder data/annotations/infovqa/ # optional
# Notes:
# - set --verdict_backend to qwen or gpt4o
# - for qwen backend, add: --models path/to/verdict_model
# - for gpt4o, you can optionally set --verdict_openai_model (default: gpt-4o)Pipeline Modes:
inference: Generate reasoning/answers from multiple modelsprefill_cross: Compute consensus scoresinference_from_topk: Inference only for selected draft expertsverdict: Synthesize reasoning into final answer
Inference Modes:
qa: Direct question answeringreason: Step-by-step reasoning
| Parameter | Description | Example |
|---|---|---|
--dataset |
Dataset type | infovqa, museum, pro, hrbench |
--inference_mode |
Inference type | qa, reason |
--models |
Model paths (space-separated) | path/to/model1 path/to/model2... |
--in_json |
Input file path | data/test.jsonl |
--out_json |
Output file path | results/output.json |
--start_idx |
Start from specific sample | 0 |
--max_entries |
Process N samples only | 100 |
--seed |
Random seed | 42 |
--merge_output |
Update and merge into existing file | flag |
Verdict-specific:
--annotated_folder: Folder containing layout-annotated images for information-intensive tasks--verdict_backend: Verdict backend (qwenorgpt4o, default:gpt4o)--verdict_openai_model: OpenAI model name when usinggpt4obackend--verdict_api_key: Optional OpenAI API key (otherwise usesOPENAI_API_KEY)
Consensus-specific:
--top_k: Number of models to select (default: 3)
Generate and use layout-annotated images for information-intensive tasks:
# 1. Generate layout-annotated images
python layout_annotation/pipeline.py \
--input data/infovqa/images/ \
--output data/infovqa/annotated/
# 2. Use in verdict stage
python main.py \
--mode verdict \
--annotated_folder data/infovqa/annotated/ \
...See layout_annotation/README.md for details.
We currently support: Qwen2.5-VL, GLM-4V, MiMO, InternVL3/3.5, Ovis2.5, LLaVA-OneVision, Eagle2.5, Gemma3.
To add your own model, modify two files:
draft.py: Add model loading logic inload_vlm()model.py: Create wrapper class with:
answer(img_path, question, prompt_tpl)→ Generate responseprefill_nll(img_path, question, answer)→ Conduct masking and compute perplexity score
See existing implementations for reference.
Evaluate final results against ground truth, following each dataset's evaluation metrics:
# InfographicVQA (ANLS metric)
python eval/eval.py infovqa \
--input results/verdict.json \
--output eval_results.json
# ChartQAPro (Relaxed accuracy)
python eval/eval.py chartqapro \
--input results/verdict.json \
--meta data/chartqapro/metadata.jsonl
# ChartMuseum (GPT-based scoring)
python eval/eval.py chartmuseum \
--input results/verdict.json
# HR-Bench (Accuracy)
python eval/eval.py hrbench \
--input results/verdict.json \
--bench data/hr_bench.jsonlSee eval/README.md for detailed evaluation documentation.
If you find this work useful, please cite our paper:
@misc{liu2025smalldraftsbigverdict,
title={Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation},
author={Yuhan Liu and Lianhui Qin and Shengjie Wang},
year={2025},
eprint={2510.20812},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.20812},
}