Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation

The codebase of "Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation"

📝 Table of Contents

Overview
Project Structure
Installation
Download Datasets
Quick Start
Usage
- Modes
- Parameters
Advanced Usage
- Using Annotated Images
- Adding Custom Models
Evaluation

Overview

Project Structure

specverdict/
├── main.py                  # Main entry point for all pipeline stages
├── draft.py                 # Draft stage
├── verdict.py               # Verdict stage
├── consensus_scoring.py     # Consensus-based expert ranking
├── prompts.py               # Dataset-specific prompts
├── model.py                 # Model wrappers
├── utils/                   # Post-processing utilities
├── eval/                    # Evaluation framework
├── layout_annotation/       # Optional: OCR-based image annotation for information-intensive benchmarks
└── requirements.txt

Installation

git clone https://github.com/Tinaliu0123/speculative-verdict.git
cd specverdict
conda create -n specverdict python=3.10
conda activate specverdict 

export OPENAI_API_KEY="your-key"
pip install -r requirements.txt
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Note: Use transformers>=4.57.1 for GLM-4.1V-Thinking. Use transformers==4.53.0 for LLaVA-OneVision.

Download Datasets

InfographicVQA: InfographicVQA Dataset
ChartMuseum: 🤗ChartMuseum Dataset
ChartQAPro: 🤗ChartQAPro Dataset
HR-Bench: 🤗HR-Bench Dataset

Note: The following datasets require light preprocessing of the question text to satisfy task-specific formatting.

ChartQAPro: Follow the official prompt in its paper and prepend the required paragraph metadata when needed.
HR-Bench 4K: For multiple-choice questions, append explicit A/B/C/D options immediately after the question.

Quick Start

Example: Complete Pipeline on InfographicVQA

# 1. Initial inference with 5 candidate models (QA mode)
python main.py \
    --mode inference \
    --inference_mode qa \
    --models path/to/model1 path/to/model2 path/to/model3 path/to/model4 path/to/model5\
    --dataset infovqa \
    --in_json data/infovqa/test.jsonl \
    --out_json results/qa_inference.json

# 2. Compute consensus scores
python main.py \
    --mode prefill_cross \
    --models path/to/model1 path/to/model2 path/to/model3 path/to/model4 path/to/model5\
    --in_json results/qa_inference.json \
    --out_json results/prefill.json

# 3. Select top-3 draft experts
python consensus_scoring.py \
    --input results/prefill.json \
    --output results/consensus.json \
    --top_k 3

# 4. Draft experts generate detailed reasoning (reason mode)
python main.py \
    --mode inference_from_topk \
    --inference_mode reason \
    --model_mapping '{"model1":"path/to/model1", "model2":"path/to/model2", ...}' \
    --consensus_file results/consensus.json \
    --in_json data/infovqa/test.jsonl \
    --out_json results/draft_reason.json

# 5. Final verdict synthesis
python main.py \
    --mode verdict \
    --verdict_backend gpt4o \
    --in_json results/draft_reason.json \
    --out_json results/verdict.json \
    --dataset infovqa \
    --annotated_folder data/annotations/infovqa/  # optional
# Notes:
# - set --verdict_backend to qwen or gpt4o
# - for qwen backend, add: --models path/to/verdict_model
# - for gpt4o, you can optionally set --verdict_openai_model (default: gpt-4o)

Usage

Modes

Pipeline Modes:

inference: Generate reasoning/answers from multiple models
prefill_cross: Compute consensus scores
inference_from_topk: Inference only for selected draft experts
verdict: Synthesize reasoning into final answer

Inference Modes:

qa: Direct question answering
reason: Step-by-step reasoning

Parameters

Parameter	Description	Example
`--dataset`	Dataset type	`infovqa`, `museum`, `pro`, `hrbench`
`--inference_mode`	Inference type	`qa`, `reason`
`--models`	Model paths (space-separated)	`path/to/model1 path/to/model2`...
`--in_json`	Input file path	`data/test.jsonl`
`--out_json`	Output file path	`results/output.json`
`--start_idx`	Start from specific sample	`0`
`--max_entries`	Process N samples only	`100`
`--seed`	Random seed	`42`
`--merge_output`	Update and merge into existing file	flag

Verdict-specific:

--annotated_folder: Folder containing layout-annotated images for information-intensive tasks
--verdict_backend: Verdict backend (qwen or gpt4o, default: gpt4o)
--verdict_openai_model: OpenAI model name when using gpt4o backend
--verdict_api_key: Optional OpenAI API key (otherwise uses OPENAI_API_KEY)

Consensus-specific:

--top_k: Number of models to select (default: 3)

Advanced Usage

Using Annotated Images

Generate and use layout-annotated images for information-intensive tasks:

# 1. Generate layout-annotated images
python layout_annotation/pipeline.py \
    --input data/infovqa/images/ \
    --output data/infovqa/annotated/

# 2. Use in verdict stage
python main.py \
    --mode verdict \
    --annotated_folder data/infovqa/annotated/ \
    ...

See layout_annotation/README.md for details.

Adding Custom Models

We currently support: Qwen2.5-VL, GLM-4V, MiMO, InternVL3/3.5, Ovis2.5, LLaVA-OneVision, Eagle2.5, Gemma3.

To add your own model, modify two files:

draft.py: Add model loading logic in load_vlm()
model.py: Create wrapper class with:

answer(img_path, question, prompt_tpl) → Generate response
prefill_nll(img_path, question, answer) → Conduct masking and compute perplexity score

See existing implementations for reference.

Evaluation

Evaluate final results against ground truth, following each dataset's evaluation metrics:

# InfographicVQA (ANLS metric)
python eval/eval.py infovqa \
    --input results/verdict.json \
    --output eval_results.json

# ChartQAPro (Relaxed accuracy)
python eval/eval.py chartqapro \
    --input results/verdict.json \
    --meta data/chartqapro/metadata.jsonl

# ChartMuseum (GPT-based scoring)
python eval/eval.py chartmuseum \
    --input results/verdict.json

# HR-Bench (Accuracy)
python eval/eval.py hrbench \
    --input results/verdict.json \
    --bench data/hr_bench.jsonl

See eval/README.md for detailed evaluation documentation.

Citation

If you find this work useful, please cite our paper:

@misc{liu2025smalldraftsbigverdict,
      title={Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation}, 
      author={Yuhan Liu and Lianhui Qin and Shengjie Wang},
      year={2025},
      eprint={2510.20812},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.20812}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation

📝 Table of Contents

Overview

Project Structure

Installation

Download Datasets

Quick Start

Example: Complete Pipeline on InfographicVQA

Usage

Modes

Parameters

Advanced Usage

Using Annotated Images

Adding Custom Models

Evaluation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
eval		eval
layout_annotation		layout_annotation
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
consensus_scoring.py		consensus_scoring.py
draft.py		draft.py
main.py		main.py
method.png		method.png
model.py		model.py
prompts.py		prompts.py
requirements.txt		requirements.txt
verdict.py		verdict.py

Folders and files

Latest commit

History

Repository files navigation

Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation

📝 Table of Contents

Overview

Project Structure

Installation

Download Datasets

Quick Start

Example: Complete Pipeline on InfographicVQA

Usage

Modes

Parameters

Advanced Usage

Using Annotated Images

Adding Custom Models

Evaluation

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages