ViDoRe Generation

Generate synthetic queries from a PDF document corpus for evaluating image retrieval models.

Install

uv venv --python 3.10
uv sync

Set up your API keys by copying .env.dist to .env and filling in the relevant keys:

cp .env.dist .env

Quick start

The full pipeline takes a folder of PDFs and produces a final_queries.json file ready for use. It runs in 4 steps:

graph LR
    A[/PDFs/] --> B[extract text]
    B --> C[generate summaries] --> D[generate queries] --> E[postprocess queries]
    E --> F[\final_queries.json\]

Step 1 — Set up your data folder

Create a folder anywhere on disk. Inside it, create a pdfs/ subfolder and put your PDF files there:

my_dataset/
└── pdfs/
    ├── document_1.pdf
    ├── document_2.pdf
    └── ...

Step 2 — Write a config file

Create a YAML file (e.g. configs/my_dataset.yaml). Here is a minimal working example:

# Unique name for your dataset — used to name all output folders and files
dataset_name: "my_dataset"

# Path to the folder that CONTAINS your dataset folder (i.e. the parent of my_dataset/)
documents_dir: "."

# LLM provider settings
llm_provider:
  # Any litellm-compatible model string, e.g.:
  #   "openai/gpt-5-nano"
  #   "fireworks_ai/kimi-k2p5"
  #   "anthropic/claude-3-5-haiku-20241022"
  lm_model_name: "openai/gpt-5-nano"

  # Optional: use a different model specifically for query generation and judging
  # Defaults to lm_model_name if not set
  # query_generation_model_name: "openai/gpt-5-nano"
  # judge_model_name: "openai/gpt-5-nano"

  # Extra parameters forwarded to the LLM (provider-specific, all optional)
  lm_extra_kwargs:
    temperature: 0.7
    top_p: 0.8

# Describe the target user who will search this document corpus.
# The more specific, the better the generated queries.
persona: "A student looking for information about physics."

# Language of the generated queries ("english", "french", "spanish", etc.)
language: "english"

# Target number of summaries to keep after filtering.
# A good starting point: 5–10× the number of PDFs.
filtered_summaries_nb: 50

# Number of multi-document summary combination iterations.
# Higher = more cross-document queries. Good default: 10–20.
combination_iteration_nb: 15

# Fraction of summaries that span multiple documents (0.0–1.0).
sampling_multi_doc_ratio: 0.5

# Print verbose LLM outputs during generation
debug: false

All available config fields with their defaults are documented in configs/example.yaml.

Warning

Automated script

It is strongly preferable to run each step individually (see below). Each step produces intermediate outputs worth inspecting before proceeding. Mistakes caught early save significant API costs.

After setting up your documents, you can run everything at once using this convenience script:

bash vidore-generation.sh my_dataset

Step 3 — Extract text from PDFs

Warning : If you have big documents, it can take a while.

vidore-generation extract-text-from-pdfs my_dataset/pdfs

This creates my_dataset/markdowns/ with one .md file per PDF.

Note that markdown extraction uses fireworks by default and kimi-k2.5. If you want to change that, you can modify the paths in the parse_pdf function of vidore_generation/pdf_parsing/extract_text_from_pdfs

Optionally verify the extraction succeeded (page counts match):

vidore-generation check-extractions my_dataset

Step 4 — Generate summaries

vidore-generation llm --config configs/my_dataset.yaml

This is the main LLM step. It reads the markdowns, generates summaries per document section, combines them across documents, judges their quality, and writes the best ones to my_dataset/filtered_summaries/filtered_summaries.json.

Output folders created under my_dataset/:

Folder	Contents
`descriptions/`	One-paragraph description of each document
`sections/`	Extracted sections per document
`summaries/`	Per-section summaries
`combined_summaries/`	Cross-document summaries
`judgments/`	Quality scores for each summary
`filtered_summaries/`	The final selection used for query generation

Step 5 — Generate queries

vidore-generation generate-queries-vidore-juicer \
  my_dataset/filtered_summaries/filtered_summaries.json \
  configs/my_dataset.yaml

This generates queries from each filtered summary, judges their quality, and writes the survivors to my_dataset/queries/vidore_juicer_my_dataset_queries.json.

Step 6 — Postprocess queries

vidore-generation postprocess-queries --config configs/my_dataset.yaml

Filters and rephrases the queries, then writes the final output to my_dataset/queries/final_my_dataset_queries.json.

Each entry in the file looks like:

{
  "query": "What is the relationship between energy and mass?",
  "generation_process": "vidore_juicer_rephrased",
  "original_query": "How does E=mc² relate energy and mass?",
  "document_ids": ["..."],
  "filenames": ["document_1"],
  "page_numbers": [[12, 13]]
}

Optional steps

Normalize document names

If your PDF filenames contain spaces, accents, or special characters, normalize them first (run this before step 3):

vidore-generation normalize-docs configs/my_dataset.yaml

Supported LLM providers

Any model supported by litellm works. Common examples:

Provider	Model string
OpenAI	`openai/gpt-5-nano`, `openai/gpt-4o`
Fireworks	`fireworks_ai/kimi-k2p5`, `fireworks_ai/qwen3-235b-a22b-instruct-2507`
Anthropic	`anthropic/claude-3-5-haiku-20241022`

Provider-specific parameters (e.g. top_k for Fireworks) can be set in lm_extra_kwargs — unsupported parameters are automatically dropped for providers that don't support them.

Set the corresponding API key in your .env file (see .env.dist for the full list).

Acknowledgement

The core code for this repo was contributed by António Loison during his work at Illuin Technology. We thank him for his contributions and helping shape the ViDoRe v3 benchmark.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
configs		configs
vidore_generation		vidore_generation
.env.dist		.env.dist
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
vidore-generation.sh		vidore-generation.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ViDoRe Generation

Install

Quick start

Step 1 — Set up your data folder

Step 2 — Write a config file

Automated script

Step 3 — Extract text from PDFs

Step 4 — Generate summaries

Step 5 — Generate queries

Step 6 — Postprocess queries

Optional steps

Normalize document names

Supported LLM providers

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ViDoRe Generation

Install

Quick start

Step 1 — Set up your data folder

Step 2 — Write a config file

Automated script

Step 3 — Extract text from PDFs

Step 4 — Generate summaries

Step 5 — Generate queries

Step 6 — Postprocess queries

Optional steps

Normalize document names

Supported LLM providers

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages