UniTor at BioASQ 2025: Modular Biomedical QA with Synthetic Snippets and Multi-Task Answer Generation
This repository accompanies the paper "UniTor at BioASQ 2025: Modular Biomedical QA with Synthetic Snippets and Multi-Task Answer Generation" accepted at the BioASQ 13b Challenge, held at CLEF 2025.
The goal of this project is to develop LLM-based methods that enhance retrieval-augmented pipelines for biomedical question answering, with a focus on factual reliability, evidence traceability, and robust performance across diverse BioASQ question types.
The main contribution of this repository is to provide both the original curated datasets, designed to adapt base LLMs for downstream tasks such as Snippet Extraction and Answer Generation, and the complete code required to train models using those datasets.
UniTor@BioASQ is a modular pipeline designed to couple high-recall retrieval with LLM-guided evidence selection and multi-task answer generation.
-
Full-Text Retrieval.
We submit the question to a Solr-based index of PubMed abstracts (BM25). To maximize recall, we typically collect top-1000 abstracts per question (high lexical overlap, efficient). -
Synthetic Snippet Generation (Full Data and Code in this repository).
A decoder-only LLM, fine-tuned on historical BioASQ (question ↔ gold snippet pairs), produces plausible answer snippets. Rather than rewriting the query, these snippets are encoded as dense vectors and used as soft semantic anchors during reranking. This injects answer-like signals without contaminating retrieval with hallucinations. -
Two-Stage Reranking.
- Unsupervised dense reranking: all candidates are rescored via precomputed document embeddings (e.g., Sentence-BERT/PubMedBERT) using similarity to both question and synthetic snippet.
- Supervised reranker: the top-100 from the dense stage are re-ranked with a fine-tuned transformer classifier on question–document pairs; the top-10 abstracts proceed downstream.
-
Snippet Extraction with Abstention. (Full Data and Code in this repository)
A decoder-only LLM performs sequence labeling over the selected abstracts. It outputs only relevant spans marked with special tags, e.g.,[BS] ... [ES]. Training includes positive, negative, and borderline negatives to teach abstention ([BS][ES]) when no evidence exists, improving precision and calibration. -
Snippet-Based Pseudo-Relevance Feedback.
High-confidence extracted snippets (not synthetic) are concatenated with the question to form an expanded query, which is re-submitted to sparse retrieval to surface additional relevant documents missed due to terminology mismatch. -
Supervised Multi-Task Answer Generation. (Full Data and Code in this repository)
A decoder-only LLM is fine-tuned to produce yes/no, factoid, list, and ideal (summary) answers. Inputs include the question, type, and the curated evidence (abstracts and/or extracted snippets). Multi-task training improves format control, faithfulness, and robustness to noise.
We release code and datasets that power the UniTor@BioASQ system across three core processes:
- Synthetic Snippet Generation (generation of answer-like snippets for soft guidance)
- Snippet Extraction (supervised sequence labeling with abstention)
- Answer Generation (multi-task LLM trained on heterogeneous QA types)
data/ # Datasets and prepared splits
└── dataset/
├── answer_generation/
├── snippet_extraction/
└── synth_query_expansion/
fine-tuning/ # Training scripts and configs
└── fine_tuning/
├── answer_generation/
├── snippet_extraction/
└── synth_query_expansion/
inference/ # Inference scripts
└── ...
├── answer_generation/
├── snippet_extraction/
└── synth_query_expansion/
To reproduce our experiments:
# Create conda environment
conda create -n unitor_env python=3.10
conda activate unitor_env
# Install requirements
pip install -r requirements.txtHardware. A CUDA-enabled GPU is recommended for training and inference with LLMs. CPU-only execution is possible for some preprocessing steps (e.g., indexing, embedding lookup), but will be slow for model stages.
External resources.
- PubMed abstracts must be obtained and indexed separately (Solr + BM25).
- Sentence encoders (for dense reranking) should be pre-computed for the corpus to enable fast dot-product similarity at inference time.
We provide module-specific folders under data/dataset/. All datasets use JSON unless stated otherwise.
All datasets are created based on the BioASQ task 13b training dataset, which includes 5, 389 manually constructed biomedical questions from the twelve previous editions of the challenge. Each question is categorized (yes/no, factoid, list, or summary) and annotated with gold-standard relevant documents, evidence snippets, exact answers, and ideal answers. This dataset not only reflects the diversity and complexity of biomedical information needs but also provides ground-truth labels for training and evaluating each system module.
To fully exploit BioASQ annotations, we construct a series of task-specific datasets, each aligned to a core subproblem:
For document reranking, we generate positive pairs from the gold-standard question and high relevant text snippets, hard negatives from the gold-standard question and relevant or low-relevant text snippets.
To ensure diverse supervision and to encourage the model to generate a spectrum of plausible, realistic answers, starting from the original BioASQ 13b training data, we sort all the gold snippets for a question by similarity and partition them into three groups of three: the top-3(most similar), the middle-3, and the bottom-3. During training, this grouping strategy, compared to randomly sampling a single snippet, exposes the model to various ways in which relevant information might be formulated, from very direct to more nuanced or tangential expressions. Moreover, this “three-snippet grouping” forces the model to generalise across varying degrees of relevance, rather than memorising a single gold phrase.
For snippet extraction, we reformat abstracts to highlight gold evidence spans and supplement training with negative/borderline examples, teaching models both to extract relevant text and abstain when no answer is present.
To ensure both high recall and high precision, we curate a training set from BioASQ and additional PubMed abstracts as follows:
- Positive instances: For each question, every gold-standard relevant abstract forms a training pair. The ground-truth snippets are annotated, and the model learns to mark the corresponding spans.
- Negative and borderline instances: To combat overgeneration, we include hard negative examples. For each question, among the top 𝑘 BM25 candidates (excluding golds and 2024+ docs), we randomly select one abstract ranked 11–30 that is not annotated as relevant. These “borderline negatives” are semantically close but lack a true answer. The model is expected to output only [BS][ES] (empty snippet) in such cases, learning to abstain.
For answer generation, we aggregate all available context, questions, evidence snippets, abstracts, and train models to produce type-specific answers in the required BioASQ format.
To train our multitask model, we construct a dataset from BioASQ gold-standard annotations, following a structured process. Each training instance includes:
- The biomedical question.
- Explicit indication of the question type (yes/no, factoid, list, summary).
- A controlled context comprising either three or five gold-standard relevant abstracts, truncated to meet prompt length constraints and focused explicitly around the gold-standard snippets tagged with [BS] and [ES] markers.
This approach encourages the model to focus solely on informative passages, significantly reducing the risk of hallucinations or irrelevant outputs. The output labels for each instance are tailored specifically to the answer type:
- Yes/no: concise answers (yes, no).
- Factoid: up to five named entities.
- List: comprehensive enumeration of relevant items.
- Summary: a concise and informative ideal-answer paragraph.
Purpose. Learn to generate answer-like snippets given a question. These snippets are not used to rewrite queries; they are encoded and used as semantic anchors during dense reranking.
python fine-tuning/fine_tuning/synth_query_expansion/synth-query_exp_ft.py python inference/synth_query_expansion/synth-query_exp_inf.pyPurpose. Mark only the evidence spans relevant to the question, and abstain when none exist, yielding higher precision and better downstream faithfulness.
python fine-tuning/fine_tuning/snippet_extraction/snippet_ext_ft.py python inference/snippet_extraction/snippet_ext_inf.py Purpose. Produce yes/no, factoid, list, and ideal answers using curated evidence.
python fine-tuning/fine_tuning/answer_generation/answer_gen_ft.pyor
python fine-tuning/fine_tuning/answer_generation/answer_gen_ft_Phi4.pypython inference/answer_generation/answer_gen_inf.py or
python inference/answer_generation/answer_gen_inf_Phi4.py See official evaluation:
- Phase A (Document Retrieval and Snippet Retrieval)
- Phase A+ (Answer Generation with Retrieved Documents and Snippets)
- Phase B (Answer Generation with Gold Documents and Snippets)
If you use this repository, please cite:
@inproceedings{UniTor,
author = {Federico Borazio and Andriy Shcherbakov and Danilo Croce and Roberto Basili},
title = {UniTor at BioASQ 2025: Modular Biomedical QA with Synthetic Snippets and Multiple Task Answer Generation},
booktitle = {Information Access Evaluation Meets Multilinguality, Multimodality, and Visualization:
16th International Conference of the CLEF Association, CLEF 2025, Madrid, Spain,
September 9–12, 2025, Proceedings},
year = {2025},
pages = {165-197},
location = {Madrid, Spain}
}
Unless otherwise stated, the code is released under MIT License.
Check third-party dependencies and datasets for their respective licenses.
For questions or issues, please open a GitHub issue or contact the authors.



