Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

README.md

DEMON Benchmark

To comprehensively benchmark the demonstrative instruction following ability, we extensively gather a wide variety of multi-modal datasets from different fields and scenarios.

example Download Link:

DEMON has three important properties:

  • Demonstrative vision-language context: all the instructions contain sequences of inter-related images and texts, such as storyboards with scripts, and textbooks with diagrams.
  • Diverse forms of complex instructions: the instructions range from designing panels for comics, to discovering differences between surveillance images, and to conversational embodied tasks.
  • Vast range of instruction-following scenarios: the benchmark covers multiple practical scenarios, including cartoons, indus- trial visuals, driving recordings, recipes, etc
Tasks Scenarios Images Instructions Avg. Images/Instructions Avg. Words/Instruction
DEMON-Core 29 19 62,813 18,176 3.46 92.69
DEMON-Full 31 20 1,769,744 477,716 3.70 97.58

Benchmark Construction

Data Format

All task instances are given to the models in a unified instruction-response form to easily achieve zero-shot generalization on various tasks. Formally, each instance in DEMON is composed of the following components:

  • Task_Instruction: provides a complete natural language definition of a given task, including the input/output format and the task objective.
  • Task_Instance: is a concrete sample of a given task that consists of interleaved image-text sequential context (e.g., visually-rich textbooks and webpages, specific questions about the context).
  • Response: represents the target output in natural language for a given task instruction and task instance. For classification tasks, we convert the class labels as options into the instruction and ask the model to output the option index in natural language as repsonse.

Without any specific emphasis, we use the term instruction to refer to the combination of Task_Instruction and Task_Instance. For each task, we manually design 10 instruction templates in natural language to increase the diversity.

Task Collection and Categorization

To comprehensively benchmark the interleaved vision-language instruction-following ability, we extensively gather a wide variety of multi-modal datasets from different fields and scenarios. Our DEMON benchmark covers 31 tasks of 7 categories across various scenarios (i.e., surveillance, webpage, industrial, cartoon, etc.). Note that some datasets (i.e., ALFRED, VISION, OCR-VQA) are not originally proposed for the task that involves interleaved image-text sequences. To further increase task diversity, we meticulously design certain rules to transform them to desired tasks, strictly following the original annotation information.

Evaluation Protocols

Thanks to the unified task format of DEMON, all tasks can be evaluated in a zero-shot manner. For the open-ended generation tasks, we adopt ROUGE-L for evaluation. For the tasks that require the models to output option indexes, we take the Accuracy as the evaluation metric. While well-formated options are provided, we empirically observe that many MLLMs struggle to strictly follow instructions to output the option indexes but generate free-form text. Thus, when models do not exactly output the required options, we match their outputs to one of the given options based on the TF-IDF distance, which we find is more robust than model-based methods (ChatGPT and SentenceBERT). Since we explore quantities of tasks, we take maximally 500 instances per task for evaluation efficiency and exclude several datasets that are difficult to obtain and are subject to strict copyright restrictions (referred as DEMON-Core). Meanwhile, we report the full version of the benchmark to facilitate future research on large-scale multi-modal instruction tuning (referred as DEMON-Full).

Evaluation Result

Model Version Multi Modal Dialogue Visual Story Telling List Visual Relation Inference Multi Modal Cloze Knowledge Grounded QA Text Rich Images QA Multi Image Reasoning
BLIP-2 vicuna-7b 11.96 20.10 3.67 18.25 39.73 30.53 39.53
InstructBlip vicuna-7b 33.58 24.41 11.49 21.20 47.40 44.40 48.55
LLaMA-Adapter V2 llama-7b 14.22 17.57 13.51 18.00 44.80 32.00 44.03
LLaVA vicuna-7b 7.79 10.70 8.27 15.85 36.20 28.33 41.53
MiniGPT-4 vicuna-7b 13.70 17.07 7.95 16.60 30.27 26.40 43.50
mPLUG-Owl llama-7b 12.67 19.33 5.40 16.25 33.27 32.47 42.50
OpenFlamingo llama-7b 16.88 24.22 13.85 21.65 32.00 30.60 41.63
Otter llama-7b 15.37 15.57 11.39 16.00 41.67 27.73 43.85
Cheetah llama-2-7b-chat 42.70 24.76 25.50 22.95 51.00 44.93 48.68
Cheetah vicuna-7b 37.50 25.20 25.90 22.15 48.60 44.93 50.28

Contact

If you want to update your model in I4-Benchmark, feel free to contact us via email zhiqige2000@gmail.com.

License

The I4-Benchmark is released under the CC-BY-NC 4.0 license.