A comparison of various Embodied and Physical AI benchmarks. We summarize key features across benchmarks, including input modalities, question formats, presence of step-by-step reasoning trails, number of annotated questions, annotation methods, diversity of tasks and embodiments, and the types of robots involved. Our benchmark (last row) is distinguished by explicitly incorporating reasoning traces, supporting a variety of question types, and covering a broader set of tasks and robotic platforms compared to prior work.
Performance of different open source as well as closed source SoTA models, highlighting the reasoning accuracy as well as the final accuracy. Here, we evaluate the reasoning steps thoroughly using our proposed evaluation criteria.
BibTeX:
@misc{dissanayake2025goodfoundationmodelsstepbystep,
title={How Good are Foundation Models in Step-by-Step Embodied Reasoning?},
author={Dinura Dissanayake and Ahmed Heakl and Omkar Thawakar and Noor Ahsan and Ritesh Thawkar and Ketan More and Jean Lahoud and Rao Anwer and Hisham Cholakkal and Ivan Laptev and Fahad Shahbaz Khan and Salman Khan},
year={2025},
eprint={2509.15293},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.15293},
}
