How Good are Foundation Models in Step-by-Step Embodied Reasoning?

Summary

Embodied agents operating in the physical world must make decisions that are not only effective but also safe, spatially coherent, and grounded in context. While recent advances in large multimodal models (LMMs) have shown promising capabilities in visual understanding and language generation, their ability to perform structured reasoning for real-world embodied tasks remains underexplored. In this work, we aim to understand how well foundation models can perform step-by-step reasoning in embodied environments. To this end, we propose the Foundation Model Embodied Reasoning (FoMER) benchmark, designed to evaluate the reasoning capabilities of LMMs in complex embodied decision-making scenarios. Our benchmark spans a diverse set of tasks that require agents to interpret multimodal observations, reason about physical constraints and safety, and generate valid next actions in natural language. We present (i) a large-scale, curated suite of embodied reasoning tasks, (ii) a novel evaluation framework that disentangles perceptual grounding from action reasoning, and (iii) empirical analysis of several leading LMMs under this setting. Our benchmark includes over 1.1k samples with detailed step-by-step reasoning across 10 tasks and 8 embodiments, covering three different robot types. Our results highlight both the potential and current limitations of LMMs in embodied reasoning, pointing towards key challenges and opportunities for future research in robot intelligence.

Dataset Benchmark Comparison

A comparison of various Embodied and Physical AI benchmarks. We summarize key features across benchmarks, including input modalities, question formats, presence of step-by-step reasoning trails, number of annotated questions, annotation methods, diversity of tasks and embodiments, and the types of robots involved. Our benchmark (last row) is distinguished by explicitly incorporating reasoning traces, supporting a variety of question types, and covering a broader set of tasks and robotic platforms compared to prior work.

Experimental Results

Performance of different open source as well as closed source SoTA models, highlighting the reasoning accuracy as well as the final accuracy. Here, we evaluate the reasoning steps thoroughly using our proposed evaluation criteria.

Citation

BibTeX:

@misc{dissanayake2025goodfoundationmodelsstepbystep,
      title={How Good are Foundation Models in Step-by-Step Embodied Reasoning?}, 
      author={Dinura Dissanayake and Ahmed Heakl and Omkar Thawakar and Noor Ahsan and Ritesh Thawkar and Ketan More and Jean Lahoud and Rao Anwer and Hisham Cholakkal and Ivan Laptev and Fahad Shahbaz Khan and Salman Khan},
      year={2025},
      eprint={2509.15293},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.15293}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Evaluation		Evaluation
images		images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How Good are Foundation Models in Step-by-Step Embodied Reasoning?

Summary

Dataset Benchmark Comparison

Experimental Results

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

How Good are Foundation Models in Step-by-Step Embodied Reasoning?

Summary

Dataset Benchmark Comparison

Experimental Results

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages