VLAExplain is an interpretability toolkit designed to help users visually understand the inner workings of Vision-Language-Action (VLA) models.
Currently in its early stage, it supports two types of attention analysis for the Pi05 model:
- ActionCode Attribution: Understand how a predicted action attends to inputs from vision, language, robot state, and future actions.
- Cross-Modal Attention in Language Model: Analyze attention relationships between modalities—e.g., text → image/state, image → text/state, state → text/image.
✨ Many features and code quality improvements are still in progress and will be gradually implemented.
Check out our demo to see VLAExplain in action:
Follow these steps to set up the environment:
git clone https://github.com/huggingface/lerobot.git
cd lerobot
git checkout v0.4.4 # Recommended versionInstall in editable mode as per LeRobot’s installation guide:
pip install -e .Then install Pi05 and Libero dependencies:
- Follow the Libero setup, install Libero dependencies:
pip install -e ".[libero]"- Follow the Pi05 policy setup, install Pi05-specific dependencies:
pip install -e ".[pi]"Finally, install visualization app requirements:
cd ..
pip install -r requirements.txtCopy and overwrite the following files:
# Model files
cp src/policies/pi05/model/* lerobot/src/lerobot/policies/pi05/
# Evaluation script
cp src/policies/pi05/infer/lerobot_eval.py lerobot/src/lerobot/scripts/lerobot_eval.pybash libero_eval.shData is saved under the path specified by the environment variable LEROBOT_DATA_DIR, containing:
expert_attention/— Action-to-modality attention weights (image, text, state)language_attention/— Cross-modal attention within the language modellanguage_info/— input_ids and state values at each action stepraw_images/— Original RGB frames from robot execution
- Set
eval.batch_size = 1andeval.n_episodes = 1 - Choose one
env.task_idsfor cleaner analysis (Multi-episode support will be added later)
Each inference step consumes ~420 MB of disk space due to high-resolution attention maps.
Launch the attention visualization tool with:
bash run_app.sh| Feature | Action Attribution | Cross-Modal Attention |
|---|---|---|
| Step Selector | ✅ | ✅ |
| Time Step Filter | ✅ | ❌ |
| Head Number | ✅ | ✅ |
| Layer Index | ✅ | ✅ |
| Normalization Method | ✅ | ✅ |
| Interpolation Method | ✅ | ✅ |
| Color Map | ✅ | ✅ |
| Opacity Control | ✅ | ✅ |
| Modality | Action Attribution | Cross-Modal Attention |
|---|---|---|
| Image | ✅ | ✅ |
| Text | ✅ | ✅ |
| State | ✅ | ✅ |
| Action | ✅ | ❌ |
If you find VLAExplain useful, please give us a ⭐ on GitHub!
We warmly welcome everyone to contribute to the development of VLAExplain!
If you use this code in your research or project, please cite:
@misc{shi2026vlaexplain,
title = {{VLAExplain}: Interpreting Vision-Language-Action ({VLA}) Models},
author = {Shi, Yafei},
year = {2026},
publisher = {GitHub},
howpublished = {\url{https://github.com/bjrobotnewbie/VLAExplain}}
}This project is licensed under the AGPLv3 License.
We sincerely thank the developers of LeRobot and Gradio. This project builds directly on their excellent open-source framework and contributions to the robotics community.
