ZsiBot VLN provides a framework for developing, testing, and deploying Vision-Language Navigation (VLN) algorithms, unifying the MATRiX simulation platform and VLN algorithms into a single, extensible pipeline. It also includes a zeroshot VLN baseline model, which serves as both a reference implementation and a practical starting point for research or product development.
zsibot_vln/
├── agents/
│ └── zeroshot/
│ └── unigoal/ # baseline VLN model
│ └── finetune/ # todo
├── assets/
├── bridge/
│ └── src/ # ROS2 ↔ ZMQ bidirectional bridging module
├── configs/ # configuration files
├── docs/
├── envs/ # MATRiX environment, adaptable to real-world robots
├── goals/ # example image goals
├── llms/ # prepared LLM/VLM HuggingFace models
├── outputs/
├── third_party/
├── main.py
├── requirements.txt
└── README.md
- CUDA-capable GPU (Nvidia 4090 recommended when using local LLMs)
- A gentle recommendation: use VPN for Git and model weight downloads
git clone git@github.com:zsibot/zsibot_vln.git
sudo apt install libzmq3-devFollow the MATRiX installation instructions and then:
#Update matrix/config.json
cp zsibot_vln/configs/config.json matrix/config/Option 1: Huggingface
conda create -n smol python=3.9 -y
conda activate smol
conda install --freeze-installed -c nvidia cuda-toolkit=12.4 -y
conda install --freeze-installed pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia -y
conda install -c conda-forge libstdcxx-ng
pip install -U transformers datasets evaluate accelerate timm
pip install num2words fastapi uvicorn hf_xet
pip install --no-cache-dir --no-build-isolation --verbose flash-attn
# download weights:
python zsibot_vln/llms/huggingface_models/smolvlm2_256m_video_instruct/smolvlm2_256m_video_instruct.pyOption 2: Cloud LLM/VLM API (fetch API-key from e.g. Aliyun Bailian)
export DASHSCOPE_API_KEY='YOUR_DASHSCOPE_API_KEY'Option 3: Ollama
curl -fsSL https://ollama.com/install.sh | sh
ollama pull gemma3:4b # or any other models from ollamaInstall and run the baseline following the instructions:
➡️ VLN Baseline Installation Guide
#run local server on shell 0 (can be sikpped when using ollama or cloud-API)
conda activate smol
python zsibot_vln/huggingface_models/llms/smolvlm2_256m_video_instruct/server.py#run MATRiX on shell 1 (no conda)
cd matrix && export ROS_DOMAIN_ID=0 && source /opt/ros/humble/setup.bash && ./run_sim.sh 1 6
# ==== NOTE ====
# Remember to stand the robot up using LB+Y (controller mode) or "u" (keyboard control mode).#run env_bridge on shell 2 (no conda)
cd zsibot_vln/bridge && export ROS_DOMAIN_ID=0 && source /opt/ros/humble/setup.bash && colcon build && source install/setup.bash && ros2 run env_bridge env_bridge#run mc_sdk_bridge on shell 3 (no conda)
cd zsibot_vln/bridge && export ROS_DOMAIN_ID=0 && source /opt/ros/humble/setup.bash && colcon build && source install/setup.bash && ros2 run mc_sdk_bridge mc_sdk_bridge#run the baselie model on shell 4
cd zsibot_vln && conda activate zsibot_vln
#search using an open-vocabulary text goal
python main.py --goal_type text --text_goal "green plant"
#or
#search using an image goal
python main.py --goal_type ins_image --image_goal_path ./goals/bed.jpgThis project builds upon and acknowledges the following works:
MATRiX – a robotic simulation framework featuring realistic scene rendering and physical dynamics.
UniGoal – a zero-shot VLN method leveraging LLMs.
This project is licensed under the BSD 3-Clause License. See the LICENSE file for details.
