Our MoIIE progressively improves while other architectures, especially Dense and Modality MoE, encounter performance limitations. Moreover, with larger training datasets, our MoIIE consistently outperforms all alternatives, with the performance gap widening between MoIIE and both Dense and Modality MoE. This suggests that the MoIIE framework offers superior scaling properties for multi-modal learning, effectively leveraging larger datasets to enhance representation power without the parameter inefficiency of dense models or the limited cross-modal reasoning of strictly modality-separated experts.
- [2025/08/01] 🔥 We have released training and evaluation codes.
- [2025/08/13] 🔥 We have released MoIIE. Checkout the paper for details.
-
CUDA and cuDNN
We use CUDA 11.8 and cuDNN 8.7.0. We actually use the CUDA docker by NVIDIA:
docker pull nvcr.io/nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04. CUDA 12 is fine, too. -
Create a conda virtual environment and activate it:
conda create -n bunny python=3.10 conda activate bunny
pip install --upgrade pip # enable PEP 660 support -
Install apex
# https://github.com/NVIDIA/apex#from-source pip install ninja git clone https://github.com/NVIDIA/apex cd apex # if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key... pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ # otherwise pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./
-
Install flash-attention
# https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#installation-and-features pip install packaging pip install flash-attn --no-build-isolation -
Install bunny and other requirements
cd Bunny pip install -e .
MoIIE is trained on 8 A100 GPUs. Under other circumstances, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly. Always keep the global batch size the same: global_batch_size = per_device_train_batch_size gradient_accumulation_steps num_gpus.
- Experiments model components
| Vision Encoders | Download Link |
|---|---|
| siglip-so400m-patch14-384 | google/siglip-so400m-patch14-384 |
| MODEL_TYPE | LLM | Download Link |
|---|---|---|
| phi-3 | Phi-3-mini-4k-instruct | microsoft/Phi-3-mini-4k-instruct |
| llama3-8b | Meta-Llama-3-8B-Instruct | meta-llama/Meta-Llama-3-8B-Instruct |
MoIIE training consists of two stages:
Pretrain stage use data to connect a frozen pretrained vision encoder to a frozen LLM, and only the connector is trained
Training script with DeepSpeed ZeRO-2 can be found in scripts/train/pretrain.sh. Global Batch Size is 256
we utilize Bunny-pretrain-LAION-2M. The dataset is available here.
Visual instruction tuning stage&&Sparse training for all model parameters: use data to teach the model to follow multimodal instructions, where the connector, learnable LLM parameters vision encoder and MoE module are updated.
First, execute the following command to initialize the dense LLM backbone as its corresponding sparse MoE LLM backbone.
python convert_moe.py \
--language-model-path path/to/base_llm_model \
--num_local_experts 4 \
--num_experts_per_tok 2 \
--vis_router_aux_loss_coef 0.001 \
--lan_router_aux_loss_coef 0.001 \
--output_vis_router_logits True \
--output_lan_router_logits True \
--save-model-path /path/to/base_llm_moe_model \
--moe_architecture bunny-mm-phi3-moe-sThan training script with DeepSpeed ZeRO-3 can be found in scripts/train/finetune_all_moe.sh. Global Batch Size is 128
we utilize MGM-Intruct ,Bunnyv1_1, LLaVA-Next and LLaVA-OV data for experiments.
-
Run
Update
--model_name_or_pathand--vision_towerto the paths of the base_llm_moe_model and vision encoder, respectively. UpdateMODEL_TYPE,PRETRAIN_DIRandOUTPUT_DIRaccordingly. The global batch size is 128. ForMODEL_TYPE = mms-phi-3-moe(MoIIE)/phi-3-moe(Vanilla MoE)/m-phi-3(Modality MoE)/phi-3(Dense), change--versiontominicpm/phi3/llama, too. S$^2$ -Wrapper would be enabled if--use_s2 Trueadded. The vision encoder would be tuned if--unfreeze_vision_tower Trueadded. If only want to tune MoE layer--moe_enable Trueadded.
see evaluation_full.md.
- LLaVA: the dataset we utilized.
- Bunny: the codebase we built upon and the dataset we utilized.
- LLaVA-Next: the dataset we utilized.
- MGM the dataset we utilized.
- Cambrian-1: the evaluation codebase we utilized.



