MoIIE: Mixture of Intra- and Inter-Modality Experts for Large Vision Language Models

Our MoIIE progressively improves while other architectures, especially Dense and Modality MoE, encounter performance limitations. Moreover, with larger training datasets, our MoIIE consistently outperforms all alternatives, with the performance gap widening between MoIIE and both Dense and Modality MoE. This suggests that the MoIIE framework offers superior scaling properties for multi-modal learning, effectively leveraging larger datasets to enhance representation power without the parameter inefficiency of dense models or the limited cross-modal reasoning of strictly modality-separated experts.

Release

[2025/08/01] 🔥 We have released training and evaluation codes.
[2025/08/13] 🔥 We have released MoIIE. Checkout the paper for details.

# https://github.com/NVIDIA/apex#from-source
pip install ninja
git clone https://github.com/NVIDIA/apex
cd apex
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key...
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
# otherwise
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Install flash-attention

# https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#installation-and-features
pip install packaging
pip install flash-attn --no-build-isolation

Install bunny and other requirements
```
cd Bunny
pip install -e .
```

Training

MoIIE is trained on 8 A100 GPUs. Under other circumstances, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly. Always keep the global batch size the same: global_batch_size = per_device_train_batch_size $\times$ gradient_accumulation_steps $\times$ num_gpus.

Experiments model components

Vision Encoders	Download Link
siglip-so400m-patch14-384	google/siglip-so400m-patch14-384

MODEL_TYPE	LLM	Download Link
phi-3	Phi-3-mini-4k-instruct	microsoft/Phi-3-mini-4k-instruct
llama3-8b	Meta-Llama-3-8B-Instruct	meta-llama/Meta-Llama-3-8B-Instruct

MoIIE training consists of two stages:

Pretrain stage

Pretrain stage use data to connect a frozen pretrained vision encoder to a frozen LLM, and only the connector is trained

Training script with DeepSpeed ZeRO-2 can be found in scripts/train/pretrain.sh. Global Batch Size is 256

we utilize Bunny-pretrain-LAION-2M. The dataset is available here.

Visual instruction tuning stage&&Sparse training

Visual instruction tuning stage&&Sparse training for all model parameters: use data to teach the model to follow multimodal instructions, where the connector, learnable LLM parameters vision encoder and MoE module are updated.

First, execute the following command to initialize the dense LLM backbone as its corresponding sparse MoE LLM backbone.

python convert_moe.py \
      --language-model-path path/to/base_llm_model  \
      --num_local_experts 4 \
      --num_experts_per_tok 2 \
      --vis_router_aux_loss_coef 0.001 \
      --lan_router_aux_loss_coef 0.001 \
      --output_vis_router_logits True \
      --output_lan_router_logits True \
      --save-model-path  /path/to/base_llm_moe_model \
      --moe_architecture bunny-mm-phi3-moe-s

Than training script with DeepSpeed ZeRO-3 can be found in scripts/train/finetune_all_moe.sh. Global Batch Size is 128

we utilize MGM-Intruct ,Bunnyv1_1, LLaVA-Next and LLaVA-OV data for experiments.

Run

Update --model_name_or_path and --vision_tower to the paths of the base_llm_moe_model and vision encoder, respectively. Update MODEL_TYPE, PRETRAIN_DIR and OUTPUT_DIR accordingly. The global batch size is 128. For MODEL_TYPE = mms-phi-3-moe(MoIIE)/phi-3-moe(Vanilla MoE)/m-phi-3(Modality MoE)/phi-3(Dense), change --version to minicpm/phi3/llama, too. S$^2$-Wrapper would be enabled if --use_s2 True added. The vision encoder would be tuned if --unfreeze_vision_tower True added. If only want to tune MoE layer --moe_enable True added.

Evaluation

see evaluation_full.md.

Acknowledgement

LLaVA: the dataset we utilized.
Bunny: the codebase we built upon and the dataset we utilized.
LLaVA-Next: the dataset we utilized.
MGM the dataset we utilized.
Cambrian-1: the evaluation codebase we utilized.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
bunny		bunny
eval		eval
script		script
LICENSE		LICENSE
MoIIE.png		MoIIE.png
README.md		README.md
case.png		case.png
data_scale.png		data_scale.png
pyproject.toml		pyproject.toml
result.png		result.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MoIIE: Mixture of Intra- and Inter-Modality Experts for Large Vision Language Models

Release

Contents

Install

Training

Pretrain stage

Visual instruction tuning stage&&Sparse training

Evaluation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MoIIE: Mixture of Intra- and Inter-Modality Experts for Large Vision Language Models

Release

Contents

Install

Training

Pretrain stage

Visual instruction tuning stage&&Sparse training

Evaluation

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages