Skip to content

AlenjandroWang/MoIIE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MoIIE: Mixture of Intra- and Inter-Modality Experts for Large Vision Language Models

result

Our MoIIE progressively improves while other architectures, especially Dense and Modality MoE, encounter performance limitations. Moreover, with larger training datasets, our MoIIE consistently outperforms all alternatives, with the performance gap widening between MoIIE and both Dense and Modality MoE. This suggests that the MoIIE framework offers superior scaling properties for multi-modal learning, effectively leveraging larger datasets to enhance representation power without the parameter inefficiency of dense models or the limited cross-modal reasoning of strictly modality-separated experts.

Release

  • [2025/08/01] 🔥 We have released training and evaluation codes.
  • [2025/08/13] 🔥 We have released MoIIE. Checkout the paper for details.

Contents

Install

  • CUDA and cuDNN

    We use CUDA 11.8 and cuDNN 8.7.0. We actually use the CUDA docker by NVIDIA: docker pull nvcr.io/nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04. CUDA 12 is fine, too.

  • Create a conda virtual environment and activate it:

    conda create -n bunny python=3.10
    conda activate bunny
    pip install --upgrade pip  # enable PEP 660 support
  • Install apex

    # https://github.com/NVIDIA/apex#from-source
    pip install ninja
    git clone https://github.com/NVIDIA/apex
    cd apex
    # if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key...
    pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
    # otherwise
    pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./
  • Install flash-attention

    # https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#installation-and-features
    pip install packaging
    pip install flash-attn --no-build-isolation
  • Install bunny and other requirements

    cd Bunny
    pip install -e .

Training

MoIIE is trained on 8 A100 GPUs. Under other circumstances, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly. Always keep the global batch size the same: global_batch_size = per_device_train_batch_size $\times$ gradient_accumulation_steps $\times$ num_gpus.

  • Experiments model components
Vision Encoders Download Link
siglip-so400m-patch14-384 google/siglip-so400m-patch14-384
MODEL_TYPE LLM Download Link
phi-3 Phi-3-mini-4k-instruct microsoft/Phi-3-mini-4k-instruct
llama3-8b Meta-Llama-3-8B-Instruct meta-llama/Meta-Llama-3-8B-Instruct

MoIIE training consists of two stages:

Pretrain stage

Pretrain stage use data to connect a frozen pretrained vision encoder to a frozen LLM, and only the connector is trained

Training script with DeepSpeed ZeRO-2 can be found in scripts/train/pretrain.sh. Global Batch Size is 256

we utilize Bunny-pretrain-LAION-2M. The dataset is available here.

Visual instruction tuning stage&&Sparse training

Visual instruction tuning stage&&Sparse training for all model parameters: use data to teach the model to follow multimodal instructions, where the connector, learnable LLM parameters vision encoder and MoE module are updated.

First, execute the following command to initialize the dense LLM backbone as its corresponding sparse MoE LLM backbone.

python convert_moe.py \
      --language-model-path path/to/base_llm_model  \
      --num_local_experts 4 \
      --num_experts_per_tok 2 \
      --vis_router_aux_loss_coef 0.001 \
      --lan_router_aux_loss_coef 0.001 \
      --output_vis_router_logits True \
      --output_lan_router_logits True \
      --save-model-path  /path/to/base_llm_moe_model \
      --moe_architecture bunny-mm-phi3-moe-s

Than training script with DeepSpeed ZeRO-3 can be found in scripts/train/finetune_all_moe.sh. Global Batch Size is 128

we utilize MGM-Intruct ,Bunnyv1_1, LLaVA-Next and LLaVA-OV data for experiments.

  • Run

    Update --model_name_or_path and --vision_tower to the paths of the base_llm_moe_model and vision encoder, respectively. Update MODEL_TYPE, PRETRAIN_DIR and OUTPUT_DIR accordingly. The global batch size is 128. For MODEL_TYPE = mms-phi-3-moe(MoIIE)/phi-3-moe(Vanilla MoE)/m-phi-3(Modality MoE)/phi-3(Dense), change --version to minicpm/phi3/llama, too. S$^2$-Wrapper would be enabled if --use_s2 True added. The vision encoder would be tuned if --unfreeze_vision_tower True added. If only want to tune MoE layer --moe_enable True added.

Evaluation

see evaluation_full.md.

Acknowledgement

  • LLaVA: the dataset we utilized.
  • Bunny: the codebase we built upon and the dataset we utilized.
  • LLaVA-Next: the dataset we utilized.
  • MGM the dataset we utilized.
  • Cambrian-1: the evaluation codebase we utilized.

About

MoIIE: Mixture of Intra- and Inter-Modality Experts for Large Vision Language Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors