From 7534e5e21351665bd33d85eb6301e5d9c8f0972e Mon Sep 17 00:00:00 2001 From: RissyRan Date: Thu, 4 Jul 2024 00:40:54 +0000 Subject: [PATCH] Add moe perf number --- end_to_end/tpu/mixtral/Run_Mixtral.md | 36 +++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) create mode 100644 end_to_end/tpu/mixtral/Run_Mixtral.md diff --git a/end_to_end/tpu/mixtral/Run_Mixtral.md b/end_to_end/tpu/mixtral/Run_Mixtral.md new file mode 100644 index 0000000000..78266b0305 --- /dev/null +++ b/end_to_end/tpu/mixtral/Run_Mixtral.md @@ -0,0 +1,36 @@ + + +# Mixtral + +[Mixtral](https://mistral.ai/news/mixtral-of-experts/) is a state-of-the-art AI model developed by Mistral AI, utilizing a sparse mixture-of-experts (some) architecture. + + +To get started, follow the instructions at [mistral-inference](https://github.com/mistralai/mistral-inference) to download the model. Once downloaded, run [llama_or_mistral_ckpt.py](../../../MaxText/llama_or_mistral_ckpt.py) to convert the checkpoint for MaxText compatibility. You can then proceed with decoding, pretraining, and finetuning. You could find Mixtral 8x7B example in the [end_to_end/tpu/mixtral/8x7b](../mixtral/8x7b) test scripts. + + +Additionally, Mixtral integrates with [MegaBlocks](https://arxiv.org/abs/2211.15841), an efficient dropless MoE strategy, which can be activated by setting the megablox flag to True (default). + + +## MaxText supports pretraining and finetuning with high performance + +Model Flop utilization for training on v5p TPUs. + +| Model size | Accelerator type | TFLOP/chip/sec | Model flops utilization (MFU) | +| ------------ | -------------- | -------------- | -------------- | +| Mixtral 8X7B | v5p-128 | 251.94 | 54.89% | + +