From 7534e5e21351665bd33d85eb6301e5d9c8f0972e Mon Sep 17 00:00:00 2001
From: RissyRan <ran.rissy@gmail.com>
Date: Thu, 4 Jul 2024 00:40:54 +0000
Subject: [PATCH] Add moe perf number

---
 end_to_end/tpu/mixtral/Run_Mixtral.md | 36 +++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)
 create mode 100644 end_to_end/tpu/mixtral/Run_Mixtral.md

diff --git a/end_to_end/tpu/mixtral/Run_Mixtral.md b/end_to_end/tpu/mixtral/Run_Mixtral.md
new file mode 100644
index 0000000000..78266b0305
--- /dev/null
+++ b/end_to_end/tpu/mixtral/Run_Mixtral.md
@@ -0,0 +1,36 @@
+<!--
+ Copyright 2024 Google LLC
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+      https://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+ -->
+
+# Mixtral
+
+[Mixtral](https://mistral.ai/news/mixtral-of-experts/) is a state-of-the-art AI model developed by Mistral AI, utilizing a sparse mixture-of-experts (some) architecture.
+
+
+To get started, follow the instructions at [mistral-inference](https://github.com/mistralai/mistral-inference) to download the model. Once downloaded, run [llama_or_mistral_ckpt.py](../../../MaxText/llama_or_mistral_ckpt.py) to convert the checkpoint for MaxText compatibility. You can then proceed with decoding, pretraining, and finetuning. You could find Mixtral 8x7B example in the [end_to_end/tpu/mixtral/8x7b](../mixtral/8x7b) test scripts.
+
+
+Additionally, Mixtral integrates with [MegaBlocks](https://arxiv.org/abs/2211.15841), an efficient dropless MoE strategy, which can be activated by setting the megablox flag to True (default).
+
+
+## MaxText supports pretraining and finetuning with high performance
+
+Model Flop utilization for training on v5p TPUs.
+
+| Model size    | Accelerator type | TFLOP/chip/sec | Model flops utilization (MFU) |
+| ------------ | -------------- | --------------  | -------------- |
+| Mixtral 8X7B | v5p-128       | 251.94 | 54.89% |
+
+