From 696faf3d624805b2d9fecc3843c3dd06b33fda4f Mon Sep 17 00:00:00 2001 From: Chen Lai Date: Thu, 18 Apr 2024 21:52:20 -0700 Subject: [PATCH 1/2] Docs for lower smaller models to mps/coreml/qnn Differential Revision: [D56340028](https://our.internmc.facebook.com/intern/diff/D56340028/) [ghstack-poisoned] --- examples/models/llama2/README.md | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/examples/models/llama2/README.md b/examples/models/llama2/README.md index 2244917d1e4..7780dafe5a5 100644 --- a/examples/models/llama2/README.md +++ b/examples/models/llama2/README.md @@ -17,9 +17,9 @@ Please note that the models are subject to the [acceptable use policy](https://g # Results -Since 7B Llama2 model needs at least 4-bit quantization to fit even within some of the highend phones, results presented here correspond to 4-bit groupwise post-training quantized model. +Since 7B Llama2 model needs at least 4-bit quantization to fit even within some of the highend phones, results presented here correspond to 4-bit groupwise post-training quantized model. -For Llama3, we can use the same process. Note that it's only supported in the ExecuTorch main branch. +For Llama3, we can use the same process. Note that it's only supported in the ExecuTorch main branch. ## Quantization: We employed 4-bit groupwise per token dynamic quantization of all the linear layers of the model. Dynamic quantization refers to quantizating activations dynamically, such that quantization parameters for activations are calculated, from min/max range, at runtime. Here we quantized activations with 8bits (signed integer). Furthermore, weights are statically quantized. In our case weights were per-channel groupwise quantized with 4bit signed integer. For more information refer to this [page](https://github.com/pytorch-labs/ao/). @@ -243,6 +243,16 @@ Please refer to [this tutorial](https://pytorch.org/executorch/main/llm/llama-de ### Android Please refer to [this tutorial](https://pytorch.org/executorch/main/llm/llama-demo-android.html) to for full instructions on building the Android LLAMA Demo App. +## Optional: Smaller models delegated to other backends +Currently we supported lowering the stories model to other backends, including, CoreML, MPS and QNN. Please refer to the instruction +for each backend ([CoreML](https://pytorch.org/executorch/main/build-run-coreml.html), [MPS](https://pytorch.org/executorch/main/build-run-mps.html), [QNN](https://pytorch.org/executorch/main/build-run-qualcomm.html)) before trying to lower them. After the backend library is installed, the script to export a lowered model is + +- Lower to CoreML: `python -m examples.models.llama2.export_llama -kv --coreml -c stories110M.pt -p params.json` +- MPS: `python -m examples.models.llama2.export_llama -kv --MPS -c stories110M.pt -p params.json` +- QNN: `python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json` + +The iOS LLAMA app supports the CoreML and MPS model and the Android LLAMA app supports the QNN model. On Android, it also allow to cross compiler the llama runner binary, push to the device and run. + # What is coming next? ## Quantization - Enabling FP16 model to leverage smaller groupsize for 4-bit quantization. From 691bc590f363e4889ba9c7b823d65a76784c5676 Mon Sep 17 00:00:00 2001 From: Chen Lai Date: Thu, 18 Apr 2024 22:09:19 -0700 Subject: [PATCH 2/2] Update on "Docs for lower smaller models to mps/coreml/qnn" Differential Revision: [D56340028](https://our.internmc.facebook.com/intern/diff/D56340028/) [ghstack-poisoned] --- examples/models/llama2/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/models/llama2/README.md b/examples/models/llama2/README.md index 7780dafe5a5..45fe8d47674 100644 --- a/examples/models/llama2/README.md +++ b/examples/models/llama2/README.md @@ -248,7 +248,7 @@ Currently we supported lowering the stories model to other backends, including, for each backend ([CoreML](https://pytorch.org/executorch/main/build-run-coreml.html), [MPS](https://pytorch.org/executorch/main/build-run-mps.html), [QNN](https://pytorch.org/executorch/main/build-run-qualcomm.html)) before trying to lower them. After the backend library is installed, the script to export a lowered model is - Lower to CoreML: `python -m examples.models.llama2.export_llama -kv --coreml -c stories110M.pt -p params.json` -- MPS: `python -m examples.models.llama2.export_llama -kv --MPS -c stories110M.pt -p params.json` +- MPS: `python -m examples.models.llama2.export_llama -kv --mps -c stories110M.pt -p params.json` - QNN: `python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json` The iOS LLAMA app supports the CoreML and MPS model and the Android LLAMA app supports the QNN model. On Android, it also allow to cross compiler the llama runner binary, push to the device and run.