[Draft] Qualcomm AI Engine Direct - [WIP] llama2...#3656
Closed
chiwwang wants to merge 1 commit intopytorch:mainfrom
Closed
[Draft] Qualcomm AI Engine Direct - [WIP] llama2...#3656chiwwang wants to merge 1 commit intopytorch:mainfrom
chiwwang wants to merge 1 commit intopytorch:mainfrom
Conversation
example/qualcomm/llama2/llama.py can be used like: ``` python examples/qualcomm/llama2/llama.py -a llama_only_quant -b build_android -m SM8650 --ptq 16a4w --tokenizer_model tokenizer.model --checkpoint stories110M.pt --params params.json --tokenizer_bin tokenizer.bin --prompt Once ``` Note that we don't have a runner for llama2 without split. What we did to optimize performance on HTP is listed: 1. One multihead attentions is transformed to multiple single head. 2. KV-cache is changed to graph I/O. The update is performed in qnn_llama_runner.cpp on CPU. 3. llama2 is partitioned to 6 pte files in examples/qualcomm/llama2/composite_llama.py 4. Embedding is quantized. This might need further investigation, e.g., can we move it out of the model on CPU..etc 5. Support u16 and u8 mixed-precision quantization. 6. KV-cache is left as quantized format in graph I/O. 7. RMSNorm is tweaked a bit to reduce the quantization sensitivity. 8. HTP Spill-Fill buffer feature is used among pte files. 9. Convert all Linear layers to Conv2d. 10 Properly set quant_min and quant_max in Observers to offset=128 in symmetrical quantization.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/3656
Note: Links to docs will display an error until the docs builds have been completed. ❌ 3 New FailuresAs of commit aaada7f with merge base 4008600 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Contributor
Author
|
Please see #4142 instead. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
example/qualcomm/llama2/llama.pycan be used like:Note that we don't have a runner for llama2 without split.
It's still FAR AWAY from a workable statically quantized llama2-7b.
Storiesllama-110M might work on 16a4w HTP. But please note that
calibration()has not been done well.Below is a reference command. But it can change anytime....
What we did to optimize performance on HTP is listed: