[Draft] Qualcomm AI Engine Direct - [WIP] llama2... by chiwwang · Pull Request #3656 · pytorch/executorch

chiwwang · 2024-05-17T10:19:05Z

example/qualcomm/llama2/llama.py can be used like:

python examples/qualcomm/llama2/llama.py -a llama_only_quant \
-b build_android \
-m SM8650 \
--ptq 16a4w \
--tokenizer_model tokenizer.model \
--checkpoint stories110M.pt \
--params params.json \
--tokenizer_bin tokenizer.bin \
--prompt Once

Note that we don't have a runner for llama2 without split.

It's still FAR AWAY from a workable statically quantized llama2-7b.
Storiesllama-110M might work on 16a4w HTP. But please note that calibration() has not been done well.
Below is a reference command. But it can change anytime....

python examples/qualcomm/llama2/composite_llama.py \
-a storiesllama_16a4w \
-b build_android \
-s <device_id> \
-H <host_connecting_device> \
-m SM8650 \
--ptq 16a4w \
--tokenizer_model tokenizer.model \
--checkpoint stories110M.pt \
--params params.json \
--tokenizer_bin tokenizer.bin \
--prompt Once \
--temperature 0

What we did to optimize performance on HTP is listed:

One multihead attentions is transformed to multiple single head.
KV-cache is changed to graph I/O. The update is performed in qnn_llama_runner.cpp on CPU.
llama2 is partitioned to 6 pte files in examples/qualcomm/llama2/composite_llama.py
Embedding is quantized. This might need further investigation, e.g., can we move it out of the model and run on CPU..etc
Support u16 and u8 mixed-precision quantization.
KV-cache is left as quantized format in graph I/O.
RMSNorm is tweaked a bit to reduce the quantization sensitivity.
HTP Spill-Fill buffer feature is used among pte files.
Convert all Linear layers to Conv2d.
10 Properly set quant_min and quant_max in Observers to offset=128 in symmetrical quantization.

example/qualcomm/llama2/llama.py can be used like: ``` python examples/qualcomm/llama2/llama.py -a llama_only_quant -b build_android -m SM8650 --ptq 16a4w --tokenizer_model tokenizer.model --checkpoint stories110M.pt --params params.json --tokenizer_bin tokenizer.bin --prompt Once ``` Note that we don't have a runner for llama2 without split. What we did to optimize performance on HTP is listed: 1. One multihead attentions is transformed to multiple single head. 2. KV-cache is changed to graph I/O. The update is performed in qnn_llama_runner.cpp on CPU. 3. llama2 is partitioned to 6 pte files in examples/qualcomm/llama2/composite_llama.py 4. Embedding is quantized. This might need further investigation, e.g., can we move it out of the model on CPU..etc 5. Support u16 and u8 mixed-precision quantization. 6. KV-cache is left as quantized format in graph I/O. 7. RMSNorm is tweaked a bit to reduce the quantization sensitivity. 8. HTP Spill-Fill buffer feature is used among pte files. 9. Convert all Linear layers to Conv2d. 10 Properly set quant_min and quant_max in Observers to offset=128 in symmetrical quantization.

pytorch-bot · 2024-05-17T10:19:08Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/3656

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures

As of commit aaada7f with merge base 4008600 ():

NEW FAILURES - The following jobs have failed:

Android / test-llama-app / mobile-job (android) (gh)
Credentials could not be loaded, please check your action inputs: Could not load credentials from any providers
Apple / upload-frameworks-ios (gh)
Credentials could not be loaded, please check your action inputs: Could not load credentials from any providers
Lint / lintrunner / linux-job (gh)
>>> Lint for exir/program/_fake_program.py:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

chiwwang · 2024-07-03T08:56:38Z

Please see #4142 instead.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 17, 2024

chiwwang mentioned this pull request May 17, 2024

[Draft] Qualcomm AI Engine Direct - Support kv_cached llama2 model #2966

Closed

chiwwang mentioned this pull request Jun 14, 2024

ERROR: Overriding output data pointer allocated by memory plan is not allowed. #3528

Closed

shewu-quic mentioned this pull request Jun 24, 2024

[Draft] Qualcomm AI Engine Direct -Enable story llama model in quantied and fp #4030

Closed

chiwwang closed this Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft] Qualcomm AI Engine Direct - [WIP] llama2...#3656

[Draft] Qualcomm AI Engine Direct - [WIP] llama2...#3656
chiwwang wants to merge 1 commit intopytorch:mainfrom
CodeLinaro:dev/kiwi/temp_pub_20240517

chiwwang commented May 17, 2024 •

edited

Loading

Uh oh!

pytorch-bot Bot commented May 17, 2024 •

edited

Loading

Uh oh!

chiwwang commented Jul 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chiwwang commented May 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented May 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/3656

❌ 3 New Failures

Uh oh!

chiwwang commented Jul 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chiwwang commented May 17, 2024 •

edited

Loading

pytorch-bot Bot commented May 17, 2024 •

edited

Loading