Qualcomm AI Engine Direct - Optimization and fix mutable buffer issue#5072
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/5072
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 9b98827 with merge base 99fbca3 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Hi @cccclai, Please have a look, thank you. ResultThis PR "with spin quant R1+R2" and calibration one sentence with 128 seq_len
This PR with calibration one sentence with 128 seq_len
|
59f4b20 to
a7e6164
Compare
|
This looks awesome, seems like it generates results with good quality, do I understand it correctly? Regarding this
How do I repro and how did you still manage to generate result with the OOM issue? |
|
Looks like you fallback |
Yes, I currently fallback it to cpu, and it could work on 16GB device. But I think we could try to quantize embedding op to reduce memory usage |
Yes, I think that good news is to generate good result.
I think you could reproduce it in this PR. |
Hmm I'm a bit confused. Isn't BPETokenizer necessary for prompt decoding and encoding? How do you generate result without tokenizer? |
Because in llama3 we should use tik_tokenizer to do encoding and decoding, right. |
Hmm if we use llama3, how do we ending up using bpe tokenizer..? |
|
#4942 is just approved. Will merge it asap |
Thanks for your effort. There is some fails for llama runner test in PR check. Is this expected? |
Ohh, could we use bpe tokenizer for llama3? |
That is unrelated. It's resolved after rebase. |
Hmm I think we will need to use tiktokenizer for llama3. bpe tokenizer is only for llama2.
Hmm I think we should just use tiktokenizer for llama3...are you using llama3 tokenizer? |
Hmm I think we need to use tik tokenizer for llama3. I thought it just falls back to bpe tokenizer if it fails. Are you using llama3 tokenizer? |
Yes, I am using llama3 tokenizer. Yes, I agree with your mention. It should be failed to load with bpe tokenizer and change to load tik tokenizer. |
Got it. I thought my problem so I add a transform to replace rms norm instead of changing in llama_transform.py |
d58fb95 to
8ab0e79
Compare
Hmm I think the default is tiktokenizer, and if it fails, it will be reset and load bpe tokenizer. |
Woops, it changed yesterday.... :) |
Actually for this one, do you remember which job is failing? Is it the [test-llama-runner-qnn job[(https://hud.pytorch.org/hud/pytorch/executorch/main/1?per_page=50&name_filter=test-llama-runner-qnn)? Then it's the qnn llama end to end job... |
Oh, for this test, it needs to use qnn 2.25 or above due to RMS Norm support |
I see, can we update these lines then? https://github.com/pytorch/executorch/blob/main/.ci/scripts/setup-qnn-deps.sh#L15-L18 |
Hi, |
|
I see, let me get some help from internal teams on this. In the worst case we disable the CI |
|
The backup plan is to delete the llama qnn ci job temporarily and resume it later, essentially,
|
|
#4942 is merged, could you rebase, and remove the llama test? |
3fbfc6c to
0e26422
Compare
Thanks, I have rebased and remove the test. |
0e26422 to
10d5344
Compare
|
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
Hi can you add this change as it breaks some internal tests (we use buck internally) |
10d5344 to
be55d6e
Compare
|
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
Looks like the patch is not part of the new commit. Did I miss anything? edit: Actually saw them. Thanks! |
Do I need to rebase again? |
|
Hey sorry could you rebase again? I'm having issues to merge |
|
Hey could you rebase? I'm having issues landing.... |
Summary: - Add a pass to convert linear to conv2d: We found the accuracy drop because of QNN Linear op in llama3. And it will be fixed with convert linear to conv2d pass. - Workaround the issue about mutable buffer for index_put op: We add a pass to replace the input of index_put op. Under the workaround, it will result in performance regression. - Insert copy op for int64 inputs to convert int64 to int32 in i64toi32 pass - Support QNN RMS Norm and use native rms norm in llama_transformer - Add a pass to compose rms norm
be55d6e to
9b98827
Compare
|
@cccclai |
|
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |




Summary:
Note that QNN supports RMS Norm after 2.25.