Dynamic Scale Factor Calculations for Key/Value Scales With FP8 KV Caching by micah-wil · Pull Request #317 · ROCm/vllm

micah-wil · 2024-12-10T23:00:20Z

This PR implements a simple method for calculating k_scale and v_scale in the attention layer. This is especially useful in the absence of scale factors in the model checkpoints, where the previous solution was to default the scale factors to 1.0.

This feature necessitated changing k_scale and v_scale to tensors rather than floats, which should be useful for exploring different types of key & value scaling in the future (e.g. per-channel).

Here are a few PPL measurements taken using Llama 3.1 70B, demonstrating superior accuracy compared to using a scale factor of 1.0 for both k_scale and v_scale.

FP16 KV Cache: PPL=2.7317
FP8 KV Cache with k/v scale set to 1.0:  PPL=2.8874
**FP8 KV Cache with dynamic scale calculation: PPL=2.7484**

…cales flag in arg_utils.py

gshtras

Overall looks good, aside from a few minor questions and comments.
Also pending conflict resolution

gshtras · 2024-12-10T23:13:06Z

vllm/envs.py

    VLLM_USE_ROCM_SKINNY_GEMM: bool = True
    VLLM_USE_ROCM_CUSTOM_PAGED_ATTN: bool = True
-    VLLM_USE_ROCM_CUSTOM_PAGED_ATTN_FP8_OUT: bool = True
+    VLLM_USE_ROCM_CUSTOM_PAGED_ATTN_FP8_OUT: bool = False


Why is this switched?

gshtras · 2024-12-10T23:13:24Z

vllm/envs.py

    VLLM_MOE_PADDING: bool = False
    VLLM_FP8_PADDING: bool = True
    VLLM_ENABLE_V1_MULTIPROCESSING: bool = False
+    K_SCALE_CONSTANT: int = 200


Do we want different values?

gshtras · 2024-12-10T23:16:37Z

vllm/worker/model_runner_base.py

    for field in dataclasses.fields(attn_backend.get_metadata_cls()):
-        if field.name in tensor_dict:
+        if field.name in tensor_dict and field.name != \
+            'enable_kv_scales_calculation':


Not sure, why do we filter it out here?

gshtras · 2024-12-10T23:17:13Z

benchmarks/P3L.py

-    engine_args = EngineArgs.from_cli_args(args)
-    llm = LLM(**dataclasses.asdict(engine_args))
+
+    llm = LLM(


This is not needed now with the **dataclasses.asdict(engine_args)

gshtras · 2024-12-10T23:17:51Z

vllm/attention/layer.py

-        self._k_scale = 1.0
-        self._v_scale = 1.0
+        self.calculate_kv_scales = calculate_kv_scales
+        self._k_scale = torch.tensor(1.0, dtype=torch.float32)


Possibly torch.ones is better

…ching (#317) * Changed _k_scale and _v_scale to tensors * fixed rocm paged attention with tensor kv scales * Added on the fly scale factor calculation * trying to fix attn metadata * fixed AttentionMetadata issue, updated description for calculate-kv-scales flag in arg_utils.py * Changed K and V scale constants * Removed unneeded comment * Changes to pass format.sh, also fixed lingering k_scale/v_scale : float * Fix for TP > 1 * Ran format.sh * Removed legacy kv_scale loading from the json file * Removed the outdated kv cache docs * Revert some unwanted changes --------- Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

micah-wil added 11 commits November 19, 2024 19:45

Changed _k_scale and _v_scale to tensors

1d13898

fixed rocm paged attention with tensor kv scales

5d8c770

Added on the fly scale factor calculation

b04beab

trying to fix attn metadata

1cf0fd5

develop merged in

13be112

fixed AttentionMetadata issue, updated description for calculate-kv-s…

090273a

…cales flag in arg_utils.py

Changed K and V scale constants

8353312

Removed unneeded comment

8b61204

Changes to pass format.sh, also fixed lingering k_scale/v_scale : float

7b39e82

Fix for TP > 1

4f03130

Ran format.sh

4ff25c9

gshtras reviewed Dec 10, 2024

View reviewed changes

gshtras added 4 commits December 17, 2024 16:24

Merge remote-tracking branch 'origin/main' into kv-scales-on-the-fly

ef507a5

Removed legacy kv_scale loading from the json file

3957a83

Removed the outdated kv cache docs

7524323

Revert some unwanted changes

ee70564

shajrawi approved these changes Dec 17, 2024

View reviewed changes

gshtras merged commit d9fed26 into main Dec 17, 2024

gshtras deleted the kv-scales-on-the-fly branch December 17, 2024 23:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic Scale Factor Calculations for Key/Value Scales With FP8 KV Caching#317

Dynamic Scale Factor Calculations for Key/Value Scales With FP8 KV Caching#317
gshtras merged 15 commits intomainfrom
kv-scales-on-the-fly

micah-wil commented Dec 10, 2024 •

edited by github-actions bot

Loading

Uh oh!

gshtras left a comment

Uh oh!

gshtras Dec 10, 2024

Uh oh!

gshtras Dec 10, 2024

Uh oh!

gshtras Dec 10, 2024

Uh oh!

gshtras Dec 10, 2024

Uh oh!

gshtras Dec 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

micah-wil commented Dec 10, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gshtras left a comment

Choose a reason for hiding this comment

Uh oh!

gshtras Dec 10, 2024

Choose a reason for hiding this comment

Uh oh!

gshtras Dec 10, 2024

Choose a reason for hiding this comment

Uh oh!

gshtras Dec 10, 2024

Choose a reason for hiding this comment

Uh oh!

gshtras Dec 10, 2024

Choose a reason for hiding this comment

Uh oh!

gshtras Dec 10, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

micah-wil commented Dec 10, 2024 •

edited by github-actions bot

Loading