Skip to content

[Question]: Question about KV-cache storage #20

@DerrickYLJ

Description

@DerrickYLJ

Describe the issue

Thank you for the amazing work!

  1. Does the model store the whole kv-cache of prefilling and generation on device? If so, how can the device hold the memory of 1M kv values; if not, how did you reduce the overhead of loading kv-values from host to device, and vice versa?

  2. What exactly does it mean by "(1) FlashAttention-2 (2) Triton == 2.1.0 are requirements"? I tried to use pip install Minference w/t having FlashAttention-2 and Triton == 2.1.0 installed, and then it outputted ERROR: Failed building wheel for pycuda.

Metadata

Metadata

Assignees

Labels

feature requestNew feature or requestquestionFurther information is requested

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions