[Question]: Question about KV-cache storage

### Describe the issue

Thank you for the amazing work!

1. Does the model store the whole kv-cache of prefilling and generation on device? If so, how can the device hold the memory of 1M kv values; if not, how did you reduce the overhead of loading kv-values from host to device, and vice versa? 

2. What exactly does it mean by "(1) FlashAttention-2 (2) Triton == 2.1.0 are requirements"? I tried to use `pip install Minference` w/t having `FlashAttention-2` and `Triton == 2.1.0` installed, and then it outputted `ERROR: Failed building wheel for pycuda`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: Question about KV-cache storage #20

Describe the issue

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Question]: Question about KV-cache storage #20

Description

Describe the issue

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions