Describe the issue
Thank you for the amazing work!
-
Does the model store the whole kv-cache of prefilling and generation on device? If so, how can the device hold the memory of 1M kv values; if not, how did you reduce the overhead of loading kv-values from host to device, and vice versa?
-
What exactly does it mean by "(1) FlashAttention-2 (2) Triton == 2.1.0 are requirements"? I tried to use pip install Minference w/t having FlashAttention-2 and Triton == 2.1.0 installed, and then it outputted ERROR: Failed building wheel for pycuda.
Describe the issue
Thank you for the amazing work!
Does the model store the whole kv-cache of prefilling and generation on device? If so, how can the device hold the memory of 1M kv values; if not, how did you reduce the overhead of loading kv-values from host to device, and vice versa?
What exactly does it mean by "(1) FlashAttention-2 (2) Triton == 2.1.0 are requirements"? I tried to use
pip install Minferencew/t havingFlashAttention-2andTriton == 2.1.0installed, and then it outputtedERROR: Failed building wheel for pycuda.