kv-cache : pad the cache size to 256 for performance#17046
Conversation
| llama_context_params cparams = llama_context_default_params(); | ||
|
|
||
| cparams.n_ctx = n_prompt + n_gen + n_depth; | ||
| cparams.n_ctx = GGML_PAD(n_prompt + n_gen + n_depth, 256); |
There was a problem hiding this comment.
I don't think this is good, applications shouldn't be responsible for this padding.
There was a problem hiding this comment.
Move the padding to llama_context constructor?
There was a problem hiding this comment.
If there is a significant performance advantage from doing that then yes, that seems like a good thing to do. Otherwise, every single application would need to pad n_ctx manually.
There was a problem hiding this comment.
To clarify, the padding in llama-bench here is important because we do the TG test at full context. So we hit the bounds of the unpadded cache buffer when generating tokens (given that the resulting cparams.n_ctx number is not multiple of 256):
llama.cpp/src/llama-kv-cache.cpp
Lines 975 to 989 in 2fd7b77
Normally, applications will rarely experience this problem, because it shows up only when you are close to the full context.
I think it's ok to be in the constructor. The drawback is that you could have your requested n_ctx mutated which is not ideal. But probably a less of a problem than the current one.
There was a problem hiding this comment.
The drawback is that you could have your requested
n_ctxmutated which is not ideal.
As long as the value is never below what the application requested, it should be fine. The application can choose to use llama_n_ctx to take advantage of every slot, but they don't have to. I think this was already the situation before the padding was removed.
2fd7b77 to
d2c30c6
Compare
| // note: the SWA cache is always padded to 256 for performance | ||
| // https://github.com/ggml-org/llama.cpp/issues/17037 | ||
| uint32_t size_swa = std::min(size_base, GGML_PAD(hparams.n_swa*(unified ? n_seq_max : 1) + n_ubatch, 256)); |
There was a problem hiding this comment.
Nit, I think this relies on kv_size being padded to 256. The GGML_PAD could be moved outside the std::min to ensure that it is still padded if this changes in the future.
5108c7a to
7b2551a
Compare
* kv-cache : pad the size of the small SWA cache for performance * context : pad the total context to 256 * cont : future-proof the swa pad * server : adjust test params to new logic
* kv-cache : pad the size of the small SWA cache for performance * context : pad the total context to 256 * cont : future-proof the swa pad * server : adjust test params to new logic
fix #17037 #17058
cont #16812
The small SWA caches can be padded to 256 without concerns about memory usage.Pad the cache size to 256. This is friendly for the CUDA backend since the FA implementation benefits from round sizes of the K/V tensors. Can also help other backends.This is essentially a partial revert of #16812.
Comparing with before the regression in #16812:
GGML_CUDA=ON CUDA_VISIBLE_DEVICES=0 ./scripts/compare-commits.sh a8ca18b4b d2c30c61a llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf -m /home/ggerganov/.cache/llama.cpp/ggml-org_Qwen2.5-Coder-3B-Q8_0-GGUF_qwen2.5-coder-3b-q8_0.gguf -ngl 99 -d 4096,8192,16384,32768 -ub 512,4096 -b 4096 -fa 1 -n 32 -mmp 0ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes