Prerequisites
Feature Description
Llama.cpp is currently one of the best backends to run large MoE models due to the ability to intelligently split the load between the devices, keeping the routed experts on CPU and the always-active-on-every-token params on GPU. Additionally, it can stream the CPU-allocated weights over to the main GPU to completely delegate prompt processing to the GPU, which tends to have much more computing power.
However, depending on the specific model and hardware - most notably the size of the CPU-allocated weights and the PCIe bandwidth - there's a different break-even point, where for small enough prompts, it's not beneficial to fully offload prompt processing to the GPU because combined CPU+GPU prompt processing will finish before you can copy all of the CPU-allocated weights over PCIe. From my understanding, there's a current hardcoded cutoff of 32 tokens.
I believe @jukofyork mentioned having a custom hack to make the threshold configurable via env var. Perhaps it's worth exposing the threshold param officially?
Motivation
To give a concrete example of when this would make a difference:
I have a 4.93bpw Deepseek quant split between CPU and GPU, with over 300gb of CPU-allocated weights and about 94gb of GPU-allocated weights+compute buffers+k/v cache.
If I send a request that requires processing of 100 new tokens of prompt, llama.cpp would still need to copy over 300gb of weights across PCIe to the GPU because 100 exceeds the hardcoded threshold of 32 tokens - which is a minimum of ~5 seconds at max theoretical PCIe 5.0 x16 speeds. Meanwhile, CPU+GPU prompt processing would finish ingesting 100 new tokens in under 5 seconds and lead to better TTFT in this scenario.
Possible Implementation
The simplest solution would be reading an env var or a cli arg instead of the hardcoded value.
It would be interesting to try to determine the break-even point automatically, but this is likely too complex and would probably need to be empirically measured on each model+quant+hardware setup.
Prerequisites
Feature Description
Llama.cpp is currently one of the best backends to run large MoE models due to the ability to intelligently split the load between the devices, keeping the routed experts on CPU and the always-active-on-every-token params on GPU. Additionally, it can stream the CPU-allocated weights over to the main GPU to completely delegate prompt processing to the GPU, which tends to have much more computing power.
However, depending on the specific model and hardware - most notably the size of the CPU-allocated weights and the PCIe bandwidth - there's a different break-even point, where for small enough prompts, it's not beneficial to fully offload prompt processing to the GPU because combined CPU+GPU prompt processing will finish before you can copy all of the CPU-allocated weights over PCIe. From my understanding, there's a current hardcoded cutoff of 32 tokens.
I believe @jukofyork mentioned having a custom hack to make the threshold configurable via env var. Perhaps it's worth exposing the threshold param officially?
Motivation
To give a concrete example of when this would make a difference:
I have a 4.93bpw Deepseek quant split between CPU and GPU, with over 300gb of CPU-allocated weights and about 94gb of GPU-allocated weights+compute buffers+k/v cache.
If I send a request that requires processing of 100 new tokens of prompt, llama.cpp would still need to copy over 300gb of weights across PCIe to the GPU because 100 exceeds the hardcoded threshold of 32 tokens - which is a minimum of ~5 seconds at max theoretical PCIe 5.0 x16 speeds. Meanwhile, CPU+GPU prompt processing would finish ingesting 100 new tokens in under 5 seconds and lead to better TTFT in this scenario.
Possible Implementation
The simplest solution would be reading an env var or a cli arg instead of the hardcoded value.
It would be interesting to try to determine the break-even point automatically, but this is likely too complex and would probably need to be empirically measured on each model+quant+hardware setup.