Feature Request: Configurable threshold for full GPU-offloaded prompt processing during CPU+GPU inference

### Prerequisites

- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the [README.md](https://github.com/ggml-org/llama.cpp/blob/master/README.md).
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggml-org/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description

Llama.cpp is currently one of the best backends to run large MoE models due to the ability to intelligently split the load between the devices, keeping the routed experts on CPU and the always-active-on-every-token params on GPU. Additionally, it can stream the CPU-allocated weights over to the main GPU to completely delegate prompt processing to the GPU, which tends to have much more computing power.

However, depending on the specific model and hardware - most notably the size of the CPU-allocated weights and the PCIe bandwidth - there's a different break-even point, where for small enough prompts, it's not beneficial to fully offload prompt processing to the GPU because combined CPU+GPU prompt processing will finish before you can copy all of the CPU-allocated weights over PCIe. From my understanding, there's a current hardcoded cutoff of 32 tokens.

I believe @jukofyork mentioned having a custom hack to make the threshold configurable via env var. Perhaps it's worth exposing the threshold param officially?

### Motivation

To give a concrete example of when this would make a difference:

I have a 4.93bpw Deepseek quant split between CPU and GPU, with over 300gb of CPU-allocated weights and about 94gb of GPU-allocated weights+compute buffers+k/v cache.

If I send a request that requires processing of 100 new tokens of prompt, llama.cpp would still need to copy over 300gb of weights across PCIe to the GPU because 100 exceeds the hardcoded threshold of 32 tokens - which is a minimum of ~5 seconds at max theoretical PCIe 5.0 x16 speeds. Meanwhile, CPU+GPU prompt processing would finish ingesting 100 new tokens in under 5 seconds and lead to better TTFT in this scenario.

### Possible Implementation

The simplest solution would be reading an env var or a cli arg instead of the hardcoded value.

It would be interesting to try to determine the break-even point automatically, but this is likely too complex and would probably need to be empirically measured on each model+quant+hardware setup.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Configurable threshold for full GPU-offloaded prompt processing during CPU+GPU inference #17026

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Configurable threshold for full GPU-offloaded prompt processing during CPU+GPU inference #17026

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions