Skip to content

llama : add option to save memory in device buffers#22679

Merged
ggerganov merged 2 commits into
masterfrom
gg/llama-state-buf
May 5, 2026
Merged

llama : add option to save memory in device buffers#22679
ggerganov merged 2 commits into
masterfrom
gg/llama-state-buf

Conversation

@ggerganov
Copy link
Copy Markdown
Member

@ggerganov ggerganov commented May 4, 2026

Overview

cont #22558

Keep the speculative checkpoints in device memory. This eliminates the overhead of D2H copies.

Additional information

Extend the llama_state_seq_*_ext API with a new flag: LLAMA_STATE_SEQ_FLAGS_ON_DEVICE. When provided, the tensor data is not copied to/from the host buffer. Instead, it is stored in a device buffers owned by the llama_context. The device buffers are created per sequence id.

Also implement cpy_tensor backend API in the Metal backend. This API is necessary for the new functionality.

Requirements

@github-actions github-actions Bot added examples server ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels May 4, 2026
@ggerganov ggerganov merged commit d6e7b03 into master May 5, 2026
52 of 53 checks passed
@ggerganov ggerganov deleted the gg/llama-state-buf branch May 5, 2026 03:35
samuraieng pushed a commit to samuraieng/llama.cpp that referenced this pull request May 6, 2026
* llama : add option to save memory in device buffers

* tests : extend llama-save-load-state
cetarthoriphros pushed a commit to cetarthoriphros/llama.cpp that referenced this pull request May 9, 2026
* llama : add option to save memory in device buffers

* tests : extend llama-save-load-state
meh pushed a commit to meh/llama.cpp that referenced this pull request May 10, 2026
* llama : add option to save memory in device buffers

* tests : extend llama-save-load-state
baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026
* llama : add option to save memory in device buffers

* tests : extend llama-save-load-state
Yoshi4470 added a commit to Yoshi4470/llama.cpp that referenced this pull request May 24, 2026
ggml_backend_tensor_copy checked tensor->buffer directly, but GGML views
keep buffer NULL and store the backing allocation on view_src. tensor_get
already resolved views; tensor_copy did not. The mismatch was latent since
tensor_copy was added (ggml-org#9707, 2024-10).

It surfaced after LLAMA_STATE_SEQ_FLAGS_ON_DEVICE I/O (ggml-org#22679, 2026-05):
read/write device destructors stage copies via ggml_view_1d and
tensor_copy, and server context checkpoints adopted ON_DEVICE device IO
(41d6949). Parallel MTP workloads then hit GGML_ASSERT(buffer) in
get_type during checkpoint save/load.

Co-authored-by: Cursor <cursoragent@cursor.com>
winstonma pushed a commit to winstonma/llama.cpp that referenced this pull request May 27, 2026
* llama : add option to save memory in device buffers

* tests : extend llama-save-load-state
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
* llama : add option to save memory in device buffers

* tests : extend llama-save-load-state
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) examples ggml changes relating to the ggml tensor library for machine learning server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants