llama : add option to save memory in device buffers#22679
Merged
Conversation
ngxson
approved these changes
May 4, 2026
ServeurpersoCom
approved these changes
May 4, 2026
samuraieng
pushed a commit
to samuraieng/llama.cpp
that referenced
this pull request
May 6, 2026
* llama : add option to save memory in device buffers * tests : extend llama-save-load-state
This was referenced May 7, 2026
cetarthoriphros
pushed a commit
to cetarthoriphros/llama.cpp
that referenced
this pull request
May 9, 2026
* llama : add option to save memory in device buffers * tests : extend llama-save-load-state
meh
pushed a commit
to meh/llama.cpp
that referenced
this pull request
May 10, 2026
* llama : add option to save memory in device buffers * tests : extend llama-save-load-state
baramofme
pushed a commit
to baramofme/llama-cpp-turboquant
that referenced
this pull request
May 23, 2026
* llama : add option to save memory in device buffers * tests : extend llama-save-load-state
Yoshi4470
added a commit
to Yoshi4470/llama.cpp
that referenced
this pull request
May 24, 2026
ggml_backend_tensor_copy checked tensor->buffer directly, but GGML views keep buffer NULL and store the backing allocation on view_src. tensor_get already resolved views; tensor_copy did not. The mismatch was latent since tensor_copy was added (ggml-org#9707, 2024-10). It surfaced after LLAMA_STATE_SEQ_FLAGS_ON_DEVICE I/O (ggml-org#22679, 2026-05): read/write device destructors stage copies via ggml_view_1d and tensor_copy, and server context checkpoints adopted ON_DEVICE device IO (41d6949). Parallel MTP workloads then hit GGML_ASSERT(buffer) in get_type during checkpoint save/load. Co-authored-by: Cursor <cursoragent@cursor.com>
winstonma
pushed a commit
to winstonma/llama.cpp
that referenced
this pull request
May 27, 2026
* llama : add option to save memory in device buffers * tests : extend llama-save-load-state
fewtarius
pushed a commit
to fewtarius/llama.cpp
that referenced
this pull request
May 30, 2026
* llama : add option to save memory in device buffers * tests : extend llama-save-load-state
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
cont #22558
Keep the speculative checkpoints in device memory. This eliminates the overhead of D2H copies.
Additional information
Extend the
llama_state_seq_*_extAPI with a new flag:LLAMA_STATE_SEQ_FLAGS_ON_DEVICE. When provided, the tensor data is not copied to/from the host buffer. Instead, it is stored in a device buffers owned by thellama_context. The device buffers are created per sequence id.Also implement
cpy_tensorbackend API in the Metal backend. This API is necessary for the new functionality.Requirements