llama : add option to save memory in device buffers by ggerganov · Pull Request #22679 · ggml-org/llama.cpp

ggerganov · 2026-05-04T12:39:59Z

Overview

Keep the speculative checkpoints in device memory. This eliminates the overhead of D2H copies.

Additional information

Extend the llama_state_seq_*_ext API with a new flag: LLAMA_STATE_SEQ_FLAGS_ON_DEVICE. When provided, the tensor data is not copied to/from the host buffer. Instead, it is stored in a device buffers owned by the llama_context. The device buffers are created per sequence id.

Also implement cpy_tensor backend API in the Metal backend. This API is necessary for the new functionality.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

* llama : add option to save memory in device buffers * tests : extend llama-save-load-state

ggml_backend_tensor_copy checked tensor->buffer directly, but GGML views keep buffer NULL and store the backing allocation on view_src. tensor_get already resolved views; tensor_copy did not. The mismatch was latent since tensor_copy was added (ggml-org#9707, 2024-10). It surfaced after LLAMA_STATE_SEQ_FLAGS_ON_DEVICE I/O (ggml-org#22679, 2026-05): read/write device destructors stage copies via ggml_view_1d and tensor_copy, and server context checkpoints adopted ON_DEVICE device IO (41d6949). Parallel MTP workloads then hit GGML_ASSERT(buffer) in get_type during checkpoint save/load. Co-authored-by: Cursor <cursoragent@cursor.com>

* llama : add option to save memory in device buffers * tests : extend llama-save-load-state

llama : add option to save memory in device buffers

4640060

ggerganov requested review from a team as code owners May 4, 2026 12:40

ggerganov mentioned this pull request May 4, 2026

llama: allow partial seq_rm for GDN models for speculative decoding #22400

Closed

ngxson approved these changes May 4, 2026

View reviewed changes

ServeurpersoCom approved these changes May 4, 2026

View reviewed changes

github-actions Bot added examples server ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels May 4, 2026

tests : extend llama-save-load-state

9c6ca18

ggerganov merged commit d6e7b03 into master May 5, 2026
52 of 53 checks passed

ggerganov deleted the gg/llama-state-buf branch May 5, 2026 03:35

samuraieng pushed a commit to samuraieng/llama.cpp that referenced this pull request May 6, 2026

llama : add option to save memory in device buffers (ggml-org#22679)

7742ba7

* llama : add option to save memory in device buffers * tests : extend llama-save-load-state

This was referenced May 7, 2026

llama : remove unnecessary seq_id check during state restore #22797

Merged

llama : fix device state save/load #22805

Merged

cetarthoriphros pushed a commit to cetarthoriphros/llama.cpp that referenced this pull request May 9, 2026

llama : add option to save memory in device buffers (ggml-org#22679)

4857c94

* llama : add option to save memory in device buffers * tests : extend llama-save-load-state

meh pushed a commit to meh/llama.cpp that referenced this pull request May 10, 2026

llama : add option to save memory in device buffers (ggml-org#22679)

1d8a563

* llama : add option to save memory in device buffers * tests : extend llama-save-load-state

a-ghorbani mentioned this pull request May 12, 2026

chore(deps): upgrade llama.rn to 0.12.0 a-ghorbani/pocketpal-ai#722

Merged

dacorvo mentioned this pull request May 22, 2026

Prompt Cache Improvements: Persistent and Automatic dacorvo/llama.cpp#1

Open

baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026

llama : add option to save memory in device buffers (ggml-org#22679)

55e16ae

* llama : add option to save memory in device buffers * tests : extend llama-save-load-state

winstonma pushed a commit to winstonma/llama.cpp that referenced this pull request May 27, 2026

llama : add option to save memory in device buffers (ggml-org#22679)

b03acd3

* llama : add option to save memory in device buffers * tests : extend llama-save-load-state

fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026

llama : add option to save memory in device buffers (ggml-org#22679)

dee3697

* llama : add option to save memory in device buffers * tests : extend llama-save-load-state

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : add option to save memory in device buffers#22679

llama : add option to save memory in device buffers#22679
ggerganov merged 2 commits into
masterfrom
gg/llama-state-buf

ggerganov commented May 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ggerganov commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ggerganov commented May 4, 2026 •

edited

Loading