Skip to content

Create a C-style API similar to whisper.cpp#77

Closed
thomasantony wants to merge 18 commits into
ggml-org:masterfrom
thomasantony:refactor_for_library
Closed

Create a C-style API similar to whisper.cpp#77
thomasantony wants to merge 18 commits into
ggml-org:masterfrom
thomasantony:refactor_for_library

Conversation

@thomasantony
Copy link
Copy Markdown

@thomasantony thomasantony commented Mar 13, 2023

This change makes it easier to use this code as a library - say to build python bindings on top of it. It extracts out the following functions into llama.cpp

  • llama_model_load
  • llama_eval
  • llama_model_quantize

It also moves the relevant struct definitions to llama.h. This for example, helps avoid redefinition of llama_hparams in quantize.cpp. Please let me know if you have any suggestions to improve this.

See here for an example of this library structure in use.

@thomasantony thomasantony force-pushed the refactor_for_library branch from fb6a512 to bb0600c Compare March 13, 2023 04:58
@j-f1
Copy link
Copy Markdown
Contributor

j-f1 commented Mar 13, 2023

In my fork I added this struct to bundle up all the relevant data:

struct llama_state {
    gpt_vocab vocab;
    llama_model model;
    struct {
        int64_t t_load_us = -1;
        int64_t t_sample_us = -1;
        int64_t t_predict_us = -1;
    } timing;
};

Copy link
Copy Markdown
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a step in the right direction, but the exposed things are not the right one.

The llama_layer and llama_model should not be publicly visible.
You have to wrap them in llama_context or llama_state, which is just forward declared in llama.h file and defined in llama.cpp.

See the whisper.cpp C-style API for doing it the correct way:
https://github.com/ggerganov/whisper.cpp/blob/master/whisper.h

If you give it another try, make sure to start from latest master since things are changing there.

@bakkot bakkot mentioned this pull request Mar 15, 2023
@thomasantony thomasantony force-pushed the refactor_for_library branch 3 times, most recently from e463b4f to 3a561bb Compare March 16, 2023 03:41
@thomasantony
Copy link
Copy Markdown
Author

@ggerganov I have made the changes. Please let me know what you think

@thomasantony thomasantony force-pushed the refactor_for_library branch 3 times, most recently from c9904e5 to 6ff3e64 Compare March 16, 2023 03:51
Comment thread CMakeLists.txt Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

llamalib already contains llama.cpp, utils.cpp and utils.h

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Comment thread CMakeLists.txt Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing llama.h, utils.cpp and utils.h

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated CMakelists.

Copy link
Copy Markdown
Contributor

@j-f1 j-f1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some feedback on the API:

Comment thread llama.h Outdated
Comment thread llama.h Outdated
Comment thread llama.h Outdated
Comment thread llama.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does llama_init_context_with_prompt take a llama_context& while llama_init_from_params returns an llama_context*? Can you make these have a similar API, or rename them to clarify how they differ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this confusion in my second pass of refactoring. I feel it is a lot cleaner now. Please take a look.

Comment thread llama.h Outdated
Comment thread llama.h Outdated
@thomasantony
Copy link
Copy Markdown
Author

thomasantony commented Mar 18, 2023

@j-f1 @Green-Sky @ggerganov I have done another pass at refactoring and also fixed a few logical bugs that left interactive mode broken in my original version (among other things). I have verified that interactive mode now works as intended and inference remains just as fast as before.

I have also rebased on to the latest master branch. Please take another look. Thanks!

@thomasantony thomasantony force-pushed the refactor_for_library branch 2 times, most recently from 41b6af6 to 71f75c1 Compare March 18, 2023 03:52
Comment thread CMakeLists.txt Outdated
@ggerganov
Copy link
Copy Markdown
Member

@thomasantony
We want to have a C-style API in llama.h. We cannot expose C++ constructs

For now, leave it like this and let me apply the necessary changes on top of yours to demonstrate what I have in mind - probably tomorrow or the day after.
Thanks for the contributing!

@ggerganov ggerganov changed the title Refactors out some of the functions into llama.cpp Create a C-style API similar to whisper.cpp Mar 18, 2023
@thomasantony
Copy link
Copy Markdown
Author

@thomasantony We want to have a C-style API in llama.h. We cannot expose C++ constructs

For now, leave it like this and let me apply the necessary changes on top of yours to demonstrate what I have in mind - probably tomorrow or the day after. Thanks for the contributing!

Okay. Thanks. In the meantime, I will rebase the new changes on the master branch on to this branch.

@thomasantony thomasantony force-pushed the refactor_for_library branch 4 times, most recently from f609ff4 to 5a5d552 Compare March 19, 2023 18:59
@thomasantony thomasantony force-pushed the refactor_for_library branch from f0aea33 to 5195fed Compare March 19, 2023 20:39
@ggerganov
Copy link
Copy Markdown
Member

Superseded by #370

@ggerganov ggerganov closed this Mar 21, 2023
rooprob pushed a commit to rooprob/llama.cpp that referenced this pull request Aug 2, 2023
Update README.md: formate output samples
phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026
* Adding q6_0 - basics + AVX2/Zen4 working

* Adding q6_0: CUDA dequantize works, but not mmvq

* Adding q6_0: CUDA mmvq works

* Adding q6_0: CUDA cpy, so Q6_0 can be used for KV-cache

* Add q6_0 to CPU flash attention

Disappointing result: for LlaMA-3.2-1B, q6_0 K- and V-cache
gives about the same PPL as q8_0 K-cache and q4_0 V-cache,
while needing the exact same RAM.
I.e., what was the point?

* q6_0: slightly better kv-cache result

Better than q8_0+q4_0, but not as good as q8_0+iq4_nl

* q6_0: works on ARM_NEON

* q6_0: dequantize works on Metal, but not vector dot product

* q6_0: it now works on Metal

Outperforms q5_0 by a significant margin. E.g.
| model                          |       size |     params | backend    | ngl | threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: |
| llama 8B Q6_0                  |   6.08 GiB |     8.03 B | Metal      | 100 |       4 |         tg128 |     44.02 ± 0.08 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Metal      | 100 |       4 |         tg128 |     40.13 ± 0.12 |
| llama 8B Q6_0                  |   6.08 GiB |     8.03 B | Metal      | 100 |       4 |         pp512 |    500.55 ± 0.32 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Metal      | 100 |       4 |         pp512 |    448.02 ± 0.27 |

* q6_0: can now be used for kv-cache on Metal

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
ParmesanParty added a commit to ParmesanParty/llama.cpp that referenced this pull request May 6, 2026
Below the assistant's text content, render a horizontal thumbnail row
for image attachments using ChatAttachmentsList with limitToSingleRow
(so long turns overflow to a gallery dialog rather than pushing chat
scroll). Built on top of resolveInlineImageSrcs (re-exported from
utils barrel for ergonomics) which already resolves both name-based
markdown image refs and literal <img src> URLs against message.extra.

Curation policy: if the model embedded ANY image artifact inline via
markdown image refs, treat that as authorial selection — hide the
entire footer gallery so rejected attempts don't appear next to the
chosen one. Unreferenced artifacts remain accessible via per-tool-call
ToolStatusChip. If the model embedded no inline image refs, the
gallery falls back to showing every image (default behavior for batch
tasks or non-curating models). Non-image attachments are unaffected.

Folds parmesan-pre-rebase commits ggml-org#48 (artifact row) and ggml-org#77 (model
curation supersedes per-attachment dedup); skips ggml-org#14 (ChatMessage-
StreamContent.svelte) since the parmesan-specific stream-content
renderer depends on streamedEvents accumulator that R5's minimal
plumbing didn't include — that's deferred to T28.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request high priority Very important issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants