llama : llama_perf + option to disable timings during decode#9355
llama : llama_perf + option to disable timings during decode#9355
Conversation
f7cee89 to
eda507d
Compare
eda507d to
ade52b6
Compare
| bool offload_kqv; // whether to offload the KQV ops (including the KV cache) to GPU | ||
| bool flash_attn; // whether to use flash attention [EXPERIMENTAL] | ||
| //bool no_perf; // whether to measure performance timings, TODO: implement | ||
| bool no_perf; // whether to measure performance timings | ||
|
|
||
| // Abort callback |
There was a problem hiding this comment.
This is minor libllama API breaking change due to the addition of the no_perf parameter
There was a problem hiding this comment.
I don't think this will be a breaking change, since struct llama_context_params is expected to be created by llama_context_default_params(), right?
There was a problem hiding this comment.
AFAIK such changes still break external bindings, such as: https://github.com/abetlen/llama-cpp-python/blob/c032fc65b0873337ed39e5d63e15468a5d797646/llama_cpp/llama_cpp.py#L841
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
| LLAMA_PERF_TYPE_SAMPLER_CHAIN = 1, | ||
| }; | ||
|
|
||
| LLAMA_API struct llama_perf_data llama_perf_get(const void * ctx, enum llama_perf_type type); |
There was a problem hiding this comment.
I think it would be preferable to have two separate functions, just to remove the possibility of calling it with the wrong type of pointer.
| return data; | ||
| } | ||
|
|
||
| const auto * p = (const struct llama_sampler_chain *) chain->ctx; |
There was a problem hiding this comment.
These casts are very error prone and should always be checked. To do so, I would suggest moving these functions to llama-sampling.cpp, and checking the interface pointer. The llama_sampler_chain struct could also be moved to llama-sampling.cpp.
There was a problem hiding this comment.
Additionally, since this only works with the chain sampler, it should be documented somewhere, either in the function/struct names, or with an explicit comment, otherwise the natural assumption is that it should work with any sampler.
There was a problem hiding this comment.
Upon passing a non-chain sampler, should it return empty data or call GGML_ABORT()?
There was a problem hiding this comment.
I think an abort would be better here until we can return status codes from functions, since it is most definitely not intended and the important part is that the programmer notices.
…g#9355) * llama : llama_perf + option to disable timings during decode ggml-ci * common : add llama_arg * Update src/llama.cpp Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * perf : separate functions in the API ggml-ci * perf : safer pointer handling + naming update ggml-ci * minor : better local var name * perf : abort on invalid sampler pointer ggml-ci --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
…g#9355) * llama : llama_perf + option to disable timings during decode ggml-ci * common : add llama_arg * Update src/llama.cpp Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * perf : separate functions in the API ggml-ci * perf : safer pointer handling + naming update ggml-ci * minor : better local var name * perf : abort on invalid sampler pointer ggml-ci --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
…g#9355) * llama : llama_perf + option to disable timings during decode ggml-ci * common : add llama_arg * Update src/llama.cpp Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * perf : separate functions in the API ggml-ci * perf : safer pointer handling + naming update ggml-ci * minor : better local var name * perf : abort on invalid sampler pointer ggml-ci --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
llama_context_params.no_perf). Performance measurements are disabled by default forlibllama, but for the examples inllama.cppthey are enabled by defaultllama_perf_getTODO:
llama_argafter common : refactor arg parser #9308