Note: This issue was copied from ggml-org#4218
Original Author: @ggerganov
Original Issue Number: ggml-org#4218
Created: 2023-11-25T17:04:06Z
There have been a few reports where the grammar sampling can significantly degrade the performance.
It would be nice to profile and optimize the implementation - there should be room for improvements.
Already on-going efforts:
Probably worth looking in multi-threading the implementation as well.
Note: This issue was copied from ggml-org#4218
Original Author: @ggerganov
Original Issue Number: ggml-org#4218
Created: 2023-11-25T17:04:06Z
There have been a few reports where the grammar sampling can significantly degrade the performance.
It would be nice to profile and optimize the implementation - there should be room for improvements.
Already on-going efforts:
reservespace indecode_utf8ggml-org/llama.cpp#4210llama_token_to_piecewhen sampling grammars ggml-org/llama.cpp#4213Probably worth looking in multi-threading the implementation as well.