Below are some general recommendations to further improve performance.
- Use int8 quantization
- Use an Intel CPU supporting AVX512
- If you are processing a large volume of data, prefer increasing
inter_threadsoverintra_threadsand use stream methods (methods whose name ends with_fileor_iterable) - Avoid the total number of threads
inter_threads * intra_threadsto be larger than the number of physical cores - For single core execution on Intel CPUs, consider enabling packed GEMM (set the environment variable
CT2_USE_EXPERIMENTAL_PACKED_GEMM=1)
- Use a larger batch size whenever possible
- Use a NVIDIA GPU with Tensor Cores (Compute Capability >= 7.0)
- Pass multiple GPU IDs to
device_indexto execute on multiple GPUs
- The default beam size for translation is 2, but consider setting
beam_size=1to improve performance - When using a beam size of 1, keep
return_scoresdisabled if you are not using prediction scores: the final softmax layer can be skipped - Set
max_batch_sizeand pass a larger batch to*_batchmethods: the input sentences will be sorted by length and split by chunk ofmax_batch_sizeelements for improved efficiency - Prefer the "tokens"
batch_typeto make the total number of elements in a batch more constant - Consider using {ref}
translation:dynamic vocabulary reductionfor translation
The [WNGT 2020 efficiency task submission](https://github.com/OpenNMT/CTranslate2/tree/master/examples/wngt2020) which applies many of these recommendations to optimize machine translation models.
- Set
include_prompt_in_result=Falseso that the input prompt can be forwarded in the decoder at once - If the model uses a system prompt, consider passing it to the argument
static_promptfor it to be cached - When using a beam size of 1, keep
return_scoresdisabled if you are not using prediction scores: the final softmax layer can be skipped