@@ -159,6 +159,72 @@ A variety of speculative models of this type are available on HF hub:
159159- [ granite-7b-instruct-accelerator] ( https://huggingface.co/ibm-granite/granite-7b-instruct-accelerator )
160160- [ granite-20b-code-instruct-accelerator] ( https://huggingface.co/ibm-granite/granite-20b-code-instruct-accelerator )
161161
162+ ## Speculating using EAGLE based draft models
163+
164+ The following code configures vLLM to use speculative decoding where proposals are generated by
165+ an [ EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)] ( https://arxiv.org/pdf/2401.15077 ) based draft model.
166+
167+ ``` python
168+ from vllm import LLM , SamplingParams
169+
170+ prompts = [
171+ " The future of AI is" ,
172+ ]
173+ sampling_params = SamplingParams(temperature = 0.8 , top_p = 0.95 )
174+
175+ llm = LLM(
176+ model = " meta-llama/Meta-Llama-3-8B-Instruct" ,
177+ tensor_parallel_size = 4 ,
178+ speculative_model = " path/to/modified/eagle/model" ,
179+ speculative_draft_tensor_parallel_size = 1 ,
180+ )
181+
182+ outputs = llm.generate(prompts, sampling_params)
183+
184+ for output in outputs:
185+ prompt = output.prompt
186+ generated_text = output.outputs[0 ].text
187+ print (f " Prompt: { prompt!r } , Generated text: { generated_text!r } " )
188+
189+ ```
190+
191+ A few important things to consider when using the EAGLE based draft models:
192+
193+ 1 . The EAGLE draft models available in the [ HF repository for EAGLE models] ( https://huggingface.co/yuhuili ) cannot be
194+ used directly with vLLM due to differences in the expected layer names and model definition.
195+ To use these models with vLLM, use the [ following script] ( https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d )
196+ to convert them. Note that this script does not modify the model's weights.
197+
198+ In the above example, use the script to first convert
199+ the [ yuhuili/EAGLE-LLaMA3-Instruct-8B] ( https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-8B ) model
200+ and then use the converted checkpoint as the draft model in vLLM.
201+
202+ 2 . The EAGLE based draft models need to be run without tensor parallelism
203+ (i.e. speculative_draft_tensor_parallel_size is set to 1), although
204+ it is possible to run the main model using tensor parallelism (see example above).
205+
206+ 3 . When using EAGLE-based speculators with vLLM, the observed speedup is lower than what is
207+ reported in the reference implementation [ here] ( https://github.com/SafeAILab/EAGLE ) . This issue is under
208+ investigation and tracked here: [ https://github.com/vllm-project/vllm/issues/9565 ] ( https://github.com/vllm-project/vllm/issues/9565 ) .
209+
210+
211+ A variety of EAGLE draft models are available on the Hugging Face hub:
212+
213+ | Base Model | EAGLE on Hugging Face | # EAGLE Parameters |
214+ | ---------------------------------------------------------------------| -------------------------------------------| --------------------|
215+ | Vicuna-7B-v1.3 | yuhuili/EAGLE-Vicuna-7B-v1.3 | 0.24B |
216+ | Vicuna-13B-v1.3 | yuhuili/EAGLE-Vicuna-13B-v1.3 | 0.37B |
217+ | Vicuna-33B-v1.3 | yuhuili/EAGLE-Vicuna-33B-v1.3 | 0.56B |
218+ | LLaMA2-Chat 7B | yuhuili/EAGLE-llama2-chat-7B | 0.24B |
219+ | LLaMA2-Chat 13B | yuhuili/EAGLE-llama2-chat-13B | 0.37B |
220+ | LLaMA2-Chat 70B | yuhuili/EAGLE-llama2-chat-70B | 0.99B |
221+ | Mixtral-8x7B-Instruct-v0.1 | yuhuili/EAGLE-mixtral-instruct-8x7B | 0.28B |
222+ | LLaMA3-Instruct 8B | yuhuili/EAGLE-LLaMA3-Instruct-8B | 0.25B |
223+ | LLaMA3-Instruct 70B | yuhuili/EAGLE-LLaMA3-Instruct-70B | 0.99B |
224+ | Qwen2-7B-Instruct | yuhuili/EAGLE-Qwen2-7B-Instruct | 0.26B |
225+ | Qwen2-72B-Instruct | yuhuili/EAGLE-Qwen2-72B-Instruct | 1.05B |
226+
227+
162228## Lossless guarantees of Speculative Decoding
163229
164230In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of
0 commit comments