-
Notifications
You must be signed in to change notification settings - Fork 711
[Feat] add deepseek eagle3 #5194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: zhaozx-cn <zhaozx2116@163.com> Co-authored-by: drslark <slarksblood@qq.com> Co-authored-by: hwhaokun <haokun0405@163.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for deepseek eagle3 speculative decoding. The changes involve patching several parts of the codebase to handle the eagle3 method, including model verification, attention metadata building, and model-specific logic for deepseek_v2. My review identifies several areas where the code could be made more robust and maintainable. Specifically, there are multiple instances of complex, hardcoded conditions that are brittle and difficult to understand. There is also a critical issue with monkey-patching a model's __init__ method, which poses a significant maintenance risk.
| def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): | ||
| super().__init__() | ||
|
|
||
| config = vllm_config.model_config.hf_config | ||
| quant_config = vllm_config.quant_config | ||
| self.config = config | ||
| self.device = current_platform.device_type | ||
|
|
||
| self.vocab_size = config.vocab_size | ||
| self.is_v32 = hasattr(config, "index_topk") | ||
| if self.is_v32: | ||
| topk_tokens = config.index_topk | ||
| topk_indices_buffer = torch.empty( | ||
| vllm_config.scheduler_config.max_num_batched_tokens, | ||
| topk_tokens, | ||
| dtype=torch.int32, | ||
| device=self.device, | ||
| ) | ||
| else: | ||
| topk_indices_buffer = None | ||
|
|
||
| if get_pp_group().is_first_rank: | ||
| self.embed_tokens = VocabParallelEmbedding( | ||
| config.vocab_size, | ||
| config.hidden_size, | ||
| quant_config=quant_config, | ||
| prefix=f"{prefix}.embed_tokens", | ||
| ) | ||
| else: | ||
| self.embed_tokens = PPMissingLayer() | ||
| self.start_layer, self.end_layer, self.layers = make_layers( | ||
| config.num_hidden_layers, | ||
| lambda prefix: DeepseekV2DecoderLayer( | ||
| vllm_config, prefix, topk_indices_buffer=topk_indices_buffer | ||
| ), | ||
| prefix=f"{prefix}.layers", | ||
| ) | ||
|
|
||
| if get_pp_group().is_last_rank: | ||
| self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps) | ||
| else: | ||
| self.norm = PPMissingLayer() | ||
| self.make_empty_intermediate_tensors = make_empty_intermediate_tensors_factory( | ||
| ["hidden_states", "residual"], config.hidden_size | ||
| ) | ||
| self.aux_hidden_state_layers: tuple[int, ...] = () | ||
|
|
||
| DeepseekV2Model.__init__ = __init__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This patch completely overwrites the DeepseekV2Model.__init__ method just to add self.aux_hidden_state_layers. This monkey-patching approach is extremely fragile and poses a significant maintenance risk. If the original __init__ method in vllm changes in a future update, this patch will likely break or introduce subtle bugs. A less invasive approach should be used. If wrapping the original __init__ is not feasible, please add a prominent comment warning about the fragility of this patch and the necessity of keeping it synchronized with the upstream vllm implementation.
| if self.vllm_config.model_config.hf_config.model_type == "deepseek_v3": | ||
| builder = self.runner.attn_groups[0][1].get_metadata_builder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for selecting the metadata builder for deepseek_v3 uses a hardcoded index [1]. This is brittle and not self-documenting. If the structure of attn_groups changes, this could lead to silent failures. Please add a comment explaining why index [1] is chosen for deepseek_v3. For better maintainability, consider refactoring this logic into a helper function that finds the correct builder based on its properties rather than a fixed index.
| if isinstance(builder, AscendAttentionMetadataBuilder): | ||
| attn_state = AscendAttentionState.DecodeOnly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The line if isinstance(builder, AscendAttentionMetadataBuilder): attn_state = AscendAttentionState.DecodeOnly unconditionally overrides the attn_state. This lacks context and could hide underlying issues or make debugging difficult. Please add a comment explaining why this state override is necessary for AscendAttentionMetadataBuilder during dummy metadata creation.
| if not self.model_config.use_mla or (self.speculative_config and | ||
| self.speculative_config.method == "eagle3" and | ||
| self.vllm_config.model_config.hf_config.model_type == "deepseek_v3" and | ||
| str(eagle_layer) in kv_cache_tensor.shared_by[0]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conditional logic is very complex and hard to read, making it difficult to maintain. It combines multiple checks, including string comparisons for model type and layer names, which are brittle. Please refactor this logic into a separate helper function with a descriptive name (e.g., _should_use_full_attention_for_eagle3(...)) to improve code clarity and maintainability.
| if self.speculative_config and self.speculative_config.method == 'eagle3' and str(eagle_layer) in layer_name: | ||
| k_shape = kv_cache_shape | ||
| v_shape = k_shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
What this PR does / why we need it?
Does this PR introduce any user-facing change?
How was this patch tested?