Skip to content

Conversation

@zhaozx-cn
Copy link
Contributor

@zhaozx-cn zhaozx-cn commented Dec 19, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Signed-off-by: zhaozx-cn <zhaozx2116@163.com>

Co-authored-by: drslark <slarksblood@qq.com>
Co-authored-by: hwhaokun <haokun0405@163.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for deepseek eagle3 speculative decoding. The changes involve patching several parts of the codebase to handle the eagle3 method, including model verification, attention metadata building, and model-specific logic for deepseek_v2. My review identifies several areas where the code could be made more robust and maintainable. Specifically, there are multiple instances of complex, hardcoded conditions that are brittle and difficult to understand. There is also a critical issue with monkey-patching a model's __init__ method, which poses a significant maintenance risk.

Comment on lines +70 to +117
def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
super().__init__()

config = vllm_config.model_config.hf_config
quant_config = vllm_config.quant_config
self.config = config
self.device = current_platform.device_type

self.vocab_size = config.vocab_size
self.is_v32 = hasattr(config, "index_topk")
if self.is_v32:
topk_tokens = config.index_topk
topk_indices_buffer = torch.empty(
vllm_config.scheduler_config.max_num_batched_tokens,
topk_tokens,
dtype=torch.int32,
device=self.device,
)
else:
topk_indices_buffer = None

if get_pp_group().is_first_rank:
self.embed_tokens = VocabParallelEmbedding(
config.vocab_size,
config.hidden_size,
quant_config=quant_config,
prefix=f"{prefix}.embed_tokens",
)
else:
self.embed_tokens = PPMissingLayer()
self.start_layer, self.end_layer, self.layers = make_layers(
config.num_hidden_layers,
lambda prefix: DeepseekV2DecoderLayer(
vllm_config, prefix, topk_indices_buffer=topk_indices_buffer
),
prefix=f"{prefix}.layers",
)

if get_pp_group().is_last_rank:
self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
else:
self.norm = PPMissingLayer()
self.make_empty_intermediate_tensors = make_empty_intermediate_tensors_factory(
["hidden_states", "residual"], config.hidden_size
)
self.aux_hidden_state_layers: tuple[int, ...] = ()

DeepseekV2Model.__init__ = __init__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This patch completely overwrites the DeepseekV2Model.__init__ method just to add self.aux_hidden_state_layers. This monkey-patching approach is extremely fragile and poses a significant maintenance risk. If the original __init__ method in vllm changes in a future update, this patch will likely break or introduce subtle bugs. A less invasive approach should be used. If wrapping the original __init__ is not feasible, please add a prominent comment warning about the fragility of this patch and the necessity of keeping it synchronized with the upstream vllm implementation.

Comment on lines +352 to +353
if self.vllm_config.model_config.hf_config.model_type == "deepseek_v3":
builder = self.runner.attn_groups[0][1].get_metadata_builder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic for selecting the metadata builder for deepseek_v3 uses a hardcoded index [1]. This is brittle and not self-documenting. If the structure of attn_groups changes, this could lead to silent failures. Please add a comment explaining why index [1] is chosen for deepseek_v3. For better maintainability, consider refactoring this logic into a helper function that finds the correct builder based on its properties rather than a fixed index.

Comment on lines +1918 to +1919
if isinstance(builder, AscendAttentionMetadataBuilder):
attn_state = AscendAttentionState.DecodeOnly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The line if isinstance(builder, AscendAttentionMetadataBuilder): attn_state = AscendAttentionState.DecodeOnly unconditionally overrides the attn_state. This lacks context and could hide underlying issues or make debugging difficult. Please add a comment explaining why this state override is necessary for AscendAttentionMetadataBuilder during dummy metadata creation.

Comment on lines +2382 to +2385
if not self.model_config.use_mla or (self.speculative_config and
self.speculative_config.method == "eagle3" and
self.vllm_config.model_config.hf_config.model_type == "deepseek_v3" and
str(eagle_layer) in kv_cache_tensor.shared_by[0]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This conditional logic is very complex and hard to read, making it difficult to maintain. It combines multiple checks, including string comparisons for model type and layer names, which are brittle. Please refactor this logic into a separate helper function with a descriptive name (e.g., _should_use_full_attention_for_eagle3(...)) to improve code clarity and maintainability.

Comment on lines +2534 to +2536
if self.speculative_config and self.speculative_config.method == 'eagle3' and str(eagle_layer) in layer_name:
k_shape = kv_cache_shape
v_shape = k_shape
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to a previous comment, this conditional logic for eagle3 is very specific and uses a brittle string check (str(eagle_layer) in layer_name). This makes the code hard to understand and maintain. Please encapsulate this logic in a helper function to improve readability and robustness.

@zhaozx-cn zhaozx-cn changed the title add deepseek eagle3 [Feat] add deepseek eagle3 Dec 19, 2025
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant