Add chatbot for fastertransformer#2
Merged
lvhan028 merged 1 commit intoInternLM:mainfrom Jun 18, 2023
lvhan028:add-chatbot
Merged
Add chatbot for fastertransformer#2lvhan028 merged 1 commit intoInternLM:mainfrom lvhan028:add-chatbot
lvhan028 merged 1 commit intoInternLM:mainfrom
lvhan028:add-chatbot
Conversation
grimoire
referenced
this pull request
in grimoire/lmdeploy
Jan 3, 2024
add internlm2-chat-7b chat template
lvhan028
pushed a commit
that referenced
this pull request
Aug 30, 2024
* support ascend using infer_ext * fix(ascend): make infer_ext using TND format q,k,v in paged_token_attention * support ascend using infer_ext * feat: support ascend moe_gating_topk_softmax * feat: change infer_ext ops function param order (#2) * ascend: align attention mask to 32bytes (#7) * fix attn args (#9) * fix: expand shape of attn_mask (#10) * feat: udpate infer_ext ops interface (#13) * rename infer_ext to dlinfer * format code * Support internlm 2.5 (#14) * refactor ascend pagedattention * fix ascend apply_rotary_pos_emb * fix import dlinfer (#16) * fix: fix rms_norm params (#18) * fix sync on ascend --------- Co-authored-by: chenchiyu <chenchiyu@pjlab.org.cn> Co-authored-by: CyCle1024 <ccy_justin@163.com> Co-authored-by: Wei Tao <1136862851@qq.com> Co-authored-by: jinminxi104 <jinminxi104@hotmail.com> Co-authored-by: pdx1989 <pdx1989@gmail.com>
3 tasks
roy-shih
pushed a commit
to roy-shih/lmdeploy
that referenced
this pull request
Nov 24, 2025
This commit implements 4 high-priority kernels to bridge the gap between
TurboMind CUDA and PyTorch Triton, enabling cross-platform deployment:
1. GELU and Mul kernel (activation.py)
- Fused GELU activation + elementwise multiply
- Follows TurboMind's GELU formula: x * 0.5 * (1 + tanh(sqrt(2/π) * (x + 0.044715 * x^3)))
- Auto-tuned for different vocab sizes
- Estimated speedup: 1.2-1.5x vs unfused PyTorch
2. Top-K Sampling kernel (topk_sampling.py)
- High-performance top-k sampling with softmax normalization
- Iterative max-finding approach optimized for Triton
- Includes topk_filter for logits filtering
- Reference PyTorch implementation for testing
- Critical for inference quality
3. Top-P (Nucleus) Sampling kernel (topp_sampling.py)
- Nucleus sampling with cumulative probability threshold
- Greedy nucleus selection for Triton efficiency
- Fused softmax + cumsum + sampling
- topp_filter for pre-sampling logits filtering
- Reference implementation included
4. Embedding Lookup + Position Encoding kernel (embedding_lookup.py)
- Fused embedding lookup + position encoding
- Three variants:
* embedding_lookup: Basic lookup
* embedding_lookup_pos_encoding: Fused lookup + pos encoding + scaling
* add_position_encoding: Add pos encoding to existing embeddings
- Auto-tuned for different hidden dimensions
- Memory bandwidth optimized with vectorized loads
Additionally:
- test_gelu_kernel.py: Comprehensive correctness and performance tests
These kernels address critical gaps identified in KERNEL_MIGRATION_CHECKLIST.md:
- Sampling: PyTorch backend had only multinomial, now has Top-K/Top-P
- Activation: Extended from SiLU to include GELU
- Embedding: Enables fused prefill operations
Performance targets (vs TurboMind CUDA):
- GELU and Mul: ≥95% (simple elementwise)
- Embedding Lookup: ≥90% (memory-bound)
- Top-K/Top-P Sampling: ≥85% (compute-bound)
All kernels support:
- FP16/BF16/FP32 precision
- Auto-tuning for optimal performance
- Cross-platform (CUDA/ROCm/Intel XPU via Triton)
Resolves tasks from KERNEL_TODO_QUICK_REF.md:
- Task InternLM#8: GELU and Mul ✅
- Task InternLM#1: Top-K Sampling ✅
- Task InternLM#2: Top-P Sampling ✅
- Task InternLM#10: Embedding + Pos Encoding ✅
Next steps:
- Performance benchmarking on GPU
- Integration tests with lmdeploy models
- KV Cache quantization kernels (INT4/INT8)
lvhan028
pushed a commit
that referenced
this pull request
Mar 24, 2026
* fix: make ruff happy * fix: autofix by ruff * fix: manual fix for ruff * fx: fix wrong moodification of ruff * fix: fix typo according to copilot suggestions * Fix docstrings: replace old-style typing constructs with modern Python 3.10+ equivalents (#2) * Fix docstrings to Google style format aligned with type hints * Fix docformatter lint: remove extra blank line before closing triple-quote in api.py * Replace old-style Dict/List/Optional type references in docstrings * Fix docstrings to conform with Google Style and use modern Python types * Fix remaining outdated type hints in docstrings (List/Dict/Tuple/Union/Optional) * Fix invalid dict() examples in vl/engine.py docstrings - use dict literals with code blocks Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: windreamer <572167+windreamer@users.noreply.github.com> --------- Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: windreamer <572167+windreamer@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.