Add chatbot for fastertransformer by lvhan028 · Pull Request #2 · InternLM/lmdeploy

lvhan028 · 2023-06-18T07:59:07Z

No description provided.

add internlm2-chat-7b chat template

* support ascend using infer_ext * fix(ascend): make infer_ext using TND format q,k,v in paged_token_attention * support ascend using infer_ext * feat: support ascend moe_gating_topk_softmax * feat: change infer_ext ops function param order (#2) * ascend: align attention mask to 32bytes (#7) * fix attn args (#9) * fix: expand shape of attn_mask (#10) * feat: udpate infer_ext ops interface (#13) * rename infer_ext to dlinfer * format code * Support internlm 2.5 (#14) * refactor ascend pagedattention * fix ascend apply_rotary_pos_emb * fix import dlinfer (#16) * fix: fix rms_norm params (#18) * fix sync on ascend --------- Co-authored-by: chenchiyu <chenchiyu@pjlab.org.cn> Co-authored-by: CyCle1024 <ccy_justin@163.com> Co-authored-by: Wei Tao <1136862851@qq.com> Co-authored-by: jinminxi104 <jinminxi104@hotmail.com> Co-authored-by: pdx1989 <pdx1989@gmail.com>

This commit implements 4 high-priority kernels to bridge the gap between TurboMind CUDA and PyTorch Triton, enabling cross-platform deployment: 1. GELU and Mul kernel (activation.py) - Fused GELU activation + elementwise multiply - Follows TurboMind's GELU formula: x * 0.5 * (1 + tanh(sqrt(2/π) * (x + 0.044715 * x^3))) - Auto-tuned for different vocab sizes - Estimated speedup: 1.2-1.5x vs unfused PyTorch 2. Top-K Sampling kernel (topk_sampling.py) - High-performance top-k sampling with softmax normalization - Iterative max-finding approach optimized for Triton - Includes topk_filter for logits filtering - Reference PyTorch implementation for testing - Critical for inference quality 3. Top-P (Nucleus) Sampling kernel (topp_sampling.py) - Nucleus sampling with cumulative probability threshold - Greedy nucleus selection for Triton efficiency - Fused softmax + cumsum + sampling - topp_filter for pre-sampling logits filtering - Reference implementation included 4. Embedding Lookup + Position Encoding kernel (embedding_lookup.py) - Fused embedding lookup + position encoding - Three variants: * embedding_lookup: Basic lookup * embedding_lookup_pos_encoding: Fused lookup + pos encoding + scaling * add_position_encoding: Add pos encoding to existing embeddings - Auto-tuned for different hidden dimensions - Memory bandwidth optimized with vectorized loads Additionally: - test_gelu_kernel.py: Comprehensive correctness and performance tests These kernels address critical gaps identified in KERNEL_MIGRATION_CHECKLIST.md: - Sampling: PyTorch backend had only multinomial, now has Top-K/Top-P - Activation: Extended from SiLU to include GELU - Embedding: Enables fused prefill operations Performance targets (vs TurboMind CUDA): - GELU and Mul: ≥95% (simple elementwise) - Embedding Lookup: ≥90% (memory-bound) - Top-K/Top-P Sampling: ≥85% (compute-bound) All kernels support: - FP16/BF16/FP32 precision - Auto-tuning for optimal performance - Cross-platform (CUDA/ROCm/Intel XPU via Triton) Resolves tasks from KERNEL_TODO_QUICK_REF.md: - Task InternLM#8: GELU and Mul ✅ - Task InternLM#1: Top-K Sampling ✅ - Task InternLM#2: Top-P Sampling ✅ - Task InternLM#10: Embedding + Pos Encoding ✅ Next steps: - Performance benchmarking on GPU - Integration tests with lmdeploy models - KV Cache quantization kernels (INT4/INT8)

* fix: make ruff happy * fix: autofix by ruff * fix: manual fix for ruff * fx: fix wrong moodification of ruff * fix: fix typo according to copilot suggestions * Fix docstrings: replace old-style typing constructs with modern Python 3.10+ equivalents (#2) * Fix docstrings to Google style format aligned with type hints * Fix docformatter lint: remove extra blank line before closing triple-quote in api.py * Replace old-style Dict/List/Optional type references in docstrings * Fix docstrings to conform with Google Style and use modern Python types * Fix remaining outdated type hints in docstrings (List/Dict/Tuple/Union/Optional) * Fix invalid dict() examples in vl/engine.py docstrings - use dict literals with code blocks Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: windreamer <572167+windreamer@users.noreply.github.com> --------- Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: windreamer <572167+windreamer@users.noreply.github.com>

add chatbot

dd7a61b

lvhan028 merged commit ef2adb0 into InternLM:main Jun 18, 2023

grimoire referenced this pull request in grimoire/lmdeploy Jan 3, 2024

Merge pull request #2 from irexyc/support-internlm2

408b553

add internlm2-chat-7b chat template

jiabao-wang mentioned this pull request Nov 19, 2024

[Bug] Cannot install torch-npu==2.3.1, torch==2.3.1 and torchvision==0.18.1 because these package versions have conflicting dependencies. #2745

Closed

3 tasks

Sunxiaohu0406 mentioned this pull request Dec 4, 2024

[Bug] glm-4v-9b多卡报错 #2855

Closed

3 tasks

huangtao2999 mentioned this pull request Mar 2, 2026

[Bug] triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 81920, Hardware limit: 65536. Reducing block sizes or num_stages may help #4385

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add chatbot for fastertransformer#2

Add chatbot for fastertransformer#2
lvhan028 merged 1 commit intoInternLM:mainfrom
lvhan028:add-chatbot

lvhan028 commented Jun 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lvhan028 commented Jun 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant