ModelTC · sufubao · Jan 6, 2026 · Jan 9, 2026 · Jan 9, 2026 · Jan 10, 2026
diff --git a/.gitignore b/.gitignore
@@ -6,3 +6,4 @@ dist
 .idea
 .vscode
 tmp/
+requirements-musa.txt
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -10,4 +10,4 @@ repos:
     rev: 6.1.0 
     hooks:
       - id: flake8
-        args: ['--max-line-length=120', '--ignore=TYP001, E722, C901, E203, E266, E402, E302, E241, E902, E731, F403, E701, F405, F401, W292, W293, W503, W606, E231']
+        args: ['--max-line-length=120', '--ignore=TYP001, E722, C901, E203, E266, E402, E302, E241, E902, E731, F403, E701, F405, F401, W292, W293, W503, W606, E231, F541']
diff --git a/docs/CN/source/getting_started/installation.rst b/docs/CN/source/getting_started/installation.rst
@@ -27,7 +27,7 @@ Lightllm 是一个纯python开发的推理框架，其中的算子使用triton
     $ # 前请确保你的docker设置中已经分配了足够的共享内存，否则可能导致
     $ # 服务无法正常启动。
     $ # 1.如果是纯文本服务，建议分配2GB以上的共享内存, 如果你的内存充足，建议分配16GB以上的共享内存.
-    $ # 2.如果是多模态服务，建议分配16GB以上的共享内存，具体可以根据实际情况进行调整. 
+    $ # 2.如果是多模态服务，建议分配16GB以上的共享内存，具体可以根据实际情况进行调整.
     $ # 如果你没有足够的共享内存，可以尝试在启动服务的时候调低 --running_max_req_size 参数，这会降低
     $ # 服务的并发请求数量，但可以减少共享内存的占用。如果是多模态服务，也可以通过降低 --cache_capacity
     $ # 参数来减少共享内存的占用。
@@ -38,7 +38,7 @@ Lightllm 是一个纯python开发的推理框架，其中的算子使用triton
 你也可以使用源码手动构建镜像并运行,建议手动构建镜像,因为更新比较频繁：
 
 .. code-block:: console
-    
+
     $ # 进入代码仓库的根目录
     $ cd /lightllm
     $ # 手动构建镜像, docker 目录下有不同功能场景的镜像构建文件，按需构建。
@@ -52,7 +52,7 @@ Lightllm 是一个纯python开发的推理框架，其中的算子使用triton
 或者你也可以直接使用脚本一键启动镜像并且运行：
 
 .. code-block:: console
-    
+
     $ # 查看脚本参数
     $ python tools/quick_launch_docker.py --help
 
@@ -80,6 +80,10 @@ Lightllm 是一个纯python开发的推理框架，其中的算子使用triton
     $ # 安装lightllm的依赖 (cuda 12.4)
     $ pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu124
     $
+    $ # 安装lightllm的依赖 (摩尔线程 GPU)
+    $ ./generate_requirements_musa.sh
+    $ pip install -r requirements-musa.txt
+    $
     $ # 安装lightllm
     $ python setup.py install
 
@@ -97,6 +101,6 @@ Lightllm 是一个纯python开发的推理框架，其中的算子使用triton
     .. code-block:: console
 
         $ pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/ triton-nightly --no-deps
-    
+
     具体原因可以参考：`issue <https://github.com/triton-lang/triton/issues/3619>`_ 和 `fix PR <https://github.com/triton-lang/triton/pull/3638>`_
 
diff --git a/docs/CN/source/tutorial/api_server_args_zh.rst b/docs/CN/source/tutorial/api_server_args_zh.rst
@@ -183,22 +183,6 @@ PD 分离模式参数
     设置为 True 时，--nccl_host 必须等于 config_server_host，--nccl_port 对于 config_server 必须是唯一的，
     不要为不同的推理节点使用相同的 nccl_port，这将是严重错误
 
-attention类型选择参数
----------------------
-
-.. option:: --mode
-
-    模型推理模式，可以指定多个值：
-
-    * ``triton_int8kv``: 使用 int8 存储 kv cache，可增加 token 容量，使用 triton kernel
-    * ``ppl_int8kv``: 使用 int8 存储 kv cache，使用 ppl 快速 kernel
-    * ``ppl_fp16``: 使用 ppl 快速 fp16 解码注意力 kernel
-    * ``triton_flashdecoding``: 用于长上下文的 flashdecoding 模式，当前支持 llama llama2 qwen
-    * ``triton_gqa_attention``: 使用 GQA 的模型的快速 kernel
-    * ``triton_gqa_flashdecoding``: 使用 GQA 的模型的快速 flashdecoding kernel
-    * ``triton_fp8kv``: 使用 float8 存储 kv cache，目前仅用于 deepseek2
-
-    需要阅读源代码以确认所有模型支持的具体模式
 
 调度参数
 --------
@@ -327,17 +311,9 @@ attention类型选择参数
 
     推理后端将为解码使用微批次重叠模式
 
-.. option:: --enable_flashinfer_prefill
-
-    推理后端将为预填充使用 flashinfer 的注意力 kernel
-
-.. option:: --enable_flashinfer_decode
-
-    推理后端将为解码使用 flashinfer 的注意力 kernel
-
-.. option:: --enable_fa3
+.. option:: --llm_kv_type
 
-    推理后端将为预填充和解码使用 fa3 注意力 kernel
+    推理后端使用什么类型的数据存储kv cache, 可选值为 "None", "int8kv", "int4kv", "fp8kv"
 
 .. option:: --disable_cudagraph
 

diff --git a/docs/CN/source/tutorial/deepseek_deployment.rst b/docs/CN/source/tutorial/deepseek_deployment.rst
@@ -33,12 +33,14 @@ LightLLM 支持以下几种部署模式：
     LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
     --model_dir /path/DeepSeek-R1 \
     --tp 8 \
-    --enable_fa3
+    --llm_prefill_att_backend fa3 \
+    --llm_decode_att_backend fa3
 
 **参数说明:**
 - `LOADWORKER=18`: 模型加载线程数，提高加载速度
 - `--tp 8`: 张量并行度，使用8个GPU
-- `--enable_fa3`: 启用 Flash Attention 3.0
+- `--llm_prefill_att_backend fa3`: 启用 Flash Attention 3.0
+- `--llm_decode_att_backend fa3`: 启用 Flash Attention 3.0
 - `--port 8088`: 服务端口
 
 1.2 单机 DP + EP 模式 (Data Parallel + Expert Parallel)
@@ -55,13 +57,15 @@ LightLLM 支持以下几种部署模式：
     --model_dir /path/DeepSeek-R1 \
     --tp 8 \
     --dp 8 \
-    --enable_fa3
+    --llm_prefill_att_backend fa3 \
+    --llm_decode_att_backend fa3
 
 **参数说明:**
 - `MOE_MODE=EP`: 设置专家并行模式
 - `--tp 8`: 张量并行度
 - `--dp 8`: 数据并行度，通常设置为与 tp 相同的值
-- `--enable_fa3`: 启用 Flash Attention 3.0
+- `--llm_prefill_att_backend fa3`: 启用 Flash Attention 3.0
+- `--llm_decode_att_backend fa3`: 启用 Flash Attention 3.0
 
 **可选优化参数:**
 - `--enable_prefill_microbatch_overlap`: 启用预填充微批次重叠
@@ -85,7 +89,8 @@ LightLLM 支持以下几种部署模式：
     LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
     --model_dir /path/DeepSeek-R1 \
     --tp 16 \
-    --enable_fa3 \
+    --llm_prefill_att_backend fa3 \
+    --llm_decode_att_backend fa3 \
     --nnodes 2 \
     --node_rank 0 \
     --nccl_host $nccl_host \
@@ -101,7 +106,8 @@ LightLLM 支持以下几种部署模式：
     LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
     --model_dir /path/DeepSeek-R1 \
     --tp 16 \
-    --enable_fa3 \
+    --llm_prefill_att_backend fa3 \
+    --llm_decode_att_backend fa3 \
     --nnodes 2 \
     --node_rank 1 \
     --nccl_host $nccl_host \
@@ -129,7 +135,8 @@ LightLLM 支持以下几种部署模式：
     --model_dir /path/DeepSeek-R1 \
     --tp 16 \
     --dp 16 \
-    --enable_fa3 \
+    --llm_prefill_att_backend fa3 \
+    --llm_decode_att_backend fa3 \
     --nnodes 2 \
     --node_rank 0 \
     --nccl_host $nccl_host \
@@ -146,7 +153,8 @@ LightLLM 支持以下几种部署模式：
     --model_dir /path/DeepSeek-R1 \
     --tp 16 \
     --dp 16 \
-    --enable_fa3 \
+    --llm_prefill_att_backend fa3 \
+    --llm_decode_att_backend fa3 \
     --nnodes 2 \
     --node_rank 1 \
     --nccl_host $nccl_host \
@@ -195,7 +203,8 @@ PD (Prefill-Decode) 分离模式将预填充和解码阶段分离部署，可以
     --host $host \
     --port 8019 \
     --nccl_port 2732 \
-    --enable_fa3 \
+    --llm_prefill_att_backend fa3 \
+    --llm_decode_att_backend fa3  \
     --disable_cudagraph \
     --pd_master_ip $pd_master_ip \
     --pd_master_port 60011
@@ -219,7 +228,8 @@ PD (Prefill-Decode) 分离模式将预填充和解码阶段分离部署，可以
     --host $host \
     --port 8121 \
     --nccl_port 12322 \
-    --enable_fa3 \
+    --llm_prefill_att_backend fa3 \
+    --llm_decode_att_backend fa3 \
     --disable_cudagraph \
     --pd_master_ip $pd_master_ip \
     --pd_master_port 60011
@@ -287,7 +297,8 @@ PD (Prefill-Decode) 分离模式将预填充和解码阶段分离部署，可以
     --tp 8 \
     --dp 8 \
     --nccl_port 2732 \
-    --enable_fa3 \
+    --llm_prefill_att_backend fa3 \
+    --llm_decode_att_backend fa3 \
     --disable_cudagraph \
     --config_server_host $config_server_host \
     --config_server_port 60088
@@ -306,7 +317,8 @@ PD (Prefill-Decode) 分离模式将预填充和解码阶段分离部署，可以
     --nccl_port 12322 \
     --tp 8 \
     --dp 8 \
-    --enable_fa3 \
+    --llm_prefill_att_backend fa3 \
+    --llm_decode_att_backend fa3 \
     --config_server_host $config_server_host \
     --config_server_port 60088
     # 如果需要启用微批次重叠，可以取消注释以下行

diff --git a/docs/CN/source/tutorial/multi_level_cache_deployment.rst b/docs/CN/source/tutorial/multi_level_cache_deployment.rst
@@ -66,7 +66,8 @@ LightLLM 的多级缓存系统采用分层设计:
         --model_dir /path/to/Qwen3-235B-A22B \
         --tp 8 \
         --graph_max_batch_size 500 \
-        --enable_fa3 \
+        --llm_prefill_att_backend fa3 \
+        --llm_decode_att_backend fa3  \
         --mem_fraction 0.88 \
         --enable_cpu_cache \
         --cpu_cache_storage_size 400 \
@@ -81,7 +82,7 @@ LightLLM 的多级缓存系统采用分层设计:
 - ``--model_dir``: 模型文件路径,支持本地路径或 HuggingFace 模型名称
 - ``--tp 8``: 张量并行度,使用 8 个 GPU 进行模型推理
 - ``--graph_max_batch_size 500``: CUDA Graph 最大批次大小,影响吞吐量和显存占用
-- ``--enable_fa3``: 启用 Flash Attention 3.0,提升注意力计算速度，也可以换成flashinfer后端性能更佳
+- ``--llm_prefill_att_backend fa3``: 启用 Flash Attention 3.0,提升注意力计算速度，也可以换成flashinfer后端性能更佳
 - ``--mem_fraction 0.88``: GPU 显存使用比例,建议设置为 0.88及以下
 
 CPU 缓存参数
@@ -130,7 +131,8 @@ CPU 缓存参数
         --model_dir /path/to/Qwen3-235B-A22B \
         --tp 8 \
         --graph_max_batch_size 500 \
-        --enable_fa3 \
+        --llm_prefill_att_backend fa3 \
+        --llm_decode_att_backend fa3  \
         --mem_fraction 0.88 \
         --enable_cpu_cache \
         --cpu_cache_storage_size 400 \

diff --git a/docs/CN/source/tutorial/reasoning_parser.rst b/docs/CN/source/tutorial/reasoning_parser.rst
@@ -32,7 +32,8 @@ DeepSeek-R1
         --model_dir /path/to/DeepSeek-R1 \
         --reasoning_parser deepseek-r1 \
         --tp 8 \
-        --enable_fa3
+        --llm_prefill_att_backend fa3 \
+        --llm_decode_att_backend fa3
 
 DeepSeek-V3
 ~~~~~~~~~~~

diff --git a/docs/EN/source/getting_started/installation.rst b/docs/EN/source/getting_started/installation.rst
@@ -24,16 +24,16 @@ The easiest way to install Lightllm is using the official image. You can directl
     $ docker pull ghcr.io/modeltc/lightllm:main
     $
     $ # Run，The current LightLLM service relies heavily on shared memory.
-    $ # Before starting, please make sure that you have allocated enough shared memory 
+    $ # Before starting, please make sure that you have allocated enough shared memory
     $ # in your Docker settings; otherwise, the service may fail to start properly.
     $ #
-    $ # 1. For text-only services, it is recommended to allocate more than 2GB of shared memory. 
+    $ # 1. For text-only services, it is recommended to allocate more than 2GB of shared memory.
     $ # If your system has sufficient RAM, allocating 16GB or more is recommended.
-    $ # 2.For multimodal services, it is recommended to allocate 16GB or more of shared memory. 
+    $ # 2.For multimodal services, it is recommended to allocate 16GB or more of shared memory.
     $ # You can adjust this value according to your specific requirements.
     $ #
-    $ # If you do not have enough shared memory available, you can try lowering 
-    $ # the --running_max_req_size parameter when starting the service. 
+    $ # If you do not have enough shared memory available, you can try lowering
+    $ # the --running_max_req_size parameter when starting the service.
     $ # This will reduce the number of concurrent requests, but also decrease shared memory usage.
     $ docker run -it --gpus all -p 8080:8080            \
     $   --shm-size 2g -v your_local_path:/data/         \
@@ -42,21 +42,21 @@ The easiest way to install Lightllm is using the official image. You can directl
 You can also manually build the image from source and run it:
 
 .. code-block:: console
-    
+
     $ # move into lightllm root dir
     $ cd /lightllm
     $ # Manually build the image
     $ docker build -t <image_name> -f ./docker/Dockerfile .
     $
-    $ # Run, 
+    $ # Run,
     $ docker run -it --gpus all -p 8080:8080            \
     $   --shm-size 2g -v your_local_path:/data/         \
     $   <image_name> /bin/bash
 
 Or you can directly use the script to launch the image and run it with one click:
 
 .. code-block:: console
-    
+
     $ # View script parameters
     $ python tools/quick_launch_docker.py --help
 
@@ -84,6 +84,10 @@ You can also install Lightllm from source:
     $ # Install Lightllm dependencies (cuda 12.4)
     $ pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu124
     $
+    $ # Install Lightllm dependencies (Moore Threads GPU)
+    $ ./generate_requirements_musa.sh
+    $ pip install -r requirements-musa.txt
+    $
     $ # Install Lightllm
     $ python setup.py install
 
@@ -101,5 +105,5 @@ You can also install Lightllm from source:
     .. code-block:: console
 
         $ pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/ triton-nightly --no-deps
-    
+
     For specific reasons, please refer to: `issue <https://github.com/triton-lang/triton/issues/3619>`_ and `fix PR <https://github.com/triton-lang/triton/pull/3638>`_
diff --git a/docs/EN/source/tutorial/api_server_args_zh.rst b/docs/EN/source/tutorial/api_server_args_zh.rst
@@ -183,23 +183,6 @@ Different Parallel Mode Setting Parameters
     When set to True, --nccl_host must equal config_server_host, --nccl_port must be unique for config_server,
     do not use the same nccl_port for different inference nodes, this will be a serious error
 
-Attention Type Selection Parameters
-------------------------------------
-
-.. option:: --mode
-
-    Model inference mode, can specify multiple values:
-
-    * ``triton_int8kv``: Use int8 to store kv cache, can increase token capacity, uses triton kernel
-    * ``ppl_int8kv``: Use int8 to store kv cache, uses ppl fast kernel
-    * ``ppl_fp16``: Use ppl fast fp16 decode attention kernel
-    * ``triton_flashdecoding``: Flashdecoding mode for long context, currently supports llama llama2 qwen
-    * ``triton_gqa_attention``: Fast kernel for models using GQA
-    * ``triton_gqa_flashdecoding``: Fast flashdecoding kernel for models using GQA
-    * ``triton_fp8kv``: Use float8 to store kv cache, currently only used for deepseek2
-
-    Need to read source code to confirm specific modes supported by all models 
-
 Scheduling Parameters
 ---------------------
 
@@ -325,18 +308,6 @@ Performance Optimization Parameters
 .. option:: --enable_decode_microbatch_overlap
 
     The inference backend will use microbatch overlap mode for decoding
-
-.. option:: --enable_flashinfer_prefill
-
-    The inference backend will use flashinfer's attention kernel for prefill
-
-.. option:: --enable_flashinfer_decode
-
-    The inference backend will use flashinfer's attention kernel for decoding
-
-.. option:: --enable_fa3
-
-    The inference backend will use fa3 attention kernel for prefill and decoding
 
 .. option:: --disable_cudagraph
-Original file line number
+Diff line change
@@ Expand Up / @@ -6,3 +6,4 @@ dist @@
     .idea
     .vscode
     tmp/
+    requirements-musa.txt