Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,4 @@ dist
.idea
.vscode
tmp/
requirements-musa.txt
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,4 @@ repos:
rev: 6.1.0
hooks:
- id: flake8
args: ['--max-line-length=120', '--ignore=TYP001, E722, C901, E203, E266, E402, E302, E241, E902, E731, F403, E701, F405, F401, W292, W293, W503, W606, E231']
args: ['--max-line-length=120', '--ignore=TYP001, E722, C901, E203, E266, E402, E302, E241, E902, E731, F403, E701, F405, F401, W292, W293, W503, W606, E231, F541']
12 changes: 8 additions & 4 deletions docs/CN/source/getting_started/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Lightllm 是一个纯python开发的推理框架,其中的算子使用triton
$ # 前请确保你的docker设置中已经分配了足够的共享内存,否则可能导致
$ # 服务无法正常启动。
$ # 1.如果是纯文本服务,建议分配2GB以上的共享内存, 如果你的内存充足,建议分配16GB以上的共享内存.
$ # 2.如果是多模态服务,建议分配16GB以上的共享内存,具体可以根据实际情况进行调整.
$ # 2.如果是多模态服务,建议分配16GB以上的共享内存,具体可以根据实际情况进行调整.
$ # 如果你没有足够的共享内存,可以尝试在启动服务的时候调低 --running_max_req_size 参数,这会降低
$ # 服务的并发请求数量,但可以减少共享内存的占用。如果是多模态服务,也可以通过降低 --cache_capacity
$ # 参数来减少共享内存的占用。
Expand All @@ -38,7 +38,7 @@ Lightllm 是一个纯python开发的推理框架,其中的算子使用triton
你也可以使用源码手动构建镜像并运行,建议手动构建镜像,因为更新比较频繁:

.. code-block:: console

$ # 进入代码仓库的根目录
$ cd /lightllm
$ # 手动构建镜像, docker 目录下有不同功能场景的镜像构建文件,按需构建。
Expand All @@ -52,7 +52,7 @@ Lightllm 是一个纯python开发的推理框架,其中的算子使用triton
或者你也可以直接使用脚本一键启动镜像并且运行:

.. code-block:: console

$ # 查看脚本参数
$ python tools/quick_launch_docker.py --help

Expand Down Expand Up @@ -80,6 +80,10 @@ Lightllm 是一个纯python开发的推理框架,其中的算子使用triton
$ # 安装lightllm的依赖 (cuda 12.4)
$ pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu124
$
$ # 安装lightllm的依赖 (摩尔线程 GPU)
$ ./generate_requirements_musa.sh
$ pip install -r requirements-musa.txt
$
$ # 安装lightllm
$ python setup.py install

Expand All @@ -97,6 +101,6 @@ Lightllm 是一个纯python开发的推理框架,其中的算子使用triton
.. code-block:: console

$ pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/ triton-nightly --no-deps

具体原因可以参考:`issue <https://github.com/triton-lang/triton/issues/3619>`_ 和 `fix PR <https://github.com/triton-lang/triton/pull/3638>`_

28 changes: 2 additions & 26 deletions docs/CN/source/tutorial/api_server_args_zh.rst
Original file line number Diff line number Diff line change
Expand Up @@ -183,22 +183,6 @@ PD 分离模式参数
设置为 True 时,--nccl_host 必须等于 config_server_host,--nccl_port 对于 config_server 必须是唯一的,
不要为不同的推理节点使用相同的 nccl_port,这将是严重错误

attention类型选择参数
---------------------

.. option:: --mode

模型推理模式,可以指定多个值:

* ``triton_int8kv``: 使用 int8 存储 kv cache,可增加 token 容量,使用 triton kernel
* ``ppl_int8kv``: 使用 int8 存储 kv cache,使用 ppl 快速 kernel
* ``ppl_fp16``: 使用 ppl 快速 fp16 解码注意力 kernel
* ``triton_flashdecoding``: 用于长上下文的 flashdecoding 模式,当前支持 llama llama2 qwen
* ``triton_gqa_attention``: 使用 GQA 的模型的快速 kernel
* ``triton_gqa_flashdecoding``: 使用 GQA 的模型的快速 flashdecoding kernel
* ``triton_fp8kv``: 使用 float8 存储 kv cache,目前仅用于 deepseek2

需要阅读源代码以确认所有模型支持的具体模式

调度参数
--------
Expand Down Expand Up @@ -327,17 +311,9 @@ attention类型选择参数

推理后端将为解码使用微批次重叠模式

.. option:: --enable_flashinfer_prefill

推理后端将为预填充使用 flashinfer 的注意力 kernel

.. option:: --enable_flashinfer_decode

推理后端将为解码使用 flashinfer 的注意力 kernel

.. option:: --enable_fa3
.. option:: --llm_kv_type

推理后端将为预填充和解码使用 fa3 注意力 kernel
推理后端使用什么类型的数据存储kv cache, 可选值为 "None", "int8kv", "int4kv", "fp8kv"

.. option:: --disable_cudagraph

Expand Down
36 changes: 24 additions & 12 deletions docs/CN/source/tutorial/deepseek_deployment.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,12 +33,14 @@ LightLLM 支持以下几种部署模式:
LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
--model_dir /path/DeepSeek-R1 \
--tp 8 \
--enable_fa3
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3

**参数说明:**
- `LOADWORKER=18`: 模型加载线程数,提高加载速度
- `--tp 8`: 张量并行度,使用8个GPU
- `--enable_fa3`: 启用 Flash Attention 3.0
- `--llm_prefill_att_backend fa3`: 启用 Flash Attention 3.0
- `--llm_decode_att_backend fa3`: 启用 Flash Attention 3.0
- `--port 8088`: 服务端口

1.2 单机 DP + EP 模式 (Data Parallel + Expert Parallel)
Expand All @@ -55,13 +57,15 @@ LightLLM 支持以下几种部署模式:
--model_dir /path/DeepSeek-R1 \
--tp 8 \
--dp 8 \
--enable_fa3
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3

**参数说明:**
- `MOE_MODE=EP`: 设置专家并行模式
- `--tp 8`: 张量并行度
- `--dp 8`: 数据并行度,通常设置为与 tp 相同的值
- `--enable_fa3`: 启用 Flash Attention 3.0
- `--llm_prefill_att_backend fa3`: 启用 Flash Attention 3.0
- `--llm_decode_att_backend fa3`: 启用 Flash Attention 3.0

**可选优化参数:**
- `--enable_prefill_microbatch_overlap`: 启用预填充微批次重叠
Expand All @@ -85,7 +89,8 @@ LightLLM 支持以下几种部署模式:
LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
--model_dir /path/DeepSeek-R1 \
--tp 16 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--nnodes 2 \
--node_rank 0 \
--nccl_host $nccl_host \
Expand All @@ -101,7 +106,8 @@ LightLLM 支持以下几种部署模式:
LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
--model_dir /path/DeepSeek-R1 \
--tp 16 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--nnodes 2 \
--node_rank 1 \
--nccl_host $nccl_host \
Expand Down Expand Up @@ -129,7 +135,8 @@ LightLLM 支持以下几种部署模式:
--model_dir /path/DeepSeek-R1 \
--tp 16 \
--dp 16 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--nnodes 2 \
--node_rank 0 \
--nccl_host $nccl_host \
Expand All @@ -146,7 +153,8 @@ LightLLM 支持以下几种部署模式:
--model_dir /path/DeepSeek-R1 \
--tp 16 \
--dp 16 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--nnodes 2 \
--node_rank 1 \
--nccl_host $nccl_host \
Expand Down Expand Up @@ -195,7 +203,8 @@ PD (Prefill-Decode) 分离模式将预填充和解码阶段分离部署,可以
--host $host \
--port 8019 \
--nccl_port 2732 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--disable_cudagraph \
--pd_master_ip $pd_master_ip \
--pd_master_port 60011
Expand All @@ -219,7 +228,8 @@ PD (Prefill-Decode) 分离模式将预填充和解码阶段分离部署,可以
--host $host \
--port 8121 \
--nccl_port 12322 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--disable_cudagraph \
--pd_master_ip $pd_master_ip \
--pd_master_port 60011
Expand Down Expand Up @@ -287,7 +297,8 @@ PD (Prefill-Decode) 分离模式将预填充和解码阶段分离部署,可以
--tp 8 \
--dp 8 \
--nccl_port 2732 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--disable_cudagraph \
--config_server_host $config_server_host \
--config_server_port 60088
Expand All @@ -306,7 +317,8 @@ PD (Prefill-Decode) 分离模式将预填充和解码阶段分离部署,可以
--nccl_port 12322 \
--tp 8 \
--dp 8 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--config_server_host $config_server_host \
--config_server_port 60088
# 如果需要启用微批次重叠,可以取消注释以下行
Expand Down
8 changes: 5 additions & 3 deletions docs/CN/source/tutorial/multi_level_cache_deployment.rst
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,8 @@ LightLLM 的多级缓存系统采用分层设计:
--model_dir /path/to/Qwen3-235B-A22B \
--tp 8 \
--graph_max_batch_size 500 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--mem_fraction 0.88 \
--enable_cpu_cache \
--cpu_cache_storage_size 400 \
Expand All @@ -81,7 +82,7 @@ LightLLM 的多级缓存系统采用分层设计:
- ``--model_dir``: 模型文件路径,支持本地路径或 HuggingFace 模型名称
- ``--tp 8``: 张量并行度,使用 8 个 GPU 进行模型推理
- ``--graph_max_batch_size 500``: CUDA Graph 最大批次大小,影响吞吐量和显存占用
- ``--enable_fa3``: 启用 Flash Attention 3.0,提升注意力计算速度,也可以换成flashinfer后端性能更佳
- ``--llm_prefill_att_backend fa3``: 启用 Flash Attention 3.0,提升注意力计算速度,也可以换成flashinfer后端性能更佳
- ``--mem_fraction 0.88``: GPU 显存使用比例,建议设置为 0.88及以下

CPU 缓存参数
Expand Down Expand Up @@ -130,7 +131,8 @@ CPU 缓存参数
--model_dir /path/to/Qwen3-235B-A22B \
--tp 8 \
--graph_max_batch_size 500 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--mem_fraction 0.88 \
--enable_cpu_cache \
--cpu_cache_storage_size 400 \
Expand Down
3 changes: 2 additions & 1 deletion docs/CN/source/tutorial/reasoning_parser.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,8 @@ DeepSeek-R1
--model_dir /path/to/DeepSeek-R1 \
--reasoning_parser deepseek-r1 \
--tp 8 \
--enable_fa3
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3

DeepSeek-V3
~~~~~~~~~~~
Expand Down
22 changes: 13 additions & 9 deletions docs/EN/source/getting_started/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,16 +24,16 @@ The easiest way to install Lightllm is using the official image. You can directl
$ docker pull ghcr.io/modeltc/lightllm:main
$
$ # Run,The current LightLLM service relies heavily on shared memory.
$ # Before starting, please make sure that you have allocated enough shared memory
$ # Before starting, please make sure that you have allocated enough shared memory
$ # in your Docker settings; otherwise, the service may fail to start properly.
$ #
$ # 1. For text-only services, it is recommended to allocate more than 2GB of shared memory.
$ # 1. For text-only services, it is recommended to allocate more than 2GB of shared memory.
$ # If your system has sufficient RAM, allocating 16GB or more is recommended.
$ # 2.For multimodal services, it is recommended to allocate 16GB or more of shared memory.
$ # 2.For multimodal services, it is recommended to allocate 16GB or more of shared memory.
$ # You can adjust this value according to your specific requirements.
$ #
$ # If you do not have enough shared memory available, you can try lowering
$ # the --running_max_req_size parameter when starting the service.
$ # If you do not have enough shared memory available, you can try lowering
$ # the --running_max_req_size parameter when starting the service.
$ # This will reduce the number of concurrent requests, but also decrease shared memory usage.
$ docker run -it --gpus all -p 8080:8080 \
$ --shm-size 2g -v your_local_path:/data/ \
Expand All @@ -42,21 +42,21 @@ The easiest way to install Lightllm is using the official image. You can directl
You can also manually build the image from source and run it:

.. code-block:: console

$ # move into lightllm root dir
$ cd /lightllm
$ # Manually build the image
$ docker build -t <image_name> -f ./docker/Dockerfile .
$
$ # Run,
$ # Run,
$ docker run -it --gpus all -p 8080:8080 \
$ --shm-size 2g -v your_local_path:/data/ \
$ <image_name> /bin/bash

Or you can directly use the script to launch the image and run it with one click:

.. code-block:: console

$ # View script parameters
$ python tools/quick_launch_docker.py --help

Expand Down Expand Up @@ -84,6 +84,10 @@ You can also install Lightllm from source:
$ # Install Lightllm dependencies (cuda 12.4)
$ pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu124
$
$ # Install Lightllm dependencies (Moore Threads GPU)
$ ./generate_requirements_musa.sh
$ pip install -r requirements-musa.txt
$
$ # Install Lightllm
$ python setup.py install

Expand All @@ -101,5 +105,5 @@ You can also install Lightllm from source:
.. code-block:: console

$ pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/ triton-nightly --no-deps

For specific reasons, please refer to: `issue <https://github.com/triton-lang/triton/issues/3619>`_ and `fix PR <https://github.com/triton-lang/triton/pull/3638>`_
29 changes: 0 additions & 29 deletions docs/EN/source/tutorial/api_server_args_zh.rst
Original file line number Diff line number Diff line change
Expand Up @@ -183,23 +183,6 @@ Different Parallel Mode Setting Parameters
When set to True, --nccl_host must equal config_server_host, --nccl_port must be unique for config_server,
do not use the same nccl_port for different inference nodes, this will be a serious error

Attention Type Selection Parameters
------------------------------------

.. option:: --mode

Model inference mode, can specify multiple values:

* ``triton_int8kv``: Use int8 to store kv cache, can increase token capacity, uses triton kernel
* ``ppl_int8kv``: Use int8 to store kv cache, uses ppl fast kernel
* ``ppl_fp16``: Use ppl fast fp16 decode attention kernel
* ``triton_flashdecoding``: Flashdecoding mode for long context, currently supports llama llama2 qwen
* ``triton_gqa_attention``: Fast kernel for models using GQA
* ``triton_gqa_flashdecoding``: Fast flashdecoding kernel for models using GQA
* ``triton_fp8kv``: Use float8 to store kv cache, currently only used for deepseek2

Need to read source code to confirm specific modes supported by all models

Scheduling Parameters
---------------------

Expand Down Expand Up @@ -325,18 +308,6 @@ Performance Optimization Parameters
.. option:: --enable_decode_microbatch_overlap

The inference backend will use microbatch overlap mode for decoding

.. option:: --enable_flashinfer_prefill

The inference backend will use flashinfer's attention kernel for prefill

.. option:: --enable_flashinfer_decode

The inference backend will use flashinfer's attention kernel for decoding

.. option:: --enable_fa3

The inference backend will use fa3 attention kernel for prefill and decoding

.. option:: --disable_cudagraph

Expand Down
Loading