Skip to content

Commit e422bca

Browse files
authored
[Mamba] Add float16 support for SSM cache dtype (sgl-project#18444)
1 parent 7e262b6 commit e422bca

File tree

2 files changed

+2
-2
lines changed

2 files changed

+2
-2
lines changed

docs/advanced_features/server_arguments.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -334,7 +334,7 @@ Please consult the documentation below and [server_args.py](https://github.com/s
334334
| Argument | Description | Defaults | Options |
335335
| --- | --- | --- | --- |
336336
| `--max-mamba-cache-size` | The maximum size of the mamba cache. | `None` | Type: int |
337-
| `--mamba-ssm-dtype` | The data type of the SSM states in mamba cache. | `float32` | `float32`, `bfloat16` |
337+
| `--mamba-ssm-dtype` | The data type of the SSM states in mamba cache. | `float32` | `float32`, `bfloat16`, `float16` |
338338
| `--mamba-full-memory-ratio` | The ratio of mamba state memory to full kv cache memory. | `0.9` | Type: float |
339339
| `--mamba-scheduler-strategy` | The strategy to use for mamba scheduler. `auto` currently defaults to `no_buffer`. 1. `no_buffer` does not support overlap scheduler due to not allocating extra mamba state buffers. Branching point caching support is feasible but not implemented. 2. `extra_buffer` supports overlap schedule by allocating extra mamba state buffers to track mamba state for caching (mamba state usage per running req becomes `2x` for non-spec; `1+(1/(2+speculative_num_draft_tokens))x` for spec dec (e.g. 1.16x if speculative_num_draft_tokens==4)). 2a. `extra_buffer` is strictly better for non-KV-cache-bound cases; for KV-cache-bound cases, the tradeoff depends on whether enabling overlap outweighs reduced max running requests. 2b. mamba caching at radix cache branching point is strictly better than non-branch but requires kernel support (currently only FLA backend), currently only extra_buffer supports branching. | `auto` | `auto`, `no_buffer`, `extra_buffer` |
340340
| `--mamba-track-interval` | The interval (in tokens) to track the mamba state during decode. Only used when `--mamba-scheduler-strategy` is `extra_buffer`. Must be divisible by page_size if set, and must be >= speculative_num_draft_tokens when using speculative decoding. | `256` | Type: int |

python/sglang/srt/server_args.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -211,7 +211,7 @@
211211
"flashinfer_trtllm",
212212
]
213213

214-
MAMBA_SSM_DTYPE_CHOICES = ["float32", "bfloat16"]
214+
MAMBA_SSM_DTYPE_CHOICES = ["float32", "bfloat16", "float16"]
215215

216216
MAMBA_SCHEDULER_STRATEGY_CHOICES = ["auto", "no_buffer", "extra_buffer"]
217217

0 commit comments

Comments
 (0)