-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Description
Checklist
- If this is not a feature request but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- Please use English. Otherwise, it will be closed.
Motivation
In the LLM RL scenario, sleeping and waking up an SGLang server is widely used and optimized in co-located placement. As detailed in Biao @hebiao064 blog: https://hebiao064.github.io/rl-memory-management
In LLM RL, we use torch_memory_savor to protect the virtual address of the SGLang LLM server in order to keep CUDA Graph alive. Right now in SGLang Diffusion, CUDA Graph is not supported (working on it by @zyksir ), in this sense. We may have more brute fore method to sleep and wake up. In extreme situations, we can even kill and relaunch the SGLang Diffusion server, and the relaunch time is profiled in #19087
In this sense, we may need a way to sleep and wake up SGLang Diffusion. The optimal API should be similar to https://docs.sglang.io/advanced_features/sglang_for_rl.html#fine-grained-engine-sleep-and-wake-up , but the start point can be more brute force.
If I let myself handle this issue myself, I will break this down into the following steps:
- Try out the brute force way to sleep and wake up the SGLang Diffusion Server (like offload some crucial parts to CPU, I don't know), and compare that with directly killing and relaunching. If brute force is the best, then we are so cooked. 🤣
- If sleep and waking up do help, then try to make up wake up and sleep APIs. Following what we did in LLM https://docs.sglang.io/advanced_features/sglang_for_rl.html#fine-grained-engine-sleep-and-wake-up . This API would be great.
- Still if 2 is done, please provide an end2end time of "sleep, wake up + refit" vs "kill and relaunch". Hope this time, we can get further speed up.
Related resources
No response