Skip to content

[Feature] Support offload and wake up of SGLang Diffusion #19090

@zhaochenyang20

Description

@zhaochenyang20

Checklist

Motivation

In the LLM RL scenario, sleeping and waking up an SGLang server is widely used and optimized in co-located placement. As detailed in Biao @hebiao064 blog: https://hebiao064.github.io/rl-memory-management

In LLM RL, we use torch_memory_savor to protect the virtual address of the SGLang LLM server in order to keep CUDA Graph alive. Right now in SGLang Diffusion, CUDA Graph is not supported (working on it by @zyksir ), in this sense. We may have more brute fore method to sleep and wake up. In extreme situations, we can even kill and relaunch the SGLang Diffusion server, and the relaunch time is profiled in #19087

In this sense, we may need a way to sleep and wake up SGLang Diffusion. The optimal API should be similar to https://docs.sglang.io/advanced_features/sglang_for_rl.html#fine-grained-engine-sleep-and-wake-up , but the start point can be more brute force.

If I let myself handle this issue myself, I will break this down into the following steps:

  1. Try out the brute force way to sleep and wake up the SGLang Diffusion Server (like offload some crucial parts to CPU, I don't know), and compare that with directly killing and relaunching. If brute force is the best, then we are so cooked. 🤣
  2. If sleep and waking up do help, then try to make up wake up and sleep APIs. Following what we did in LLM https://docs.sglang.io/advanced_features/sglang_for_rl.html#fine-grained-engine-sleep-and-wake-up . This API would be great.
  3. Still if 2 is done, please provide an end2end time of "sleep, wake up + refit" vs "kill and relaunch". Hope this time, we can get further speed up.

Related resources

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions