-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Description
Description
We are looking to evaluate the current inference performance of zai-org/GLM-Image when running on the sglang-diffusion engine compared to the baseline Diffusers implementation.
Preliminary observations suggest that the current implementation for GLM-Image within our stack may be under-optimized. Specifically, it appears to lack support for Sequence Parallelism (SP), which is crucial for handling high-resolution image generation efficiently. Improving this will not only boost GLM-Image performance but also provide architectural insights for the broader SGLang-D project.
Goals
- Benchmarking: Establish a performance baseline (latency, throughput, and VRAM usage) for GLM-Image using both
sglang-diffusionanddiffusers. - Profiling: Identify bottlenecks in the current
sglang-diffusionpath for this model (e.g., attention kernels, memory overhead). - Optimization (Optional/Bonus): Propose or implement initial optimizations, such as enabling Sequence Parallelism or improving memory management.
Technical Tasks
- Set up a reproducible benchmarking script for GLM-Image.
- Compare inference latency across different batch sizes and resolutions.
- Analyze if and where Sequence Parallelism can be integrated into the current GLM-Image wrapper.
- Document the findings in a detailed report or table within this issue.
You can read this as reference:
Calling SGLang-D community members! If you are interested in high-performance computing, kernel optimization, or the latest diffusion models, we would love your help on this.