[diffusion] feat: allow T5's TP Group to reuse the transformer's SP Group by nono-Sang · Pull Request #17818 · sgl-project/sglang

nono-Sang · 2026-01-27T10:52:28Z

Motivation

Currently, the TP Group of the text_encoder is consistent with that of the transformer. I have introduced an optional feature that allows the text_encoder to adopt the SP or Ulysses or Ring Group of the transformer as its own TP Group (referred to as "parallel folding").

Use case: For instance, if I run inference on eight GPUs with ulysses=8 and tp=1, each GPU will store a complete copy of the text_encoder. By enabling parallel folding, the text_encoder can utilize tp8 instead.

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-01-27T10:53:00Z

Summary of Changes

Hello @nono-Sang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant optimization for T5 text encoders in distributed environments. It enables the text encoder's tensor parallelism to leverage existing sequence, Ulysses, or ring parallel groups from the main transformer model. This "parallel folding" can lead to more efficient resource utilization, particularly memory, by preventing redundant model copies on GPUs when other parallelism strategies are already in use. The changes involve modifying T5 configuration, refactoring distributed communication primitives, and updating parallel linear and embedding layers to support dynamic process group assignment.

Highlights

T5 Parallel Folding Feature: Introduced an optional parallel_folding mechanism for T5 text encoders, allowing them to reuse existing transformer parallel groups (Sequence Parallel, Ulysses, or Ring) as their own Tensor Parallel (TP) group for improved resource utilization.
Flexible Distributed Operations: Refactored core distributed communication operations (tensor_model_parallel_all_reduce, tensor_model_parallel_all_gather) and parallel linear/embedding layers to accept an explicit tp_group argument, enhancing flexibility in distributed setups.
Unified Group Utilities: Added get_group_size and get_group_rank utility functions to abstract away the underlying ProcessGroup or GroupCoordinator types, simplifying group-related operations across the codebase.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new 'parallel folding' feature for T5 text encoders, allowing them to utilize existing sequence parallelism groups (SP, Ulysses, or Ring Group) as their tensor parallelism (TP) group. This is achieved by adding parallel_folding and parallel_folding_mode configurations to T5Config. The changes involve refactoring distributed communication operations to accept an optional tp_group argument, which defaults to the global TP group if not specified. This refactoring is consistently applied across various linear and embedding layers to enhance flexibility in group assignment. Additionally, new utility functions get_group_size and get_group_rank are introduced for abstracting process group properties. A notable change in wanvideo.py updates proj_out to use ColumnParallelLinear, enabling its parallelization.

gemini-code-assist · 2026-01-27T10:54:48Z

python/sglang/multimodal_gen/runtime/models/encoders/t5.py

-        self.n_heads = config.num_heads // tp_world_size
+        self.tp_group = _get_folding_tp_group(config)
+        self.tp_world_size = get_group_size(self.tp_group)
+        assert config.num_heads % self.tp_world_size == 0


The assertion assert config.num_heads % self.tp_world_size == 0 is crucial for ensuring the correct distribution of attention heads in a tensor parallel setup. If config.num_heads is not divisible by self.tp_world_size, it would lead to an uneven distribution or errors in parallel processing.

gemini-code-assist · 2026-01-27T10:54:49Z

python/sglang/multimodal_gen/runtime/models/dits/wanvideo.py

+        self.proj_out = ColumnParallelLinear(
+            inner_dim,
+            config.out_channels * math.prod(config.patch_size),
+            bias=True,
+            gather_output=True,


Changing self.proj_out from nn.Linear to ColumnParallelLinear implies that this projection layer will now be parallelized. This is a significant functional change that should be thoroughly tested to ensure correctness and to verify any performance implications.

gemini-code-assist · 2026-01-27T10:54:49Z

python/sglang/multimodal_gen/runtime/distributed/communication_op.py

+def tensor_model_parallel_all_reduce(
+    input_: torch.Tensor, tp_group: dist.ProcessGroup = None


The addition of tp_group: dist.ProcessGroup = None as an optional argument and defaulting to get_tp_group() significantly enhances the flexibility of these communication operations. This is crucial for supporting the new parallel folding feature, allowing specific process groups to be used for tensor parallelism.

gemini-code-assist · 2026-01-27T10:54:49Z

python/sglang/multimodal_gen/runtime/layers/linear.py

+        tp_group: dist.ProcessGroup = None,
    ):
        # Divide the weight matrix along the last dimension.
-        self.tp_size = get_tp_world_size()
+        self.tp_group = tp_group or get_tp_group()
+        self.tp_size = get_group_size(self.tp_group)
+        self.tp_rank = get_group_rank(self.tp_group)


Accepting tp_group in ColumnParallelLinear.__init__ and deriving tp_size and tp_rank from it makes the layer more modular and independent of global state. This is a good architectural improvement for flexibility and testability.

gemini-code-assist · 2026-01-27T10:54:49Z

python/sglang/multimodal_gen/runtime/layers/linear.py

+        self.output_sizes = output_sizes
+        assert all(output_size % self.tp_size == 0 for output_size in output_sizes)


Moving the assertion assert all(output_size % self.tp_size == 0 for output_size in output_sizes) after the super().__init__ call is correct. This ensures that self.tp_size has been properly initialized by the superclass before being used in the assertion.

gemini-code-assist · 2026-01-27T10:54:49Z

python/sglang/multimodal_gen/runtime/layers/utils.py

+def get_group_size(group) -> int:
+    if hasattr(group, "world_size"):
+        return group.world_size  # GroupCoordinator
+    elif hasattr(group, "size") and callable(getattr(group, "size", None)):
+        return group.size()  # ProcessGroup
+    else:
+        raise ValueError(f"Unsupported group type: {type(group)}")
+
+
+def get_group_rank(group) -> int:
+    if hasattr(group, "rank_in_group"):
+        return group.rank_in_group  # GroupCoordinator
+    elif hasattr(group, "rank") and callable(getattr(group, "rank", None)):
+        return group.rank()  # ProcessGroup
+    else:
+        raise ValueError(f"Unsupported group type: {type(group)}")


The get_group_size and get_group_rank functions provide a clean and robust abstraction for querying properties of different process group types (GroupCoordinator and ProcessGroup). This enhances code readability and maintainability by centralizing this logic.

gemini-code-assist · 2026-01-27T10:54:49Z

python/sglang/multimodal_gen/runtime/models/encoders/t5.py

+def _get_folding_tp_group(config: T5Config) -> dist.ProcessGroup | None:
+    if config.parallel_folding:
+        if config.parallel_folding_mode == "sp":
+            return get_sp_group()
+        elif config.parallel_folding_mode == "ulysses":
+            return get_sp_group().ulysses_group
+        elif config.parallel_folding_mode == "ring":
+            return get_sp_group().ring_group
+    return get_tp_group()


The _get_folding_tp_group function effectively centralizes the logic for determining the appropriate tp_group based on the parallel_folding configuration. This is a good design pattern for managing conditional logic related to distributed groups, improving clarity and reducing redundancy.

BBuf · 2026-02-02T09:05:38Z

@mickqian Any advices?

mickqian · 2026-02-03T02:48:39Z

python/sglang/multimodal_gen/runtime/models/encoders/t5.py

    attn_bias: torch.Tensor


+def _get_folding_tp_group(config: T5Config) -> dist.ProcessGroup | None:


could we have a @lru_cache(maxsize=1) here?

mickqian · 2026-02-03T02:49:10Z

python/sglang/multimodal_gen/runtime/models/encoders/t5.py

    attn_bias: torch.Tensor


+def _get_folding_tp_group(config: T5Config) -> dist.ProcessGroup | None:


consider moving to somewhere like distributed/util.py

mickqian

brilliant. we should document this change

BBuf · 2026-02-03T09:48:14Z

cc @nono-Sang

mickqian · 2026-02-04T11:42:47Z

python/sglang/multimodal_gen/runtime/models/encoders/t5.py



-_seen_keys = set()  # 用集合记录已经出现过的 key
+_seen_keys = set()


could you also clean this? seems redundant 😂

…roup

mickqian · 2026-02-05T03:46:11Z

/tag-and-rerun-ci

…roup (sgl-project#17818) Co-authored-by: Mick <mickjagger19@icloud.com>

nono-Sang requested review from BBuf, mickqian and yhyang201 as code owners January 27, 2026 10:52

github-actions bot added the diffusion SGLang Diffusion label Jan 27, 2026

gemini-code-assist bot reviewed Jan 27, 2026

View reviewed changes

mickqian reviewed Feb 3, 2026

View reviewed changes

nono-Sang force-pushed the t5_parallel_folding branch from 6db6152 to 429f0f9 Compare February 3, 2026 16:13

mickqian reviewed Feb 4, 2026

View reviewed changes

nono-Sang requested a review from yingluosanqian as a code owner February 4, 2026 14:32

nono-Sang and others added 5 commits February 5, 2026 09:37

[diffusion] feat: allow T5's TP Group to reuse the transformer's SP G…

d4ea7c1

…roup

fix

e260e74

update

7e44509

fix

663c60c

upd

2af63e2

mickqian force-pushed the t5_parallel_folding branch from e34df17 to 2af63e2 Compare February 5, 2026 01:37

github-actions bot added the run-ci label Feb 5, 2026

mickqian added 5 commits February 5, 2026 18:46

upd

240fc6f

upd

8129d41

upd

505e332

upd

28b153e

upd

926df7e

mickqian approved these changes Feb 5, 2026

View reviewed changes

mickqian merged commit b639779 into sgl-project:main Feb 5, 2026
80 of 82 checks passed

charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Feb 9, 2026

[diffusion] feat: allow T5's TP Group to reuse the transformer's SP G…

697da09

…roup (sgl-project#17818) Co-authored-by: Mick <mickjagger19@icloud.com>

Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026

[diffusion] feat: allow T5's TP Group to reuse the transformer's SP G…

4741efe

…roup (sgl-project#17818) Co-authored-by: Mick <mickjagger19@icloud.com>

		def tensor_model_parallel_all_reduce(
		input_: torch.Tensor, tp_group: dist.ProcessGroup = None

		self.output_sizes = output_sizes
		assert all(output_size % self.tp_size == 0 for output_size in output_sizes)

		attn_bias: torch.Tensor


		def _get_folding_tp_group(config: T5Config) -> dist.ProcessGroup \| None:



		_seen_keys = set() # 用集合记录已经出现过的 key
		_seen_keys = set()

Conversation

nono-Sang commented Jan 27, 2026

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Jan 27, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

BBuf commented Feb 2, 2026

Uh oh!

mickqian Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

mickqian Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

mickqian left a comment

Choose a reason for hiding this comment

Uh oh!

BBuf commented Feb 3, 2026

Uh oh!

mickqian Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

mickqian commented Feb 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants