[WIP] Merging AutoSP into DeepSpeed by neeldani · Pull Request #7860 · deepspeedai/DeepSpeed

neeldani · 2026-02-19T06:17:57Z

AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism

Overview

AutoSP is a compiler optimization pass that shards inputs along the sequence dimension and enables Ulysses styled sequence parallelism while preventing graph breaks during torch.compile(). All the passes operate at the Torch IR on the forward graph.

API Design

User-Facing Entry Point: `prepare_autosp_inputs()`

Users must explicitly call this function to prepare inputs for AutoSP compilation:

def prepare_autosp_inputs(
    input_id: torch.Tensor,
    label_id: torch.Tensor,
    position_id: torch.Tensor = None,
    attention_mask: torch.Tensor = None,
    seq_dim: int = 1
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]

Purpose: Symbolize sequence dimension and annotate tensors for identification.

Operations:

Mark sequence dimension as dynamic using torch._dynamo.decorators.mark_dynamic()
Attach metadata tags for tensor identification for auto-sharding:
- input_id.tag = constants.INPUT_ID_KEY
- label_id.tag = constants.LABEL_ID_KEY
- position_id.tag = constants.POSITION_ID_KEY (if provided)

Rationale: PyTorch's FX graph tracer requires explicit annotation of data-dependent dimensions. Marking the sequence dimension as dynamic prevents symbolic shape propagation from losing dimension information through reshape/view operations.

Compilation Passes

Pass 1: `pass_shard_seq_dim()`

Objective: Propagate sharded sequence dimension to all consumers.

Algorithm:

Extract symbolic sequence dimension from input_id shape metadata
Locate the symbolic dimension node in the FX graph
Create a floor-divide node: seq_dim / world_size
Perform worklist-based graph traversal to find all direct and indirect consumers of input node, label node and position id node.
Replace symbolic dimension references with sharded dimension in consumer nodes

Rationale: Reshapes and views that consume the sequence dimension as an argument do not get updated during propagation of symbolic shapes. This pass explicitly rewires the computation graph to use sharded dimensions, enabling proper shape inference downstream.

Pass 2: `pass_shard_input_ids()` / `pass_shard_label_ids()` / `pass_shard_position_ids()`

Objective: Insert slicing operations after input tensors.

Implementation: Call shard_tensor_node() utility which inserts slice operations. Each rank retains only the portion of the tensor corresponding to its sequence partition and drops the remaining buffer.

Note on attention_mask: Not sharded because it applies to the full sequence length, not the partitioned dimension.

Pass 3: `pass_insert_attention_all_to_all()`

Objective: Insert all-to-all collectives around attention (Ulysses styled) to avoid graph breaks during compilation.

Algorithm:

Identify all SDPA (Scaled Dot-Product Attention) nodes in the graph
For each SDPA node with inputs Q, K, V, after each of Q, K, V: insert A2A scatter heads (dim=1), gather sequence (dim=2)
Insert A2A after thre attention output O: scatter sequence (dim=2), gather heads (dim=1)

Graph Rewrite Example:

Q [B, N, S/P, H] --A2A(scatter_heads,gather_seq)--> [B, N/P, S, H]
K [B, N, S/P, H] --A2A(scatter_heads,gather_seq)--> [B, N/P, S, H]
V [B, N, S/P, H] --A2A(scatter_heads,gather_seq)--> [B, N/P, S, H]
                     |
                    SDPA
                     |
O [B, N/P, S, H] --A2A(scatter_seq,gather_heads)--> [B, N, S/P, H]

Current support: Currently only supports torch.nn.functional.scaled_dot_product_attention(). Composite attention patterns require additional pattern matching logic.

Pass 4: `pass_propagate_shapes()`

Objective: Compute static shapes for all nodes using fake tensor execution.

Implementation:

Create ShapeEnv for symbolic dimension tracking
Construct FakeTensorMode with the shape environment
Execute FakeTensorProp.propagate() to compute shape metadata

Pass 5: `pass_canonicalize()`

Objective: Finalize graph representation.

Operations:

eliminate_dead_code(): Remove unused operations
lint(): Validate graph structure
recompile(): Regenerate compiled representation

Execution Order

prepare_autosp_inputs()
    ↓
pass_shard_seq_dim
    ↓
pass_shard_input_ids
    ↓
pass_shard_label_ids
    ↓
pass_shard_position_ids
    ↓
pass_insert_attention_all_to_all
    ↓
pass_propagate_shapes
    ↓
pass_canonicalize

Reducing gradients across ranks

AutoSP requires an all-reduce to reduce the gradients across ranks. This is automatically called by DeepSpeed's engine here

Known Limitations

Attention Pattern Matching: Only torch.nn.functional.scaled_dot_product_attention() is supported. Fused attention implementations require pattern-specific handling.
No Graph Break Requirement: AutoSP will fail if there are graph breaks because use-def chains are lost and it becomes tricky to propagate auto-sharding information across graph modules.

Example

DeepSpeedExample PR: deepspeedai/DeepSpeedExamples#999

tohtana · 2026-02-20T00:18:22Z

Hi @neeldani,
Thank you for opening this PR! This is truly exciting.

Since this is a large PR, let’s proceed step by step. Here are my suggestions:

Code Location: This PR contains a significant amount of client code in bench_dc_ulysses. Could we move that to DeepSpeedExamples instead? Feel free to open a separate PR there for it.
Documentation: The README in bench_dc_ulysses appears to be outdated. Could you update it with instructions so we can reproduce the results?
API Design: Could you share the current API design? As you mentioned, we should discuss this further. You can either add the details to this PR or start a new Discussion in this repo.

neeldani · 2026-02-23T07:46:57Z

@tohtana thank you for the feedback:

Moved the scripts to DeepSpeedExample and put up a new PR: Add AutoSP example DeepSpeedExamples#999
Updated the README.md with the setup instructions
Updated the description of this PR with the API design

Please let me know if there are any hiccups to run AutoSP or have any questions related to the design - happy to discuss them on this PR.

tohtana · 2026-02-25T21:53:36Z

deepspeed/compile/custom_ops/all_to_all.py

@@ -0,0 +1,64 @@
+import torch 


(Not about this file) Don't we need __init__.py in custom_ops?

tohtana · 2026-02-25T21:55:35Z

deepspeed/runtime/constants.py

+#########################################
+# AUTOSP
+#########################################
+INPUT_ID_KEY = "input_id"


Can we make these AutoSP specific? like AUTOSP_*

tohtana · 2026-02-25T22:01:39Z

deepspeed/runtime/engine.py

        if enable_deepcompile and self.zero_optimization_stage() != ZeroStageEnum.optimizer_states \
                and self.zero_optimization_stage() != ZeroStageEnum.weights \
-                and self.zero_optimization_stage() != ZeroStageEnum.gradients:
+                and self.zero_optimization_stage() != ZeroStageEnum.gradients \


Can you elaborate the intension of this change? (the preexisting condition also seems weird, though)
Assuming zero_optimization_stage returns 0, 1, 2, 3, we never enter this block?
Maybe we need the fallback only when sp is disabled and zero stage is 0?

tohtana · 2026-02-25T22:22:31Z

deepspeed/runtime/engine.py

                        "DeepCompile with ZeRO stage 3 is not currently supported on PyTorch >= 2.9. "
                        "Please use ZeRO stage 1 or 2 with DeepCompile, or disable DeepCompile for ZeRO stage 3.")
                backend = init_z3(self, backend, compile_config, compile_kwargs, schedule)
+            elif self.zero_optimization_stage() == ZeroStageEnum.disabled:


Currently, do we enable AutoSP by these?

set zero stage to 0

Enable deepcompile
If so, I think we should make it more explicit.
For AutoTP (see example), we do

"tensor_parallel": { "autotp_size": args.tp_size, ... }

For AutoEP proposal,

"expert_parallel": { "autoep_size": args.ep_size, ... }

AutoEP is currently just a proposal, but how about making the config

"sequence_parallel": { "autosp_size": args.sp_size, ... }

You may want to require DeepCompile to be enabled too. As we don't have eager AutoSP now, it might be good to enable DeepCompile automatically when sequence_parallel is enabled.

Here is the flow for enabling autosp (and its interoperability with zero-1 DP):

User specifies in a config, the pertinent compiler passes and their parameters. This looks like the following

"compile": { "deepcompile": true, "passes": ["autosp"], "sp_size": 2, "dp_size": 1

Specifying what the SP and DP size should be.

Next, if the DP size is larger than one, the user can opt to turn on zero-1. The entire configuration would look like this:

"zero_optimization":{ "stage": 1 }, "compile": { "deepcompile": true, "passes": ["autosp"], "sp_size": 2, "dp_size": 1

Using the legacy config-style to specify the zero_optimization stage. This would then accordingly compose SP with zero-1 DP. However, note that this is not Zero-1 DP from deepcompile, rather the Zero-1 DP implemented originally in DeepSpeed.

Here, I have currently opted to make both the sp_size and dp_size explicitly controllable by the user. Another option is to automatically infer the DP-size from the sp-size by computing dp-size = num-devices/sp-size.

For a sample config, here is the deepspeed config in DeepSPeedExamples (link here).

Thank you, @spikerheado1234!
Can you clarify a few points?

Can we automatically determine dp_size based on world size and sp size?

I feel a bit weird to have "sp_size" as a part of compile config. Shouldn't we make a new item to set arguments to compiler passes? (I also want to get thoughts from @sfc-gh-truwase)

In current SP (Ulysses), we have the size for zero sharding is dp_size * sp_size. Is this same for AutoSP?

Can you add tests to run the matrix of sp_size * dp_size * zero_stage? It will clarify that what we support and guarantee it works.

tohtana · 2026-02-25T22:29:26Z

Thank you @neeldani for the update! As we don't have a lot of changes in existing code, I don't think we have much risk.
The key discussion is the interface to enable AutoSP. See this comment. I also want to get thoughs from @sfc-gh-truwase @minjiazhang

We should also have clear assertions to terminate early when we hit these limitations: Attention Pattern Matching and No Graph Break Requirement.

Patch Zero-1 interoperability when using AutoSP.

spikerheado1234 · 2026-02-27T20:47:41Z

Thank you @neeldani for the update! As we don't have a lot of changes in existing code, I don't think we have much risk. The key discussion is the interface to enable AutoSP. See this comment. I also want to get thoughs from @sfc-gh-truwase @minjiazhang

Hi @tohtana, just merged in the code that correctly enables zero-1 and AutoSP interoperability.

neeldani added 3 commits February 14, 2026 20:10

add autosp backend

b57ccb8

add benchmarking script

4df32d1

Merge remote-tracking branch 'upstream/master' into autosp

fad4846

neeldani changed the title ~~[WIP] Merging AutoSP into Deepspeed~~ [WIP] Merging AutoSP into DeepSpeed Feb 19, 2026

tohtana mentioned this pull request Feb 19, 2026

(Draft) [Roadmap] DeepSpeed Roadmap Q2 2026 #7861

Open

27 tasks

sfc-gh-truwase and others added 3 commits February 21, 2026 21:53

Merge branch 'master' into autosp

10d9cc0

move bench scripts to DeepSpeedExamples

ecbc6ea

move constants and apis to deepspeed library

a38674e

PKUWZP self-requested a review February 23, 2026 06:09

neeldani mentioned this pull request Feb 23, 2026

Add AutoSP example deepspeedai/DeepSpeedExamples#999

Draft

tohtana reviewed Feb 25, 2026

View reviewed changes

Ubuntu and others added 3 commits February 27, 2026 05:20

add zero-1 interoperability to autosp

fea194c

fix early termination of gradients issue when using autosp

bd916b7

Merge pull request #1 from neeldani/staging

d6a0aaa

Patch Zero-1 interoperability when using AutoSP.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Merging AutoSP into DeepSpeed#7860

[WIP] Merging AutoSP into DeepSpeed#7860
neeldani wants to merge 9 commits intodeepspeedai:masterfrom
neeldani:autosp

neeldani commented Feb 19, 2026 •

edited

Loading

Uh oh!

tohtana commented Feb 20, 2026

Uh oh!

neeldani commented Feb 23, 2026

Uh oh!

tohtana Feb 25, 2026

Uh oh!

tohtana Feb 25, 2026

Uh oh!

tohtana Feb 25, 2026

Uh oh!

tohtana Feb 25, 2026

Uh oh!

spikerheado1234 Feb 28, 2026

Uh oh!

tohtana Mar 1, 2026

Uh oh!

tohtana commented Feb 25, 2026

Uh oh!

spikerheado1234 commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

neeldani commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism

Overview

API Design

User-Facing Entry Point: prepare_autosp_inputs()

Compilation Passes

Pass 1: pass_shard_seq_dim()

Pass 2: pass_shard_input_ids() / pass_shard_label_ids() / pass_shard_position_ids()

Pass 3: pass_insert_attention_all_to_all()

Pass 4: pass_propagate_shapes()

Pass 5: pass_canonicalize()

Execution Order

Reducing gradients across ranks

Known Limitations

Example

Uh oh!

tohtana commented Feb 20, 2026

Uh oh!

neeldani commented Feb 23, 2026

Uh oh!

tohtana Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

tohtana Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

tohtana Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

tohtana Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

spikerheado1234 Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

tohtana Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

tohtana commented Feb 25, 2026

Uh oh!

spikerheado1234 commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

neeldani commented Feb 19, 2026 •

edited

Loading

User-Facing Entry Point: `prepare_autosp_inputs()`

Pass 1: `pass_shard_seq_dim()`

Pass 2: `pass_shard_input_ids()` / `pass_shard_label_ids()` / `pass_shard_position_ids()`

Pass 3: `pass_insert_attention_all_to_all()`

Pass 4: `pass_propagate_shapes()`

Pass 5: `pass_canonicalize()`