[Refactor][Pipeline] Run pipeline rewriting before layout inference and stabilize tiled WS#2002
Merged
LeiWang1999 merged 89 commits intotile-ai:mainfrom Apr 7, 2026
Merged
Conversation
…ling - Consolidated the handling of shared barriers and pipeline planning by removing redundant conditional checks. - Ensured that `LowerSharedBarrier`, `PipelinePlanning`, and `InjectSoftwarePipeline` are consistently applied, enhancing the clarity and efficiency of the optimization process. This change improves the maintainability of the code while preserving existing functionality.
…tor gemm.h - Added a new `InstructionAnnotation` pass to annotate tile operations with their instruction kind before layout inference, improving the optimization pipeline's ability to reason about instruction mixes. - Refactored `gemm.h` to move the `allowTcgen5Mma` and `allowWgmma` methods under the private section, enhancing code organization and encapsulation. These changes improve the clarity and maintainability of the code while preserving existing functionality.
…pecialization - Updated `multi_version_buffer_rewriter.cc` to improve read/write access detection for tile operations by analyzing `tl.tileop.region` calls, ensuring accurate buffer access tracking. - Modified `phase.py` to integrate `ProducerConsumerWarpSpecializedTiled` before layout inference, allowing for high-level tile-op IR transformations that enhance producer/consumer splits. - Added a new `ProducerConsumerWarpSpecializedTiled` function in `__init__.py` to facilitate tile-level warp specialization, improving the optimization pipeline's efficiency. These changes enhance the handling of multi-version buffers and optimize the transformation process for tiled operations.
… plan - inject_pipeline.cc: guard reads/writes recalculation with subtree_modified_ flag to prevent local.var buffer promotion to kernel parameters - phase.py: add temporary debug prints after PipelinePlanning and InjectSoftwarePipeline (to be removed during CI fix work) - example_group_per_split_token_cast_to_fp8.py: add disable_cache for debugging - docs/plan.md: implementation plan for fixing CI test failures - draft.md: original design draft Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove print("After PipelinePlanning"), print(mod),
print("After InjectSoftwarePipeline"), print(mod) from
LowerAndLegalize in tilelang/engine/phase.py
- Remove tilelang.disable_cache() from
examples/cast/example_group_per_split_token_cast_to_fp8.py
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After InjectSoftwarePipeline, multi-versioned buffers share the same data Var as the original but have an extra leading dimension (num_stages). LayoutInference's alias propagation and annotation handling tried to Reshape layouts between these buffers, which failed because the total element counts differ. Guard three Reshape call sites in layout_inference.cc to skip sibling buffers whose total storage size is incompatible with the source layout. This lets multi-versioned buffers get their own layout inference instead of inheriting an incompatible layout from the original buffer. Fixes compilation failures in dequantize_gemm, GDN, and other kernels that use software pipelining with shared memory buffers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When ProducerConsumerWarpSpecializedTiled identifies a TMA kernel as a warp-specialization candidate but the tiled rewriter cannot handle it (e.g., conditional loop bodies like sparse block masks), the fallback previously returned the original function with num_stages annotations intact. PipelinePlanning and InjectSoftwarePipeline would then generate non-WS TMA pipeline code with broken barrier phase tracking for conditional pipeline bodies (barrier waits outside conditionals cause deadlocks when the condition is false). Fix: on WS fallback, strip num_stages annotations from pipeline loops so that the pipeline passes skip the function. The kernel runs unpipelined but correctly. Fixes CUDA_ERROR_LAUNCH_FAILED in blocksparse_gemm and related TMA kernels with conditional loop bodies. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Port PhaseCounter and StageExprReplacer from the legacy ProducerConsumerWarpSpecialized pass into the tiled WS pass to handle conditional loop bodies (e.g., sparse block masks). When the pipeline loop body is wrapped in an IfThenElse without else: 1. Unwrap the condition before classifying statements 2. Create separate producer/consumer PhaseCounters (local int32 buffers) 3. Use counter-based stage/parity expressions instead of loop-variable 4. Wrap producer and consumer bodies in the original condition 5. Increment counters at end of each guarded iteration 6. Rewrite shared-buffer stage indices via StageExprReplacer This ensures barrier parity stays correct when iterations are conditionally skipped, fixing CUDA_ERROR_LAUNCH_FAILED in blocksparse_gemm and related TMA kernels with conditional execution. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add GemmSPNode handling to: - inject_pipeline.cc AddReadsWritesForTileOp: model A, E, B as reads and C as write (E is the sparse metadata buffer) - pipeline_planning.cc: same access model for dependency analysis This makes sparse GEMM visible to the pipeline machinery for correct stage assignment and buffer multi-versioning. However, the tile-op's consumer-side buffer accesses still don't get stage-indexed because the pipeline body rewriter can't rewrite high-level tile-op Call arguments (an architectural limitation of running InjectSoftwarePipeline before LowerTileOp). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace whole-buffer access_ptr(1) calls with MakeAccessPtrFromRegion for A, B, C, and E buffers in sparse GEMM lowering. This preserves stage-specific region offsets from pipeline multi-versioning, matching the dense GemmNode::Lower pattern. CUDA output now shows correct stage-indexed consumer accesses: gemm_sp_ss(..., (k%3)*8192, (k%3)*8192+27648, C_local, (k%3)*2048+49152) instead of always using stage-0 offsets. Note: gemm_sp still produces incorrect results because the kernel needs warp specialization but TiledWSCandidate::Check doesn't recognize it as a TMA candidate. The non-WS pipeline path generates structurally different code from the reference WS path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three changes in producer_consumer_ws_tiled.cc: 1. Require num_stages >= 2 for WS candidacy (was >= 1). Single-stage kernels like seer_attention don't need WS and the transformation produces incorrect results for them. 2. Add HasTmaPipeline() check to detect TMA kernels with pipeline annotations that are rejected by the full WS candidate check (e.g., kernels with manual layout annotations like gemm_sp). 3. Strip num_stages annotations for rejected TMA pipeline kernels to prevent InjectSoftwarePipeline from generating broken non-WS TMA pipeline code. One change in gemm_sp.cc: - Use MakeAccessPtrFromRegion for A, B, C, E buffer access pointers instead of whole-buffer access_ptr. This preserves stage-specific region offsets from pipeline multi-versioning. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Only unwrap IfThenElse wrapper when the then-branch is a simple flat sequence of tile-op Evaluate calls. Skip unwrapping for complex bodies with LetStmt, For, or other control flow that could break variable scoping when split into producer/consumer for WS. Fixes variable-used-before-definition error in blocksparse_attention sparse_gqa_decode_varlen_indice, which has a conditional loop body containing LetStmt bindings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The >= 2 threshold broke test_num_stages_one_pure_tma_keeps_auto_warp_specialize. Pure TMA kernels with num_stages=1 should still be WS candidates. The seer_attention issue (num_stages=1 with manual layout) is handled by the has_manual_layout_ check, not the num_stages threshold. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When LayoutInference encounters an MVB-expanded buffer (with leading stage dimensions) whose trailing dimensions match the original layout, use Layout::Expand to propagate the manual layout instead of rejecting or skipping the buffer. Applied to all 3 layout propagation paths: annotated layout map, alias propagation, and finalization. Also remove the blanket !has_manual_layout_ WS candidate rejection since manual layouts now survive onto versioned shared buffers via Layout::Expand. Fixes test_sparse_ws_regular_metadata_copy_stays_in_producer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
has_manual_layout_ guard 1. LetStmt chain peeling: when IfThenElse then_case starts with LetStmt bindings, peel them and append to let_bindings before checking the simple-body guard. This allows WS for conditional bodies with variable definitions (e.g., sparse attention patterns). 2. Restore !has_manual_layout_ in WS candidacy check: removing it caused dequant_groupedgemm_bf16_mxfp4_hopper to fail because MXFP4 layouts don't survive MVB expansion. The Layout::Expand fix handles sparse metadata layouts but not all manual layout types. 3. Layout::Expand propagation (from Round 6) remains in place for future use when MVB learns to handle all manual layout types. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The simple-body guard only accepted flat Evaluate sequences inside IfThenElse, blocking legitimate WS for complex conditional bodies like sparse flash attention (T.clear, T.reduce_max, T.gemm inside the guard). The LetStmt peeling already handles variable scoping. Fixes test_pure_tma_consumer_local_init_does_not_leak_into_producer and test_sparse_ws_regular_metadata_copy_stays_in_producer. The remaining test_mixed_tma_cp_async_shared_stage_barriers failure is a pre-existing issue on the original branch: the tiled WS pass produces SIMT copies instead of cp.async because LowerPTXAsyncCopy is not in the pass pipeline (the comment at phase.py:308 says it runs earlier but no actual call exists on either branch or reference). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wrap the WS producer loop body in a kPipelineContextNumStages AttrStmt so that LowerTileOp's pipelined_depth_ is > 0 when processing SIMT producer copies. This enables InjectPTXAsyncCopy to generate cp.async for global-to-shared copies in the WS producer branch. Without this, the WS rewriter strips all pipeline annotations from the rewritten loops, causing LowerTileOp to skip cp.async injection for SIMT producers. The consumer loop stays annotation-free since it doesn't need cp.async. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…out guard 1. Mixed TMA+cp.async barrier: use ptx_cp_async_barrier_noinc for forward barrier arrival in mixed producer groups, matching the reference producer_consumer_ws.cc protocol. 2. Consumer-only pre-loop init sinking: in ReplacePipelineLoopInStmt, guard pre-loop siblings as consumer-only when they're not classified as producer (TMA/SIMT/cp.async). Fragment init (T.fill, T.clear) and local buffer init are placed in the consumer branch instead of the shared prelude. 3. Restored blanket has_manual_layout_ guard: the dtype-based heuristic to distinguish sparse metadata from MXFP4 layouts doesn't work because both use uint8. Dequant_groupedgemm_bf16_mxfp4_hopper requires the guard to prevent broken WS. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… barrier 1. Consumer-init sinking: only sink pre-loop stmts that are FillNode writing to fragment/local buffers. Keep block_mask setup and shared state in the shared prelude. 2. Manual-layout: attempt targeted check that only rejects when TMA copy destinations match layout_map entries. Collect layout_map vars and compare against TMA copy destinations. 3. Per-group cp.async barrier: use group-level cp.async flag (single group for now) instead of function-wide boolean. The 3 WS issue tests still fail because the layout_map annotation parsing falls through to the conservative rejection path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The layout_map annotation uses Map<Var, Layout> before LayoutInference (not Map<Buffer, Layout>). Parse as Map<ObjectRef, ObjectRef> and handle both key types: Buffer (post-inference) and Var (pre-inference). For Var keys, look up the corresponding alloc_buffer by data Var match. Compare collected manual-layout buffers against pipeline copy destinations: reject only when a manually-laid-out buffer is also a producer copy target (TMA/SIMT/cp.async) inside the pipeline. This allows sparse metadata (E_shared, SIMT-copied) onto the WS path while rejecting dequant MXFP4 (B_shared, SIMT-copied with swizzle). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Fixed SEGFAULT: removed dangling layout_map_layouts_ vector that was never populated, causing OOB access. Now stores Buffer+Layout pairs in layout_map_entries_. 2. Use DetectSwizzleMode to distinguish swizzled layouts (MXFP4, incompatible with MVB) from non-swizzled (sparse metadata, safe). Swizzled layouts reject WS candidacy; non-swizzled layouts allow it. 3. Removed debug LOG(WARNING) from hot path. 4. Parse layout_map annotation keys as both Buffer and Var (via Map<ObjectRef, ObjectRef>), resolving Var keys to alloc_buffers. Results: sparse metadata WS test PASSES, dequant PASSES (protected). Down to 2 failures: mixed barrier pattern + consumer init sinking. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sink pre-loop Evaluate nodes classified as kConsumer into consumer branch. Keep For loops (block_mask setup), producer copies, and other control flow in the shared prelude. This is simpler and safer than the FillNode scope check approach. The test_pure_tma_consumer_local_init test still fails because the T.fill statements are at a different structural level than the SeqStmt where the pipeline loop lives. Fixing this requires deeper IR structure analysis. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract consumer-only pre-loop Evaluate statements (T.fill on fragments) from the shared prelude and prepend them to the consumer branch inside the WS if/else structure. This ensures fragment init like acc_o, logsum, scores_max appears only in the consumer branch, not in the shared prelude or producer branch. Uses a two-pass approach: first ReplacePipelineLoopInStmt extracts consumer-only stmts into extracted_consumer_init_, then the WS body is rebuilt with the extracted stmts prepended to the consumer branch. Fixes test_pure_tma_consumer_local_init_does_not_leak_into_producer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Member
Author
|
@regression-perf |
Performance Regression Test ReportTriggered by: @LeiWang1999 Results
Artifacts
|
Member
Author
|
@regression-perf |
Performance Regression Test ReportTriggered by: @LeiWang1999 Results
Artifacts
|
Member
Author
|
@regression-perf |
Performance Regression Test ReportTriggered by: @LeiWang1999 Results
Artifacts
|
Member
Author
|
@regression-perf |
Performance Regression Test ReportTriggered by: @LeiWang1999 Results
Artifacts
|
LeiWang1999
added a commit
to LeiWang1999/tilelang
that referenced
this pull request
Apr 12, 2026
Two unrelated ROCm bugs were latent on main and surfaced on this PR's ROCm CI run. Both predate the GEMM v1 removal work - the previous "green" ROCm run (job 70894846938 on commit 0a34a6a) hit the exact same nine failures but happened to receive exit code 0 from pytest-xdist after maxfail kicked in, masking the failure. Fix both so the ROCm job is actually green, not accidentally green. Bug 1: HIP threadblock swizzle codegen emits runtime call instead of template instantiation ---------------------------------------------------------------------- Introduced in 3ee0988 (tile-ai#2002). codegen_hip.cc rewrote the threadblock_swizzle_pattern attribute handler to take a (func_name, panel_size) tuple but emits panel_size as a runtime argument: const dim3 blockIdx = tl::rasterization2DRow(10); while the template in tl_templates/hip/threadblock_swizzle.h declares it as a template parameter: template <int panel_width> dim3 rasterization2DRow(). hipcc rightly rejects the call with "no matching function". The CUDA counterpart in codegen_cuda.cc already emits tl::func_name<panel_size>() - mirror that in HIP. Affects test_tilelang_gemm_mfma_intrinsic.py::test_assert_tl_matmul. Bug 2: MFMA macro generator hardcodes 2D buffer indexing and breaks under pipelined shared buffers ---------------------------------------------------------------------- MatrixCoreIntrinEmitter.ldmatrix_a / ldmatrix_b in intrinsics/mfma_macro_generator.py extract A_base0/A_base1 from the last two region dims but then index A_buf with exactly two indices: A_buf[A_base0 + l + row, A_base1 + r + col]. This works when the user-declared shared buffer is 2D, but pipeline multi-versioning (T.Pipelined(..., num_stages >= 2)) rewrites the shared buffer to carry a leading stage dimension, making A_buf 3D. The access then fails layout inference with "Buffer A_shared is 3-dimensional, cannot be indexed with the 2-dimensional indices provided". The CUDA counterpart in intrinsics/mma_macro_generator.py handles this correctly by collecting leading base offsets into A_other and indexing as A_buf[tuple(A_other) + (A_base0 + ..., A_base1 + ...)]. Mirror that pattern in the MFMA generator for both ldmatrix_a and ldmatrix_b. Affects test_block_sparse_matmul_{global,shared,local} and test_tilelang_jit_{callback,gemm_cython}::test_gemm_jit_kernel plus test_cython_kernel_multi_stream. Why these were masked on main ---------------------------------------------------------------------- .github/workflows/ci.yml runs the ROCm test step with --maxfail=3 --numprocesses=4. With xdist parallelism four workers can accumulate failures concurrently; the controller eventually raises xdist.dsession.Interrupted: stopping after 3 failures but pytest's exit code on that interrupt path is non-deterministic. Previous runs got exit 0 (success) while this PR happened to get exit 2 (failure). Fixing the underlying bugs is the only way to make the job reliably green.
This was referenced Apr 15, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Testing
./format.shcmake --build build -j$(nproc)python -m pytest -q testing/python/transform/test_tilelang_transform_Inject_software_pipeline.py testing/python/transform/test_tilelang_transform_pipeline_planning.py testing/python/transform/test_tilelang_transform_producer_consumer_ws_tiled.pypython -m pytest -q testing/python/issue/test_tilelang_issue_tma_no_ws.py -k mixed_tma_cp_async_shared_stage_barrierspython -m pytest -q testing/python/issue/test_tilelang_issue_ws_simt_copy_full_producer_extent.pySummary by CodeRabbit
New Features
Performance Improvements
Bug Fixes
Tests