Skip to content

support more than 2 warp groups#9

Open
AutumnKite wants to merge 144 commits intosilentCoder-dev:auto-schedulefrom
AutumnKite:auto-schedule
Open

support more than 2 warp groups#9
AutumnKite wants to merge 144 commits intosilentCoder-dev:auto-schedulefrom
AutumnKite:auto-schedule

Conversation

@AutumnKite
Copy link
Copy Markdown

@AutumnKite AutumnKite commented Apr 8, 2026

Summary by CodeRabbit

  • New Features

    • Added NVIDIA Blackwell support alongside Hopper in warp-specialized auto-scheduling.
    • Example GEMM now relies on default workgroup coordination semantics.
  • Improvements

    • Generalized warpgroup partitioning to support arbitrary warp counts.
    • Improved thread-count, shared-memory, barrier/synchronization handling for more robust coordination.
    • Better detection and propagation of multi-buffering eligibility for improved buffering performance.
  • Bug Fixes

    • Benchmark script no longer runs the manual flash attention variant (removed from default runs).

LeiWang1999 and others added 22 commits March 30, 2026 15:56
* Fix tcgen05 barrier allocation planning regression

* Add explicit TMEM deallocation and shared transpose copy

* Add transpose operation to documentation and update implementation in copy_op.py

* LINT FIX

* Refactor: extract transpose from CopyNode into standalone TransposeNode op

Remove the transpose annotation logic from CopyNode (GetTranspose,
MakeIndices transpose branch, MakePredicate transpose branch, and
GetCopyInst early return). Transpose is now handled by the independent
TransposeNode registered as tl.tileop.transpose.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Refactor CopyNode to remove transpose handling from index generation and predicate creation. Simplify MakeIndices and MakePredicate methods by eliminating unnecessary transpose checks and related logic. Update associated checks to ensure index consistency. Clean up unused GetTranspose method in copy.h.

* lint fix

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…st/ir.py` (tile-ai#1996)

* Initial plan

* Add annotations parameter support to alloc_buffer in tilelang/language/ast/ir.py

Agent-Logs-Url: https://github.com/tile-ai/tilelang/sessions/17577985-06fa-4b35-b714-185004d91524

Co-authored-by: LeiWang1999 <34334180+LeiWang1999@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: LeiWang1999 <34334180+LeiWang1999@users.noreply.github.com>
…ile-ai#1994)

* [Bugfix] Raise error on zero grid dimension instead of silent clamp (tile-ai#1993)

Fix ThreadWorkLoad::Extract() silently clamping zero grid dims to 1,
which caused either CUDA_ERROR_ILLEGAL_ADDRESS crashes (dynamic case)
or silent wrong results (static case).

Closes tile-ai#1993

* lint fix
…#1995)

* add check for kRows when allocating tmem

* Add FirstOwningBlockCollector to map allocated buffers to their declaring blocks

This change introduces the FirstOwningBlockCollector class, which collects the first Block that declares each allocated buffer variable. The BufferAllocationLocator is updated to utilize this new collector, ensuring that shared barriers remain associated with their declaring blocks during buffer allocation. This enhancement improves the management of buffer allocations and addresses potential issues with opaque child blocks in pipelined For loops.

* replace `T.gemm` with `T.tcgen05_gemm` in fa sm100 examples

* lint

---------

Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
…he order of blockIdx (tile-ai#1980)

* Add TCGEN05 thread synchronization fences in GEMM examples and builtins

- Introduced `tcgen05_before_thread_sync` and `tcgen05_after_thread_sync` builtins to manage thread synchronization in TCGEN05 operations.
- Updated `gemm` and `gemm_2cta` examples to include synchronization calls before and after thread barriers, ensuring correct execution order.
- Adjusted kernel launch parameters in GEMM examples for consistency in block dimensions.
- Enhanced memory copy operations to align with the new synchronization logic, improving performance and correctness.

* lint
* [Refactor] Refactor CUDA atomic helpers

* update
…tile-ai#1972)

* [Bugfix] Fix CuTeDSL autotune cache saving .py as .so (tile-ai#1967)

The autotune cache had no CuTeDSL-specific branch, causing it to save
the Python source file (kernel.py) as kernel_lib.so. On reload,
importlib treated the .so extension as a native extension module and
failed with "invalid ELF header".

Fix: add cutedsl branches in _save_kernel_to_disk and
_load_kernel_from_disk to use KERNEL_PY_PATH ("kernel.py") instead of
KERNEL_LIB_PATH ("kernel_lib.so"). Also saves launcher .so and cubin
artifacts when present.

Closes tile-ai#1967

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [BugFix] Change _all_dtypes from set to list for deterministic order

set has non-deterministic iteration order across processes, causing
pytest-xdist workers to collect test parameters in different orders
and fail with "Different tests were collected between gw3 and gw2".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…rcution (tile-ai#2004)

* fix copy+cast vectorize loop to use wider vector load/store instrcution

* clean test

* fix test

* fix format

* test fix

---------

Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
…and out_idx as PrimFunc attrs (tile-ai#2006)

Allow configuring pass configs, compile flags, and out_idx directly
inside function bodies using T.annotate_compile_flags(),
T.annotate_pass_configs(), and T.empty()+return. These are stored as
proper PrimFunc attrs (tilelang_compile_flags, tilelang_pass_configs,
tilelang_out_idx) instead of monkey-patching, and merged at compile
time. Annotations can be placed before or after tensor type annotations.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…CuTeDSL alloc_global tests (tile-ai#2009)

1. Add CI step to clean stale JIT temp files (/tmp/*.so, *.cu, *.cubin,
   tvm-debug-mode-tempdirs) before tests on self-hosted runners. These
   files accumulate across CI runs and can fill the disk, causing g++
   to be killed (SIGTERM) during JIT compilation.

2. Skip CuTeDSL-incompatible example tests that use alloc_global
   (flash_decoding, deepseek_mla), since the CuTeDSL wrapper does not
   yet support alloc_global buffers.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Add a regression test covering 1D single-dimension tensor TMA copy
(global -> shared -> global) with warp specialization disabled.

The underlying bug was fixed in tile-ai#1840, but the test suite only covered
2D descriptor-based TMA paths. This test ensures the 1D bulk copy path
(cp.async.bulk) also works correctly with proper mbarrier allocation.

Closes tile-ai#1842

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…y instructions (tile-ai#1986)

* fix f32x2 vectorize for wider shape

* clean code

* fix: use string::operator+= to satisfy clang-tidy performance check

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Freebase6912 <amid-gauze-racing@duck.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…nd (tile-ai#2000)

add cuda get last error in tvm ffi to align Cython backend
Remove example_dequant_groupedgemm_bf16_mxfp4_hopper references from regression and test files after its deletion. This cleanup ensures that the codebase remains consistent and free of unused imports.
updates:
- [github.com/pre-commit/mirrors-clang-format: v22.1.0 → v22.1.2](pre-commit/mirrors-clang-format@v22.1.0...v22.1.2)
- [github.com/astral-sh/ruff-pre-commit: v0.15.4 → v0.15.9](astral-sh/ruff-pre-commit@v0.15.4...v0.15.9)
- [github.com/codespell-project/codespell: v2.4.1 → v2.4.2](codespell-project/codespell@v2.4.1...v2.4.2)
- [github.com/jackdewinter/pymarkdown: v0.9.35 → v0.9.36](jackdewinter/pymarkdown@v0.9.35...v0.9.36)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Enhance CUDA vectorization for binary operations by supporting wider tensor dimensions. Update kernel generation to accommodate variable width, improving auto-vectorization for float32 types. Add tests for width-8 scenarios to ensure correct emission of packed operations.
* fuse packed x2 mul-add into fma2

* document packed x2 fma2 fusion rationale

---------

Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>
…nd stabilize tiled WS (tile-ai#2002)

* Refactor optimization pipeline in phase.py to streamline barrier handling

- Consolidated the handling of shared barriers and pipeline planning by removing redundant conditional checks.
- Ensured that `LowerSharedBarrier`, `PipelinePlanning`, and `InjectSoftwarePipeline` are consistently applied, enhancing the clarity and efficiency of the optimization process.

This change improves the maintainability of the code while preserving existing functionality.

* Enhance tile operation handling with instruction annotation and refactor gemm.h

- Added a new `InstructionAnnotation` pass to annotate tile operations with their instruction kind before layout inference, improving the optimization pipeline's ability to reason about instruction mixes.
- Refactored `gemm.h` to move the `allowTcgen5Mma` and `allowWgmma` methods under the private section, enhancing code organization and encapsulation.

These changes improve the clarity and maintainability of the code while preserving existing functionality.

* Enhance multi-version buffer handling and introduce tile-level warp specialization

- Updated `multi_version_buffer_rewriter.cc` to improve read/write access detection for tile operations by analyzing `tl.tileop.region` calls, ensuring accurate buffer access tracking.
- Modified `phase.py` to integrate `ProducerConsumerWarpSpecializedTiled` before layout inference, allowing for high-level tile-op IR transformations that enhance producer/consumer splits.
- Added a new `ProducerConsumerWarpSpecializedTiled` function in `__init__.py` to facilitate tile-level warp specialization, improving the optimization pipeline's efficiency.

These changes enhance the handling of multi-version buffers and optimize the transformation process for tiled operations.

* Enable tiled WS for stage-1 pipelines

* Move Hopper pipeline planning before layout inference

* Run pipeline rewriting before layout inference

* Add pipeline refactor WIP: subtree_modified_ guard, debug prints, and plan

- inject_pipeline.cc: guard reads/writes recalculation with subtree_modified_
  flag to prevent local.var buffer promotion to kernel parameters
- phase.py: add temporary debug prints after PipelinePlanning and
  InjectSoftwarePipeline (to be removed during CI fix work)
- example_group_per_split_token_cast_to_fp8.py: add disable_cache for debugging
- docs/plan.md: implementation plan for fixing CI test failures
- draft.md: original design draft

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Remove debug artifacts: prints in phase.py, disable_cache in example

- Remove print("After PipelinePlanning"), print(mod),
  print("After InjectSoftwarePipeline"), print(mod) from
  LowerAndLegalize in tilelang/engine/phase.py
- Remove tilelang.disable_cache() from
  examples/cast/example_group_per_split_token_cast_to_fp8.py

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix LayoutInference crash on multi-versioned pipeline buffers

After InjectSoftwarePipeline, multi-versioned buffers share the same
data Var as the original but have an extra leading dimension (num_stages).
LayoutInference's alias propagation and annotation handling tried to
Reshape layouts between these buffers, which failed because the total
element counts differ.

Guard three Reshape call sites in layout_inference.cc to skip sibling
buffers whose total storage size is incompatible with the source layout.
This lets multi-versioned buffers get their own layout inference instead
of inheriting an incompatible layout from the original buffer.

Fixes compilation failures in dequantize_gemm, GDN, and other kernels
that use software pipelining with shared memory buffers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Strip pipeline annotations on WS fallback for TMA kernels

When ProducerConsumerWarpSpecializedTiled identifies a TMA kernel as a
warp-specialization candidate but the tiled rewriter cannot handle it
(e.g., conditional loop bodies like sparse block masks), the fallback
previously returned the original function with num_stages annotations
intact.  PipelinePlanning and InjectSoftwarePipeline would then generate
non-WS TMA pipeline code with broken barrier phase tracking for
conditional pipeline bodies (barrier waits outside conditionals cause
deadlocks when the condition is false).

Fix: on WS fallback, strip num_stages annotations from pipeline loops
so that the pipeline passes skip the function.  The kernel runs
unpipelined but correctly.

Fixes CUDA_ERROR_LAUNCH_FAILED in blocksparse_gemm and related TMA
kernels with conditional loop bodies.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add guarded phase-counter support to tiled WS pass

Port PhaseCounter and StageExprReplacer from the legacy
ProducerConsumerWarpSpecialized pass into the tiled WS pass to handle
conditional loop bodies (e.g., sparse block masks).

When the pipeline loop body is wrapped in an IfThenElse without else:
1. Unwrap the condition before classifying statements
2. Create separate producer/consumer PhaseCounters (local int32 buffers)
3. Use counter-based stage/parity expressions instead of loop-variable
4. Wrap producer and consumer bodies in the original condition
5. Increment counters at end of each guarded iteration
6. Rewrite shared-buffer stage indices via StageExprReplacer

This ensures barrier parity stays correct when iterations are
conditionally skipped, fixing CUDA_ERROR_LAUNCH_FAILED in
blocksparse_gemm and related TMA kernels with conditional execution.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add GemmSPNode support to pipeline planning and injection

Add GemmSPNode handling to:
- inject_pipeline.cc AddReadsWritesForTileOp: model A, E, B as reads
  and C as write (E is the sparse metadata buffer)
- pipeline_planning.cc: same access model for dependency analysis

This makes sparse GEMM visible to the pipeline machinery for correct
stage assignment and buffer multi-versioning. However, the tile-op's
consumer-side buffer accesses still don't get stage-indexed because
the pipeline body rewriter can't rewrite high-level tile-op Call
arguments (an architectural limitation of running InjectSoftwarePipeline
before LowerTileOp).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix GemmSPNode::Lower to use MakeAccessPtrFromRegion

Replace whole-buffer access_ptr(1) calls with MakeAccessPtrFromRegion
for A, B, C, and E buffers in sparse GEMM lowering. This preserves
stage-specific region offsets from pipeline multi-versioning, matching
the dense GemmNode::Lower pattern.

CUDA output now shows correct stage-indexed consumer accesses:
  gemm_sp_ss(..., (k%3)*8192, (k%3)*8192+27648, C_local, (k%3)*2048+49152)
instead of always using stage-0 offsets.

Note: gemm_sp still produces incorrect results because the kernel needs
warp specialization but TiledWSCandidate::Check doesn't recognize it
as a TMA candidate. The non-WS pipeline path generates structurally
different code from the reference WS path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix gemm_sp and seer_attention regressions

Three changes in producer_consumer_ws_tiled.cc:
1. Require num_stages >= 2 for WS candidacy (was >= 1). Single-stage
   kernels like seer_attention don't need WS and the transformation
   produces incorrect results for them.
2. Add HasTmaPipeline() check to detect TMA kernels with pipeline
   annotations that are rejected by the full WS candidate check (e.g.,
   kernels with manual layout annotations like gemm_sp).
3. Strip num_stages annotations for rejected TMA pipeline kernels to
   prevent InjectSoftwarePipeline from generating broken non-WS TMA
   pipeline code.

One change in gemm_sp.cc:
- Use MakeAccessPtrFromRegion for A, B, C, E buffer access pointers
  instead of whole-buffer access_ptr. This preserves stage-specific
  region offsets from pipeline multi-versioning.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix WS conditional body unwrapping for complex loop bodies

Only unwrap IfThenElse wrapper when the then-branch is a simple flat
sequence of tile-op Evaluate calls. Skip unwrapping for complex bodies
with LetStmt, For, or other control flow that could break variable
scoping when split into producer/consumer for WS.

Fixes variable-used-before-definition error in
blocksparse_attention sparse_gqa_decode_varlen_indice, which has a
conditional loop body containing LetStmt bindings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Revert num_stages >= 2 to >= 1 for WS candidacy

The >= 2 threshold broke test_num_stages_one_pure_tma_keeps_auto_warp_specialize.
Pure TMA kernels with num_stages=1 should still be WS candidates.
The seer_attention issue (num_stages=1 with manual layout) is handled
by the has_manual_layout_ check, not the num_stages threshold.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Propagate manual layouts onto MVB-expanded buffers via Layout::Expand

When LayoutInference encounters an MVB-expanded buffer (with leading
stage dimensions) whose trailing dimensions match the original layout,
use Layout::Expand to propagate the manual layout instead of rejecting
or skipping the buffer. Applied to all 3 layout propagation paths:
annotated layout map, alias propagation, and finalization.

Also remove the blanket !has_manual_layout_ WS candidate rejection
since manual layouts now survive onto versioned shared buffers via
Layout::Expand.

Fixes test_sparse_ws_regular_metadata_copy_stays_in_producer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Extend conditional body unwrapping for LetStmt chains and restore
has_manual_layout_ guard

1. LetStmt chain peeling: when IfThenElse then_case starts with
   LetStmt bindings, peel them and append to let_bindings before
   checking the simple-body guard. This allows WS for conditional
   bodies with variable definitions (e.g., sparse attention patterns).

2. Restore !has_manual_layout_ in WS candidacy check: removing it
   caused dequant_groupedgemm_bf16_mxfp4_hopper to fail because
   MXFP4 layouts don't survive MVB expansion. The Layout::Expand fix
   handles sparse metadata layouts but not all manual layout types.

3. Layout::Expand propagation (from Round 6) remains in place for
   future use when MVB learns to handle all manual layout types.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Remove is_simple_body guard from conditional WS unwrapping

The simple-body guard only accepted flat Evaluate sequences inside
IfThenElse, blocking legitimate WS for complex conditional bodies
like sparse flash attention (T.clear, T.reduce_max, T.gemm inside
the guard). The LetStmt peeling already handles variable scoping.

Fixes test_pure_tma_consumer_local_init_does_not_leak_into_producer
and test_sparse_ws_regular_metadata_copy_stays_in_producer.

The remaining test_mixed_tma_cp_async_shared_stage_barriers failure
is a pre-existing issue on the original branch: the tiled WS pass
produces SIMT copies instead of cp.async because LowerPTXAsyncCopy
is not in the pass pipeline (the comment at phase.py:308 says it
runs earlier but no actual call exists on either branch or reference).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Preserve pipeline context for WS producer cp.async injection

Wrap the WS producer loop body in a kPipelineContextNumStages AttrStmt
so that LowerTileOp's pipelined_depth_ is > 0 when processing SIMT
producer copies. This enables InjectPTXAsyncCopy to generate cp.async
for global-to-shared copies in the WS producer branch.

Without this, the WS rewriter strips all pipeline annotations from
the rewritten loops, causing LowerTileOp to skip cp.async injection
for SIMT producers. The consumer loop stays annotation-free since it
doesn't need cp.async.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix mixed barrier protocol, consumer-init sinking, restore manual-layout guard

1. Mixed TMA+cp.async barrier: use ptx_cp_async_barrier_noinc for
   forward barrier arrival in mixed producer groups, matching the
   reference producer_consumer_ws.cc protocol.

2. Consumer-only pre-loop init sinking: in ReplacePipelineLoopInStmt,
   guard pre-loop siblings as consumer-only when they're not classified
   as producer (TMA/SIMT/cp.async). Fragment init (T.fill, T.clear)
   and local buffer init are placed in the consumer branch instead of
   the shared prelude.

3. Restored blanket has_manual_layout_ guard: the dtype-based heuristic
   to distinguish sparse metadata from MXFP4 layouts doesn't work
   because both use uint8. Dequant_groupedgemm_bf16_mxfp4_hopper
   requires the guard to prevent broken WS.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Narrow consumer-init sinking, targeted manual-layout check, per-group barrier

1. Consumer-init sinking: only sink pre-loop stmts that are FillNode
   writing to fragment/local buffers. Keep block_mask setup and shared
   state in the shared prelude.

2. Manual-layout: attempt targeted check that only rejects when TMA
   copy destinations match layout_map entries. Collect layout_map vars
   and compare against TMA copy destinations.

3. Per-group cp.async barrier: use group-level cp.async flag (single
   group for now) instead of function-wide boolean.

The 3 WS issue tests still fail because the layout_map annotation
parsing falls through to the conservative rejection path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix layout_map parsing with Var→Buffer mapping for manual-layout check

The layout_map annotation uses Map<Var, Layout> before LayoutInference
(not Map<Buffer, Layout>). Parse as Map<ObjectRef, ObjectRef> and
handle both key types: Buffer (post-inference) and Var (pre-inference).
For Var keys, look up the corresponding alloc_buffer by data Var match.

Compare collected manual-layout buffers against pipeline copy
destinations: reject only when a manually-laid-out buffer is also
a producer copy target (TMA/SIMT/cp.async) inside the pipeline.
This allows sparse metadata (E_shared, SIMT-copied) onto the WS path
while rejecting dequant MXFP4 (B_shared, SIMT-copied with swizzle).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix SEGFAULT in manual-layout check, use DetectSwizzleMode

1. Fixed SEGFAULT: removed dangling layout_map_layouts_ vector that
   was never populated, causing OOB access. Now stores Buffer+Layout
   pairs in layout_map_entries_.

2. Use DetectSwizzleMode to distinguish swizzled layouts (MXFP4,
   incompatible with MVB) from non-swizzled (sparse metadata, safe).
   Swizzled layouts reject WS candidacy; non-swizzled layouts allow it.

3. Removed debug LOG(WARNING) from hot path.

4. Parse layout_map annotation keys as both Buffer and Var (via
   Map<ObjectRef, ObjectRef>), resolving Var keys to alloc_buffers.

Results: sparse metadata WS test PASSES, dequant PASSES (protected).
Down to 2 failures: mixed barrier pattern + consumer init sinking.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Simplify consumer-init sinking to Evaluate-only classification

Sink pre-loop Evaluate nodes classified as kConsumer into consumer
branch. Keep For loops (block_mask setup), producer copies, and other
control flow in the shared prelude. This is simpler and safer than
the FillNode scope check approach.

The test_pure_tma_consumer_local_init test still fails because the
T.fill statements are at a different structural level than the
SeqStmt where the pipeline loop lives. Fixing this requires deeper
IR structure analysis.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Sink consumer-only pre-loop init into WS consumer branch

Extract consumer-only pre-loop Evaluate statements (T.fill on
fragments) from the shared prelude and prepend them to the consumer
branch inside the WS if/else structure. This ensures fragment init
like acc_o, logsum, scores_max appears only in the consumer branch,
not in the shared prelude or producer branch.

Uses a two-pass approach: first ReplacePipelineLoopInStmt extracts
consumer-only stmts into extracted_consumer_init_, then the WS body
is rebuilt with the extracted stmts prepended to the consumer branch.

Fixes test_pure_tma_consumer_local_init_does_not_leak_into_producer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Finalize pipeline refactor fixes

* Refactor CopyNode stride checks for TMA bulk load/store

- Extracted the global stride validation logic into a new static method `CheckGlobalStrides` for better readability and reusability.
- Updated `CheckBulkLoad` and `CheckBulkStore` methods to utilize the new stride checking function, improving code clarity and maintainability.
- Enhanced documentation for the new method to clarify its purpose and requirements.

* Remove redundant copy direction notes from documentation in `copy.h` for stride checks. This simplifies the comments while maintaining clarity on TMA requirements.

* refactor

* layout related fix

* refactor mbarrier with software pipeline.

* remove legacy pass

* Enhance TMA barrier handling and merging logic

- Updated `CopyToTmaCopyRewriter` to conditionally emit arrive barriers based on the last TMA copy.
- Refactored barrier creation to allow merging of TMA barriers when conditions are met, reducing the number of barriers needed.
- Adjusted consumer and producer logic to accommodate merged barriers, ensuring correct synchronization across TMA copies.
- Improved documentation and comments for clarity on the new barrier merging behavior.

* Update example_mhc_post.py to disable main execution and add test function call

- Commented out the main function call in `example_mhc_post.py`.
- Added a call to `tilelang.disable_cache()` and a new `test(n=4096, h=2560)` function for testing purposes.

* Enhance pipeline barrier handling and TMA copy detection

- Updated `RewritePipelineTmaBarriers` to accept additional parameters for loop variable and stage count, improving barrier synchronization.
- Modified `HandleTileOp` to ensure only user-defined T.copy patterns are marked for pipeline barriers, preventing interference with T.tma_copy.
- Adjusted test cases to enforce specific CUDA compute version requirements for better compatibility and consistency.

* lint fix

* Enhance convolution example and TMA handling

- Added kernel source printing in `example_convolution.py` for debugging.
- Disabled the main function call and invoked `run_regression_perf()` to streamline execution.
- Introduced `MakeTmaLeaderCondition` function in `copy.cc` to improve TMA leader-thread condition handling.
- Updated `LowerBulkCopy` and `LowerBulkCopy1D` methods to utilize the new TMA leader condition for better thread management.
- Enhanced `Conv2DIm2ColOpNode` to support barrier annotations, improving synchronization in TMA operations.

* enhance

* Remove example_dequant_groupedgemm_bf16_mxfp4_hopper.py and clean up phase.py by removing unnecessary whitespace. Enhance TMA handling in producer_consumer_ws_tiled.cc to improve barrier synchronization and streamline TMA copy operations.

* fix

* fix

* Eliminate MVB(barrier_only=true) late fixup from OptimizeForTarget

Move pipeline barrier ownership into InjectSoftwarePipeline: create
pipeline_mbar[num_stages] at final expanded size instead of
pipeline_mbar[1] that required late MVB expansion.

Key changes:
- RewritePipelineTmaBarriers creates barriers at num_stages size
- Barrier indices use FloorMod(loop_var - loop_min, num_stages)
- barrier_init has num_stages entries (one per slot, arrive_count=1)
- CopyToTmaCopyRewriter accepts PrimExpr barrier_id (was int)
- phase.py: remove MVB(barrier_only=True), unify both paths under
  PlanAndUpdateBufferAllocationLocation
- Fix pre-existing test expectation for tma_copies annotation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix Codex review issues: num_stages=1 regression, im2col pipeline barriers, test fixes

1. Fix num_stages=1 regression: use tl_pipelined_num_stages annotation
   for barrier sizing instead of max_stage+1. Gate barrier creation on
   pipeline_depth > 1 so num_stages=1 kernels don't get multi-versioned
   pipeline barriers.

2. Extend RewritePipelineTmaBarriers to handle c2d_im2col: annotate
   im2col calls with pipeline barrier in CopyToTmaCopyRewriter. Add
   im2col TMA recognition to PipelinePlanning. Fix im2col Lower() to
   respect emit_arrive annotation from pipeline barriers.

3. Fix test_simple_pipeline: use annotation-checking approach instead of
   structural equality with hardcoded tma_copies annotation that fails
   on non-Hopper targets.

4. Remove dead kPipelineMVBStageExpr/kPipelineMVBParityExpr/
   kPipelineMVBContextNumStages emission from inject_pipeline.cc since
   pipeline barriers are now created at final size.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix depth-1 barrier ownership, add regression tests, remove dead kPipelineMVB*

1. Fix depth-1 ownership gap: always create shared pipeline barrier for
   TMA copies even when pipeline_depth=1. This prevents LowerTileOp from
   allocating separate per-copy internal barriers, keeping num_stages=1
   kernels at pipeline_mbar[1] (single slot).

2. Add checked-in lowering regression tests:
   - non-WS num_stages=3 TMA GEMM → asserts pipeline_mbar[3], no fallback
   - non-WS num_stages=1 TMA GEMM → asserts pipeline_mbar[1], no multi-slot
   - non-WS num_stages=3 im2col → asserts pipeline_mbar[3] feeds tma_load_im2col

3. Remove dead kPipelineMVB* definitions from pipeline_utils.h and
   consumption code from multi_version_buffer_rewriter.cc (stacks,
   explicit parity/version index logic, attr stripping).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix GemmWMMA.lower() signature and autotuner cache backwards compatibility

[P1] Add missing mbar_phase_expr parameter to GemmWMMA.lower() to match
the interface expected by GemmPy.lower() dispatch. Without this, any
RDNA kernel using the WMMA path would fail with TypeError.

[P2] Make out_idx.json loading optional in autotuner load_from_disk().
Older cache directories don't have this file; fallback to
compile_args.out_idx when the file is absent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add .humanize to gitignore

* Fix HIP swizzle codegen and CPU scalar GEMM region handling

[P2] Update codegen_hip.cc to parse tvm_tuple(device_func, panel_size)
format for threadblock_swizzle_pattern, matching the annotation format
now emitted by T.use_swizzle(). Previously expected StringImmNode which
would ICHECK-fail.

[P2] Fix gemm_scalar.py: clear only the output tile region (not the
whole buffer), and use the last two dimensions from regions to handle
rank>2 buffers with leading singleton dimensions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix HIP codegen to parse tvm_tuple swizzle annotation format

Update codegen_hip.cc threadblock_swizzle_pattern handler to parse
tvm_tuple(device_func, panel_size) format, matching CUDA and cutedsl
codegens. Previously expected StringImmNode which ICHECK-fails with
the current annotation format.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Restore kPipelineMVB* annotations — still needed by WS tiled full MVB

The kPipelineMVBStageExpr/ParityExpr/ContextNumStages annotations and
their MVB consumption code cannot be removed: they are emitted by
EmitImpl for ALL pipelines and consumed by the full MVB call inside
ProducerConsumerWarpSpecializedTiled. Removing them broke non-TMA
pipeline kernels like mhc_post that go through the WS path.

Restores: annotation emission in inject_pipeline.cc, constant
definitions in pipeline_utils.h, and consumption code in
multi_version_buffer_rewriter.cc.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Unify barrier multi-versioning in InjectSoftwarePipeline (Direction A)

Move all pipeline barrier multi-versioning into InjectSoftwarePipeline
via ExpandPipelineBarriers, eliminating the need for late
MVB(barrier_only=true) in OptimizeForTarget.

ExpandPipelineBarriers runs before BuildPipeline and handles:
- ISP-created pipeline_mbar (for non-WS TMA pipelines)
- User-written T.alloc_barrier (for manual WS pipelines like softpipe)

Key design: only barriers with explicit ptx_arrive_barrier calls OR
ISP-created local barriers are expanded. Barriers whose arrival is
managed internally by tile-ops (e.g., tcgen05 MMA arrive) are left
unchanged. This distinguishes pipeline sync barriers from hardware-
managed barriers.

Buffer expansion: barrier[N] -> barrier[N * num_stages]
Index rewriting: barrier[idx] -> barrier[stage_expr * N + idx]
Parity rewriting: user_parity -> (iteration_block + offset) % 2
barrier_init replication: [c0,c1] -> [c0,c1,c0,c1] for num_stages=2

Expanded buffers propagate to outer blocks via pending_buffer_remap_.
barrier_init annotations are replicated in VisitStmt_(BlockNode).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix Python 3.9 compat: replace star expression in index with tuple concat

* fix

* fix

* revert changes

* fix

* enhance pipeline

* enhance pipeline

* implement descriptor reuse pass

* Refactor pipeline management and enhance async copy handling

- Updated `example_mhc_pre.py` to replace the main function call with a test function for better testing flexibility.
- Modified `example_gqa_decode.py` to disable argument parsing and added latency measurement for regression performance.
- Enhanced `copy.cc` and `copy.h` with new checks for pipeline-managed cp.async synchronization and improved async copy handling.
- Updated `inject_pipeline.cc` and `pipeline_planning.cc` to refine the handling of tile operations and global/shared buffer checks.
- Added tests to ensure correct behavior of async pipeline and descriptor allocation reuse.

This commit improves the overall pipeline management and async copy capabilities, ensuring better performance and flexibility in the codebase.

* enhance pipeline

* enhance pipeline

* enhance pipeline

* enhance pipeline

* enhance pipeline

* Revert "enhance pipeline"

This reverts commit c056ded.

* preloop tma handling

* lint fix

* Clean up debug changes in examples

* Refactor PTX async copy injection result

* refactor

* Refactor tile op access region collection

* Clean up pipeline access and async analysis

* Refactor head async wait analysis

* refactor layout expand

* auto tma fix

* Allow auto TMA store for plain copy

* slash sparse test fix

* Unify warp specialization and internalize buffer versioning

* Revert formatting-only changes

* Keep deprecated TL_DISABLE_TMA_LOWER compat

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 8, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Generalizes warpgroup partitioning from a fixed 2-way model to N-way via vectorized thread counts, threads a WarpSpecializeConfig through scheduling APIs, updates barrier and warp-partition signatures/logic, changes ScheduleUnit before/after storage to maps, and removes an explicit wg_wait=-1 from a GEMM example. (39 words)

Changes

Cohort / File(s) Summary
GEMM Example
examples/gemm_sm100/gemm_auto_tcgen5mma.py
Removed wg_wait=-1 argument from the pipelined T.gemm() call; GEMM now relies on T.gemm default wg-wait behavior.
Public API / Config
src/transform/auto_schedule.h, src/transform/auto_schedule/schedule_builder.h
Introduced WarpSpecializeArch / WarpSpecializeConfig types and GetSharedMemoryLimit(Target); replaced boolean warp-partition flag with config object; updated APIs to accept config + explicit thread-count expressions.
Schedule Builder / Core
src/transform/auto_schedule.cc, src/transform/auto_schedule/schedule_builder.cc
ScheduleUnitBuilder::Build()/NaiveBuild() changed to return std::vector<PrimExpr> (thread counts) instead of bool; builder uses SetWarpSpecializeConfig(config) and accumulates dynamic thread counts for barrier/thread-extent logic.
Warpgroup Partitioning
src/transform/auto_schedule/warpgroup_partition.h, src/transform/auto_schedule/warpgroup_partition.cc
ApplyWarpgroupPartitionToIRStructure() now accepts const std::vector<PrimExpr>& thread_count and removes producer_consumer; partitioning generalized to N warpgroups, per-wg cloned IR structures, per-wg predicates, nested Ifs, and accumulated neutral-sync extents.
Barrier Analysis
src/transform/auto_schedule/barrier.h
Barrier APIs updated to const std::vector<PrimExpr>& thread_count; internal bookkeeping moved from fixed 2-way arrays to vectorized per-wg structures; parity/version indexing and barrier buffer loads adjusted for N-way warpgroups.
IRStructure / ScheduleUnit
src/transform/auto_schedule/ir_structure.h
ScheduleUnit::before/after changed from std::vector<std::vector<Stmt>> to std::map<int,std::vector<Stmt>>; ControlNode adds multi_buffering_buffers; iteration/substitution/logging updated to map-based traversal.
Auxiliary / Hygiene
src/transform/auto_schedule/warpgroup_partition.cc, src/transform/auto_schedule/warpgroup_partition.*
Added <numeric>; inserted return nullptr after LOG(FATAL) paths; tightened constness/structured bindings; added CloneIRStructureChildrenWithWarpgroupFilter helper and updated call sites to new signatures.
Benchmarks
examples/auto_schedule/flashmla_benchmark.py
Removed/commented out the flashattn_manual entry from the benchmark list so it is no longer compiled/benchmarked by default.

Sequence Diagram(s)

sequenceDiagram
    participant Scheduler as Scheduler
    participant Builder as ScheduleUnitBuilder
    participant Assigner as AssignWarpgroupIdsGlobal
    participant Partitioner as ApplyWarpgroupPartition
    participant Barrier as AnalyzeAndInsertBarriers
    participant Converter as ConvertIRStructureToStmt

    Scheduler->>Builder: SetWarpSpecializeConfig(config)\nBuild(ir_structure) -> thread_count (vector)
    Builder->>Assigner: AssignWarpgroupIdsGlobal(root, config, thread_count)
    Assigner->>Partitioner: ApplyWarpgroupPartition(root,..., thread_count, config, neutral_barrier)
    Partitioner->>Barrier: AnalyzeAndInsertBarriers(..., thread_count, ...)
    Partitioner->>Converter: ConvertIRStructureToStmt(cloned_subtrees)
    Converter-->>Scheduler: Final Stmt
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐇 I hopped from two to many threads tonight,

I stitched a config through loops of gentle light.
I summed each count and nested every gate,
warpgroups bloom and barriers await.
A tiny hop — the schedule feels just right.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 44.44% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and directly describes the main objective of the changeset—extending warp group support from a fixed 2-element model to a dynamic N-element vector-based architecture across multiple core scheduling files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/transform/auto_schedule/warpgroup_partition.cc (1)

1225-1230: Tensor core warpgroup range is hardcoded to warpgroup 0.

The code at lines 1229-1230 passes 0 and thread_count[0] as the tensor core warpgroup range, assuming tensor cores are always in warpgroup 0. The TODO comment acknowledges this limitation.

For correctness when supporting more than 2 warpgroups, consider tracking which warpgroup(s) contain tensor core operations and passing the appropriate range(s) dynamically.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/transform/auto_schedule/warpgroup_partition.cc` around lines 1225 - 1230,
The call to InsertBarriersForNeutralSyncWithDependency hardcodes the tensor-core
warpgroup range as 0..thread_count[0] which fails when tensor core ops are not
confined to warpgroup 0; identify which warpgroup indices contain tensor-core
operations (e.g. scan warpgroup metadata/annotations or examine
pro_and_warpgroup_stmt/thread mappings to compute tensor_wg_start and
tensor_wg_end from thread_count[]), then pass those computed start/end indices
instead of literal 0 and thread_count[0] when calling
InsertBarriersForNeutralSyncWithDependency (update the call site using the same
symbols pro_and_warpgroup_stmt, epi_neutral_body, barrier_buffers, barrier_map,
updated_thread_extent, need_shared_barrier_for_epi, need_tmem_barrier_for_epi,
Buffer(), thread_var->var, thread_count).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/transform/auto_schedule/schedule_builder.h`:
- Around line 614-616: Rename the misspelled setter SetWarpSpeicializeConfig to
SetWarpSpecializeConfig in schedule_builder.h (and update its
declaration/definition name reference config_ = config) and update all call
sites to the new name (e.g., replace calls in auto_schedule.cc that use
SetWarpSpeicializeConfig with SetWarpSpecializeConfig); ensure any forward
declarations, headers, or tests referencing the old name are updated as well to
keep the API consistent.

---

Nitpick comments:
In `@src/transform/auto_schedule/warpgroup_partition.cc`:
- Around line 1225-1230: The call to InsertBarriersForNeutralSyncWithDependency
hardcodes the tensor-core warpgroup range as 0..thread_count[0] which fails when
tensor core ops are not confined to warpgroup 0; identify which warpgroup
indices contain tensor-core operations (e.g. scan warpgroup metadata/annotations
or examine pro_and_warpgroup_stmt/thread mappings to compute tensor_wg_start and
tensor_wg_end from thread_count[]), then pass those computed start/end indices
instead of literal 0 and thread_count[0] when calling
InsertBarriersForNeutralSyncWithDependency (update the call site using the same
symbols pro_and_warpgroup_stmt, epi_neutral_body, barrier_buffers, barrier_map,
updated_thread_extent, need_shared_barrier_for_epi, need_tmem_barrier_for_epi,
Buffer(), thread_var->var, thread_count).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 737445ab-24fd-4716-9e9b-c6c498e3c7eb

📥 Commits

Reviewing files that changed from the base of the PR and between 0f8e055 and 0402120.

📒 Files selected for processing (8)
  • examples/gemm_sm100/gemm_auto_tcgen5mma.py
  • src/transform/auto_schedule.cc
  • src/transform/auto_schedule.h
  • src/transform/auto_schedule/barrier.h
  • src/transform/auto_schedule/schedule_builder.cc
  • src/transform/auto_schedule/schedule_builder.h
  • src/transform/auto_schedule/warpgroup_partition.cc
  • src/transform/auto_schedule/warpgroup_partition.h

Comment thread src/transform/auto_schedule/schedule_builder.h Outdated
AutumnKite and others added 5 commits April 8, 2026 17:18
…ile-ai#2021)

Bumps [transformers](https://github.com/huggingface/transformers) from 4.53.0 to 5.0.0rc3.
- [Release notes](https://github.com/huggingface/transformers/releases)
- [Commits](huggingface/transformers@v4.53.0...v5.0.0rc3)

---
updated-dependencies:
- dependency-name: transformers
  dependency-version: 5.0.0rc3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Fix int64 loop phase dtype handling in LowerTileOp
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/transform/auto_schedule/schedule_builder.h`:
- Around line 32-37: The header declares enum class WarpSpecializeArch : uint8_t
but doesn't directly include <cstdint>, so add a direct include of <cstdint> at
the top of the file to ensure uint8_t is defined (i.e., insert `#include`
<cstdint> in src/transform/auto_schedule/schedule_builder.h before the
WarpSpecializeArch declaration).
- Around line 78-83: Add an explicit precondition check at the start of
ScheduleBuilder::Build to ensure SetThreadVar() has been called (i.e.,
thread_var_ and its domain/extent are initialized) before dereferencing
thread_var_->dom->extent; fail fast with a clear message mentioning
SetThreadVar, then proceed to call ScheduleRecursive and
AssignWarpgroupIdsGlobal as before so AssignWarpgroupIdsGlobal(root.get(),
config_, thread_var_->dom->extent) is safe to use.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 536ceb00-67b7-407f-b724-44111658bb24

📥 Commits

Reviewing files that changed from the base of the PR and between 0402120 and 9a15696.

📒 Files selected for processing (1)
  • src/transform/auto_schedule/schedule_builder.h

Comment on lines +32 to +37
// Warp specialization architecture enum
enum class WarpSpecializeArch : uint8_t {
kHopper = 0,
kBlackwell = 1,
kUnsupported = 2,
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n src/transform/auto_schedule/schedule_builder.h | head -60

Repository: silentCoder-dev/tilelang

Length of output: 2072


🏁 Script executed:

# Check for existing cstdint include
rg "#include <cstdint>" src/transform/auto_schedule/schedule_builder.h

Repository: silentCoder-dev/tilelang

Length of output: 50


🏁 Script executed:

# Check all includes at the top of the file
rg "^#include" src/transform/auto_schedule/schedule_builder.h | head -20

Repository: silentCoder-dev/tilelang

Length of output: 511


Add the direct include for uint8_t.

WarpSpecializeArch uses uint8_t as its underlying type, but the header does not directly include <cstdint>. Relying on transitive includes risks breakage if include chains change in unrelated headers.

✏️ Proposed fix
 `#include` <tvm/tir/stmt_functor.h>
 
+#include <cstdint>
 `#include` <map>
 `#include` <memory>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/transform/auto_schedule/schedule_builder.h` around lines 32 - 37, The
header declares enum class WarpSpecializeArch : uint8_t but doesn't directly
include <cstdint>, so add a direct include of <cstdint> at the top of the file
to ensure uint8_t is defined (i.e., insert `#include` <cstdint> in
src/transform/auto_schedule/schedule_builder.h before the WarpSpecializeArch
declaration).

Comment thread src/transform/auto_schedule/schedule_builder.h Outdated
AutumnKite and others added 30 commits April 22, 2026 17:56
Fix let statement clone bug and update layout map
Co-authored-by: Copilot <copilot@github.com>
Fix local variable copy and naive IR structure bug
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
Remove redundant let statements and fix barrier issues
Fix buffer analysis and check kernel with barrier formatting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.