support more than 2 warp groups#9
support more than 2 warp groups#9AutumnKite wants to merge 144 commits intosilentCoder-dev:auto-schedulefrom
Conversation
* Fix tcgen05 barrier allocation planning regression * Add explicit TMEM deallocation and shared transpose copy * Add transpose operation to documentation and update implementation in copy_op.py * LINT FIX * Refactor: extract transpose from CopyNode into standalone TransposeNode op Remove the transpose annotation logic from CopyNode (GetTranspose, MakeIndices transpose branch, MakePredicate transpose branch, and GetCopyInst early return). Transpose is now handled by the independent TransposeNode registered as tl.tileop.transpose. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Refactor CopyNode to remove transpose handling from index generation and predicate creation. Simplify MakeIndices and MakePredicate methods by eliminating unnecessary transpose checks and related logic. Update associated checks to ensure index consistency. Clean up unused GetTranspose method in copy.h. * lint fix --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…st/ir.py` (tile-ai#1996) * Initial plan * Add annotations parameter support to alloc_buffer in tilelang/language/ast/ir.py Agent-Logs-Url: https://github.com/tile-ai/tilelang/sessions/17577985-06fa-4b35-b714-185004d91524 Co-authored-by: LeiWang1999 <34334180+LeiWang1999@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: LeiWang1999 <34334180+LeiWang1999@users.noreply.github.com>
…ile-ai#1994) * [Bugfix] Raise error on zero grid dimension instead of silent clamp (tile-ai#1993) Fix ThreadWorkLoad::Extract() silently clamping zero grid dims to 1, which caused either CUDA_ERROR_ILLEGAL_ADDRESS crashes (dynamic case) or silent wrong results (static case). Closes tile-ai#1993 * lint fix
…#1995) * add check for kRows when allocating tmem * Add FirstOwningBlockCollector to map allocated buffers to their declaring blocks This change introduces the FirstOwningBlockCollector class, which collects the first Block that declares each allocated buffer variable. The BufferAllocationLocator is updated to utilize this new collector, ensuring that shared barriers remain associated with their declaring blocks during buffer allocation. This enhancement improves the management of buffer allocations and addresses potential issues with opaque child blocks in pipelined For loops. * replace `T.gemm` with `T.tcgen05_gemm` in fa sm100 examples * lint --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
…he order of blockIdx (tile-ai#1980) * Add TCGEN05 thread synchronization fences in GEMM examples and builtins - Introduced `tcgen05_before_thread_sync` and `tcgen05_after_thread_sync` builtins to manage thread synchronization in TCGEN05 operations. - Updated `gemm` and `gemm_2cta` examples to include synchronization calls before and after thread barriers, ensuring correct execution order. - Adjusted kernel launch parameters in GEMM examples for consistency in block dimensions. - Enhanced memory copy operations to align with the new synchronization logic, improving performance and correctness. * lint
* [Refactor] Refactor CUDA atomic helpers * update
…tile-ai#1972) * [Bugfix] Fix CuTeDSL autotune cache saving .py as .so (tile-ai#1967) The autotune cache had no CuTeDSL-specific branch, causing it to save the Python source file (kernel.py) as kernel_lib.so. On reload, importlib treated the .so extension as a native extension module and failed with "invalid ELF header". Fix: add cutedsl branches in _save_kernel_to_disk and _load_kernel_from_disk to use KERNEL_PY_PATH ("kernel.py") instead of KERNEL_LIB_PATH ("kernel_lib.so"). Also saves launcher .so and cubin artifacts when present. Closes tile-ai#1967 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [BugFix] Change _all_dtypes from set to list for deterministic order set has non-deterministic iteration order across processes, causing pytest-xdist workers to collect test parameters in different orders and fail with "Different tests were collected between gw3 and gw2". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…rcution (tile-ai#2004) * fix copy+cast vectorize loop to use wider vector load/store instrcution * clean test * fix test * fix format * test fix --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
…and out_idx as PrimFunc attrs (tile-ai#2006) Allow configuring pass configs, compile flags, and out_idx directly inside function bodies using T.annotate_compile_flags(), T.annotate_pass_configs(), and T.empty()+return. These are stored as proper PrimFunc attrs (tilelang_compile_flags, tilelang_pass_configs, tilelang_out_idx) instead of monkey-patching, and merged at compile time. Annotations can be placed before or after tensor type annotations. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…CuTeDSL alloc_global tests (tile-ai#2009) 1. Add CI step to clean stale JIT temp files (/tmp/*.so, *.cu, *.cubin, tvm-debug-mode-tempdirs) before tests on self-hosted runners. These files accumulate across CI runs and can fill the disk, causing g++ to be killed (SIGTERM) during JIT compilation. 2. Skip CuTeDSL-incompatible example tests that use alloc_global (flash_decoding, deepseek_mla), since the CuTeDSL wrapper does not yet support alloc_global buffers. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Add a regression test covering 1D single-dimension tensor TMA copy (global -> shared -> global) with warp specialization disabled. The underlying bug was fixed in tile-ai#1840, but the test suite only covered 2D descriptor-based TMA paths. This test ensures the 1D bulk copy path (cp.async.bulk) also works correctly with proper mbarrier allocation. Closes tile-ai#1842 Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…y instructions (tile-ai#1986) * fix f32x2 vectorize for wider shape * clean code * fix: use string::operator+= to satisfy clang-tidy performance check Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Freebase6912 <amid-gauze-racing@duck.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…nd (tile-ai#2000) add cuda get last error in tvm ffi to align Cython backend
Remove example_dequant_groupedgemm_bf16_mxfp4_hopper references from regression and test files after its deletion. This cleanup ensures that the codebase remains consistent and free of unused imports.
updates: - [github.com/pre-commit/mirrors-clang-format: v22.1.0 → v22.1.2](pre-commit/mirrors-clang-format@v22.1.0...v22.1.2) - [github.com/astral-sh/ruff-pre-commit: v0.15.4 → v0.15.9](astral-sh/ruff-pre-commit@v0.15.4...v0.15.9) - [github.com/codespell-project/codespell: v2.4.1 → v2.4.2](codespell-project/codespell@v2.4.1...v2.4.2) - [github.com/jackdewinter/pymarkdown: v0.9.35 → v0.9.36](jackdewinter/pymarkdown@v0.9.35...v0.9.36) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Enhance CUDA vectorization for binary operations by supporting wider tensor dimensions. Update kernel generation to accommodate variable width, improving auto-vectorization for float32 types. Add tests for width-8 scenarios to ensure correct emission of packed operations.
* fuse packed x2 mul-add into fma2 * document packed x2 fma2 fusion rationale --------- Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>
Optimize bitwise reduce test runtime
…nd stabilize tiled WS (tile-ai#2002) * Refactor optimization pipeline in phase.py to streamline barrier handling - Consolidated the handling of shared barriers and pipeline planning by removing redundant conditional checks. - Ensured that `LowerSharedBarrier`, `PipelinePlanning`, and `InjectSoftwarePipeline` are consistently applied, enhancing the clarity and efficiency of the optimization process. This change improves the maintainability of the code while preserving existing functionality. * Enhance tile operation handling with instruction annotation and refactor gemm.h - Added a new `InstructionAnnotation` pass to annotate tile operations with their instruction kind before layout inference, improving the optimization pipeline's ability to reason about instruction mixes. - Refactored `gemm.h` to move the `allowTcgen5Mma` and `allowWgmma` methods under the private section, enhancing code organization and encapsulation. These changes improve the clarity and maintainability of the code while preserving existing functionality. * Enhance multi-version buffer handling and introduce tile-level warp specialization - Updated `multi_version_buffer_rewriter.cc` to improve read/write access detection for tile operations by analyzing `tl.tileop.region` calls, ensuring accurate buffer access tracking. - Modified `phase.py` to integrate `ProducerConsumerWarpSpecializedTiled` before layout inference, allowing for high-level tile-op IR transformations that enhance producer/consumer splits. - Added a new `ProducerConsumerWarpSpecializedTiled` function in `__init__.py` to facilitate tile-level warp specialization, improving the optimization pipeline's efficiency. These changes enhance the handling of multi-version buffers and optimize the transformation process for tiled operations. * Enable tiled WS for stage-1 pipelines * Move Hopper pipeline planning before layout inference * Run pipeline rewriting before layout inference * Add pipeline refactor WIP: subtree_modified_ guard, debug prints, and plan - inject_pipeline.cc: guard reads/writes recalculation with subtree_modified_ flag to prevent local.var buffer promotion to kernel parameters - phase.py: add temporary debug prints after PipelinePlanning and InjectSoftwarePipeline (to be removed during CI fix work) - example_group_per_split_token_cast_to_fp8.py: add disable_cache for debugging - docs/plan.md: implementation plan for fixing CI test failures - draft.md: original design draft Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Remove debug artifacts: prints in phase.py, disable_cache in example - Remove print("After PipelinePlanning"), print(mod), print("After InjectSoftwarePipeline"), print(mod) from LowerAndLegalize in tilelang/engine/phase.py - Remove tilelang.disable_cache() from examples/cast/example_group_per_split_token_cast_to_fp8.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix LayoutInference crash on multi-versioned pipeline buffers After InjectSoftwarePipeline, multi-versioned buffers share the same data Var as the original but have an extra leading dimension (num_stages). LayoutInference's alias propagation and annotation handling tried to Reshape layouts between these buffers, which failed because the total element counts differ. Guard three Reshape call sites in layout_inference.cc to skip sibling buffers whose total storage size is incompatible with the source layout. This lets multi-versioned buffers get their own layout inference instead of inheriting an incompatible layout from the original buffer. Fixes compilation failures in dequantize_gemm, GDN, and other kernels that use software pipelining with shared memory buffers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Strip pipeline annotations on WS fallback for TMA kernels When ProducerConsumerWarpSpecializedTiled identifies a TMA kernel as a warp-specialization candidate but the tiled rewriter cannot handle it (e.g., conditional loop bodies like sparse block masks), the fallback previously returned the original function with num_stages annotations intact. PipelinePlanning and InjectSoftwarePipeline would then generate non-WS TMA pipeline code with broken barrier phase tracking for conditional pipeline bodies (barrier waits outside conditionals cause deadlocks when the condition is false). Fix: on WS fallback, strip num_stages annotations from pipeline loops so that the pipeline passes skip the function. The kernel runs unpipelined but correctly. Fixes CUDA_ERROR_LAUNCH_FAILED in blocksparse_gemm and related TMA kernels with conditional loop bodies. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add guarded phase-counter support to tiled WS pass Port PhaseCounter and StageExprReplacer from the legacy ProducerConsumerWarpSpecialized pass into the tiled WS pass to handle conditional loop bodies (e.g., sparse block masks). When the pipeline loop body is wrapped in an IfThenElse without else: 1. Unwrap the condition before classifying statements 2. Create separate producer/consumer PhaseCounters (local int32 buffers) 3. Use counter-based stage/parity expressions instead of loop-variable 4. Wrap producer and consumer bodies in the original condition 5. Increment counters at end of each guarded iteration 6. Rewrite shared-buffer stage indices via StageExprReplacer This ensures barrier parity stays correct when iterations are conditionally skipped, fixing CUDA_ERROR_LAUNCH_FAILED in blocksparse_gemm and related TMA kernels with conditional execution. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add GemmSPNode support to pipeline planning and injection Add GemmSPNode handling to: - inject_pipeline.cc AddReadsWritesForTileOp: model A, E, B as reads and C as write (E is the sparse metadata buffer) - pipeline_planning.cc: same access model for dependency analysis This makes sparse GEMM visible to the pipeline machinery for correct stage assignment and buffer multi-versioning. However, the tile-op's consumer-side buffer accesses still don't get stage-indexed because the pipeline body rewriter can't rewrite high-level tile-op Call arguments (an architectural limitation of running InjectSoftwarePipeline before LowerTileOp). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix GemmSPNode::Lower to use MakeAccessPtrFromRegion Replace whole-buffer access_ptr(1) calls with MakeAccessPtrFromRegion for A, B, C, and E buffers in sparse GEMM lowering. This preserves stage-specific region offsets from pipeline multi-versioning, matching the dense GemmNode::Lower pattern. CUDA output now shows correct stage-indexed consumer accesses: gemm_sp_ss(..., (k%3)*8192, (k%3)*8192+27648, C_local, (k%3)*2048+49152) instead of always using stage-0 offsets. Note: gemm_sp still produces incorrect results because the kernel needs warp specialization but TiledWSCandidate::Check doesn't recognize it as a TMA candidate. The non-WS pipeline path generates structurally different code from the reference WS path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix gemm_sp and seer_attention regressions Three changes in producer_consumer_ws_tiled.cc: 1. Require num_stages >= 2 for WS candidacy (was >= 1). Single-stage kernels like seer_attention don't need WS and the transformation produces incorrect results for them. 2. Add HasTmaPipeline() check to detect TMA kernels with pipeline annotations that are rejected by the full WS candidate check (e.g., kernels with manual layout annotations like gemm_sp). 3. Strip num_stages annotations for rejected TMA pipeline kernels to prevent InjectSoftwarePipeline from generating broken non-WS TMA pipeline code. One change in gemm_sp.cc: - Use MakeAccessPtrFromRegion for A, B, C, E buffer access pointers instead of whole-buffer access_ptr. This preserves stage-specific region offsets from pipeline multi-versioning. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix WS conditional body unwrapping for complex loop bodies Only unwrap IfThenElse wrapper when the then-branch is a simple flat sequence of tile-op Evaluate calls. Skip unwrapping for complex bodies with LetStmt, For, or other control flow that could break variable scoping when split into producer/consumer for WS. Fixes variable-used-before-definition error in blocksparse_attention sparse_gqa_decode_varlen_indice, which has a conditional loop body containing LetStmt bindings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Revert num_stages >= 2 to >= 1 for WS candidacy The >= 2 threshold broke test_num_stages_one_pure_tma_keeps_auto_warp_specialize. Pure TMA kernels with num_stages=1 should still be WS candidates. The seer_attention issue (num_stages=1 with manual layout) is handled by the has_manual_layout_ check, not the num_stages threshold. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Propagate manual layouts onto MVB-expanded buffers via Layout::Expand When LayoutInference encounters an MVB-expanded buffer (with leading stage dimensions) whose trailing dimensions match the original layout, use Layout::Expand to propagate the manual layout instead of rejecting or skipping the buffer. Applied to all 3 layout propagation paths: annotated layout map, alias propagation, and finalization. Also remove the blanket !has_manual_layout_ WS candidate rejection since manual layouts now survive onto versioned shared buffers via Layout::Expand. Fixes test_sparse_ws_regular_metadata_copy_stays_in_producer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Extend conditional body unwrapping for LetStmt chains and restore has_manual_layout_ guard 1. LetStmt chain peeling: when IfThenElse then_case starts with LetStmt bindings, peel them and append to let_bindings before checking the simple-body guard. This allows WS for conditional bodies with variable definitions (e.g., sparse attention patterns). 2. Restore !has_manual_layout_ in WS candidacy check: removing it caused dequant_groupedgemm_bf16_mxfp4_hopper to fail because MXFP4 layouts don't survive MVB expansion. The Layout::Expand fix handles sparse metadata layouts but not all manual layout types. 3. Layout::Expand propagation (from Round 6) remains in place for future use when MVB learns to handle all manual layout types. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Remove is_simple_body guard from conditional WS unwrapping The simple-body guard only accepted flat Evaluate sequences inside IfThenElse, blocking legitimate WS for complex conditional bodies like sparse flash attention (T.clear, T.reduce_max, T.gemm inside the guard). The LetStmt peeling already handles variable scoping. Fixes test_pure_tma_consumer_local_init_does_not_leak_into_producer and test_sparse_ws_regular_metadata_copy_stays_in_producer. The remaining test_mixed_tma_cp_async_shared_stage_barriers failure is a pre-existing issue on the original branch: the tiled WS pass produces SIMT copies instead of cp.async because LowerPTXAsyncCopy is not in the pass pipeline (the comment at phase.py:308 says it runs earlier but no actual call exists on either branch or reference). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Preserve pipeline context for WS producer cp.async injection Wrap the WS producer loop body in a kPipelineContextNumStages AttrStmt so that LowerTileOp's pipelined_depth_ is > 0 when processing SIMT producer copies. This enables InjectPTXAsyncCopy to generate cp.async for global-to-shared copies in the WS producer branch. Without this, the WS rewriter strips all pipeline annotations from the rewritten loops, causing LowerTileOp to skip cp.async injection for SIMT producers. The consumer loop stays annotation-free since it doesn't need cp.async. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix mixed barrier protocol, consumer-init sinking, restore manual-layout guard 1. Mixed TMA+cp.async barrier: use ptx_cp_async_barrier_noinc for forward barrier arrival in mixed producer groups, matching the reference producer_consumer_ws.cc protocol. 2. Consumer-only pre-loop init sinking: in ReplacePipelineLoopInStmt, guard pre-loop siblings as consumer-only when they're not classified as producer (TMA/SIMT/cp.async). Fragment init (T.fill, T.clear) and local buffer init are placed in the consumer branch instead of the shared prelude. 3. Restored blanket has_manual_layout_ guard: the dtype-based heuristic to distinguish sparse metadata from MXFP4 layouts doesn't work because both use uint8. Dequant_groupedgemm_bf16_mxfp4_hopper requires the guard to prevent broken WS. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Narrow consumer-init sinking, targeted manual-layout check, per-group barrier 1. Consumer-init sinking: only sink pre-loop stmts that are FillNode writing to fragment/local buffers. Keep block_mask setup and shared state in the shared prelude. 2. Manual-layout: attempt targeted check that only rejects when TMA copy destinations match layout_map entries. Collect layout_map vars and compare against TMA copy destinations. 3. Per-group cp.async barrier: use group-level cp.async flag (single group for now) instead of function-wide boolean. The 3 WS issue tests still fail because the layout_map annotation parsing falls through to the conservative rejection path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix layout_map parsing with Var→Buffer mapping for manual-layout check The layout_map annotation uses Map<Var, Layout> before LayoutInference (not Map<Buffer, Layout>). Parse as Map<ObjectRef, ObjectRef> and handle both key types: Buffer (post-inference) and Var (pre-inference). For Var keys, look up the corresponding alloc_buffer by data Var match. Compare collected manual-layout buffers against pipeline copy destinations: reject only when a manually-laid-out buffer is also a producer copy target (TMA/SIMT/cp.async) inside the pipeline. This allows sparse metadata (E_shared, SIMT-copied) onto the WS path while rejecting dequant MXFP4 (B_shared, SIMT-copied with swizzle). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix SEGFAULT in manual-layout check, use DetectSwizzleMode 1. Fixed SEGFAULT: removed dangling layout_map_layouts_ vector that was never populated, causing OOB access. Now stores Buffer+Layout pairs in layout_map_entries_. 2. Use DetectSwizzleMode to distinguish swizzled layouts (MXFP4, incompatible with MVB) from non-swizzled (sparse metadata, safe). Swizzled layouts reject WS candidacy; non-swizzled layouts allow it. 3. Removed debug LOG(WARNING) from hot path. 4. Parse layout_map annotation keys as both Buffer and Var (via Map<ObjectRef, ObjectRef>), resolving Var keys to alloc_buffers. Results: sparse metadata WS test PASSES, dequant PASSES (protected). Down to 2 failures: mixed barrier pattern + consumer init sinking. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Simplify consumer-init sinking to Evaluate-only classification Sink pre-loop Evaluate nodes classified as kConsumer into consumer branch. Keep For loops (block_mask setup), producer copies, and other control flow in the shared prelude. This is simpler and safer than the FillNode scope check approach. The test_pure_tma_consumer_local_init test still fails because the T.fill statements are at a different structural level than the SeqStmt where the pipeline loop lives. Fixing this requires deeper IR structure analysis. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Sink consumer-only pre-loop init into WS consumer branch Extract consumer-only pre-loop Evaluate statements (T.fill on fragments) from the shared prelude and prepend them to the consumer branch inside the WS if/else structure. This ensures fragment init like acc_o, logsum, scores_max appears only in the consumer branch, not in the shared prelude or producer branch. Uses a two-pass approach: first ReplacePipelineLoopInStmt extracts consumer-only stmts into extracted_consumer_init_, then the WS body is rebuilt with the extracted stmts prepended to the consumer branch. Fixes test_pure_tma_consumer_local_init_does_not_leak_into_producer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Finalize pipeline refactor fixes * Refactor CopyNode stride checks for TMA bulk load/store - Extracted the global stride validation logic into a new static method `CheckGlobalStrides` for better readability and reusability. - Updated `CheckBulkLoad` and `CheckBulkStore` methods to utilize the new stride checking function, improving code clarity and maintainability. - Enhanced documentation for the new method to clarify its purpose and requirements. * Remove redundant copy direction notes from documentation in `copy.h` for stride checks. This simplifies the comments while maintaining clarity on TMA requirements. * refactor * layout related fix * refactor mbarrier with software pipeline. * remove legacy pass * Enhance TMA barrier handling and merging logic - Updated `CopyToTmaCopyRewriter` to conditionally emit arrive barriers based on the last TMA copy. - Refactored barrier creation to allow merging of TMA barriers when conditions are met, reducing the number of barriers needed. - Adjusted consumer and producer logic to accommodate merged barriers, ensuring correct synchronization across TMA copies. - Improved documentation and comments for clarity on the new barrier merging behavior. * Update example_mhc_post.py to disable main execution and add test function call - Commented out the main function call in `example_mhc_post.py`. - Added a call to `tilelang.disable_cache()` and a new `test(n=4096, h=2560)` function for testing purposes. * Enhance pipeline barrier handling and TMA copy detection - Updated `RewritePipelineTmaBarriers` to accept additional parameters for loop variable and stage count, improving barrier synchronization. - Modified `HandleTileOp` to ensure only user-defined T.copy patterns are marked for pipeline barriers, preventing interference with T.tma_copy. - Adjusted test cases to enforce specific CUDA compute version requirements for better compatibility and consistency. * lint fix * Enhance convolution example and TMA handling - Added kernel source printing in `example_convolution.py` for debugging. - Disabled the main function call and invoked `run_regression_perf()` to streamline execution. - Introduced `MakeTmaLeaderCondition` function in `copy.cc` to improve TMA leader-thread condition handling. - Updated `LowerBulkCopy` and `LowerBulkCopy1D` methods to utilize the new TMA leader condition for better thread management. - Enhanced `Conv2DIm2ColOpNode` to support barrier annotations, improving synchronization in TMA operations. * enhance * Remove example_dequant_groupedgemm_bf16_mxfp4_hopper.py and clean up phase.py by removing unnecessary whitespace. Enhance TMA handling in producer_consumer_ws_tiled.cc to improve barrier synchronization and streamline TMA copy operations. * fix * fix * Eliminate MVB(barrier_only=true) late fixup from OptimizeForTarget Move pipeline barrier ownership into InjectSoftwarePipeline: create pipeline_mbar[num_stages] at final expanded size instead of pipeline_mbar[1] that required late MVB expansion. Key changes: - RewritePipelineTmaBarriers creates barriers at num_stages size - Barrier indices use FloorMod(loop_var - loop_min, num_stages) - barrier_init has num_stages entries (one per slot, arrive_count=1) - CopyToTmaCopyRewriter accepts PrimExpr barrier_id (was int) - phase.py: remove MVB(barrier_only=True), unify both paths under PlanAndUpdateBufferAllocationLocation - Fix pre-existing test expectation for tma_copies annotation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix Codex review issues: num_stages=1 regression, im2col pipeline barriers, test fixes 1. Fix num_stages=1 regression: use tl_pipelined_num_stages annotation for barrier sizing instead of max_stage+1. Gate barrier creation on pipeline_depth > 1 so num_stages=1 kernels don't get multi-versioned pipeline barriers. 2. Extend RewritePipelineTmaBarriers to handle c2d_im2col: annotate im2col calls with pipeline barrier in CopyToTmaCopyRewriter. Add im2col TMA recognition to PipelinePlanning. Fix im2col Lower() to respect emit_arrive annotation from pipeline barriers. 3. Fix test_simple_pipeline: use annotation-checking approach instead of structural equality with hardcoded tma_copies annotation that fails on non-Hopper targets. 4. Remove dead kPipelineMVBStageExpr/kPipelineMVBParityExpr/ kPipelineMVBContextNumStages emission from inject_pipeline.cc since pipeline barriers are now created at final size. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix depth-1 barrier ownership, add regression tests, remove dead kPipelineMVB* 1. Fix depth-1 ownership gap: always create shared pipeline barrier for TMA copies even when pipeline_depth=1. This prevents LowerTileOp from allocating separate per-copy internal barriers, keeping num_stages=1 kernels at pipeline_mbar[1] (single slot). 2. Add checked-in lowering regression tests: - non-WS num_stages=3 TMA GEMM → asserts pipeline_mbar[3], no fallback - non-WS num_stages=1 TMA GEMM → asserts pipeline_mbar[1], no multi-slot - non-WS num_stages=3 im2col → asserts pipeline_mbar[3] feeds tma_load_im2col 3. Remove dead kPipelineMVB* definitions from pipeline_utils.h and consumption code from multi_version_buffer_rewriter.cc (stacks, explicit parity/version index logic, attr stripping). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix GemmWMMA.lower() signature and autotuner cache backwards compatibility [P1] Add missing mbar_phase_expr parameter to GemmWMMA.lower() to match the interface expected by GemmPy.lower() dispatch. Without this, any RDNA kernel using the WMMA path would fail with TypeError. [P2] Make out_idx.json loading optional in autotuner load_from_disk(). Older cache directories don't have this file; fallback to compile_args.out_idx when the file is absent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add .humanize to gitignore * Fix HIP swizzle codegen and CPU scalar GEMM region handling [P2] Update codegen_hip.cc to parse tvm_tuple(device_func, panel_size) format for threadblock_swizzle_pattern, matching the annotation format now emitted by T.use_swizzle(). Previously expected StringImmNode which would ICHECK-fail. [P2] Fix gemm_scalar.py: clear only the output tile region (not the whole buffer), and use the last two dimensions from regions to handle rank>2 buffers with leading singleton dimensions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix HIP codegen to parse tvm_tuple swizzle annotation format Update codegen_hip.cc threadblock_swizzle_pattern handler to parse tvm_tuple(device_func, panel_size) format, matching CUDA and cutedsl codegens. Previously expected StringImmNode which ICHECK-fails with the current annotation format. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Restore kPipelineMVB* annotations — still needed by WS tiled full MVB The kPipelineMVBStageExpr/ParityExpr/ContextNumStages annotations and their MVB consumption code cannot be removed: they are emitted by EmitImpl for ALL pipelines and consumed by the full MVB call inside ProducerConsumerWarpSpecializedTiled. Removing them broke non-TMA pipeline kernels like mhc_post that go through the WS path. Restores: annotation emission in inject_pipeline.cc, constant definitions in pipeline_utils.h, and consumption code in multi_version_buffer_rewriter.cc. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Unify barrier multi-versioning in InjectSoftwarePipeline (Direction A) Move all pipeline barrier multi-versioning into InjectSoftwarePipeline via ExpandPipelineBarriers, eliminating the need for late MVB(barrier_only=true) in OptimizeForTarget. ExpandPipelineBarriers runs before BuildPipeline and handles: - ISP-created pipeline_mbar (for non-WS TMA pipelines) - User-written T.alloc_barrier (for manual WS pipelines like softpipe) Key design: only barriers with explicit ptx_arrive_barrier calls OR ISP-created local barriers are expanded. Barriers whose arrival is managed internally by tile-ops (e.g., tcgen05 MMA arrive) are left unchanged. This distinguishes pipeline sync barriers from hardware- managed barriers. Buffer expansion: barrier[N] -> barrier[N * num_stages] Index rewriting: barrier[idx] -> barrier[stage_expr * N + idx] Parity rewriting: user_parity -> (iteration_block + offset) % 2 barrier_init replication: [c0,c1] -> [c0,c1,c0,c1] for num_stages=2 Expanded buffers propagate to outer blocks via pending_buffer_remap_. barrier_init annotations are replicated in VisitStmt_(BlockNode). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix Python 3.9 compat: replace star expression in index with tuple concat * fix * fix * revert changes * fix * enhance pipeline * enhance pipeline * implement descriptor reuse pass * Refactor pipeline management and enhance async copy handling - Updated `example_mhc_pre.py` to replace the main function call with a test function for better testing flexibility. - Modified `example_gqa_decode.py` to disable argument parsing and added latency measurement for regression performance. - Enhanced `copy.cc` and `copy.h` with new checks for pipeline-managed cp.async synchronization and improved async copy handling. - Updated `inject_pipeline.cc` and `pipeline_planning.cc` to refine the handling of tile operations and global/shared buffer checks. - Added tests to ensure correct behavior of async pipeline and descriptor allocation reuse. This commit improves the overall pipeline management and async copy capabilities, ensuring better performance and flexibility in the codebase. * enhance pipeline * enhance pipeline * enhance pipeline * enhance pipeline * enhance pipeline * Revert "enhance pipeline" This reverts commit c056ded. * preloop tma handling * lint fix * Clean up debug changes in examples * Refactor PTX async copy injection result * refactor * Refactor tile op access region collection * Clean up pipeline access and async analysis * Refactor head async wait analysis * refactor layout expand * auto tma fix * Allow auto TMA store for plain copy * slash sparse test fix * Unify warp specialization and internalize buffer versioning * Revert formatting-only changes * Keep deprecated TL_DISABLE_TMA_LOWER compat --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughGeneralizes warpgroup partitioning from a fixed 2-way model to N-way via vectorized thread counts, threads a WarpSpecializeConfig through scheduling APIs, updates barrier and warp-partition signatures/logic, changes ScheduleUnit before/after storage to maps, and removes an explicit Changes
Sequence Diagram(s)sequenceDiagram
participant Scheduler as Scheduler
participant Builder as ScheduleUnitBuilder
participant Assigner as AssignWarpgroupIdsGlobal
participant Partitioner as ApplyWarpgroupPartition
participant Barrier as AnalyzeAndInsertBarriers
participant Converter as ConvertIRStructureToStmt
Scheduler->>Builder: SetWarpSpecializeConfig(config)\nBuild(ir_structure) -> thread_count (vector)
Builder->>Assigner: AssignWarpgroupIdsGlobal(root, config, thread_count)
Assigner->>Partitioner: ApplyWarpgroupPartition(root,..., thread_count, config, neutral_barrier)
Partitioner->>Barrier: AnalyzeAndInsertBarriers(..., thread_count, ...)
Partitioner->>Converter: ConvertIRStructureToStmt(cloned_subtrees)
Converter-->>Scheduler: Final Stmt
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/transform/auto_schedule/warpgroup_partition.cc (1)
1225-1230: Tensor core warpgroup range is hardcoded to warpgroup 0.The code at lines 1229-1230 passes
0andthread_count[0]as the tensor core warpgroup range, assuming tensor cores are always in warpgroup 0. The TODO comment acknowledges this limitation.For correctness when supporting more than 2 warpgroups, consider tracking which warpgroup(s) contain tensor core operations and passing the appropriate range(s) dynamically.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/transform/auto_schedule/warpgroup_partition.cc` around lines 1225 - 1230, The call to InsertBarriersForNeutralSyncWithDependency hardcodes the tensor-core warpgroup range as 0..thread_count[0] which fails when tensor core ops are not confined to warpgroup 0; identify which warpgroup indices contain tensor-core operations (e.g. scan warpgroup metadata/annotations or examine pro_and_warpgroup_stmt/thread mappings to compute tensor_wg_start and tensor_wg_end from thread_count[]), then pass those computed start/end indices instead of literal 0 and thread_count[0] when calling InsertBarriersForNeutralSyncWithDependency (update the call site using the same symbols pro_and_warpgroup_stmt, epi_neutral_body, barrier_buffers, barrier_map, updated_thread_extent, need_shared_barrier_for_epi, need_tmem_barrier_for_epi, Buffer(), thread_var->var, thread_count).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/transform/auto_schedule/schedule_builder.h`:
- Around line 614-616: Rename the misspelled setter SetWarpSpeicializeConfig to
SetWarpSpecializeConfig in schedule_builder.h (and update its
declaration/definition name reference config_ = config) and update all call
sites to the new name (e.g., replace calls in auto_schedule.cc that use
SetWarpSpeicializeConfig with SetWarpSpecializeConfig); ensure any forward
declarations, headers, or tests referencing the old name are updated as well to
keep the API consistent.
---
Nitpick comments:
In `@src/transform/auto_schedule/warpgroup_partition.cc`:
- Around line 1225-1230: The call to InsertBarriersForNeutralSyncWithDependency
hardcodes the tensor-core warpgroup range as 0..thread_count[0] which fails when
tensor core ops are not confined to warpgroup 0; identify which warpgroup
indices contain tensor-core operations (e.g. scan warpgroup metadata/annotations
or examine pro_and_warpgroup_stmt/thread mappings to compute tensor_wg_start and
tensor_wg_end from thread_count[]), then pass those computed start/end indices
instead of literal 0 and thread_count[0] when calling
InsertBarriersForNeutralSyncWithDependency (update the call site using the same
symbols pro_and_warpgroup_stmt, epi_neutral_body, barrier_buffers, barrier_map,
updated_thread_extent, need_shared_barrier_for_epi, need_tmem_barrier_for_epi,
Buffer(), thread_var->var, thread_count).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 737445ab-24fd-4716-9e9b-c6c498e3c7eb
📒 Files selected for processing (8)
examples/gemm_sm100/gemm_auto_tcgen5mma.pysrc/transform/auto_schedule.ccsrc/transform/auto_schedule.hsrc/transform/auto_schedule/barrier.hsrc/transform/auto_schedule/schedule_builder.ccsrc/transform/auto_schedule/schedule_builder.hsrc/transform/auto_schedule/warpgroup_partition.ccsrc/transform/auto_schedule/warpgroup_partition.h
…ile-ai#2021) Bumps [transformers](https://github.com/huggingface/transformers) from 4.53.0 to 5.0.0rc3. - [Release notes](https://github.com/huggingface/transformers/releases) - [Commits](huggingface/transformers@v4.53.0...v5.0.0rc3) --- updated-dependencies: - dependency-name: transformers dependency-version: 5.0.0rc3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Fix int64 loop phase dtype handling in LowerTileOp
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/transform/auto_schedule/schedule_builder.h`:
- Around line 32-37: The header declares enum class WarpSpecializeArch : uint8_t
but doesn't directly include <cstdint>, so add a direct include of <cstdint> at
the top of the file to ensure uint8_t is defined (i.e., insert `#include`
<cstdint> in src/transform/auto_schedule/schedule_builder.h before the
WarpSpecializeArch declaration).
- Around line 78-83: Add an explicit precondition check at the start of
ScheduleBuilder::Build to ensure SetThreadVar() has been called (i.e.,
thread_var_ and its domain/extent are initialized) before dereferencing
thread_var_->dom->extent; fail fast with a clear message mentioning
SetThreadVar, then proceed to call ScheduleRecursive and
AssignWarpgroupIdsGlobal as before so AssignWarpgroupIdsGlobal(root.get(),
config_, thread_var_->dom->extent) is safe to use.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 536ceb00-67b7-407f-b724-44111658bb24
📒 Files selected for processing (1)
src/transform/auto_schedule/schedule_builder.h
| // Warp specialization architecture enum | ||
| enum class WarpSpecializeArch : uint8_t { | ||
| kHopper = 0, | ||
| kBlackwell = 1, | ||
| kUnsupported = 2, | ||
| }; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n src/transform/auto_schedule/schedule_builder.h | head -60Repository: silentCoder-dev/tilelang
Length of output: 2072
🏁 Script executed:
# Check for existing cstdint include
rg "#include <cstdint>" src/transform/auto_schedule/schedule_builder.hRepository: silentCoder-dev/tilelang
Length of output: 50
🏁 Script executed:
# Check all includes at the top of the file
rg "^#include" src/transform/auto_schedule/schedule_builder.h | head -20Repository: silentCoder-dev/tilelang
Length of output: 511
Add the direct include for uint8_t.
WarpSpecializeArch uses uint8_t as its underlying type, but the header does not directly include <cstdint>. Relying on transitive includes risks breakage if include chains change in unrelated headers.
✏️ Proposed fix
`#include` <tvm/tir/stmt_functor.h>
+#include <cstdint>
`#include` <map>
`#include` <memory>🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/transform/auto_schedule/schedule_builder.h` around lines 32 - 37, The
header declares enum class WarpSpecializeArch : uint8_t but doesn't directly
include <cstdint>, so add a direct include of <cstdint> at the top of the file
to ensure uint8_t is defined (i.e., insert `#include` <cstdint> in
src/transform/auto_schedule/schedule_builder.h before the WarpSpecializeArch
declaration).
Fix let statement clone bug and update layout map
… the type of dependency
…into auto-schedule
fix attr warp partition
Co-authored-by: Copilot <copilot@github.com>
…into auto-schedule
Fix local variable copy and naive IR structure bug
…into auto-schedule-myh
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
…into auto-schedule-myh
Remove redundant let statements and fix barrier issues
Fix buffer analysis and check kernel with barrier formatting
Summary by CodeRabbit
New Features
Improvements
Bug Fixes