[TIR] More flexible buffer compaction by wrongtest-intellif · Pull Request #14021 · apache/tvm

wrongtest-intellif · 2023-02-17T11:01:53Z

Hi there, the change want to enforce the power and flexiblity of CompactBufferAllocation pass in two aspects:

Free of pass order
- It could work on both s-tir (with opaque blocks) and lowered tir. For example, now one could be able to invoke LoopPartition and then make partitioned tile aware buffer compactions.
- We test existing cases to ensure that
  (LowerOpaqueBlock . CompactBufferAllocation) (mod) == (CompactBufferAllocation . LowerOpaqueBlock) (mod)
Allow "non-strict" compaction
- Add an option is_strict defaults to True, to denote that during compaction we should respect the original buffer shape bound. Thus the compacted buffer region never exceed the original.
- If set to False, the "compacted" shape is totally determined to cover buffer region accesses. Thus it may become larger than the original shape. This change the original semantic for out-of-bound accesses but may be helpful in certain usages.
- If loop domain changed (eg, align the loop dim or remove the extra predicate), the accessed buffer region may grow, the pass could provide a fallback implementation to adapt the buffer regions.

About implementation issues:

To achieve [1]
- Buffer decl point:
  - (s-tir) T.alloc_buffer in block form is handled without any change.
  - (lowered) T.decl_buffer is newly handled. We assume it is at the proper position to dominate all accesses.
- Predicates
  - Predicates in T.where and IfThenElse, T.if_then_else are handled uniformlly now. We would try simply resolve the predicate to update loop domain as before. But on failures, now we keep them into pending stack.
  - The visit logic is very alike IRVisitorWithAnalyzer's, but since arith::Analyzer and arith::IntSet are independent components now, the change do not introduce IRVisitorWithAnalyzer'.
- Buffer accesses
  - There is no difference between s-tir and lowered form. If there are pending predicates on access point, always try affine iter analysis to get more tight relaxed region.
- Thread binding
  - To work on lowered form, attr::thread_extent and attr::virtual_thread are handled to record neccesary info for thread relaxion.
- Buffer aliasing
  - No change to T.match_buffer handling, and we explicitly disable compaction to alised buffers in lowered form.
- Dim alignment
  - We utilize annotation field in T.allocate and preserve attr::buffer_dim_align in LowerOpaqueBlock pass when it convert T.alloc_buffer to T.allocate. Thus the compaction could collect the dim alignment information in lowered form.
To achieve [2]
- It is much direct. SimplifyAndNarrowBufferRegionFromNDIntSet would not intersect accessed region with original shape if the is_strict option is overrided to false.

tvm-bot · 2023-02-17T11:01:56Z

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

cc @Hzfengsy, @junrushao, @quic-sanirudh, @shingjan _{See #10317 for details}

_{Generated by tvm-bot}

wrongtest-intellif · 2023-02-28T10:40:17Z

cc @Hzfengsy @spectrometerHBH @vinx13 @Lunderberg

Hzfengsy · 2023-03-01T07:26:09Z

I have no objections. But it would be great if we can have double-checked with other TIR maintainers, as this PR changes the fundamental flow.

Lunderberg

I like the changes, especially the flexibility in order of passes. I don't have any worries on the lowering flow, especially since this PR doesn't change the order itself, and only prepares a way for doing so in the future. I did have a couple of questions regarding exactly how much of TIR should be supported by this pass:

Since this pass can now be applied to the lowered TIR, would this allow for de-duplication between the S-TIR lowering path and TE lowering path?
Are the DeclBuffer nodes sufficient to find all buffer declaration locations? @vinx13 can correct me, but last I heard they weren't fully used across the lowering flow.

Lunderberg · 2023-03-03T15:35:42Z

+    StmtExprVisitor::VisitStmt_(op);
+  }
+
+  void VisitStmt_(const DeclBufferNode* op) final {


If there is no DeclBuffer present for a buffer, would the resizing work correctly if it pre-visits the stmt to collect all buffers used within the body? That may be a way to avoid depending on DeclBuffer, which isn't used at all points in the lowering flow.

After some thinking we decide to switch to Allocate as the buffer def point, thus not depend on DeclBuffer.

That is, if we have

data = T.allocate(...) for i in range(10): ... X = T.decl_buffer(..., data=data) for j in range(10): # access X[i, j]

The region is relaxed wrt scope of i instead of j, since we actually want to mutate allocation regions.

junrushao · 2023-03-19T15:00:15Z

@wrongtest-intellif @Hzfengsy @Lunderberg any updates?

wrongtest-intellif · 2023-03-20T12:15:08Z

@wrongtest-intellif @Hzfengsy @Lunderberg any updates?

sorry for too late.. I am making modifications for review suggestions.

wrongtest-intellif · 2023-03-21T03:51:52Z

Since this pass can now be applied to the lowered TIR, would this allow for de-duplication between the S-TIR lowering path and TE lowering path?

I remove the legacy check, let's see if there are issues in testing. Generally, since the pass is after storage flatten for te codepath, I expect there would be no special IR form to cause problems.

Are the DeclBuffer nodes sufficient to find all buffer declaration locations? @vinx13 can correct me, but last I heard they weren't fully used across the lowering flow.

The change is refactored to not depend on DeclBuffer nodes.

wrongtest-intellif · 2023-03-25T02:01:41Z

Sea of failures when use it with "legacy" schedule (T.T).. We are trying to fix all of them to best improve the generality, but at the last I think we'd better to skip it by default. @Lunderberg

Lunderberg · 2023-03-28T13:10:15Z

@wrongtest-intellif That makes sense, and I agree that it isn't worth the effort to fix the issues when using the new pass on TE-derived schedule.

wrongtest-intellif · 2023-04-10T15:15:05Z

Here lists lessons from te workloads:

There may exist opaque usage to the var before we meet any buffers, so a pre-collection var2buffer_ is added to ensure we can treat the opaque access as full region access to the corresponding buffer. If the aliased buffer count > 1, we would skip the region collection and the buffers would not get mutated.
The itervar of thread binding from te-workloads would have empty dom field.
There would be conditions with non-bool type. Thus we add some normalization from if(x: int32) to if (x != 0) for non-bool conditions.
There would be mismatch from allocation dtype and buffer dtype, even if the buffer is the only one who refer to the allocated var.
The inferred region extent could be data dependent (for example, dependent by the buffer value Y = T.allocate([X[i], ...), in such cases, we should not compact the buffer to the dynamic extent, just like in the loop-dependent allocation cases.

Now it seems ok to the CI cases. Now we could disable the process for workloads from TE again and just leave some more covering cases for them.

wrongtest-intellif · 2023-04-12T04:38:41Z

Hi~ Could you help to take another round of look? cc @Hzfengsy @Lunderberg

wrongtest-intellif · 2023-04-12T07:01:16Z

Better to wait #14596 merged and rebase then.

…nd allocate order

Lunderberg

I think the changes make sense. The main question I have is whether the improved domain analysis should be merged into a more general utility.

Lunderberg · 2023-04-17T14:24:16Z

-    BufferAccessRegionCollector collector;
-    collector(f->body);
-    return std::move(collector.buffer_access_region_);
+      const PrimFunc& f, bool collect_inbound) {


The functionality of this class looks very similar to that of the DomainTouched utility. Should the two implementations be merged?

Hi~ @Lunderberg I totally agree the DomainTouched utility is of the similar purpose . Here are some differences I think:

BufferAccessRegionCollector collect region wrt the allocation point, that is,

for i in range(10): a = T.allocate([10]) A = T.decl_buffer( [10, 10], data=a) for j in range(10): # use A[i, j]

would give region A[i, 0:10], which is the domain "touched" under each allocation scope.

BufferAccessRegionCollector take some special considerations on thread bindings.

BufferAccessRegionCollector try use more arith utilities to improve intset analysis.

It would be great to share the same implementation (maybe at the cost of adding cost to where we use DomainTouched?). I'd like to try it in a standalone MR.

junrushao

I have no objections either given all the issues are resolved. Seems a positive change as it makes the existing logic much more clear

junrushao · 2023-04-27T03:02:54Z

CC @Hzfengsy @Lunderberg for a second look

…#215)

wrongtest-intellif marked this pull request as draft February 17, 2023 11:17

wrongtest-intellif force-pushed the more_flexible_buffer_compaction branch 6 times, most recently from 2c8f46a to a92fb05 Compare February 28, 2023 10:36

wrongtest-intellif marked this pull request as ready for review February 28, 2023 10:37

wrongtest-intellif changed the title ~~[Draft][TIR] More flexible buffer compaction~~ [TIR] More flexible buffer compaction Feb 28, 2023

wrongtest-intellif requested a review from Lunderberg February 28, 2023 10:38

wrongtest-intellif force-pushed the more_flexible_buffer_compaction branch from 517374c to a8acc7b Compare March 1, 2023 01:40

Hzfengsy reviewed Mar 1, 2023

View reviewed changes

Comment thread python/tvm/tir/transform/transform.py Outdated

Comment thread src/tir/transforms/ir_utils.cc Outdated

Lunderberg reviewed Mar 3, 2023

View reviewed changes

Lunderberg reviewed Mar 23, 2023

View reviewed changes

Comment thread src/tir/transforms/compact_buffer_region.cc Outdated

Comment thread src/tir/transforms/compact_buffer_region.cc

Comment thread include/tvm/tir/transform.h

wrongtest-intellif force-pushed the more_flexible_buffer_compaction branch 3 times, most recently from e7f6cad to f1e937f Compare April 10, 2023 10:00

wrongtest-intellif force-pushed the more_flexible_buffer_compaction branch from f1e937f to 4be12af Compare April 11, 2023 10:00

Allow more flexible buffer compaction

9fbb95a

wrongtest-intellif added 9 commits April 13, 2023 18:31

fix lint

8998863

fix missing include header

3365356

use allocation as buffer def point

9e546c2

fix lint issues

ed79b3c

cancel from legacy check

2df0f52

prepare all new buffer before mutation, to decouple with declbuffer a…

ee2b8b1

…nd allocate order

fit thread binding var of unknown domain

ed86fd4

fix failures on te schedule flow

93ec6a7

add skip legacy-te back

db25959

wrongtest-intellif force-pushed the more_flexible_buffer_compaction branch from 4be12af to db25959 Compare April 13, 2023 11:19

fix lint

0358e77

Lunderberg reviewed Apr 17, 2023

View reviewed changes

junrushao approved these changes Apr 27, 2023

View reviewed changes

junrushao merged commit 30b34d2 into apache:main Apr 28, 2023

MasterJH5574 mentioned this pull request May 12, 2023

[Bug] Wrong lowered TIR with symbolic input buffer #14834

Closed

wrongtest-intellif mentioned this pull request May 17, 2023

[TIR] Avoid too complex predicate in compaction #14866

Merged

tqchen pushed a commit to tqchen/tvm that referenced this pull request May 17, 2023

Revert "[TIR] More flexible buffer compaction (apache#14021)" (apache…

5124eea

…#215)

ysh329 mentioned this pull request Jul 12, 2023

[Release] v0.13.0 Release Candidate Notes #15295

Closed

Conversation

wrongtest-intellif commented Feb 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tvm-bot commented Feb 17, 2023

Uh oh!

wrongtest-intellif commented Feb 28, 2023

Uh oh!

Uh oh!

Uh oh!

Hzfengsy commented Mar 1, 2023

Uh oh!

Lunderberg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Lunderberg Mar 3, 2023

Choose a reason for hiding this comment

Uh oh!

wrongtest-intellif Mar 21, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

junrushao commented Mar 19, 2023

Uh oh!

wrongtest-intellif commented Mar 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wrongtest-intellif commented Mar 21, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wrongtest-intellif commented Mar 25, 2023

Uh oh!

Lunderberg commented Mar 28, 2023

Uh oh!

wrongtest-intellif commented Apr 10, 2023

Uh oh!

wrongtest-intellif commented Apr 12, 2023

Uh oh!

wrongtest-intellif commented Apr 12, 2023

Uh oh!

Lunderberg left a comment

Choose a reason for hiding this comment

Uh oh!

Lunderberg Apr 17, 2023

Choose a reason for hiding this comment

Uh oh!

wrongtest-intellif Apr 18, 2023

Choose a reason for hiding this comment

Uh oh!

junrushao left a comment

Choose a reason for hiding this comment

Uh oh!

junrushao commented Apr 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wrongtest-intellif commented Feb 17, 2023 •

edited

Loading

wrongtest-intellif commented Mar 20, 2023 •

edited

Loading