[TIR] More flexible buffer compaction#14021
Conversation
|
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.
Generated by tvm-bot |
2c8f46a to
a92fb05
Compare
517374c to
a8acc7b
Compare
|
I have no objections. But it would be great if we can have double-checked with other TIR maintainers, as this PR changes the fundamental flow. |
Lunderberg
left a comment
There was a problem hiding this comment.
I like the changes, especially the flexibility in order of passes. I don't have any worries on the lowering flow, especially since this PR doesn't change the order itself, and only prepares a way for doing so in the future. I did have a couple of questions regarding exactly how much of TIR should be supported by this pass:
-
Since this pass can now be applied to the lowered TIR, would this allow for de-duplication between the S-TIR lowering path and TE lowering path?
-
Are the
DeclBuffernodes sufficient to find all buffer declaration locations? @vinx13 can correct me, but last I heard they weren't fully used across the lowering flow.
| StmtExprVisitor::VisitStmt_(op); | ||
| } | ||
|
|
||
| void VisitStmt_(const DeclBufferNode* op) final { |
There was a problem hiding this comment.
If there is no DeclBuffer present for a buffer, would the resizing work correctly if it pre-visits the stmt to collect all buffers used within the body? That may be a way to avoid depending on DeclBuffer, which isn't used at all points in the lowering flow.
There was a problem hiding this comment.
After some thinking we decide to switch to Allocate as the buffer def point, thus not depend on DeclBuffer.
That is, if we have
data = T.allocate(...)
for i in range(10):
...
X = T.decl_buffer(..., data=data)
for j in range(10):
# access X[i, j]The region is relaxed wrt scope of i instead of j, since we actually want to mutate allocation regions.
|
@wrongtest-intellif @Hzfengsy @Lunderberg any updates? |
sorry for too late.. I am making modifications for review suggestions. |
I remove the legacy check, let's see if there are issues in testing. Generally, since the pass is after storage flatten for te codepath, I expect there would be no special IR form to cause problems.
The change is refactored to not depend on |
|
Sea of failures when use it with "legacy" schedule (T.T).. We are trying to fix all of them to best improve the generality, but at the last I think we'd better to skip it by default. @Lunderberg |
|
@wrongtest-intellif That makes sense, and I agree that it isn't worth the effort to fix the issues when using the new pass on TE-derived schedule. |
e7f6cad to
f1e937f
Compare
|
Here lists lessons from te workloads:
Now it seems ok to the CI cases. Now we could disable the process for workloads from TE again and just leave some more covering cases for them. |
f1e937f to
4be12af
Compare
|
Hi~ Could you help to take another round of look? cc @Hzfengsy @Lunderberg |
|
Better to wait #14596 merged and rebase then. |
…nd allocate order
4be12af to
db25959
Compare
Lunderberg
left a comment
There was a problem hiding this comment.
I think the changes make sense. The main question I have is whether the improved domain analysis should be merged into a more general utility.
| BufferAccessRegionCollector collector; | ||
| collector(f->body); | ||
| return std::move(collector.buffer_access_region_); | ||
| const PrimFunc& f, bool collect_inbound) { |
There was a problem hiding this comment.
The functionality of this class looks very similar to that of the DomainTouched utility. Should the two implementations be merged?
There was a problem hiding this comment.
Hi~ @Lunderberg I totally agree the DomainTouched utility is of the similar purpose . Here are some differences I think:
BufferAccessRegionCollectorcollect region wrt the allocation point, that is,
for i in range(10):
a = T.allocate([10])
A = T.decl_buffer( [10, 10], data=a)
for j in range(10):
# use A[i, j]would give region A[i, 0:10], which is the domain "touched" under each allocation scope.
-
BufferAccessRegionCollectortake some special considerations on thread bindings. -
BufferAccessRegionCollectortry use more arith utilities to improve intset analysis.
It would be great to share the same implementation (maybe at the cost of adding cost to where we use DomainTouched?). I'd like to try it in a standalone MR.
junrushao
left a comment
There was a problem hiding this comment.
I have no objections either given all the issues are resolved. Seems a positive change as it makes the existing logic much more clear
|
CC @Hzfengsy @Lunderberg for a second look |
Hi there, the change want to enforce the power and flexiblity of
CompactBufferAllocationpass in two aspects:Free of pass order
LoopPartitionand then make partitioned tile aware buffer compactions.(LowerOpaqueBlock . CompactBufferAllocation) (mod) == (CompactBufferAllocation . LowerOpaqueBlock) (mod)Allow "non-strict" compaction
is_strictdefaults toTrue, to denote that during compaction we should respect the original buffer shape bound. Thus the compacted buffer region never exceed the original.False, the "compacted" shape is totally determined to cover buffer region accesses. Thus it may become larger than the original shape. This change the original semantic for out-of-bound accesses but may be helpful in certain usages.About implementation issues:
To achieve [1]
T.alloc_bufferin block form is handled without any change.T.decl_bufferis newly handled. We assume it is at the proper position to dominate all accesses.T.whereandIfThenElse,T.if_then_elseare handled uniformlly now. We would try simply resolve the predicate to update loop domain as before. But on failures, now we keep them into pending stack.IRVisitorWithAnalyzer's, but sincearith::Analyzerandarith::IntSetare independent components now, the change do not introduceIRVisitorWithAnalyzer'.attr::thread_extentandattr::virtual_threadare handled to record neccesary info for thread relaxion.T.match_bufferhandling, and we explicitly disable compaction to alised buffers in lowered form.T.allocateand preserveattr::buffer_dim_aligninLowerOpaqueBlockpass when it convertT.alloc_buffertoT.allocate. Thus the compaction could collect the dim alignment information in lowered form.To achieve [2]
SimplifyAndNarrowBufferRegionFromNDIntSetwould not intersect accessed region with original shape if theis_strictoption is overrided to false.