Skip to content

[TIR] More flexible buffer compaction#14021

Merged
junrushao merged 11 commits intoapache:mainfrom
wrongtest-intellif:more_flexible_buffer_compaction
Apr 28, 2023
Merged

[TIR] More flexible buffer compaction#14021
junrushao merged 11 commits intoapache:mainfrom
wrongtest-intellif:more_flexible_buffer_compaction

Conversation

@wrongtest-intellif
Copy link
Copy Markdown
Contributor

@wrongtest-intellif wrongtest-intellif commented Feb 17, 2023

Hi there, the change want to enforce the power and flexiblity of CompactBufferAllocation pass in two aspects:

  1. Free of pass order

    • It could work on both s-tir (with opaque blocks) and lowered tir. For example, now one could be able to invoke LoopPartition and then make partitioned tile aware buffer compactions.
    • We test existing cases to ensure that
      (LowerOpaqueBlock . CompactBufferAllocation) (mod) == (CompactBufferAllocation . LowerOpaqueBlock) (mod)
  2. Allow "non-strict" compaction

    • Add an option is_strict defaults to True, to denote that during compaction we should respect the original buffer shape bound. Thus the compacted buffer region never exceed the original.
    • If set to False, the "compacted" shape is totally determined to cover buffer region accesses. Thus it may become larger than the original shape. This change the original semantic for out-of-bound accesses but may be helpful in certain usages.
    • If loop domain changed (eg, align the loop dim or remove the extra predicate), the accessed buffer region may grow, the pass could provide a fallback implementation to adapt the buffer regions.

About implementation issues:

  • To achieve [1]

    • Buffer decl point:
      • (s-tir) T.alloc_buffer in block form is handled without any change.
      • (lowered) T.decl_buffer is newly handled. We assume it is at the proper position to dominate all accesses.
    • Predicates
      • Predicates in T.where and IfThenElse, T.if_then_else are handled uniformlly now. We would try simply resolve the predicate to update loop domain as before. But on failures, now we keep them into pending stack.
      • The visit logic is very alike IRVisitorWithAnalyzer's, but since arith::Analyzer and arith::IntSet are independent components now, the change do not introduce IRVisitorWithAnalyzer'.
    • Buffer accesses
      • There is no difference between s-tir and lowered form. If there are pending predicates on access point, always try affine iter analysis to get more tight relaxed region.
    • Thread binding
      • To work on lowered form, attr::thread_extent and attr::virtual_thread are handled to record neccesary info for thread relaxion.
    • Buffer aliasing
      • No change to T.match_buffer handling, and we explicitly disable compaction to alised buffers in lowered form.
    • Dim alignment
      • We utilize annotation field in T.allocate and preserve attr::buffer_dim_align in LowerOpaqueBlock pass when it convert T.alloc_buffer to T.allocate. Thus the compaction could collect the dim alignment information in lowered form.
  • To achieve [2]

    • It is much direct. SimplifyAndNarrowBufferRegionFromNDIntSet would not intersect accessed region with original shape if the is_strict option is overrided to false.

@tvm-bot
Copy link
Copy Markdown
Collaborator

tvm-bot commented Feb 17, 2023

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

Generated by tvm-bot

@wrongtest-intellif wrongtest-intellif marked this pull request as draft February 17, 2023 11:17
@wrongtest-intellif wrongtest-intellif force-pushed the more_flexible_buffer_compaction branch 6 times, most recently from 2c8f46a to a92fb05 Compare February 28, 2023 10:36
@wrongtest-intellif wrongtest-intellif marked this pull request as ready for review February 28, 2023 10:37
@wrongtest-intellif wrongtest-intellif changed the title [Draft][TIR] More flexible buffer compaction [TIR] More flexible buffer compaction Feb 28, 2023
@wrongtest-intellif
Copy link
Copy Markdown
Contributor Author

@wrongtest-intellif wrongtest-intellif force-pushed the more_flexible_buffer_compaction branch from 517374c to a8acc7b Compare March 1, 2023 01:40
Comment thread python/tvm/tir/transform/transform.py Outdated
Comment thread src/tir/transforms/ir_utils.cc Outdated
@Hzfengsy
Copy link
Copy Markdown
Member

Hzfengsy commented Mar 1, 2023

I have no objections. But it would be great if we can have double-checked with other TIR maintainers, as this PR changes the fundamental flow.

Copy link
Copy Markdown
Contributor

@Lunderberg Lunderberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the changes, especially the flexibility in order of passes. I don't have any worries on the lowering flow, especially since this PR doesn't change the order itself, and only prepares a way for doing so in the future. I did have a couple of questions regarding exactly how much of TIR should be supported by this pass:

  1. Since this pass can now be applied to the lowered TIR, would this allow for de-duplication between the S-TIR lowering path and TE lowering path?

  2. Are the DeclBuffer nodes sufficient to find all buffer declaration locations? @vinx13 can correct me, but last I heard they weren't fully used across the lowering flow.

Comment thread src/tir/analysis/block_access_region_detector.cc
StmtExprVisitor::VisitStmt_(op);
}

void VisitStmt_(const DeclBufferNode* op) final {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no DeclBuffer present for a buffer, would the resizing work correctly if it pre-visits the stmt to collect all buffers used within the body? That may be a way to avoid depending on DeclBuffer, which isn't used at all points in the lowering flow.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some thinking we decide to switch to Allocate as the buffer def point, thus not depend on DeclBuffer.

That is, if we have

data = T.allocate(...)
for i in range(10):
    ...
    X = T.decl_buffer(..., data=data)
    for j in range(10):
        # access X[i, j]

The region is relaxed wrt scope of i instead of j, since we actually want to mutate allocation regions.

Comment thread src/tir/transforms/compact_buffer_region.cc
@junrushao
Copy link
Copy Markdown
Member

@wrongtest-intellif @Hzfengsy @Lunderberg any updates?

@wrongtest-intellif
Copy link
Copy Markdown
Contributor Author

wrongtest-intellif commented Mar 20, 2023

@wrongtest-intellif @Hzfengsy @Lunderberg any updates?

sorry for too late.. I am making modifications for review suggestions.

@wrongtest-intellif
Copy link
Copy Markdown
Contributor Author

  1. Since this pass can now be applied to the lowered TIR, would this allow for de-duplication between the S-TIR lowering path and TE lowering path?

I remove the legacy check, let's see if there are issues in testing. Generally, since the pass is after storage flatten for te codepath, I expect there would be no special IR form to cause problems.

  1. Are the DeclBuffer nodes sufficient to find all buffer declaration locations? @vinx13 can correct me, but last I heard they weren't fully used across the lowering flow.

The change is refactored to not depend on DeclBuffer nodes.

Comment thread src/tir/transforms/compact_buffer_region.cc Outdated
Comment thread src/tir/transforms/compact_buffer_region.cc
Comment thread include/tvm/tir/transform.h
@wrongtest-intellif
Copy link
Copy Markdown
Contributor Author

Sea of failures when use it with "legacy" schedule (T.T).. We are trying to fix all of them to best improve the generality, but at the last I think we'd better to skip it by default. @Lunderberg

@Lunderberg
Copy link
Copy Markdown
Contributor

@wrongtest-intellif That makes sense, and I agree that it isn't worth the effort to fix the issues when using the new pass on TE-derived schedule.

@wrongtest-intellif wrongtest-intellif force-pushed the more_flexible_buffer_compaction branch 3 times, most recently from e7f6cad to f1e937f Compare April 10, 2023 10:00
@wrongtest-intellif
Copy link
Copy Markdown
Contributor Author

Here lists lessons from te workloads:

  • There may exist opaque usage to the var before we meet any buffers, so a pre-collection var2buffer_ is added to ensure we can treat the opaque access as full region access to the corresponding buffer. If the aliased buffer count > 1, we would skip the region collection and the buffers would not get mutated.

  • The itervar of thread binding from te-workloads would have empty dom field.

  • There would be conditions with non-bool type. Thus we add some normalization from if(x: int32) to if (x != 0) for non-bool conditions.

  • There would be mismatch from allocation dtype and buffer dtype, even if the buffer is the only one who refer to the allocated var.

  • The inferred region extent could be data dependent (for example, dependent by the buffer value Y = T.allocate([X[i], ...), in such cases, we should not compact the buffer to the dynamic extent, just like in the loop-dependent allocation cases.

Now it seems ok to the CI cases. Now we could disable the process for workloads from TE again and just leave some more covering cases for them.

@wrongtest-intellif wrongtest-intellif force-pushed the more_flexible_buffer_compaction branch from f1e937f to 4be12af Compare April 11, 2023 10:00
@wrongtest-intellif
Copy link
Copy Markdown
Contributor Author

Hi~ Could you help to take another round of look? cc @Hzfengsy @Lunderberg

@wrongtest-intellif
Copy link
Copy Markdown
Contributor Author

Better to wait #14596 merged and rebase then.

@wrongtest-intellif wrongtest-intellif force-pushed the more_flexible_buffer_compaction branch from 4be12af to db25959 Compare April 13, 2023 11:19
Copy link
Copy Markdown
Contributor

@Lunderberg Lunderberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the changes make sense. The main question I have is whether the improved domain analysis should be merged into a more general utility.

BufferAccessRegionCollector collector;
collector(f->body);
return std::move(collector.buffer_access_region_);
const PrimFunc& f, bool collect_inbound) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functionality of this class looks very similar to that of the DomainTouched utility. Should the two implementations be merged?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi~ @Lunderberg I totally agree the DomainTouched utility is of the similar purpose . Here are some differences I think:

  • BufferAccessRegionCollector collect region wrt the allocation point, that is,
for i in range(10):
    a = T.allocate([10])
    A = T.decl_buffer( [10, 10], data=a)
    for j in range(10):
         # use A[i, j]

would give region A[i, 0:10], which is the domain "touched" under each allocation scope.

  • BufferAccessRegionCollector take some special considerations on thread bindings.

  • BufferAccessRegionCollector try use more arith utilities to improve intset analysis.

It would be great to share the same implementation (maybe at the cost of adding cost to where we use DomainTouched?). I'd like to try it in a standalone MR.

Copy link
Copy Markdown
Member

@junrushao junrushao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no objections either given all the issues are resolved. Seems a positive change as it makes the existing logic much more clear

@junrushao
Copy link
Copy Markdown
Member

CC @Hzfengsy @Lunderberg for a second look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants