[TIR] Utility function to decide loop mapping for auto tensorization#11050
[TIR] Utility function to decide loop mapping for auto tensorization#11050vinx13 merged 19 commits intoapache:mainfrom
Conversation
| next_block_ind = i_block - 1; | ||
| break; | ||
| } | ||
| } |
There was a problem hiding this comment.
The logic here is very different from the one in the original code https://github.com/spectrometerHBH/tvm/blob/auto-tensorization/src/tir/schedule/analysis/analysis.cc#L1246. I was not able to understand why the original code has been written that way and it didn't work for the case where matching loops in the target block are not in the innermost positions (conv2d NCHWc on CPU, a test in
).I think my change is simple and obvious. The condition for a match is (1) divisibility of loop extent and (2) matching iterator types (reduction vs spatial). Mapping is determined starting from the innermost axis.
Please have a look at this change carefully, and let me know if I need to bring back some logic in the original code @spectrometerHBH @vinx13
There was a problem hiding this comment.
I would love to have @spectrometerHBH review this change before merging
There was a problem hiding this comment.
The goal of the original mapping is to support
for k:
for i:
for j:
C[i, j] += A[i, k] * B[k, j]where loops are not in the same order as the tensor intrinsic description function.
There was a problem hiding this comment.
But it also makes sense if we don't support such cases for this PR. So I approve it.
There was a problem hiding this comment.
Thanks @spectrometerHBH, I now understand the original code and was able to integrate the original logic to support loop permutations. Please have a look at the current diff, also cc @vinx13 @Hzfengsy @MasterJH5574
The key difference between the original code and the code I submitted yesterday was that, my code was looking at only the loop nest (ForNode) to determine the mapping, while @spectrometerHBH's mapping logic is based on iter_var/value of the block (so invariant to the order of the loop nest).
d6ae848 to
1ff1df9
Compare
MasterJH5574
left a comment
There was a problem hiding this comment.
Thanks Masa! I just caught a minor point.
This reverts commit eb147f3.
Co-authored-by: Ruihang Lai <lairuihangdongdong@qq.com>
e504536 to
94391b1
Compare
94391b1 to
9ec0974
Compare
212d5dc to
2909a06
Compare
| ICHECK(desc_loops.size() == static_cast<size_t>(n_desc_vars)); | ||
| ICHECK(block_loops.size() == iter_types_block.size()); | ||
|
|
||
| // We assume that the orders of iter_vars in the target and the desc block are consistent. |
There was a problem hiding this comment.
i.e., no matter what the permutation of loop is, we should always have
i, j, k = T.axis.remap("SSR", [i0, i1, i2])
for GEMM.
I think this is a reasonable assumption. Correct me if I'm wrong @spectrometerHBH @junrushao1994 @vinx13
There was a problem hiding this comment.
I agree this is a reasonable assumption. Though there might be corner cases, it covers all of the current use cases
Add
TensorizeInfostructure andGetTensorizeLoopMappingfunction, that are used for determining the correspondence of loops between a target block and an intrinsic description.Matching is based on a heuristic: It works in all cases I tested (CPU dot product for dense / conv2d, CPU / GPU matmul), but there is no guarantee that it always finds the "right" mapping. If the mapping is not correct, tensorize would fail.
The original code is https://github.com/spectrometerHBH/tvm/blob/auto-tensorization/src/tir/schedule/analysis/analysis.cc#L1175, I modified this code to support more cases and added tests. I'm sending this PR on behalf of the team, but most of the work were done by others earlier.
Co-authored-by: Siyuan Feng Hzfengsy@sjtu.edu.cn
Co-authored-by: Bohan Hou 32121147+spectrometerHBH@users.noreply.github.com
Co-authored-by: Hongyi Jin 3231950289@qq.com
Co-authored-by: Ruihang Lai lairuihangdongdong@qq.com
Co-authored-by: Wuwei Lin wuwei@apache.org
@vinx13 @junrushao1994 @spectrometerHBH @Hzfengsy @MasterJH5574 @jinhongyii