[TOPI] Vectorize depthwise conv2d output operator#14519
Conversation
|
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.
Generated by tvm-bot |
echuraev
left a comment
There was a problem hiding this comment.
Thank you for your PR. Should we add a new test case?
| s[out].unroll(ci_outer) | ||
|
|
||
| else: | ||
| s[out].vectorize(ci) |
There was a problem hiding this comment.
Shouldn't be the value of ci limited somehow? What if the value of ci is huge? You can use filter argument in define_split to prevent this situation.
There was a problem hiding this comment.
This is a good point. I'm not sure what would happen if ci was huge, I guess LLVM would legalize the vectors and loop over the size of the vector registers in hardware when they are storing 32bit floats.
I sort of just copied this from above though using the same scheduling that we use for the actual convolution.
Since the fallback scheduling is to use a loop splitting where ci = 8 and for the hardware in question this is the number of 32bit float in a vector I think we should be okay. In the case we aren't using the fallback i.e. auto-scheduling I would hope the auto-scheduler is able to find the optimal choice for ci which should be at least as good as 8 or better, but perhaps there are scenarios where it could choose a bad value?
We could do something similar to how we handle regular convolutions, and define a tunable parameter for vectorization, but that would potentially change the scheduling for the depthwise convolution here, rather than its output, which is sort of beyond the initial scope of this PR.
What are your thoughts?
There was a problem hiding this comment.
Thank you for your answer. I think that to avoid guesses about the size of ci (are use sure that it will be equal to 8?) you can just add a filter parameter to define split, e.g.:
cfg.define_split("tile_c", c, num_outputs=2, filter=lambda entry: entry.size[1] <= 32)In this case I believe we can guarantee that ci won't be bigger than 32. What do you think?
There was a problem hiding this comment.
I think we are already defining a split here though, this is where I got the value of 8 from.
There was a problem hiding this comment.
This split will work only for run w/o tuning statistic (in default mode). I suggest restricting the search space in tuning by adding filter value in define_split. It will help to improve tuning time and dropout useless configurations.
By the way, I reread your message and noticed that probably I was wrong than suggest to restrict vectorization size by 32. You wrote that it is storing 32bit floats, that means that the maximum possible capacity for vectorization is 8, am I right?
There was a problem hiding this comment.
Cool, thanks for the suggestion, I've added a filter on the split.
Yeah I think you are almost correct, for neon where vectors are 128 bits and sizeof(float) = 32, then there will be 128 / 32 = 4 elements processed in each vector operation. I think the backend will split the 8 element llvm vectors into two 4 element vectors during legalization, so this isn't a problem.
Thanks for the speedy review :D Yeah, although I'm not sure where a test for something like this would live. We would need to test that vectorization happens as expected, so some kind of lit test based on tir? |
I believe that a test in test_conv2d.py might be enough. Probably you can also add a test for tir or codegen to see that vectorization was applied but in this case in codegen test it will be necessary to call an existing schedule instead of writing a new one. |
Sorry I've been on annual leave and just got back to reviewing this. I took a look at test_conv2d.py, although would a more appropriate place for this not be test_depthwise_conv2d.py. I think at present though we don't do any testing for Arm Cortex A targets (see here), so perhaps this is outside the scope of this PR? |
I thought that this schedule is a common depthwise conv2d schedule for ARM and doesn't matter is it Cortex A or any other target, am I right? But anyway, I'm ok if we merge this PR w/o test, because it is pretty simple and probably writing a good test which will check generated tir, will require too much efford. |
Sorry, yes you are correct, other arm CPU targets will also go through this path. Although for the test we currently only test the cortex-M target with the aot cornerstone 300 runner, which I think means we will always go the CMSIS-NN path. I think you are probably right and we can merge this without a test, although you've identified we are missing test coverage for these target schedules, so I think after this is merged I'll follow up with another PR that fixes that. |
Depthwise Conv2D operations may consists of a convolution + an output operator e.g. Relu. This commit will: * Apply vectorization across the inner channel loop when there is an output operator. * Remove some unused variables in `schedule_depthwise_conv2d_nhwc`. * Limit the loop splitting to 8 elements in the inner loop.
7278fa9 to
3946cf5
Compare
lhutton1
left a comment
There was a problem hiding this comment.
Thanks for the fix @FranklandJack! I agree with the comments about testing, perhaps we can create a separate issue to look at improving the testing coverage?
Cool, I've create #14759 to track this. |
echuraev
left a comment
There was a problem hiding this comment.
I'm sorry. I forgot about this PR. I'm ok about a separate issue. Let's merge this PR.
Depthwise Conv2D operations may consists of a convolution + an output operator e.g. Relu. This commit will:
schedule_depthwise_conv2d_nhwc.