feat(bb): Introduce chunks for univariate computation for the AVM#12707
Conversation
|
The gains look amazing! but... I think we should wait until we have the capacity to do a full trace in VM2 for this. Also I think think
|
871a40e to
f9f88dd
Compare
|
@fcarreiro I addressed 1) as it was indeed ugly. |
Pushed some changes to fix 1 |
|
Having fixed (1) I'm not against merging this as long as crypto is ok with it. We should however revisit it later. |
8aba26d to
f006b50
Compare
|
Actually now it looks good to merge! |
lucasxia01
left a comment
There was a problem hiding this comment.
Looks fine, and the results are great! I'm not sure how well tested this code is - some of the logic is a little tricky so I would hope for some better testing. Also feel like it needs to be more readable.
|
|
||
| // When the trace is shrunk to a point where the chunk portion size per thread is lower than 2, | ||
| // we fall back to a single chunk, i.e., we keep the "non-AVM" values. | ||
| if (thread_portion_size_candidate >= 2) { |
There was a problem hiding this comment.
why is there an if here? Seems unnecessary?
There was a problem hiding this comment.
I added more explanations and actually could simplify a bit the logic. See new version.
| static_assert(Flavor::MAX_CHUNK_THREAD_PORTION_SIZE >= 2); | ||
| static_assert((Flavor::MAX_CHUNK_THREAD_PORTION_SIZE & (Flavor::MAX_CHUNK_THREAD_PORTION_SIZE - 1)) == 0); | ||
|
|
||
| const auto thread_portion_size_candidate = |
There was a problem hiding this comment.
naming is not great.. can't think of proper naming at the moment but at least requires a comment to what this is
There was a problem hiding this comment.
yeah, not so easy to have great terminology. I did no change it but added plenty of explanations.
| size_t num_threads = bb::calculate_num_threads_pow2(round_size, min_iterations_per_thread); | ||
| size_t iterations_per_thread = round_size / num_threads; // actual iterations per thread | ||
|
|
||
| // In the AVM, the trace is more dense at the top and therefore it is worth to split the work over the threads |
There was a problem hiding this comment.
this section just has a lot of logic thats hard to follow. All of the divisions and unclear names make it hard to parse.
| size_t iterations_per_thread = round_size / num_threads; // actual iterations per thread | ||
|
|
||
| // In the AVM, the trace is more dense at the top and therefore it is worth to split the work over the threads | ||
| // a bit more evenly on the vertical axis. To achieve this, we split the trace into chunks and each thread |
There was a problem hiding this comment.
this comment could have more detail in terms what you mean by "more evenly on the vertical axis"
4dc4d59 to
56df29c
Compare
|
@lucasxia01 I added a significant number of explanations and explained the required properties to be satisfied and added a little proof of why the code satisfies this. I hope this gives enough confidence that the code is correct. it is a bit hard to unit test this. In any case, I think there is enough safeguard that the new code does not affect non-AVM parts in any way. |
|
@jeanmon This is more or less exactly what I had in mind with this issue. @lucasxia01 do you see any reason why this same mechanism isn't applicable for us in the PG context? |
|
Personally I think this approach is noticeably better if you don't have uniform density (like in our case, we get many times improvement), and even if you do, it should work at least as good (assuming poor cache locality, which in any case you could probably improve with a bigger thread chunk; but that needs hard data). |
lucasxia01
left a comment
There was a problem hiding this comment.
Thanks for the fantastic comments!
|
@ledwards2225 It's not clear exactly what we do in PG, but I could see it applying similarly. It might not be as effective or easy to implement because of how we structure the trace so the nonzero blocks are all over the place. |
This PR introduces a more even vertically distributed trace chunks processing among threads in the univariate computation as part of sumcheck. This leads to a substantial speed up of sumcheck.
Let
tbe the number of threads. The processing of the rows of a circuit used to beround_size/trows)round_size/trows) [end of circuit]while this PR introduces the possibility of having the processing be interleaved, and therefore the load is more uniformly balanced across threads:
chunk_thread_portion_sizerows)chunk_thread_portion_sizerows)chunk_thread_portion_sizerows)chunk_thread_portion_sizerows)This PR improves #12703 measurement from 8.5 seconds to 2.4 seconds.