Skip to content

feat(bb): Introduce chunks for univariate computation for the AVM#12707

Merged
jeanmon merged 8 commits into
masterfrom
jm/sumcheck-chunks
Mar 16, 2025
Merged

feat(bb): Introduce chunks for univariate computation for the AVM#12707
jeanmon merged 8 commits into
masterfrom
jm/sumcheck-chunks

Conversation

@jeanmon

@jeanmon jeanmon commented Mar 13, 2025

Copy link
Copy Markdown
Contributor

This PR introduces a more even vertically distributed trace chunks processing among threads in the univariate computation as part of sumcheck. This leads to a substantial speed up of sumcheck.

Let t be the number of threads. The processing of the rows of a circuit used to be

  • thread_1 (round_size/t rows)
  • ...
  • thread_t (round_size/t rows) [end of circuit]

while this PR introduces the possibility of having the processing be interleaved, and therefore the load is more uniformly balanced across threads:

  • thread_1 (chunk_thread_portion_size rows)
  • ...
  • thread_t (chunk_thread_portion_size rows)
  • thread_1 (chunk_thread_portion_size rows)
  • ...
  • thread_t (chunk_thread_portion_size rows)
  • ...
  • [end of circuit]

This PR improves #12703 measurement from 8.5 seconds to 2.4 seconds.

@jeanmon jeanmon marked this pull request as ready for review March 13, 2025 09:27
@jeanmon jeanmon requested review from ledwards2225 and lucasxia01 and removed request for IlyasRidhuan and Maddiaa0 March 13, 2025 09:27
@fcarreiro

Copy link
Copy Markdown
Contributor

The gains look amazing! but... I think we should wait until we have the capacity to do a full trace in VM2 for this. Also I think think

  1. the current approach couples the sumcheck_round with the concept of avm flavor, and requires ifdefs, which kind of pollutes the code a lot. I think we can do better.
  2. we should have a more detailed PR explanation with tracy graphs (before/after) on the threads, explaining why this approach works both for short and fuller traces

@jeanmon jeanmon force-pushed the jm/sumcheck-chunks branch from 871a40e to f9f88dd Compare March 13, 2025 10:15
@jeanmon

jeanmon commented Mar 13, 2025

Copy link
Copy Markdown
Contributor Author

@fcarreiro I addressed 1) as it was indeed ugly.

@jeanmon jeanmon marked this pull request as draft March 13, 2025 10:44
@fcarreiro

Copy link
Copy Markdown
Contributor

The gains look amazing! but... I think we should wait until we have the capacity to do a full trace in VM2 for this. Also I think think

  1. the current approach couples the sumcheck_round with the concept of avm flavor, and requires ifdefs, which kind of pollutes the code a lot. I think we can do better.
  2. we should have a more detailed PR explanation with tracy graphs (before/after) on the threads, explaining why this approach works both for short and fuller traces

Pushed some changes to fix 1

Comment thread barretenberg/cpp/src/barretenberg/sumcheck/sumcheck_round.hpp
@fcarreiro

Copy link
Copy Markdown
Contributor

Having fixed (1) I'm not against merging this as long as crypto is ok with it. We should however revisit it later.

@jeanmon jeanmon force-pushed the jm/sumcheck-chunks branch from 8aba26d to f006b50 Compare March 13, 2025 16:38
Comment thread barretenberg/cpp/src/barretenberg/vm2/generated/flavor.hpp Outdated
Comment thread barretenberg/cpp/src/barretenberg/vm2/generated/flavor.hpp Outdated
@fcarreiro

Copy link
Copy Markdown
Contributor

Actually now it looks good to merge!

@jeanmon jeanmon marked this pull request as ready for review March 13, 2025 17:15
@fcarreiro fcarreiro changed the title chore: Introduce chunks for univariate computation for the AVM feat(bb): Introduce chunks for univariate computation for the AVM Mar 13, 2025

@lucasxia01 lucasxia01 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine, and the results are great! I'm not sure how well tested this code is - some of the logic is a little tricky so I would hope for some better testing. Also feel like it needs to be more readable.

Comment thread barretenberg/cpp/src/barretenberg/vm2/generated/flavor.hpp

// When the trace is shrunk to a point where the chunk portion size per thread is lower than 2,
// we fall back to a single chunk, i.e., we keep the "non-AVM" values.
if (thread_portion_size_candidate >= 2) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is there an if here? Seems unnecessary?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more explanations and actually could simplify a bit the logic. See new version.

static_assert(Flavor::MAX_CHUNK_THREAD_PORTION_SIZE >= 2);
static_assert((Flavor::MAX_CHUNK_THREAD_PORTION_SIZE & (Flavor::MAX_CHUNK_THREAD_PORTION_SIZE - 1)) == 0);

const auto thread_portion_size_candidate =

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming is not great.. can't think of proper naming at the moment but at least requires a comment to what this is

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, not so easy to have great terminology. I did no change it but added plenty of explanations.

size_t num_threads = bb::calculate_num_threads_pow2(round_size, min_iterations_per_thread);
size_t iterations_per_thread = round_size / num_threads; // actual iterations per thread

// In the AVM, the trace is more dense at the top and therefore it is worth to split the work over the threads

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this section just has a lot of logic thats hard to follow. All of the divisions and unclear names make it hard to parse.

size_t iterations_per_thread = round_size / num_threads; // actual iterations per thread

// In the AVM, the trace is more dense at the top and therefore it is worth to split the work over the threads
// a bit more evenly on the vertical axis. To achieve this, we split the trace into chunks and each thread

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment could have more detail in terms what you mean by "more evenly on the vertical axis"

@jeanmon jeanmon force-pushed the jm/sumcheck-chunks branch from 4dc4d59 to 56df29c Compare March 14, 2025 09:11
@jeanmon

jeanmon commented Mar 14, 2025

Copy link
Copy Markdown
Contributor Author

@lucasxia01 I added a significant number of explanations and explained the required properties to be satisfied and added a little proof of why the code satisfies this. I hope this gives enough confidence that the code is correct. it is a bit hard to unit test this. In any case, I think there is enough safeguard that the new code does not affect non-AVM parts in any way.

@jeanmon jeanmon requested a review from lucasxia01 March 14, 2025 09:16
@ledwards2225

Copy link
Copy Markdown
Contributor

@jeanmon This is more or less exactly what I had in mind with this issue. @lucasxia01 do you see any reason why this same mechanism isn't applicable for us in the PG context?

Copy link
Copy Markdown
Contributor

Personally I think this approach is noticeably better if you don't have uniform density (like in our case, we get many times improvement), and even if you do, it should work at least as good (assuming poor cache locality, which in any case you could probably improve with a bigger thread chunk; but that needs hard data).

@lucasxia01 lucasxia01 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fantastic comments!

@lucasxia01

Copy link
Copy Markdown
Contributor

@ledwards2225 It's not clear exactly what we do in PG, but I could see it applying similarly. It might not be as effective or easy to implement because of how we structure the trace so the nonzero blocks are all over the place.

@jeanmon jeanmon merged commit c912bd6 into master Mar 16, 2025
@jeanmon jeanmon deleted the jm/sumcheck-chunks branch March 16, 2025 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants