Skip to content

Use tree reduction in dense gather reduce#4296

Open
BirdsOfAFthr wants to merge 1 commit into
mainfrom
dense-gr
Open

Use tree reduction in dense gather reduce#4296
BirdsOfAFthr wants to merge 1 commit into
mainfrom
dense-gr

Conversation

@BirdsOfAFthr

Copy link
Copy Markdown
Collaborator

Description

Previously, we created a sequential dependency chain of vector additions of length reduce_group_size - 1 (which is topk - 1). For topk = 4, it was 3 dependent additions; for topk = 8, it was 7. The Vector ALU had to wait for the previous addition to finish before starting the next one, causing pipeline stalls.

By replacing this with a tree-structured reduction (generating additions in a binary tree shape, like (a + b) + (c + d)), we shortened the critical path of dependent additions to log2(topk):

For topk = 4: Critical path reduced from 3 to 2 additions (~1.2% speedup). For topk = 8: Critical path reduced from 7 to 3 additions (~4.2% speedup).

This optimization reduces stalls and scales well as the topk value increases.

Tests

To test the correctness of the changes, we ran the dense gather-reduce kernel unit tests on the TPU VM inside the node Docker container:

pytest tests/kernels/dense_gather_reduce_test.py

All 54 test cases ran and verified the mathematical correctness of the tree-structured reduction logic.

Checklist

Before submitting this PR, please make sure:

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have made or will make corresponding changes to any relevant documentation.

@codecov

codecov Bot commented Jun 29, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 10 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/maxtext/kernels/gather_reduce_pallas.py 0.00% 10 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant