Add TPU7x specific CI test workflows by darisoy · Pull Request #4294 · AI-Hypercomputer/maxtext

darisoy · 2026-06-29T17:33:46Z

Description

This PR integrates automated GitHub Actions CI test suites for Google's newly available TPU v7x runner fleet (linux-x86-tpu7x-224-4tpu), establishing parity with our existing TPU (v6e-4), GPU, and CPU continuous integration pipelines.

Motivation & Problem Solved

Hardware Onboarding: Previously, continuous integration ran exclusively against 4-chip TPU v6e runners (device_count == 4). Onboarding TPU7X introduces 4-chip VMs exposing 2 TensorCores per chip (device_count == 8), enabling continuous testing of 8-way sharding meshes and preventing regressions on v7x hardware.
Sharding Indivisibility Fix: In tests/integration/train_tests.py, simulated small model tests hardcoded base_emb_dim = 28 when non-decoupled. On TPU7X (8 devices), 28 is indivisible by 8, triggering sharding verification errors. This PR dynamically calculates embedding dimensions (((28 + device_count - 1) // device_count) * device_count), ensuring clean divisibility across any device topology while removing unused decoupled parameters.

Implementation Details

.github/workflows/ci_pipeline.yml: Added the tpu7x-tests job matrix alongside existing test suites, triggering tpu7x-unit, tpu7x-integration, and tpu7x-post-training-unit runs. Updated status aggregation (all_tests_passed / notify_failure / investigate_failure) to enforce TPU7X pass requirements before merge.
.github/workflows/run_tests_coordinator.yml: Configured job dispatch routing for TPU7X matrix flavors, mapping them to self-hosted runner linux-x86-tpu7x-224-4tpu targeting device tpu7x-8 using container maxtext-unit-test-tpu:py312.

Shortcomings & Future Improvements

Temporarily Ignored Test: In run_tests_coordinator.yml, tests/inference/kvcache_test.py is temporarily ignored for tpu7x-unit. On 8-device meshes, compiling the Multi-Head Latent Attention (MLA) KV cache update kernel (test_update_kv_cache) triggers an XLA compiler loop/hang on the current libtpu backend, pushing execution past pod timeouts. Re-enabling this test via a compiler bump or layout adaptation is tracked in our follow-up child bug.

BUGS: b/527927834, b/529360553, b/527504273, b/529379676

Tests

Tested extensively on remote TPU VM (darisoy-gvnic-test) by triggering workflow dispatches against self-hosted runner pools via GitHub Actions CLI:

gh workflow run ci_pipeline.yml --ref darisoy-tpu7x-ci-workflows

Test Execution Comparison

All required test suites completed with 0 failures. Below is the exact execution breakdown comparing standard TPU (v6e-4) against TPU7X (tpu7x-8):

Workflow Suite	Accelerator Hardware	Passed	Skipped	Deselected	Execution Notes
`tpu-unit`	TPU v6e-4 (4 chips)	939	1,208	335	Standard baseline unit test execution (~25m).
`tpu7x-unit`	TPU v7x-8 (4 chips / 8 cores)	925	1,216	341	Similar result to `tpu-unit`.
`tpu-integration`	TPU v6e-4 (4 chips)	68	4	2,380	Full integration test suite (~23m).
`tpu7x-integration`	TPU v7x-8 (4 chips / 8 cores)	62	10	2,380	6 additional tests skipped due to platform-specific AOT/hardware markers (~22m).
`tpu-post-training-unit`	TPU v6e-4 (4 chips)	91	1	2,488	Post-training lightweight RL/alignment unit suite (~2.5m).
`tpu7x-post-training-unit`	TPU v7x-8 (4 chips / 8 cores)	91	1	2,488	Identical 1-to-1 execution to standard TPU (~3m).

Follow-up Tracking

b/529360553: Follow-up task to investigate the XLA HLO compilation trace on 8 devices and re-enable tests/inference/kvcache_test.py in tpu7x-unit.
b/527504273: Follow-up task to support large DeepSeek MoE/Engram scanned decoder tests on TPU7X (removing @pytest.mark.skip_on_tpu7x from moe_test.py).

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-06-29T17:38:11Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

darisoy marked this pull request as ready for review June 29, 2026 17:58

darisoy force-pushed the darisoy-tpu7x-ci-workflows branch from 1d26c33 to 9910c25 Compare June 29, 2026 18:56

igorts-git approved these changes Jun 29, 2026

View reviewed changes

Add TPU7x specific CI test workflows

aea293a

darisoy force-pushed the darisoy-tpu7x-ci-workflows branch from 9910c25 to aea293a Compare June 30, 2026 00:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add TPU7x specific CI test workflows#4294

Add TPU7x specific CI test workflows#4294
darisoy wants to merge 1 commit into
mainfrom
darisoy-tpu7x-ci-workflows

darisoy commented Jun 29, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

darisoy commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation & Problem Solved

Implementation Details

Shortcomings & Future Improvements

Tests

Test Execution Comparison

Follow-up Tracking

Checklist

Uh oh!

codecov Bot commented Jun 29, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

darisoy commented Jun 29, 2026 •

edited

Loading