Skip to content

Add TPU7x specific CI test workflows#4294

Open
darisoy wants to merge 1 commit into
mainfrom
darisoy-tpu7x-ci-workflows
Open

Add TPU7x specific CI test workflows#4294
darisoy wants to merge 1 commit into
mainfrom
darisoy-tpu7x-ci-workflows

Conversation

@darisoy

@darisoy darisoy commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Description

This PR integrates automated GitHub Actions CI test suites for Google's newly available TPU v7x runner fleet (linux-x86-tpu7x-224-4tpu), establishing parity with our existing TPU (v6e-4), GPU, and CPU continuous integration pipelines.

Motivation & Problem Solved

  • Hardware Onboarding: Previously, continuous integration ran exclusively against 4-chip TPU v6e runners (device_count == 4). Onboarding TPU7X introduces 4-chip VMs exposing 2 TensorCores per chip (device_count == 8), enabling continuous testing of 8-way sharding meshes and preventing regressions on v7x hardware.
  • Sharding Indivisibility Fix: In tests/integration/train_tests.py, simulated small model tests hardcoded base_emb_dim = 28 when non-decoupled. On TPU7X (8 devices), 28 is indivisible by 8, triggering sharding verification errors. This PR dynamically calculates embedding dimensions (((28 + device_count - 1) // device_count) * device_count), ensuring clean divisibility across any device topology while removing unused decoupled parameters.

Implementation Details

  • .github/workflows/ci_pipeline.yml: Added the tpu7x-tests job matrix alongside existing test suites, triggering tpu7x-unit, tpu7x-integration, and tpu7x-post-training-unit runs. Updated status aggregation (all_tests_passed / notify_failure / investigate_failure) to enforce TPU7X pass requirements before merge.
  • .github/workflows/run_tests_coordinator.yml: Configured job dispatch routing for TPU7X matrix flavors, mapping them to self-hosted runner linux-x86-tpu7x-224-4tpu targeting device tpu7x-8 using container maxtext-unit-test-tpu:py312.

Shortcomings & Future Improvements

  • Temporarily Ignored Test: In run_tests_coordinator.yml, tests/inference/kvcache_test.py is temporarily ignored for tpu7x-unit. On 8-device meshes, compiling the Multi-Head Latent Attention (MLA) KV cache update kernel (test_update_kv_cache) triggers an XLA compiler loop/hang on the current libtpu backend, pushing execution past pod timeouts. Re-enabling this test via a compiler bump or layout adaptation is tracked in our follow-up child bug.

BUGS: b/527927834, b/529360553, b/527504273, b/529379676

Tests

Tested extensively on remote TPU VM (darisoy-gvnic-test) by triggering workflow dispatches against self-hosted runner pools via GitHub Actions CLI:

gh workflow run ci_pipeline.yml --ref darisoy-tpu7x-ci-workflows

Test Execution Comparison

All required test suites completed with 0 failures. Below is the exact execution breakdown comparing standard TPU (v6e-4) against TPU7X (tpu7x-8):

Workflow Suite Accelerator Hardware Passed Skipped Deselected Failed Execution Notes
tpu-unit TPU v6e-4 (4 chips) 939 1,208 335 0 Standard baseline unit test execution (~25m).
tpu7x-unit TPU v7x-8 (4 chips / 8 cores) 925 1,216 341 0 Similar result to tpu-unit.
tpu-integration TPU v6e-4 (4 chips) 68 4 2,380 0 Full integration test suite (~23m).
tpu7x-integration TPU v7x-8 (4 chips / 8 cores) 62 10 2,380 0 6 additional tests skipped due to platform-specific AOT/hardware markers (~22m).
tpu-post-training-unit TPU v6e-4 (4 chips) 91 1 2,488 0 Post-training lightweight RL/alignment unit suite (~2.5m).
tpu7x-post-training-unit TPU v7x-8 (4 chips / 8 cores) 91 1 2,488 0 Identical 1-to-1 execution to standard TPU (~3m).

Follow-up Tracking

  • b/529360553: Follow-up task to investigate the XLA HLO compilation trace on 8 devices and re-enable tests/inference/kvcache_test.py in tpu7x-unit.
  • b/527504273: Follow-up task to support large DeepSeek MoE/Engram scanned decoder tests on TPU7X (removing @pytest.mark.skip_on_tpu7x from moe_test.py).

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@codecov

codecov Bot commented Jun 29, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@darisoy darisoy force-pushed the darisoy-tpu7x-ci-workflows branch from 9910c25 to aea293a Compare June 30, 2026 00:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants