[AMD] Add kimi mi35x nightly test, folder organization and several stability fixes by michaelzhang-ai · Pull Request #17895 · sgl-project/sglang

michaelzhang-ai · 2026-01-28T19:38:59Z

Motivation

Add kimi mi35x nightly test, Test Suite Restructuring and fix gpt-oss accuracy issue.
Please help to review: @yctseng0211 @bingxche.
Nightly: https://github.com/sgl-project/sglang/actions/runs/21611463422?pr=17895

Modifications

Add MI35x Kimi-K2 nightly accuracy test and fix several MI35x test stability issues.

Changes

New Test:

Add test_kimi_k2_eval_mi35x.py for Kimi-K2 accuracy evaluation on MI35x

Test Organization:

Move non-MI35x accuracy/perf tests into mi30x/ subdirectories

Stability Fixes:

Fix gpt-oss accuracy issue on mi35x.
Lower Grok-2 MI35x accuracy threshold to 0.90 for test stability
Increase Kimi-K2 timeout to 180 min and add 20min watchdog timeout
Increase Grok1-INT4 MI35x timeout from 60 to 90 minutes
Disable profiling for AMD perf tests (keep enabled for NVIDIA)

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-01-28T19:39:44Z

Summary of Changes

Hello @michaelzhang-ai, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the continuous integration and testing framework for AMD GPUs by introducing a structured approach to categorize tests by hardware generation (MI30x and MI35x). It expands test coverage with new accuracy and performance benchmarks for the DeepSeek-V3.2 and Kimi-K2 models across various configurations, ensuring that new models and optimizations are thoroughly validated. The changes also refine existing tests with improved logging and adjusted thresholds, contributing to a more reliable and informative testing pipeline.

Highlights

Test Suite Restructuring: AMD-specific accuracy and performance tests have been reorganized into dedicated mi30x and mi35x subdirectories. This change improves the logical grouping and management of tests based on the target GPU architecture.
New DeepSeek-V3.2 Tests: Comprehensive nightly accuracy and performance benchmarks for the DeepSeek-V3.2 model have been introduced. These tests cover various configurations including Data Parallel (DP), Multi-Token Prediction (MTP) with EAGLE speculative decoding, and Torch Compile, ensuring broad validation across both MI30x and MI35x platforms.
New Kimi-K2 Tests: Nightly accuracy evaluation tests for the Kimi-K2 model have been added for both MI30x and MI35x platforms, expanding the coverage of supported models.
Enhanced Test Visibility: Existing accuracy tests for several models (DeepSeek-R1, DeepSeek-V3.1, GPT-OSS, Grok1, Grok2) now include explicit print statements for accuracy and threshold values, providing clearer feedback during CI runs.
Accuracy Threshold Adjustments: Accuracy thresholds have been updated for several Qwen2 and Mixtral models in the test_gsm8k_eval_amd.py test, and the gpt-oss-20b-bf16 model's threshold was slightly lowered to reflect expected performance.
VLM Model Exclusion: The zai-org/GLM-4.1V-9B-Thinking model has been added to the list of known failing VLM models for AMD, indicating a current incompatibility or issue.
Robust Performance Benchmarking: Performance tests for MI35x now include watchdog timeouts and utilize a new utility function (_run_benchmark_with_timeout) to ensure more robust handling of server launch and execution, preventing indefinite hangs.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Ignored Files

Ignored by pattern: .github/workflows/** (1)
- .github/workflows/nightly-test-amd.yml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds a significant number of new nightly tests for Kimi and DeepSeek-V3.2 models on AMD MI30x and MI35x platforms. It also refactors the test directory structure to separate tests by hardware (mi30x, mi35x). The changes are extensive and improve test coverage for new models and hardware. My review focuses on improving code consistency and reducing duplication in the newly added test files. I've pointed out opportunities to refactor duplicated benchmark logic and to align server process management with existing patterns in the codebase.

I am having trouble creating individual review comments. Click here to see my feedback.

test/registered/amd/accuracy/mi30x/test_deepseek_v32_eval_amd.py (91-163)

This file defines local helper functions (get_one_example, get_few_shot_examples, get_answer_value) and a benchmark runner (run_gsm8k_benchmark) for the GSM8K evaluation. This introduces significant code duplication, as similar logic exists in other test files and shared utilities. Other new tests for deepseek-v32 in this PR (e.g., test_deepseek_v32_dp_eval_amd.py) use the shared run_eval_few_shot_gsm8k function from sglang.test.few_shot_gsm8k. To improve maintainability and consistency, please refactor this test to use the shared run_eval_few_shot_gsm8k function and remove the duplicated helper functions.

test/registered/amd/accuracy/mi35x/test_kimi_k2_eval_mi35x.py (35-101)

The server process management is inconsistent with other similar tests in this PR (e.g., test_kimi_k2_eval_amd.py). The server is launched and torn down within the test method test_kimi_k2_gsm8k_accuracy. It's better practice to use setUpClass to launch the server and tearDownClass to terminate it. This ensures the server is managed at the class level, started once for all tests in the class, and reliably cleaned up.

class TestKimiK2EvalMI35x(CustomTestCase):
    """Kimi-K2 GSM8K Completion Evaluation Test for AMD MI35x."""

    @classmethod
    def setUpClass(cls):
        cls.base_url = DEFAULT_URL_FOR_TEST
        other_args = [
            "--tp",
            "8",
            "--decode-attention-backend",
            "triton",
            "--prefill-attention-backend",
            "aiter",
            "--trust-remote-code",
            "--model-loader-extra-config",
            '{"enable_multithread_load": true}',
        ]
        env = os.environ.copy()
        env["SGLANG_USE_AITER"] = "1"
        env["SGLANG_ROCM_FUSED_DECODE_MLA"] = "0"

        cls.process = popen_launch_server(
            KIMI_K2_MODEL_PATH,
            cls.base_url,
            timeout=SERVER_LAUNCH_TIMEOUT,
            other_args=other_args,
            env=env,
        )

    @classmethod
    def tearDownClass(cls):
        kill_process_tree(cls.process.pid)

    def test_kimi_k2_gsm8k_accuracy(self):
        """Test Kimi-K2 with GSM8K few-shot completion benchmark."""
        requests.get(self.base_url + "/flush_cache")

        args = SimpleNamespace(
            num_shots=8,
            data_path=None,
            num_questions=1319,
            parallel=1319,
            max_new_tokens=512,
            host="http://127.0.0.1",
            port=int(self.base_url.split(":")[-1]),
        )
        metrics = run_eval_few_shot_gsm8k(args)
        acc = metrics["accuracy"]

        passed = acc >= ACCURACY_THRESHOLD
        status = "✅ PASS" if passed else "❌ FAIL"
        print(f"  accuracy={acc:.3f} threshold={ACCURACY_THRESHOLD} {status}")

        if is_in_ci():
            summary = "### Kimi-K2 Model (MI35x)\n\n"
            summary += "| Model | TP | Accuracy | Threshold | Status |\n"
            summary += "| ----- | -- | -------- | --------- | ------ |\n"
            summary += f"| {KIMI_K2_MODEL_PATH} | 8 | {acc:.3f} | {ACCURACY_THRESHOLD} | {status} |\n"
            write_github_step_summary(summary)

        self.assertGreaterEqual(
            acc,
            ACCURACY_THRESHOLD,
            f"Kimi-K2 accuracy {acc:.3f} below threshold {ACCURACY_THRESHOLD}",
        )

test/registered/amd/perf/mi35x/test_deepseek_v32_mtp_perf_mi35x.py (65-108)

The new helper function _run_benchmark_with_timeout appears to be a copy of NightlyBenchmarkRunner.run_benchmark_for_model with the main difference being the addition of a custom server launch timeout. This introduces significant code duplication. A more maintainable approach would be to extend NightlyBenchmarkRunner.run_benchmark_for_model to accept an optional timeout parameter. If modifying NightlyBenchmarkRunner is not feasible in this PR, please add a code comment explaining why this duplication is necessary as a temporary measure.

- Move MI30x accuracy tests (14 files) to test/registered/amd/accuracy/mi30x/ - Move MI30x perf tests (9 files) to test/registered/amd/perf/mi30x/ - Add new MI35x Kimi-K2 accuracy test - Update nightly-test-amd.yml workflow to include nightly-8-gpu-mi35x-kimi-k2 job

… from 120 to 180 minutes and increase timeout per file from 3600 to 7200 seconds.

…k CI)

…bug (see upstream sgl-project#17220)

michaelzhang-ai · 2026-02-03T08:47:06Z

Nightly pass: https://github.com/sgl-project/sglang/actions/runs/21611463422?pr=17895. @HaiShaw cc: @yctseng0211 @bingxche

yctseng0211

LGTM

…ability fixes (sgl-project#17895)

michaelzhang-ai requested review from Fridge003, Kangyan-Zhou, ispobock and merrymercy as code owners January 28, 2026 19:39

github-actions bot added amd Multi-modal multi-modal language model deepseek labels Jan 28, 2026

gemini-code-assist bot reviewed Jan 28, 2026

View reviewed changes

michaelzhang-ai force-pushed the add_kimi_mi35x_nightly_test branch from 78e4ba4 to 9b6a279 Compare January 28, 2026 19:48

michaelzhang-ai mentioned this pull request Jan 28, 2026

[AMD] Add Kimi-K2, DeepSeek-V3.2 tests to nightly CI #17523

Merged

5 tasks

bingxche added the run-ci label Jan 29, 2026

michaelzhang-ai changed the title ~~Add kimi mi35x nightly test~~ [AMD] Add kimi mi35x nightly test Jan 30, 2026

michaelzhang-ai changed the title ~~[AMD] Add kimi mi35x nightly test~~ [AMD] Add kimi mi35x nightly test and fix gpt-oss Feb 2, 2026

michaelzhang-ai added 7 commits February 2, 2026 15:46

Update nightly-test-amd.yml to extend timeout for MI35x accuracy test…

d114606

… from 120 to 180 minutes and increase timeout per file from 3600 to 7200 seconds.

Exclude MI30x perf jobs from check-all-jobs (perf failures don't bloc…

2b9b7de

…k CI)

Add 20min watchdog timeout for MI35x Kimi-K2 test

ca8cd9b

Disable profiling for AMD perf tests (keep profiling for NVIDIA)

146e657

Add pull request trigger for nightly AMD tests

b079a8e

Lower Grok-2 MI35x accuracy threshold to 0.90 for test stability

0b80c10

michaelzhang-ai force-pushed the add_kimi_mi35x_nightly_test branch from bdb81ea to 80f4089 Compare February 2, 2026 21:48

Disable SGLANG_USE_AITER in MI35X GPT-OSS models due to SWA eviction …

f6dbcf5

…bug (see upstream sgl-project#17220)

michaelzhang-ai force-pushed the add_kimi_mi35x_nightly_test branch 2 times, most recently from b986c6a to f6dbcf5 Compare February 2, 2026 22:09

michaelzhang-ai added 2 commits February 2, 2026 18:21

Increase MI35x Grok1-INT4 accuracy test timeout from 60 to 90 minutes

f45d567

Remove pull request trigger from nightly AMD test workflow

c6aee91

michaelzhang-ai changed the title ~~[AMD] Add kimi mi35x nightly test and fix gpt-oss~~ [AMD] Add kimi mi35x nightly test, folder organization and fix gpt-oss Feb 3, 2026

michaelzhang-ai changed the title ~~[AMD] Add kimi mi35x nightly test, folder organization and fix gpt-oss~~ [AMD] Add kimi mi35x nightly test, folder organization and several stability fixes Feb 3, 2026

yctseng0211 approved these changes Feb 4, 2026

View reviewed changes

HaiShaw approved these changes Feb 4, 2026

View reviewed changes

HaiShaw merged commit 6fd878b into sgl-project:main Feb 4, 2026
174 of 192 checks passed

charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Feb 5, 2026

[AMD] Add kimi mi35x nightly test, folder organization and several st…

6005468

…ability fixes (sgl-project#17895)

michaelzhang-ai deleted the add_kimi_mi35x_nightly_test branch February 6, 2026 05:45

RubiaCx pushed a commit to RubiaCx/sglang that referenced this pull request Feb 8, 2026

[AMD] Add kimi mi35x nightly test, folder organization and several st…

ed72930

…ability fixes (sgl-project#17895)

Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026

[AMD] Add kimi mi35x nightly test, folder organization and several st…

c5995fb

…ability fixes (sgl-project#17895)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Add kimi mi35x nightly test, folder organization and several stability fixes#17895

[AMD] Add kimi mi35x nightly test, folder organization and several stability fixes#17895
HaiShaw merged 10 commits intosgl-project:mainfrom
michaelzhang-ai:add_kimi_mi35x_nightly_test

michaelzhang-ai commented Jan 28, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jan 28, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

michaelzhang-ai commented Feb 3, 2026 •

edited

Loading

Uh oh!

yctseng0211 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

michaelzhang-ai commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Changes

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Jan 28, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

test/registered/amd/accuracy/mi30x/test_deepseek_v32_eval_amd.py (91-163)

test/registered/amd/accuracy/mi35x/test_kimi_k2_eval_mi35x.py (35-101)

test/registered/amd/perf/mi35x/test_deepseek_v32_mtp_perf_mi35x.py (65-108)

Uh oh!

michaelzhang-ai commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yctseng0211 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

michaelzhang-ai commented Jan 28, 2026 •

edited

Loading

michaelzhang-ai commented Feb 3, 2026 •

edited

Loading