Skip to content

[AMD] Add kimi mi35x nightly test, folder organization and several stability fixes#17895

Merged
HaiShaw merged 10 commits intosgl-project:mainfrom
michaelzhang-ai:add_kimi_mi35x_nightly_test
Feb 4, 2026
Merged

[AMD] Add kimi mi35x nightly test, folder organization and several stability fixes#17895
HaiShaw merged 10 commits intosgl-project:mainfrom
michaelzhang-ai:add_kimi_mi35x_nightly_test

Conversation

@michaelzhang-ai
Copy link
Collaborator

@michaelzhang-ai michaelzhang-ai commented Jan 28, 2026

Motivation

Add kimi mi35x nightly test, Test Suite Restructuring and fix gpt-oss accuracy issue.
Please help to review: @yctseng0211 @bingxche.
Nightly: https://github.com/sgl-project/sglang/actions/runs/21611463422?pr=17895

Modifications

Add MI35x Kimi-K2 nightly accuracy test and fix several MI35x test stability issues.

Changes

New Test:

  • Add test_kimi_k2_eval_mi35x.py for Kimi-K2 accuracy evaluation on MI35x

Test Organization:

  • Move non-MI35x accuracy/perf tests into mi30x/ subdirectories

Stability Fixes:

  • Fix gpt-oss accuracy issue on mi35x.
  • Lower Grok-2 MI35x accuracy threshold to 0.90 for test stability
  • Increase Kimi-K2 timeout to 180 min and add 20min watchdog timeout
  • Increase Grok1-INT4 MI35x timeout from 60 to 90 minutes
  • Disable profiling for AMD perf tests (keep enabled for NVIDIA)

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @michaelzhang-ai, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the continuous integration and testing framework for AMD GPUs by introducing a structured approach to categorize tests by hardware generation (MI30x and MI35x). It expands test coverage with new accuracy and performance benchmarks for the DeepSeek-V3.2 and Kimi-K2 models across various configurations, ensuring that new models and optimizations are thoroughly validated. The changes also refine existing tests with improved logging and adjusted thresholds, contributing to a more reliable and informative testing pipeline.

Highlights

  • Test Suite Restructuring: AMD-specific accuracy and performance tests have been reorganized into dedicated mi30x and mi35x subdirectories. This change improves the logical grouping and management of tests based on the target GPU architecture.
  • New DeepSeek-V3.2 Tests: Comprehensive nightly accuracy and performance benchmarks for the DeepSeek-V3.2 model have been introduced. These tests cover various configurations including Data Parallel (DP), Multi-Token Prediction (MTP) with EAGLE speculative decoding, and Torch Compile, ensuring broad validation across both MI30x and MI35x platforms.
  • New Kimi-K2 Tests: Nightly accuracy evaluation tests for the Kimi-K2 model have been added for both MI30x and MI35x platforms, expanding the coverage of supported models.
  • Enhanced Test Visibility: Existing accuracy tests for several models (DeepSeek-R1, DeepSeek-V3.1, GPT-OSS, Grok1, Grok2) now include explicit print statements for accuracy and threshold values, providing clearer feedback during CI runs.
  • Accuracy Threshold Adjustments: Accuracy thresholds have been updated for several Qwen2 and Mixtral models in the test_gsm8k_eval_amd.py test, and the gpt-oss-20b-bf16 model's threshold was slightly lowered to reflect expected performance.
  • VLM Model Exclusion: The zai-org/GLM-4.1V-9B-Thinking model has been added to the list of known failing VLM models for AMD, indicating a current incompatibility or issue.
  • Robust Performance Benchmarking: Performance tests for MI35x now include watchdog timeouts and utilize a new utility function (_run_benchmark_with_timeout) to ensure more robust handling of server launch and execution, preventing indefinite hangs.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Ignored Files
  • Ignored by pattern: .github/workflows/** (1)
    • .github/workflows/nightly-test-amd.yml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a significant number of new nightly tests for Kimi and DeepSeek-V3.2 models on AMD MI30x and MI35x platforms. It also refactors the test directory structure to separate tests by hardware (mi30x, mi35x). The changes are extensive and improve test coverage for new models and hardware. My review focuses on improving code consistency and reducing duplication in the newly added test files. I've pointed out opportunities to refactor duplicated benchmark logic and to align server process management with existing patterns in the codebase.

I am having trouble creating individual review comments. Click here to see my feedback.

test/registered/amd/accuracy/mi30x/test_deepseek_v32_eval_amd.py (91-163)

medium

This file defines local helper functions (get_one_example, get_few_shot_examples, get_answer_value) and a benchmark runner (run_gsm8k_benchmark) for the GSM8K evaluation. This introduces significant code duplication, as similar logic exists in other test files and shared utilities. Other new tests for deepseek-v32 in this PR (e.g., test_deepseek_v32_dp_eval_amd.py) use the shared run_eval_few_shot_gsm8k function from sglang.test.few_shot_gsm8k. To improve maintainability and consistency, please refactor this test to use the shared run_eval_few_shot_gsm8k function and remove the duplicated helper functions.

test/registered/amd/accuracy/mi35x/test_kimi_k2_eval_mi35x.py (35-101)

medium

The server process management is inconsistent with other similar tests in this PR (e.g., test_kimi_k2_eval_amd.py). The server is launched and torn down within the test method test_kimi_k2_gsm8k_accuracy. It's better practice to use setUpClass to launch the server and tearDownClass to terminate it. This ensures the server is managed at the class level, started once for all tests in the class, and reliably cleaned up.

class TestKimiK2EvalMI35x(CustomTestCase):
    """Kimi-K2 GSM8K Completion Evaluation Test for AMD MI35x."""

    @classmethod
    def setUpClass(cls):
        cls.base_url = DEFAULT_URL_FOR_TEST
        other_args = [
            "--tp",
            "8",
            "--decode-attention-backend",
            "triton",
            "--prefill-attention-backend",
            "aiter",
            "--trust-remote-code",
            "--model-loader-extra-config",
            '{"enable_multithread_load": true}',
        ]
        env = os.environ.copy()
        env["SGLANG_USE_AITER"] = "1"
        env["SGLANG_ROCM_FUSED_DECODE_MLA"] = "0"

        cls.process = popen_launch_server(
            KIMI_K2_MODEL_PATH,
            cls.base_url,
            timeout=SERVER_LAUNCH_TIMEOUT,
            other_args=other_args,
            env=env,
        )

    @classmethod
    def tearDownClass(cls):
        kill_process_tree(cls.process.pid)

    def test_kimi_k2_gsm8k_accuracy(self):
        """Test Kimi-K2 with GSM8K few-shot completion benchmark."""
        requests.get(self.base_url + "/flush_cache")

        args = SimpleNamespace(
            num_shots=8,
            data_path=None,
            num_questions=1319,
            parallel=1319,
            max_new_tokens=512,
            host="http://127.0.0.1",
            port=int(self.base_url.split(":")[-1]),
        )
        metrics = run_eval_few_shot_gsm8k(args)
        acc = metrics["accuracy"]

        passed = acc >= ACCURACY_THRESHOLD
        status = "✅ PASS" if passed else "❌ FAIL"
        print(f"  accuracy={acc:.3f} threshold={ACCURACY_THRESHOLD} {status}")

        if is_in_ci():
            summary = "### Kimi-K2 Model (MI35x)\n\n"
            summary += "| Model | TP | Accuracy | Threshold | Status |\n"
            summary += "| ----- | -- | -------- | --------- | ------ |\n"
            summary += f"| {KIMI_K2_MODEL_PATH} | 8 | {acc:.3f} | {ACCURACY_THRESHOLD} | {status} |\n"
            write_github_step_summary(summary)

        self.assertGreaterEqual(
            acc,
            ACCURACY_THRESHOLD,
            f"Kimi-K2 accuracy {acc:.3f} below threshold {ACCURACY_THRESHOLD}",
        )

test/registered/amd/perf/mi35x/test_deepseek_v32_mtp_perf_mi35x.py (65-108)

medium

The new helper function _run_benchmark_with_timeout appears to be a copy of NightlyBenchmarkRunner.run_benchmark_for_model with the main difference being the addition of a custom server launch timeout. This introduces significant code duplication. A more maintainable approach would be to extend NightlyBenchmarkRunner.run_benchmark_for_model to accept an optional timeout parameter. If modifying NightlyBenchmarkRunner is not feasible in this PR, please add a code comment explaining why this duplication is necessary as a temporary measure.

@michaelzhang-ai michaelzhang-ai force-pushed the add_kimi_mi35x_nightly_test branch from 78e4ba4 to 9b6a279 Compare January 28, 2026 19:48
@michaelzhang-ai michaelzhang-ai changed the title Add kimi mi35x nightly test [AMD] Add kimi mi35x nightly test Jan 30, 2026
@michaelzhang-ai michaelzhang-ai changed the title [AMD] Add kimi mi35x nightly test [AMD] Add kimi mi35x nightly test and fix gpt-oss Feb 2, 2026
- Move MI30x accuracy tests (14 files) to test/registered/amd/accuracy/mi30x/
- Move MI30x perf tests (9 files) to test/registered/amd/perf/mi30x/
- Add new MI35x Kimi-K2 accuracy test
- Update nightly-test-amd.yml workflow to include nightly-8-gpu-mi35x-kimi-k2 job
… from 120 to 180 minutes and increase timeout per file from 3600 to 7200 seconds.
@michaelzhang-ai michaelzhang-ai force-pushed the add_kimi_mi35x_nightly_test branch from bdb81ea to 80f4089 Compare February 2, 2026 21:48
@michaelzhang-ai michaelzhang-ai force-pushed the add_kimi_mi35x_nightly_test branch 2 times, most recently from b986c6a to f6dbcf5 Compare February 2, 2026 22:09
@michaelzhang-ai michaelzhang-ai changed the title [AMD] Add kimi mi35x nightly test and fix gpt-oss [AMD] Add kimi mi35x nightly test, folder organization and fix gpt-oss Feb 3, 2026
@michaelzhang-ai michaelzhang-ai changed the title [AMD] Add kimi mi35x nightly test, folder organization and fix gpt-oss [AMD] Add kimi mi35x nightly test, folder organization and several stability fixes Feb 3, 2026
@michaelzhang-ai
Copy link
Collaborator Author

michaelzhang-ai commented Feb 3, 2026

Copy link
Collaborator

@yctseng0211 yctseng0211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HaiShaw HaiShaw merged commit 6fd878b into sgl-project:main Feb 4, 2026
174 of 192 checks passed
charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Feb 5, 2026
@michaelzhang-ai michaelzhang-ai deleted the add_kimi_mi35x_nightly_test branch February 6, 2026 05:45
RubiaCx pushed a commit to RubiaCx/sglang that referenced this pull request Feb 8, 2026
Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

amd deepseek Multi-modal multi-modal language model run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants