[AMD] Add kimi mi35x nightly test, folder organization and several stability fixes#17895
Conversation
Summary of ChangesHello @michaelzhang-ai, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the continuous integration and testing framework for AMD GPUs by introducing a structured approach to categorize tests by hardware generation (MI30x and MI35x). It expands test coverage with new accuracy and performance benchmarks for the DeepSeek-V3.2 and Kimi-K2 models across various configurations, ensuring that new models and optimizations are thoroughly validated. The changes also refine existing tests with improved logging and adjusted thresholds, contributing to a more reliable and informative testing pipeline. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds a significant number of new nightly tests for Kimi and DeepSeek-V3.2 models on AMD MI30x and MI35x platforms. It also refactors the test directory structure to separate tests by hardware (mi30x, mi35x). The changes are extensive and improve test coverage for new models and hardware. My review focuses on improving code consistency and reducing duplication in the newly added test files. I've pointed out opportunities to refactor duplicated benchmark logic and to align server process management with existing patterns in the codebase.
I am having trouble creating individual review comments. Click here to see my feedback.
test/registered/amd/accuracy/mi30x/test_deepseek_v32_eval_amd.py (91-163)
This file defines local helper functions (get_one_example, get_few_shot_examples, get_answer_value) and a benchmark runner (run_gsm8k_benchmark) for the GSM8K evaluation. This introduces significant code duplication, as similar logic exists in other test files and shared utilities. Other new tests for deepseek-v32 in this PR (e.g., test_deepseek_v32_dp_eval_amd.py) use the shared run_eval_few_shot_gsm8k function from sglang.test.few_shot_gsm8k. To improve maintainability and consistency, please refactor this test to use the shared run_eval_few_shot_gsm8k function and remove the duplicated helper functions.
test/registered/amd/accuracy/mi35x/test_kimi_k2_eval_mi35x.py (35-101)
The server process management is inconsistent with other similar tests in this PR (e.g., test_kimi_k2_eval_amd.py). The server is launched and torn down within the test method test_kimi_k2_gsm8k_accuracy. It's better practice to use setUpClass to launch the server and tearDownClass to terminate it. This ensures the server is managed at the class level, started once for all tests in the class, and reliably cleaned up.
class TestKimiK2EvalMI35x(CustomTestCase):
"""Kimi-K2 GSM8K Completion Evaluation Test for AMD MI35x."""
@classmethod
def setUpClass(cls):
cls.base_url = DEFAULT_URL_FOR_TEST
other_args = [
"--tp",
"8",
"--decode-attention-backend",
"triton",
"--prefill-attention-backend",
"aiter",
"--trust-remote-code",
"--model-loader-extra-config",
'{"enable_multithread_load": true}',
]
env = os.environ.copy()
env["SGLANG_USE_AITER"] = "1"
env["SGLANG_ROCM_FUSED_DECODE_MLA"] = "0"
cls.process = popen_launch_server(
KIMI_K2_MODEL_PATH,
cls.base_url,
timeout=SERVER_LAUNCH_TIMEOUT,
other_args=other_args,
env=env,
)
@classmethod
def tearDownClass(cls):
kill_process_tree(cls.process.pid)
def test_kimi_k2_gsm8k_accuracy(self):
"""Test Kimi-K2 with GSM8K few-shot completion benchmark."""
requests.get(self.base_url + "/flush_cache")
args = SimpleNamespace(
num_shots=8,
data_path=None,
num_questions=1319,
parallel=1319,
max_new_tokens=512,
host="http://127.0.0.1",
port=int(self.base_url.split(":")[-1]),
)
metrics = run_eval_few_shot_gsm8k(args)
acc = metrics["accuracy"]
passed = acc >= ACCURACY_THRESHOLD
status = "✅ PASS" if passed else "❌ FAIL"
print(f" accuracy={acc:.3f} threshold={ACCURACY_THRESHOLD} {status}")
if is_in_ci():
summary = "### Kimi-K2 Model (MI35x)\n\n"
summary += "| Model | TP | Accuracy | Threshold | Status |\n"
summary += "| ----- | -- | -------- | --------- | ------ |\n"
summary += f"| {KIMI_K2_MODEL_PATH} | 8 | {acc:.3f} | {ACCURACY_THRESHOLD} | {status} |\n"
write_github_step_summary(summary)
self.assertGreaterEqual(
acc,
ACCURACY_THRESHOLD,
f"Kimi-K2 accuracy {acc:.3f} below threshold {ACCURACY_THRESHOLD}",
)test/registered/amd/perf/mi35x/test_deepseek_v32_mtp_perf_mi35x.py (65-108)
The new helper function _run_benchmark_with_timeout appears to be a copy of NightlyBenchmarkRunner.run_benchmark_for_model with the main difference being the addition of a custom server launch timeout. This introduces significant code duplication. A more maintainable approach would be to extend NightlyBenchmarkRunner.run_benchmark_for_model to accept an optional timeout parameter. If modifying NightlyBenchmarkRunner is not feasible in this PR, please add a code comment explaining why this duplication is necessary as a temporary measure.
78e4ba4 to
9b6a279
Compare
- Move MI30x accuracy tests (14 files) to test/registered/amd/accuracy/mi30x/ - Move MI30x perf tests (9 files) to test/registered/amd/perf/mi30x/ - Add new MI35x Kimi-K2 accuracy test - Update nightly-test-amd.yml workflow to include nightly-8-gpu-mi35x-kimi-k2 job
… from 120 to 180 minutes and increase timeout per file from 3600 to 7200 seconds.
bdb81ea to
80f4089
Compare
b986c6a to
f6dbcf5
Compare
Motivation
Add kimi mi35x nightly test, Test Suite Restructuring and fix gpt-oss accuracy issue.
Please help to review: @yctseng0211 @bingxche.
Nightly: https://github.com/sgl-project/sglang/actions/runs/21611463422?pr=17895
Modifications
Add MI35x Kimi-K2 nightly accuracy test and fix several MI35x test stability issues.
Changes
New Test:
test_kimi_k2_eval_mi35x.pyfor Kimi-K2 accuracy evaluation on MI35xTest Organization:
mi30x/subdirectoriesStability Fixes:
Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci