Skip to content

fix(ci): recover from corrupted MMMU parquet cache#17256

Merged
Kangyan-Zhou merged 2 commits intosgl-project:mainfrom
harvenstar:fix/mmmu-cache-corruption
Jan 18, 2026
Merged

fix(ci): recover from corrupted MMMU parquet cache#17256
Kangyan-Zhou merged 2 commits intosgl-project:mainfrom
harvenstar:fix/mmmu-cache-corruption

Conversation

@harvenstar
Copy link
Collaborator

When MMMU dataset parquet files are corrupted in HuggingFace cache (ArrowInvalid: Parquet magic bytes not found), the lmms_eval process fails without producing JSON results, causing CI test failures.
https://github.com/sgl-project/sglang/actions/runs/21076101101/job/60658633322?pr=16388
https://github.com/sgl-project/sglang/actions/runs/21075841576/job/60653165682?pr=16561

This fix adds automatic recovery in mmmu_vlm_kit.py:

  • Detects parquet corruption errors in lmms_eval subprocess output
  • Cleans up the corrupted MMMU dataset cache directory (prioritizes CI path /hf_home, fallback to HF_HOME)
  • Retries once with HF_HUB_OFFLINE=0 and force_redownload mode
  • Only affects MMMU parquet corruption cases (minimal impact)

Fixes the intermittent CI failures on MMMU evaluation tests.

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

When MMMU dataset parquet files are corrupted in HuggingFace cache
(ArrowInvalid: Parquet magic bytes not found), the lmms_eval process
fails without producing JSON results, causing CI test failures.

This fix adds automatic recovery in mmmu_vlm_kit.py:
- Detects parquet corruption errors in lmms_eval subprocess output
- Cleans up the corrupted MMMU dataset cache directory
  (prioritizes CI path /hf_home, fallback to HF_HOME)
- Retries once with HF_HUB_OFFLINE=0 and force_redownload mode
- Only affects MMMU parquet corruption cases (minimal impact)

Fixes the intermittent CI failures on MMMU evaluation tests.
@github-actions github-actions bot added the Multi-modal multi-modal language model label Jan 17, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @harvenstar, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the robustness of the continuous integration pipeline by introducing an automatic recovery mechanism for corrupted HuggingFace dataset caches, specifically for MMMU evaluations. By detecting data corruption, clearing the problematic cache, and retrying the evaluation with a forced redownload, the system can now self-heal from transient data issues, leading to more stable and reliable CI test results.

Highlights

  • Automatic Error Detection: Implemented a mechanism to automatically detect ArrowInvalid: Parquet magic bytes not found errors, specifically targeting corrupted MMMU dataset parquet files during lmms_eval subprocess execution.
  • Corrupted Cache Cleanup: Added functionality to clean up the corrupted MMMU dataset cache directory. This prioritizes the CI path (/hf_home) and falls back to the standard HuggingFace cache location (HF_HOME or ~/.cache/huggingface).
  • Retry Mechanism: Introduced a retry logic for lmms_eval commands. Upon detecting and cleaning a corrupted cache, the evaluation is retried once with HF_HUB_OFFLINE=0 and HF_DATASETS_DOWNLOAD_MODE=force_redownload to ensure a fresh download of the dataset.
  • CI Stability Improvement: This fix directly addresses intermittent CI failures caused by corrupted MMMU evaluation datasets, enhancing the reliability of continuous integration tests.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust retry mechanism to handle intermittent CI failures caused by corrupted MMMU dataset parquet files. The changes are well-targeted and introduce helper functions to detect the specific error, clean up the cache, and retry the evaluation. My review focuses on improving the logging and error reporting within the new retry logic to ensure that subprocess outputs are not lost, which is crucial for debugging.

@harvenstar harvenstar force-pushed the fix/mmmu-cache-corruption branch from 3e0c0ea to 9c65b48 Compare January 17, 2026 09:19
Address code review feedback:
- Print captured stdout/stderr on successful runs for visibility
- Log error output before re-raising on cleanup failure
- Log error output before re-raising on non-parquet errors

This ensures subprocess outputs are preserved for debugging.
@harvenstar harvenstar force-pushed the fix/mmmu-cache-corruption branch from 9c65b48 to 21b763c Compare January 17, 2026 09:23
@b8zhong
Copy link
Collaborator

b8zhong commented Jan 17, 2026

/tag-and-rerun-ci

@Kangyan-Zhou Kangyan-Zhou merged commit 90399cb into sgl-project:main Jan 18, 2026
139 of 154 checks passed
harvenstar added a commit to harvenstar/sglang that referenced this pull request Jan 19, 2026
The MMMU parquet corruption retry logic added in PR sgl-project#17256 was not
being used by test_epd_disaggregation.py, which had its own run_mmmu_eval
method that directly called subprocess.run(). This caused EPD tests to
fail when encountering corrupted parquet files.

Changes:
- Import _run_lmms_eval_with_retry from mmmu_vlm_kit
- Replace subprocess.run() with _run_lmms_eval_with_retry() in both
  TestEPDDisaggregationOneEncoder and TestEPDDisaggregationMultiEncoders

This ensures EPD tests will automatically clean up corrupted MMMU cache
and retry with fresh downloads, matching the behavior of other MMMU tests.
harvenstar added a commit to harvenstar/sglang that referenced this pull request Jan 19, 2026
The MMMU parquet corruption retry logic added in PR sgl-project#17256 was not
being used by test_epd_disaggregation.py, which had its own run_mmmu_eval
method that directly called subprocess.run(). This caused EPD tests to
fail when encountering corrupted parquet files.

Changes:
- Import _run_lmms_eval_with_retry from mmmu_vlm_kit
- Replace subprocess.run() with _run_lmms_eval_with_retry() in both
  TestEPDDisaggregationOneEncoder and TestEPDDisaggregationMultiEncoders

This ensures EPD tests will automatically clean up corrupted MMMU cache
and retry with fresh downloads, matching the behavior of other MMMU tests.
DotSlash-A pushed a commit to DotSlash-A/sglang that referenced this pull request Jan 19, 2026
* fix(ci): recover from corrupted MMMU parquet cache (sgl-project#17256)

* [diffusion] feat: support default 4-step inference for Flux2-Klein distilled models (sgl-project#17225)

Signed-off-by: Lancer <maruixiang6688@gmail.com>

* Add runner utilization report workflow (sgl-project#17234)

* cli: support sglang version (sgl-project#17250)

* Use swa radix cache and memory pool for gpt-oss model (sgl-project#17261)

* [VLM][Reland] Refactor load_mm_data to improve performance (sgl-project#16152)

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

* [Tiny] Improve docs (sgl-project#17264)

* [diffusion] fix: set guidance_scale default to None (sgl-project#17182)

* Tiny fix comment typo (sgl-project#17287)

* [SPEC_V2] Enable cudagraph draft_extend for trtllm_mla_backend and Acclen Fix for DP under cudagraph mode (sgl-project#16974)

* Add kl test for swa radix cache (sgl-project#17281)

* fix: Handle multiple named chat templates in HuggingFace tokenizers (sgl-project#17236)

Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>

* Move radix cache related tests (sgl-project#17295)

* [Refactor] Add `-fp4-gemm-backend` to replace `SGLANG_FLASHINFER_FP4_GEMM_BACKEND` (sgl-project#16534)

Co-authored-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com>

* [Bugfix] Fix PD accuracy when MTP is not configured on the prefill node (sgl-project#17212)

Co-authored-by: Shangming Cai <csmthu@gmail.com>

* [Diffusion] Apply jit qk_norm to flux1 (sgl-project#17296)

* [Refactor] Split out deepseek v2 weight loader function into mixin (sgl-project#16649)

* [NPU]Support GPT-OSS for NPU (sgl-project#14197)

* [jit-kernel] Add CuTe DSL GDN Decode Kernel (sgl-project#15631)

Co-authored-by: Jinyan Chen <jinyanc@nvidia.com>

* [GLM 4.7] Add RTX 6000 Pro aka sm120 (sgl-project#17235)

Co-authored-by: root <root@ubuntu-nvidia.localdomain>

* Update CODEOWNERS for multimodal_gen (sgl-project#17308)

Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>

* [Feature] overlap LoRA weight loading with compute (sgl-project#15512)

* [PD] Optimize MHA models pp util calculation logic (sgl-project#17306)

* [Minor] Correct sglang version when installing from source (sgl-project#17315)

* Use dsv3 optimized routing `fused_topk_deepseek` instead of `moe_fused_gate` (sgl-project#15347)

* [DeepSeek v3.2] Opt MTP decode cuda batch sizes and nsa implementation (sgl-project#16961)

* Update code sync scripts (sgl-project#17319)

* [Auto Sync] Update tokenizer_manager.py (20260119) (sgl-project#17317)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* support new qwen3_coder_detector (sgl-project#16744)

Co-authored-by: liugaoji.lgj <liugaoji.lgj@alibaba-inc.com>

* Fix kernel selection in biased_grouped_topk_gpu (sgl-project#17325)

* KV Cache Events with Attention DP bug fix (sgl-project#16030) (sgl-project#16412)

* [Perf] fuse q, k norm for Flux2Attention (sgl-project#17241)

Co-authored-by: Minglei Zhu <zminglei@linkedin.com>

* [CI] Add partition to stage-b-test-large-1-gpu (11->12) (sgl-project#17245)

* fix(ci): rate limit and permission errors in trace publishing (sgl-project#17238)

* Revert "[Perf] fuse q, k norm for Flux2Attention (sgl-project#17241)" (sgl-project#17332)

* Migrate performance, accuracy, and quantization tests to CI registry (sgl-project#17177)

Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>

* Inclusion of nvfp4 blockscale in EPLB Rebalance (sgl-project#17158)

* [Refactor] Set `fp4-gemm-backend=auto` on SM100 and rename `fp4-gemm-backend` with `flashinfer_` prefix (sgl-project#17309)

* [Diffusion] Apply qknorm to flux2 and apply lightx2v rms_norm_one_pass kernel(without residual) (sgl-project#17305)

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Fix v32 continue_final_message not work (sgl-project#16567)

* Evict swa kv cache during decoding (sgl-project#17220)

* [RadixTree][1/N Refactor]: Support unified match_prefix params (sgl-project#17142)

Co-authored-by: yizhang2077 <1109276519@qq.com>
Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com>

* [AMD CI] Migrate and Add More Testcases (sgl-project#17116)

Co-authored-by: yctseng0211 <yctseng@amd.com>

* [AMD] CI - add partitions for stage-b-test-small-1-gpu-amd (sgl-project#17345)

* Restore deepseek_v2.py to main's code, except the utils

* Ran `pre-commit`

---------

Signed-off-by: Lancer <maruixiang6688@gmail.com>
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Hudson Xing <1277646412@qq.com>
Co-authored-by: Lancer <402430575@qq.com>
Co-authored-by: Alison Shao <54658187+alisonshao@users.noreply.github.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: Ke Bao <ispobaoke@gmail.com>
Co-authored-by: Yuan Luo <yuan.luo@hotmail.com>
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: Mohammad Miadh Angkad <mangkad.bsdsba2027@aim.edu>
Co-authored-by: Changyi Yang <112288487+ChangyiYang@users.noreply.github.com>
Co-authored-by: YAMY <74099316+YAMY1234@users.noreply.github.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: b8zhong <b8zhong@uwaterloo.ca>
Co-authored-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com>
Co-authored-by: Ch3ngY1 <91232537+Ch3ngY1@users.noreply.github.com>
Co-authored-by: Shangming Cai <csmthu@gmail.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Co-authored-by: Jerry Ji <jerryjilol@gmail.com>
Co-authored-by: Todobe <43903496+Todobe@users.noreply.github.com>
Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com>
Co-authored-by: Jinyan Chen <jinyanc@nvidia.com>
Co-authored-by: Koushik Dutta <koush@koushikdutta.com>
Co-authored-by: root <root@ubuntu-nvidia.localdomain>
Co-authored-by: Glen Liu <62917497+glenliu21@users.noreply.github.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: Lee Nau <lnau@nvidia.com>
Co-authored-by: Yongfei Xu <xuyongfei.xyf@antgroup.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Gaoji Liu <34803073+attack204@users.noreply.github.com>
Co-authored-by: liugaoji.lgj <liugaoji.lgj@alibaba-inc.com>
Co-authored-by: yudian0504 <138860534+yudian0504@users.noreply.github.com>
Co-authored-by: Kartik Ramesh <kartikx2000@gmail.com>
Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com>
Co-authored-by: Minglei Zhu <zminglei@linkedin.com>
Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>
Co-authored-by: Shu Wang <shuw@nvidia.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com>
Co-authored-by: zhangheng <hzh0425@apache.org>
Co-authored-by: yizhang2077 <1109276519@qq.com>
Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com>
Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>
Co-authored-by: yctseng0211 <yctseng@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Multi-modal multi-modal language model run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants