[CI] fix: notebook ci may not working#18417
Conversation
|
Note Gemini is unable to generate a summary for this pull request due to the file types involved not being currently supported. |
|
thanks so much for this PR. I found that I am thinking of it like, "if there is no tag, then add run-ci tag and run all the CI; if there is alreay tag, just rerun the failed one". So it's quite different that I think? |
Not that, as I mentioned, any non-markdown changes should trigger this CI. If there is a single line of code change, even in Python with just a blank line, this CI should also be triggered. |
Sorry, I didn’t make it clear earlier. Any non-markdown changes should trigger this CI, as shown in the code: |
* www/pr/ks: (265 commits) [BugFix][PD]Fix metadata_buffer_index leak when aborted in PD (sgl-project#17483) Refactoring Mooncake TE as a shared distributed component (sgl-project#17810) [ModelOPT] Support Qwen 3 Next Coder NVFP4 (sgl-project#18224) Update author information in pyproject.toml (sgl-project#18453) [Kimi-K2.5] Fix missing `quant_config` in `KimiK25` (sgl-project#18440) Add tensor parallelism support to LFM2 ShortConv layers (sgl-project#17777) [diffusion] chore: revise process title (sgl-project#18446) Fix TRT-LLM MLA backend applying k_scale to BF16 KV cache in BMM1 (sgl-project#18396) [diffusion] refactor: group component loaders under the component_loaders/ directory (sgl-project#18438) [ModelOpt] Fix broken Qwen3-235B-A22B-Instruct-2507-NVFP4 launch (sgl-project#18189) [diffusion] feat: support efficient sequence shard (sgl-project#18161) [CI] fix: notebook ci may not working (sgl-project#18417) fix: sync server_args.kv_cache_dtype when detecting FP8 KV cache (sgl-project#18394) [Fix] Fix backend selection after flashinfer version update (sgl-project#18364) [diffusion] platform: support WAN/FLUX/Qwen-Image/Qwen-Image-edit on Ascend (sgl-project#13662) fix: fix NVFP4 Kimi-K2.5 weight mapping and exclude list (sgl-project#18370) [diffusion] feat: support saving videos directly on the server to avoid the overhead of tensor transfer (sgl-project#18253) [diffusion] fix: respect dist_timeout option (sgl-project#18386) [Doc] Fix outdated `--fp4-gemm-backend` documentation (sgl-project#18350) [diffusion] fix: remove unnecessary norm_type argument from GLM-Image dits (sgl-project#18382) ...
Motivation
trying to fix #18430
In the previous code, the
Execute Notebooksworkflow could be effectively skipped due to a combination of factors:opened), the notebook workflow is not triggered becausepull_request.typesdid not includeopened.synchronize) and no other triggering events occur (such as adding a label),/tag-and-rerun-cican only rerun jobs if there is already a workflow run record (eitherskippedorfailure) for that head SHA.Adding
openedalone only solves part of the problem: the workflow run will at least exist. But even ifrun-all-notebooksis skipped, the overall workflow often still ends insuccessinstead ofskipped. In that case,handle_rerun_failed_ciwill not pick it up as a failed/skipped target to rerun. To fully address this, we also need to adjust bothcall-gateand the finish job (following the pattern used in thePR Testworkflow).how to reproduce
For example:
[Do NOT MERGE] notebook ci test #18418
Benchmarking and Profiling
See #18416.
After this fix, any non-markdown changes will correctly trigger the notebook CI once it is tagged with
/tag-and-rerun-ci.Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci