-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Support filtering labels in dumper #19018
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 1 commit
Commits
Show all changes
25 commits
Select commit
Hold shift + click to select a range
0cc42a7
feat(dumper): add SGLANG_DUMPER_CLEANUP env var to auto-remove old dumps
fzyzcjy 3726326
more
fzyzcjy 1d85c3a
more
fzyzcjy bc0a29c
feat(dumper): add watchdog timeout for collective communication ops
fzyzcjy c2ec469
test(dumper): add tests for collective communication timeout watchdog
fzyzcjy c45a0d6
more
fzyzcjy ff36758
refactor(test): extract _capture_stdout context manager for stdout re…
fzyzcjy ef77c2d
refactor(test): use nonlocal instead of list, add debug prints for ca…
fzyzcjy 0c8dd0c
fmt
fzyzcjy 294b10b
fix(test): use correct kwarg cleanup_previous instead of needs_cleanup
fzyzcjy 3c5b578
Merge branch 'ac8398/0' into ac8398/1
fzyzcjy 3079303
refactor(dumper): split SGLANG_DUMPER_WRITE_FILE into OUTPUT_FILE and…
fzyzcjy 73e98bc
refactor(test): merge output file and console tests into symmetric Te…
fzyzcjy 279749a
remove redundant distributed test_write_disabled, covered by TestOutp…
fzyzcjy 681c86e
feat(dumper): add capture_output() context manager for in-memory dump…
fzyzcjy e829bfe
test(dumper): verify capture_output respects filter
fzyzcjy ff729d1
refactor(dumper): extract _DumperConfig frozen dataclass
fzyzcjy a0a6f1c
Revert "refactor(dumper): extract _DumperConfig frozen dataclass"
fzyzcjy 96493cb
feat(dumper): support KV filtering on name, extra_kwargs, and global_ctx
fzyzcjy bb4dfaf
style(dumper): collapse double-if filter check into single condition
fzyzcjy 9002a4a
refactor(dumper): pre-compute user_kwargs at grad hook registration time
fzyzcjy 6a742c9
rename(dumper): user_kwargs -> tags
fzyzcjy 9e2592b
rename(dumper): _format_kwargs -> _format_tags
fzyzcjy 8dc3443
fmt
fzyzcjy e7cf6c3
Merge branch 'main' into ac8398/3
fzyzcjy File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
feat(dumper): add watchdog timeout for collective communication ops
Collective operations (broadcast_object_list, all_gather_object) hang silently when not all ranks participate. Add a configurable timeout (default 60s) that prints a warning if a collective op doesn't complete, helping users diagnose missing rank participation.
- Loading branch information
commit bc0a29cba2c5a17ff1fd3cefda50389481a52e51
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
collective_timeoutis hardcoded to 60 seconds. While this is a reasonable default, it would be more flexible to allow users to configure this value through an environment variable, similar to other settings. This is particularly useful in environments with slower networks or for debugging complex distributed scenarios where collectives might take longer than expected.