Enhance reset, states, http in dumper#19095
Merged
fzyzcjy merged 164 commits intosgl-project:mainfrom Feb 22, 2026
Merged
Conversation
…r_step step() is now called at the end of each iteration instead of at the beginning. _curr_step starts at 0 and increments on each step() call. All dump file tags, get_state output, dump_loader/comparator column names updated accordingly.
Port the hook-based tensor dumping from tensor_dump_forward_hook.py into the dumper subsystem as a standalone _HookDumper class. This enables automatic tensor capture via PyTorch forward hooks while reusing dumper's filtering, HTTP control, and metadata infrastructure. Also fix test_disabled_dumper_no_output to use configure() instead of passing enable=False as constructor override.
- Remove dump_layers param from _HookDumper (use dumper filter instead) - Add top_level_module_name and layers_module_name to _DumperConfig (auto env var support: DUMPER_TOP_LEVEL_MODULE_NAME, etc.) - Remove duplicate _HookDumper class - Remove test_layer_filtering test
Store top_level_module_name as instance var for clean access, fix recursive call passing removed keyword argument.
…rdBatch input capture - _convert_hook_output returns dict[str, Tensor] for any hook value - _dump_converted loops over the dict and calls dumper.dump() - Top-level model hook now also processes input args to capture ForwardBatch (input_ids, seq_lens, positions)
Skip creating _DumperHttpManager entirely when server_port_parsed is None, instead of constructing one that only wraps a _LocalOnlyBroadcast that is never used.
ZMQ RPC target is now the manager itself instead of the raw _Dumper, so all HTTP control logic is encapsulated in _DumperHttpManager.
Previously cleanup_previous was triggered lazily on the first dump() call inside _dump_inner. This is unreliable in distributed settings because different ranks may reach the first dump at different times (or not at all in pipeline parallelism), causing the barrier inside _cleanup_old_dumps to hang. Now cleanup is triggered eagerly in configure() right when the config is set, where all ranks call it together. The _cleanup_old_dumps function already has rank-0-only rmtree + dist.barrier().
…e()" This reverts commit 13dd918.
Use _collective_with_timeout for the dist.barrier() in _cleanup_old_dumps so a warning is printed if not all ranks participate within the timeout window.
…nches Cover the 4 branches of _DumperConfig.server_port_parsed: negative -> None, zero -> None, positive -> int, "reuse" -> str. Also add missing imports for upcoming tests.
Verify prefix is "dump_" and suffix has expected timestamp format (YYYYMMDD_HHMMSS_MMMRRR = 22 chars with underscore at position 8).
Verify that nesting capture_output() raises AssertionError to prevent accidental data overwrites from concurrent captures.
Verify that POSTing to an unknown dumper control method returns 400.
When both enable_value and enable_grad are False, _dump_inner should skip without producing any output files.
Cover unset (default), valid int, invalid string (default fallback), and whitespace-only string (default fallback).
Verify _register_forward_hook_or_replace_fn raises ValueError when given an unrecognized mode string.
Cover both the tensor clone path and the dict deepcopy path, verifying mutations to the original don't affect the copy.
Verify that dump_model saves Parameter.data (plain Tensor), not the Parameter wrapper itself.
Verify that _parse_env_value treats whitespace-only strings as unset and returns the field default.
… error Previously _dump_single unconditionally stripped .data from Parameters, making _torch_save's Parameter fallback dead code for the dict format. Now Parameters are saved as-is. The fallback in _torch_save is extended to handle dicts containing unpicklable Parameter subclasses.
The function was copied from utils/common.py but never called within dumper.py itself. Remove it and its tests.
Replace two ad-hoc isinstance branches with a single helper that handles both bare Parameter and dict-wrapped Parameter forms.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci