Skip to content

Enhance reset, states, http in dumper#19095

Merged
fzyzcjy merged 164 commits intosgl-project:mainfrom
fzyzcjy:ac8402/8
Feb 22, 2026
Merged

Enhance reset, states, http in dumper#19095
fzyzcjy merged 164 commits intosgl-project:mainfrom
fzyzcjy:ac8402/8

Conversation

@fzyzcjy
Copy link
Collaborator

@fzyzcjy fzyzcjy commented Feb 21, 2026

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

…r_step

step() is now called at the end of each iteration instead of at the
beginning. _curr_step starts at 0 and increments on each step() call.
All dump file tags, get_state output, dump_loader/comparator column
names updated accordingly.
Port the hook-based tensor dumping from tensor_dump_forward_hook.py
into the dumper subsystem as a standalone _HookDumper class. This
enables automatic tensor capture via PyTorch forward hooks while
reusing dumper's filtering, HTTP control, and metadata infrastructure.

Also fix test_disabled_dumper_no_output to use configure() instead
of passing enable=False as constructor override.
- Remove dump_layers param from _HookDumper (use dumper filter instead)
- Add top_level_module_name and layers_module_name to _DumperConfig
  (auto env var support: DUMPER_TOP_LEVEL_MODULE_NAME, etc.)
- Remove duplicate _HookDumper class
- Remove test_layer_filtering test
Store top_level_module_name as instance var for clean access, fix
recursive call passing removed keyword argument.
…rdBatch input capture

- _convert_hook_output returns dict[str, Tensor] for any hook value
- _dump_converted loops over the dict and calls dumper.dump()
- Top-level model hook now also processes input args to capture
  ForwardBatch (input_ids, seq_lens, positions)
Skip creating _DumperHttpManager entirely when server_port_parsed is
None, instead of constructing one that only wraps a _LocalOnlyBroadcast
that is never used.
ZMQ RPC target is now the manager itself instead of the raw _Dumper,
so all HTTP control logic is encapsulated in _DumperHttpManager.
Previously cleanup_previous was triggered lazily on the first dump()
call inside _dump_inner. This is unreliable in distributed settings
because different ranks may reach the first dump at different times
(or not at all in pipeline parallelism), causing the barrier inside
_cleanup_old_dumps to hang.

Now cleanup is triggered eagerly in configure() right when the config
is set, where all ranks call it together. The _cleanup_old_dumps
function already has rank-0-only rmtree + dist.barrier().
Use _collective_with_timeout for the dist.barrier() in
_cleanup_old_dumps so a warning is printed if not all ranks
participate within the timeout window.
…nches

Cover the 4 branches of _DumperConfig.server_port_parsed:
negative -> None, zero -> None, positive -> int, "reuse" -> str.
Also add missing imports for upcoming tests.
Verify prefix is "dump_" and suffix has expected timestamp format
(YYYYMMDD_HHMMSS_MMMRRR = 22 chars with underscore at position 8).
Verify that nesting capture_output() raises AssertionError to prevent
accidental data overwrites from concurrent captures.
Verify that POSTing to an unknown dumper control method returns 400.
When both enable_value and enable_grad are False, _dump_inner should
skip without producing any output files.
Cover unset (default), valid int, invalid string (default fallback),
and whitespace-only string (default fallback).
Verify _register_forward_hook_or_replace_fn raises ValueError
when given an unrecognized mode string.
Cover both the tensor clone path and the dict deepcopy path,
verifying mutations to the original don't affect the copy.
Verify that dump_model saves Parameter.data (plain Tensor),
not the Parameter wrapper itself.
Verify that _parse_env_value treats whitespace-only strings as
unset and returns the field default.
… error

Previously _dump_single unconditionally stripped .data from Parameters,
making _torch_save's Parameter fallback dead code for the dict format.

Now Parameters are saved as-is. The fallback in _torch_save is extended
to handle dicts containing unpicklable Parameter subclasses.
The function was copied from utils/common.py but never called within
dumper.py itself. Remove it and its tests.
Replace two ad-hoc isinstance branches with a single helper that
handles both bare Parameter and dict-wrapped Parameter forms.
@fzyzcjy fzyzcjy merged commit c1f497e into sgl-project:main Feb 22, 2026
24 of 28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant