Enhance reset, states, http in dumper by fzyzcjy · Pull Request #19095 · sgl-project/sglang

fzyzcjy · 2026-02-21T02:08:22Z

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

…r_step step() is now called at the end of each iteration instead of at the beginning. _curr_step starts at 0 and increments on each step() call. All dump file tags, get_state output, dump_loader/comparator column names updated accordingly.

Port the hook-based tensor dumping from tensor_dump_forward_hook.py into the dumper subsystem as a standalone _HookDumper class. This enables automatic tensor capture via PyTorch forward hooks while reusing dumper's filtering, HTTP control, and metadata infrastructure. Also fix test_disabled_dumper_no_output to use configure() instead of passing enable=False as constructor override.

- Remove dump_layers param from _HookDumper (use dumper filter instead) - Add top_level_module_name and layers_module_name to _DumperConfig (auto env var support: DUMPER_TOP_LEVEL_MODULE_NAME, etc.) - Remove duplicate _HookDumper class - Remove test_layer_filtering test

Store top_level_module_name as instance var for clean access, fix recursive call passing removed keyword argument.

…rdBatch input capture - _convert_hook_output returns dict[str, Tensor] for any hook value - _dump_converted loops over the dict and calls dumper.dump() - Top-level model hook now also processes input args to capture ForwardBatch (input_ids, seq_lens, positions)

Skip creating _DumperHttpManager entirely when server_port_parsed is None, instead of constructing one that only wraps a _LocalOnlyBroadcast that is never used.

ZMQ RPC target is now the manager itself instead of the raw _Dumper, so all HTTP control logic is encapsulated in _DumperHttpManager.

Previously cleanup_previous was triggered lazily on the first dump() call inside _dump_inner. This is unreliable in distributed settings because different ranks may reach the first dump at different times (or not at all in pipeline parallelism), causing the barrier inside _cleanup_old_dumps to hang. Now cleanup is triggered eagerly in configure() right when the config is set, where all ranks call it together. The _cleanup_old_dumps function already has rank-0-only rmtree + dist.barrier().

…e()" This reverts commit 13dd918.

Use _collective_with_timeout for the dist.barrier() in _cleanup_old_dumps so a warning is printed if not all ranks participate within the timeout window.

…nches Cover the 4 branches of _DumperConfig.server_port_parsed: negative -> None, zero -> None, positive -> int, "reuse" -> str. Also add missing imports for upcoming tests.

Verify prefix is "dump_" and suffix has expected timestamp format (YYYYMMDD_HHMMSS_MMMRRR = 22 chars with underscore at position 8).

Verify that nesting capture_output() raises AssertionError to prevent accidental data overwrites from concurrent captures.

Verify that POSTing to an unknown dumper control method returns 400.

When both enable_value and enable_grad are False, _dump_inner should skip without producing any output files.

Cover unset (default), valid int, invalid string (default fallback), and whitespace-only string (default fallback).

Verify _register_forward_hook_or_replace_fn raises ValueError when given an unrecognized mode string.

Cover both the tensor clone path and the dict deepcopy path, verifying mutations to the original don't affect the copy.

Verify that dump_model saves Parameter.data (plain Tensor), not the Parameter wrapper itself.

Verify that _parse_env_value treats whitespace-only strings as unset and returns the field default.

… error Previously _dump_single unconditionally stripped .data from Parameters, making _torch_save's Parameter fallback dead code for the dict format. Now Parameters are saved as-is. The fallback in _torch_save is extended to handle dicts containing unpicklable Parameter subclasses.

The function was copied from utils/common.py but never called within dumper.py itself. Remove it and its tests.

Replace two ad-hoc isinstance branches with a single helper that handles both bare Parameter and dict-wrapped Parameter forms.

fzyzcjy added 30 commits February 20, 2026 19:39

more

ec4a5ee

more

6602e67

more

e60f7ce

more

ab2ac39

more

079b9f2

Shorten curr_step to step in dump tags, state keys, and internals

89d3ac8

more

38da9e0

Rename env var prefix SGLANG_DUMPER_ to DUMPER_

14fe78f

more

90fe262

more

dd9014a

Remove allow_sglang=True from temp_set_env calls for DUMPER_ env vars

97474fa

more

fe39601

more

a2f0603

more

989cd19

more

db3e88e

fmt

5b4edc4

Merge branch 'ac8402/0' into ac8402/1

c0373f4

more

4ec4d94

more

54e76e5

more

2ed766a

more

c67e67d

more

9072082

Fix _HookDumper bugs: undefined vars and stale kwargs in recursive call

19c2705

Store top_level_module_name as instance var for clean access, fix recursive call passing removed keyword argument.

more

ee938e3

more

e26543b

more

dc1f04f

fzyzcjy requested a review from xiezhq-hermann as a code owner February 21, 2026 02:35

fzyzcjy added 26 commits February 21, 2026 10:38

more

a7025d7

refactor: _http_manager returns None when no server port configured

3a9979c

Skip creating _DumperHttpManager entirely when server_port_parsed is None, instead of constructing one that only wraps a _LocalOnlyBroadcast that is never used.

more

7066616

refactor: move _handle_http_control_request into _DumperHttpManager

45142c6

ZMQ RPC target is now the manager itself instead of the raw _Dumper, so all HTTP control logic is encapsulated in _DumperHttpManager.

more

ae974be

more

017c419

fmt

113ce1e

more

61e7a84

Revert "Move cleanup_previous from lazy _dump_inner to eager configur…

251a4e0

…e()" This reverts commit 13dd918.

Wrap cleanup_previous barrier with timeout warning

f2746f8

Use _collective_with_timeout for the dist.barrier() in _cleanup_old_dumps so a warning is printed if not all ranks participate within the timeout window.

test(dumper): add TestServerPortParsed for all server_port_parsed bra…

d4c0d30

…nches Cover the 4 branches of _DumperConfig.server_port_parsed: negative -> None, zero -> None, positive -> int, "reuse" -> str. Also add missing imports for upcoming tests.

test(dumper): add TestDefaultExpName to verify exp_name format

635bff3

Verify prefix is "dump_" and suffix has expected timestamp format (YYYYMMDD_HHMMSS_MMMRRR = 22 chars with underscore at position 8).

test(dumper): add test_capture_output_nested_raises

80455a6

Verify that nesting capture_output() raises AssertionError to prevent accidental data overwrites from concurrent captures.

test(dumper): add test_error_unknown_method to TestDumperHttp

761c93c

Verify that POSTing to an unknown dumper control method returns 400.

test(dumper): add test for all-enables-false early return path

e374bd3

When both enable_value and enable_grad are False, _dump_inner should skip without producing any output files.

test(dumper): add TestGetIntEnvVar for all get_int_env_var branches

184d618

Cover unset (default), valid int, invalid string (default fallback), and whitespace-only string (default fallback).

test(dumper): add test for unknown hook mode ValueError

67fc4a1

Verify _register_forward_hook_or_replace_fn raises ValueError when given an unrecognized mode string.

test(dumper): add _deepcopy_or_clone tests for tensor and non-tensor

70a27b7

Cover both the tensor clone path and the dict deepcopy path, verifying mutations to the original don't affect the copy.

test(dumper): add test_parameter_saved_as_data to TestDumpModel

9c2229a

Verify that dump_model saves Parameter.data (plain Tensor), not the Parameter wrapper itself.

test(dumper): add test for whitespace env var treated as unset

cc04c35

Verify that _parse_env_value treats whitespace-only strings as unset and returns the field default.

style: apply black formatting to test_dumper.py

d8b9cfa

Rename _DumperConfig to DumperConfig

4e8f225

refactor(dumper): remove unused get_int_env_var from dumper.py

5e6c3ef

The function was copied from utils/common.py but never called within dumper.py itself. Remove it and its tests.

refactor(dumper): extract _strip_parameter from _torch_save

175e493

Replace two ad-hoc isinstance branches with a single helper that handles both bare Parameter and dict-wrapped Parameter forms.

fzyzcjy force-pushed the ac8402/8 branch from 4bc0e56 to 175e493 Compare February 21, 2026 14:29

Merge remote-tracking branch 'upstream/main' into ac8402/8

8552c2a

fzyzcjy merged commit c1f497e into sgl-project:main Feb 22, 2026
24 of 28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance reset, states, http in dumper#19095

Enhance reset, states, http in dumper#19095
fzyzcjy merged 164 commits intosgl-project:mainfrom
fzyzcjy:ac8402/8

fzyzcjy commented Feb 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fzyzcjy commented Feb 21, 2026

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant