Skip to content

Refactor: drop WorkerPayload; IWorker::run takes TaskArgsView#564

Merged
ChaoWao merged 1 commit intohw-native-sys:mainfrom
ChaoWao:refactor/drop-workerpayload-iworker-run-view
Apr 15, 2026
Merged

Refactor: drop WorkerPayload; IWorker::run takes TaskArgsView#564
ChaoWao merged 1 commit intohw-native-sys:mainfrom
ChaoWao:refactor/drop-workerpayload-iworker-run-view

Conversation

@ChaoWao
Copy link
Copy Markdown
Collaborator

@ChaoWao ChaoWao commented Apr 15, 2026

Summary

  • IWorker::run(uint64_t callable, TaskArgsView args, ChipCallConfig) replaces the WorkerPayload dispatch carrier. Each worker interprets callable per its semantics (ChipCallable ptr / registry id / orch-fn handle).
  • DistTaskSlotState drops chip_storage_list in favour of TaskArgs (single) / task_args_list (N-SPMD group) + is_group_ flag. slot.args_view(i) is a zero-copy TaskArgsView. callable_ptr renames to callable.
  • view_to_chip_storage moves from submit into ChipWorker::run — the L2 ABI POD is assembled lazily at the worker.
  • PROCESS-mode mailbox writes a length-prefixed blob via write_blob; the Python child decodes it through a new _ChipWorker.run_from_blob binding that rebuilds the ChipStorageTaskArgs POD on the child heap.
  • WorkerThread queue now carries WorkerDispatch { slot, group_index }; the thread resolves slot state via ring->slot_state(id). Scheduler pushes N dispatches per group member.
  • Strict-1 per-scope rings were split out into a separate PR-Scope (plan revision 2026-04-15) that also exposes user-facing scope_begin / scope_end. PR-C intentionally does not touch ring partitioning.

Docs

docs/roadmap.md, docs/scheduler.md, docs/worker-manager.md, docs/task-flow.md, and the PR-C entry in the plan are updated to match the new dispatch surface, slot shape, and PR-Scope split.

Test plan

  • Python unit tests pass: pytest tests/ut/py/test_dist_worker/ (18/18)
  • C++ unit tests pass: ctest in build/ut_cpp (7/7 — ring, scope, tensormap, orchestrator, scheduler, pto2_fatal × 2)
  • pip install --no-build-isolation -e . builds cleanly
  • grep -r WorkerPayload src/ python/ returns only historical doc references (all source clean)
  • grep -r chip_storage_list src/ python/ returns 0 matches
  • Hardware / sim scene tests (not run locally — need device)

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the task dispatch system by removing the WorkerPayload carrier and updating the IWorker::run interface to use TaskArgsView. Task arguments are now stored directly in the DistTaskSlotState, with the conversion to the L2 ABI ChipStorageTaskArgs POD deferred to the worker's execution phase. The WorkerThread now utilizes a WorkerDispatch handle to resolve task details from the DistRing at dispatch time. Documentation and tests have been updated to reflect these architectural changes. Feedback was provided regarding the efficiency of buffer allocation within the Python worker loop, suggesting that moving the allocation outside the loop or avoiding the copy could reduce garbage collector pressure.

Comment thread python/simpler/worker.py Outdated
- IWorker::run(uint64_t callable, TaskArgsView args, ChipCallConfig)
  replaces the WorkerPayload carrier. Each worker interprets callable
  per its semantics: ChipWorker -> ChipCallable buffer ptr cast;
  DistSubWorker -> registry id (low 32 bits); DistChipProcess -> forwarded
  through the shm mailbox to its forked child.
- DistTaskSlotState drops chip_storage_list in favour of TaskArgs
  (single) and task_args_list (N-SPMD group) with an is_group_ flag;
  slot.args_view(i) returns a zero-copy TaskArgsView for dispatch.
  callable_ptr renames to callable.
- view_to_chip_storage moves into ChipWorker::run so the L2 ABI POD
  is assembled lazily at the worker instead of at submit time.
- PROCESS-mode mailbox writes a length-prefixed blob via write_blob;
  the Python child decodes it through the new _ChipWorker.run_from_blob
  binding, which rebuilds the ChipStorageTaskArgs POD on the child heap
  before calling pto2_run_runtime. The child reads the blob directly
  from the mailbox — no per-task heap buffer — since the parent is
  blocked on TASK_DONE and view_to_chip_storage already memcpys the
  tensor / scalar bytes into the POD.
- WorkerThread queue carries WorkerDispatch { slot, group_index } — the
  thread reads callable / args / config via ring->slot_state(id).
  Scheduler pushes N dispatches (one per group member) for group slots;
  the per-worker payload carrier is gone.
- Strict-1 per-scope rings were split out into a separate PR-Scope
  (plan revision 2026-04-15) that also exposes user-facing scope_begin /
  scope_end; PR-C no longer touches ring partitioning.
- Fix conftest.py --pto-isa-commit import path: the symbol is
  ensure_pto_isa_root in simpler_setup.pto_isa, not
  _ensure_pto_isa_root in simpler_setup.code_runner. Unblocks the
  st-onboard-a2a3 CI job, which invokes pytest with --pto-isa-commit.

Docs: roadmap / scheduler / worker-manager / task-flow updated to match
the new dispatch surface, slot shape, and PR-Scope split.

Tests: 18 Python dist_worker UTs + 7 C++ UTs (ring, scope, tensormap,
orchestrator, scheduler, pto2_fatal) pass.
@ChaoWao ChaoWao force-pushed the refactor/drop-workerpayload-iworker-run-view branch from 49c36a1 to e8ceb4c Compare April 15, 2026 07:47
@ChaoWao ChaoWao merged commit e3b07d9 into hw-native-sys:main Apr 15, 2026
15 checks passed
@ChaoWao ChaoWao deleted the refactor/drop-workerpayload-iworker-run-view branch April 16, 2026 03:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant