Refactor: drop WorkerPayload; IWorker::run takes TaskArgsView#564
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the task dispatch system by removing the WorkerPayload carrier and updating the IWorker::run interface to use TaskArgsView. Task arguments are now stored directly in the DistTaskSlotState, with the conversion to the L2 ABI ChipStorageTaskArgs POD deferred to the worker's execution phase. The WorkerThread now utilizes a WorkerDispatch handle to resolve task details from the DistRing at dispatch time. Documentation and tests have been updated to reflect these architectural changes. Feedback was provided regarding the efficiency of buffer allocation within the Python worker loop, suggesting that moving the allocation outside the loop or avoiding the copy could reduce garbage collector pressure.
- IWorker::run(uint64_t callable, TaskArgsView args, ChipCallConfig)
replaces the WorkerPayload carrier. Each worker interprets callable
per its semantics: ChipWorker -> ChipCallable buffer ptr cast;
DistSubWorker -> registry id (low 32 bits); DistChipProcess -> forwarded
through the shm mailbox to its forked child.
- DistTaskSlotState drops chip_storage_list in favour of TaskArgs
(single) and task_args_list (N-SPMD group) with an is_group_ flag;
slot.args_view(i) returns a zero-copy TaskArgsView for dispatch.
callable_ptr renames to callable.
- view_to_chip_storage moves into ChipWorker::run so the L2 ABI POD
is assembled lazily at the worker instead of at submit time.
- PROCESS-mode mailbox writes a length-prefixed blob via write_blob;
the Python child decodes it through the new _ChipWorker.run_from_blob
binding, which rebuilds the ChipStorageTaskArgs POD on the child heap
before calling pto2_run_runtime. The child reads the blob directly
from the mailbox — no per-task heap buffer — since the parent is
blocked on TASK_DONE and view_to_chip_storage already memcpys the
tensor / scalar bytes into the POD.
- WorkerThread queue carries WorkerDispatch { slot, group_index } — the
thread reads callable / args / config via ring->slot_state(id).
Scheduler pushes N dispatches (one per group member) for group slots;
the per-worker payload carrier is gone.
- Strict-1 per-scope rings were split out into a separate PR-Scope
(plan revision 2026-04-15) that also exposes user-facing scope_begin /
scope_end; PR-C no longer touches ring partitioning.
- Fix conftest.py --pto-isa-commit import path: the symbol is
ensure_pto_isa_root in simpler_setup.pto_isa, not
_ensure_pto_isa_root in simpler_setup.code_runner. Unblocks the
st-onboard-a2a3 CI job, which invokes pytest with --pto-isa-commit.
Docs: roadmap / scheduler / worker-manager / task-flow updated to match
the new dispatch surface, slot shape, and PR-Scope split.
Tests: 18 Python dist_worker UTs + 7 C++ UTs (ring, scope, tensormap,
orchestrator, scheduler, pto2_fatal) pass.
49c36a1 to
e8ceb4c
Compare
Summary
IWorker::run(uint64_t callable, TaskArgsView args, ChipCallConfig)replaces theWorkerPayloaddispatch carrier. Each worker interpretscallableper its semantics (ChipCallable ptr / registry id / orch-fn handle).DistTaskSlotStatedropschip_storage_listin favour ofTaskArgs(single) /task_args_list(N-SPMD group) +is_group_flag.slot.args_view(i)is a zero-copyTaskArgsView.callable_ptrrenames tocallable.view_to_chip_storagemoves from submit intoChipWorker::run— the L2 ABI POD is assembled lazily at the worker.write_blob; the Python child decodes it through a new_ChipWorker.run_from_blobbinding that rebuilds theChipStorageTaskArgsPOD on the child heap.WorkerThreadqueue now carriesWorkerDispatch { slot, group_index }; the thread resolves slot state viaring->slot_state(id). Scheduler pushes N dispatches per group member.scope_begin/scope_end. PR-C intentionally does not touch ring partitioning.Docs
docs/roadmap.md,docs/scheduler.md,docs/worker-manager.md,docs/task-flow.md, and the PR-C entry in the plan are updated to match the new dispatch surface, slot shape, and PR-Scope split.Test plan
pytest tests/ut/py/test_dist_worker/(18/18)ctestinbuild/ut_cpp(7/7 — ring, scope, tensormap, orchestrator, scheduler, pto2_fatal × 2)pip install --no-build-isolation -e .builds cleanlygrep -r WorkerPayload src/ python/returns only historical doc references (all source clean)grep -r chip_storage_list src/ python/returns 0 matches