Add manual-scope v0 to tensormap runtime#568
Add manual-scope v0 to tensormap runtime#568uv-xiao wants to merge 21 commits intohw-native-sys:mainfrom
Conversation
- add PTO2ScopeMode and Arg.add_dep(task_id) on the existing submit path - carry pending scope mode through PTO2Runtime so manual scope works in sim/device orchestration without changing the ops table - reject nested manual scopes, add_dep outside manual scope, foreign manual deps, and explicit deps on alloc_tensors at submit time - add C++ API coverage and a2a3sim scene tests for the invalid-usage cases
- turn Arg.add_dep(task_id) into ordinary submit-time fanins in the orchestrator - stamp runtime-created tensors with producer scope and manual-scope depth metadata for conservative manual-local classification - skip TensorMap lookup and insert only for current-manual-scope-local tensors when explicit deps already provide the ordering - add a runtime-level C++ unit test that links the real PTO2 runtime sources and verifies explicit dep fanins plus output metadata
- switch the per-q scope to PTO2ScopeMode::MANUAL in the example and profiling scene variant - replay explicit task-id deps for qk->softmax, softmax->pv, and alloc/softmax/pv->online_update at submit time - keep the scene-test orchestration in one manual scope so current-scope validation accepts the explicit dep ids
- reject unsupported paged-attention shape tuples in the example and\n production ST orchestrations before kernel submission\n- remove the unsupported head_dim=256 positive case from the\n production paged-attention golden list\n- add a hardware negative test that asserts the runtime reports\n PTO2 invalid-args status for the invalid head_dim case
- summarize the supported-scope correctness checks that currently pass\n- record fresh paged-attention benchmarks for merge-base TMR,\n current TMR, and current ABG on real hardware\n- state clearly that the branch is functionally correct for the\n supported scope but still not performance-ready
- replace explicit-dep scope scans with exact per-scope epoch checks\n- skip redundant creator-retention work when an owner is already explicit\n- cache tensor start_offset eagerly so payload init stops recomputing it\n- add runtime tests for scope isolation and eager view offset caching
- split the pre-tuning benchmark table from later tuning work\n- add a per-optimization ledger with files, functions, and measured effect\n- record reverted experiments that failed correctness or did not hold up\n- summarize the current tuned snapshot and remaining hot buckets
- keep default paged_attention in auto mode and restore its nested auto scope - add _manual_scope siblings for non-unroll and unroll under tests/ and examples/ - add the missing example-side paged_attention_unroll auto variant - extend benchmark_rounds.sh to enumerate the auto/manual paged attention rows
- restore upstream allocator and tensor start-offset behavior\n- keep manual-scope metadata and validation paths intact\n- remove the runtime unit test that only covered eager offset caching
- restore runtime Arg to the current TaskArgsTpl contract after rebase\n- update rebased paged-attention goldens and orchestration symbols to match main\n- drop the benchmark_rounds workaround so the branch stays non-intrusive
- Preserve TensorMap publication and lookup for manual-scope modifier\n tensors in pto2_submit_mixed_task() so repeated updater chains stay\n correct even when producer-flow deps are explicit\n- Refresh docs/manual-scope-v0-design.md with the rebased runtime\n rule, fresh device validation, and a new 30-round benchmark batch
There was a problem hiding this comment.
Code Review
This pull request introduces a lighter 'manual-scope' mode for the a2a3/tensormap_and_ringbuffer service, allowing for explicit task dependency management while maintaining the existing AUTO-mode submit API. The design avoids complex delayed wiring, opting for explicit Arg.add_dep(task_id) annotations within a PTO2_SCOPE(PTO2ScopeMode::MANUAL) block. I have identified a critical issue where PTO2_SCOPE_GUARD(); is used incorrectly, resulting in an immediate destruction of the guard and leaving the loop body unprotected. Additionally, the design document contains a future date that should be corrected.
| const Tensor &mi_update = alloc_outs.get_ref(2); | ||
|
|
||
| for (uint64_t bn = 0; bn < bn_this_batch; bn++) { | ||
| PTO2_SCOPE_GUARD(); |
There was a problem hiding this comment.
The statement PTO2_SCOPE_GUARD(); creates a temporary PTO2ScopeGuard object that is immediately destroyed at the semicolon. This means the scope it's intended to guard is empty, and the rest of the loop body executes outside of any PTO2 scope. This is likely not the intended behavior and can lead to subtle bugs.
To ensure the scope is active for the entire for loop block, you should declare a named PTO2ScopeGuard variable.
PTO2_SCOPE_GUARD guard;References
- Use RAII guards to manage resources like thread-specific data and device contexts, ensuring cleanup is automatically handled on all function exit paths.
| @@ -0,0 +1,423 @@ | |||
| # Manual Scope V0 Design | |||
|
|
|||
| Date: 2026-04-15 | |||
- drop benchmark-only and duplicate paged-attention payload trees that are\n outside the intended manual-scope v0 review surface\n- restore the auto paged_attention orchestration file to the upstream\n baseline so the PR only carries manual-scope-related example changes\n- remove the paged-attention validation test added by the heavier branch\n line so follow-up benchmarking and review focus on the v0 runtime
- rewrite the validation and benchmark sections for the cleaned\n manual_scope_v0 branch state\n- record the fresh 2026-04-15 sim checks, device golden reruns, and\n 30-round trimmed benchmark table gathered on device 9\n- remove stale rebase-specific narrative and old performance claims so\n the design doc matches the current PR scope and measured results
- explain that raw paged_attention vs paged_attention_unroll numbers are\n not cross-workload comparable and derive the task-count difference from\n the current device logs\n- restore a historical optimization section for the earlier heavy branch,\n clearly marked as context rather than current PR scope\n- keep the current-branch benchmark table separate from historical notes
|
Latest update on top of the current PR branch: What changed
Fresh benchmark results30 rounds, trimmed average, device
Fresh TensorMap profilingNon-unroll
What these numbers show:
The design note in |
Summary
manual_scope v0toa2a3/tensormap_and_ringbufferwithout introducing a separate manual submit API familyArg.add_dep(task_id)task_id, so zero-output updater chains can be expressed explicitlydocs/manual-scope-v0-design.mdDesign Reference
Primary design note:
docs/manual-scope-v0-design.mdConstraints Of This Version
This PR intentionally stays narrow.
PTO2_SCOPE()remains AUTO by defaultPTO2_SCOPE(PTO2ScopeMode::MANUAL)enables manual scopepto2_rt_submit_aic_task(...)pto2_rt_submit_aiv_task(...)pto2_rt_submit_task(...)Arg.add_dep(task_id)Arg.add_dep(...)is rejected outside manual scopeCore Runtime Semantics
Inside manual scope, tasks are still published immediately at submit time, like AUTO mode.
The additional v0 behavior is:
ArgThis keeps manual scope as a lightweight extension on top of normal PTO2 submit rather than a separate graph-construction pipeline.
Manual-local vs Boundary Tensors
Current-manual-scope-local tensors are identified by producer manual-scope depth.
For manual-local tensors:
For boundary tensors:
manual_dep=trueBoundary tensors include:
Why
alloc_tensors(...)Mattersalloc_tensors(...)stays output-only, but v0 treats allocation as a task and returns its task id together with the output tensors.That allows later manual tasks to depend on allocation explicitly:
auto alloc = alloc_tensors(ci0, ci1); Arg args; args.add_dep(alloc.task_id());Zero-output Updater Chaining
Submit results now carry a standalone
task_id, independent of whether any newOUTPUTtensor was materialized.That lets manual orchestration express repeated updater chains explicitly:
Fresh Hardware Benchmark
30 rounds, trimmed average, device
9, PTO-ISAd96c8784.paged_attentionCase1paged_attentionCase2paged_attention_unrollCase1paged_attention_unrollCase2Interpretation:
Fresh TensorMap Profiling
Non-unroll
paged_attention, 30 rounds, device-log parsing from a profiling-only rebuild:lookup+dep(us)tensormap_ins(us)Case1Case1Case2Case2Interpretation:
Testing
Verified on this branch tip:
ctest --test-dir tests/ut/cpp/build -R 'test_a2a3_pto2_manual_scope_(api|runtime)' --output-on-failurepython -m pytest tests/st/a2a3/tensormap_and_ringbuffer/test_manual_scope_validation.py --platform a2a3sim --device 0 -qa2a3, device9:paged_attentionCase1/Case2paged_attention_manual_scopeCase1/Case2paged_attention_unrollCase1/Case2paged_attention_unroll_manual_scopeCase1/Case2Notes
This PR is intended to replace the earlier draft line of work around manual dependency experiments with a cleaner v0 proposal.