Add manual-scope v0 to tensormap runtime by uv-xiao · Pull Request #568 · hw-native-sys/simpler

uv-xiao · 2026-04-15T08:01:32Z

Summary

add manual_scope v0 to a2a3/tensormap_and_ringbuffer without introducing a separate manual submit API family
keep the same submit calls as AUTO mode and express same-scope ordering through Arg.add_dep(task_id)
publish tasks at submit time; do not add delayed wiring, delayed linking, or a manual scope-end replay barrier
make submit/alloc results carry a standalone task_id, so zero-output updater chains can be expressed explicitly
bypass TensorMap lookup and insert for current-manual-scope-local tensors; keep TensorMap only for boundary tensors
add manual-scope validation coverage, paged-attention manual-scope examples, and the design note in docs/manual-scope-v0-design.md

Design Reference

Primary design note:

docs/manual-scope-v0-design.md

Constraints Of This Version

This PR intentionally stays narrow.

PTO2_SCOPE() remains AUTO by default
PTO2_SCOPE(PTO2ScopeMode::MANUAL) enables manual scope
submit APIs stay unchanged:
- pto2_rt_submit_aic_task(...)
- pto2_rt_submit_aiv_task(...)
- pto2_rt_submit_task(...)
explicit ordering is expressed only through Arg.add_dep(task_id)
Arg.add_dep(...) is rejected outside manual scope
nested manual scopes are rejected in v0
no post-submit dependency API
no delayed wiring / delayed linking / delayed publish path
no manual-specific scope-end dependency replay

Core Runtime Semantics

Inside manual scope, tasks are still published immediately at submit time, like AUTO mode.

The additional v0 behavior is:

read explicit deps from Arg
validate that those deps belong to the current manual scope
materialize them as ordinary fanins before publish
classify tensor args by scope metadata to decide whether TensorMap is still required

This keeps manual scope as a lightweight extension on top of normal PTO2 submit rather than a separate graph-construction pipeline.

Manual-local vs Boundary Tensors

Current-manual-scope-local tensors are identified by producer manual-scope depth.

For manual-local tensors:

explicit task ids are the only ordering source
no TensorMap lookup
no TensorMap insert

For boundary tensors:

keep creator retention
keep normal TensorMap lookup / insert behavior unless manual_dep=true

Boundary tensors include:

external tensors
AUTO-scope tensors
outer-scope tensors
outer-manual-scope tensors

Why `alloc_tensors(...)` Matters

alloc_tensors(...) stays output-only, but v0 treats allocation as a task and returns its task id together with the output tensors.

That allows later manual tasks to depend on allocation explicitly:

auto alloc = alloc_tensors(ci0, ci1);
Arg args;
args.add_dep(alloc.task_id());

Zero-output Updater Chaining

Submit results now carry a standalone task_id, independent of whether any new OUTPUT tensor was materialized.

That lets manual orchestration express repeated updater chains explicitly:

PTO2TaskId prev_update = PTO2TaskId::invalid();

for (...) {
    Arg up = make_update_args(...);
    if (prev_update.is_valid()) {
        up.add_dep(prev_update);
    }
    TaskOutputTensors update_out = pto2_rt_submit_aiv_task(FUNC_ONLINE_UPDATE, up);
    prev_update = update_out.task_id();
}

Fresh Hardware Benchmark

30 rounds, trimmed average, device 9, PTO-ISA d96c8784.

Example	Case	Auto Elapsed (us)	Auto Orch (us)	Manual Elapsed (us)	Manual Orch (us)	Elapsed Delta	Orch Delta
`paged_attention`	`Case1`	77.5	62.4	124.0	108.6	+46.5	+46.2
`paged_attention`	`Case2`	96.2	74.1	146.7	124.6	+50.5	+50.5
`paged_attention_unroll`	`Case1`	1138.4	800.7	1131.3	694.9	-7.1	-105.8
`paged_attention_unroll`	`Case2`	519.7	319.1	511.3	282.6	-8.4	-36.5

Interpretation:

non-unroll manual scope is still slower than AUTO, and the gap is mostly orch time
unroll manual scope is now slightly faster than AUTO on both kept cases

Fresh TensorMap Profiling

Non-unroll paged_attention, 30 rounds, device-log parsing from a profiling-only rebuild:

Case	Mode	`lookup+dep` (us)	`tensormap_ins` (us)	Lookups	Inserts	Full Orch (us)
`Case1`	AUTO	4.132	1.842	40.0	12.0	194.508
`Case1`	MANUAL	1.944	1.414	16.0	3.0	259.318
`Case2`	AUTO	6.320	2.638	105.0	32.0	210.274
`Case2`	MANUAL	2.598	1.728	41.0	8.0	285.182

Interpretation:

the manual-local TensorMap bypass is working
the remaining non-unroll gap is no longer explained by TensorMap lookup / insert
the next optimization target is explicit-dep construction / validation overhead

Testing

Verified on this branch tip:

ctest --test-dir tests/ut/cpp/build -R 'test_a2a3_pto2_manual_scope_(api|runtime)' --output-on-failure
python -m pytest tests/st/a2a3/tensormap_and_ringbuffer/test_manual_scope_validation.py --platform a2a3sim --device 0 -q
real-device golden reruns on a2a3, device 9:
- paged_attention Case1 / Case2
- paged_attention_manual_scope Case1 / Case2
- paged_attention_unroll Case1 / Case2
- paged_attention_unroll_manual_scope Case1 / Case2

Notes

This PR is intended to replace the earlier draft line of work around manual dependency experiments with a cleaner v0 proposal.

- add PTO2ScopeMode and Arg.add_dep(task_id) on the existing submit path - carry pending scope mode through PTO2Runtime so manual scope works in sim/device orchestration without changing the ops table - reject nested manual scopes, add_dep outside manual scope, foreign manual deps, and explicit deps on alloc_tensors at submit time - add C++ API coverage and a2a3sim scene tests for the invalid-usage cases

- turn Arg.add_dep(task_id) into ordinary submit-time fanins in the orchestrator - stamp runtime-created tensors with producer scope and manual-scope depth metadata for conservative manual-local classification - skip TensorMap lookup and insert only for current-manual-scope-local tensors when explicit deps already provide the ordering - add a runtime-level C++ unit test that links the real PTO2 runtime sources and verifies explicit dep fanins plus output metadata

- switch the per-q scope to PTO2ScopeMode::MANUAL in the example and profiling scene variant - replay explicit task-id deps for qk->softmax, softmax->pv, and alloc/softmax/pv->online_update at submit time - keep the scene-test orchestration in one manual scope so current-scope validation accepts the explicit dep ids

- reject unsupported paged-attention shape tuples in the example and\n production ST orchestrations before kernel submission\n- remove the unsupported head_dim=256 positive case from the\n production paged-attention golden list\n- add a hardware negative test that asserts the runtime reports\n PTO2 invalid-args status for the invalid head_dim case

- summarize the supported-scope correctness checks that currently pass\n- record fresh paged-attention benchmarks for merge-base TMR,\n current TMR, and current ABG on real hardware\n- state clearly that the branch is functionally correct for the\n supported scope but still not performance-ready

- replace explicit-dep scope scans with exact per-scope epoch checks\n- skip redundant creator-retention work when an owner is already explicit\n- cache tensor start_offset eagerly so payload init stops recomputing it\n- add runtime tests for scope isolation and eager view offset caching

- split the pre-tuning benchmark table from later tuning work\n- add a per-optimization ledger with files, functions, and measured effect\n- record reverted experiments that failed correctness or did not hold up\n- summarize the current tuned snapshot and remaining hot buckets

- keep default paged_attention in auto mode and restore its nested auto scope - add _manual_scope siblings for non-unroll and unroll under tests/ and examples/ - add the missing example-side paged_attention_unroll auto variant - extend benchmark_rounds.sh to enumerate the auto/manual paged attention rows

- restore upstream allocator and tensor start-offset behavior\n- keep manual-scope metadata and validation paths intact\n- remove the runtime unit test that only covered eager offset caching

- restore runtime Arg to the current TaskArgsTpl contract after rebase\n- update rebased paged-attention goldens and orchestration symbols to match main\n- drop the benchmark_rounds workaround so the branch stays non-intrusive

- Preserve TensorMap publication and lookup for manual-scope modifier\n tensors in pto2_submit_mixed_task() so repeated updater chains stay\n correct even when producer-flow deps are explicit\n- Refresh docs/manual-scope-v0-design.md with the rebased runtime\n rule, fresh device validation, and a new 30-round benchmark batch

gemini-code-assist

Code Review

This pull request introduces a lighter 'manual-scope' mode for the a2a3/tensormap_and_ringbuffer service, allowing for explicit task dependency management while maintaining the existing AUTO-mode submit API. The design avoids complex delayed wiring, opting for explicit Arg.add_dep(task_id) annotations within a PTO2_SCOPE(PTO2ScopeMode::MANUAL) block. I have identified a critical issue where PTO2_SCOPE_GUARD(); is used incorrectly, resulting in an immediate destruction of the guard and leaving the loop body unprotected. Additionally, the design document contains a future date that should be corrected.

gemini-code-assist · 2026-04-15T08:06:02Z

                const Tensor &mi_update = alloc_outs.get_ref(2);
-
                for (uint64_t bn = 0; bn < bn_this_batch; bn++) {
+                    PTO2_SCOPE_GUARD();


The statement PTO2_SCOPE_GUARD(); creates a temporary PTO2ScopeGuard object that is immediately destroyed at the semicolon. This means the scope it's intended to guard is empty, and the rest of the loop body executes outside of any PTO2 scope. This is likely not the intended behavior and can lead to subtle bugs.

To ensure the scope is active for the entire for loop block, you should declare a named PTO2ScopeGuard variable.

PTO2_SCOPE_GUARD guard;

References

Use RAII guards to manage resources like thread-specific data and device contexts, ensuring cleanup is automatically handled on all function exit paths.

gemini-code-assist · 2026-04-15T08:06:02Z

@@ -0,0 +1,423 @@
+# Manual Scope V0 Design
+
+Date: 2026-04-15


The date 2026-04-15 appears to be a typo, as it is in the future. Please consider correcting it.

- drop benchmark-only and duplicate paged-attention payload trees that are\n outside the intended manual-scope v0 review surface\n- restore the auto paged_attention orchestration file to the upstream\n baseline so the PR only carries manual-scope-related example changes\n- remove the paged-attention validation test added by the heavier branch\n line so follow-up benchmarking and review focus on the v0 runtime

- rewrite the validation and benchmark sections for the cleaned\n manual_scope_v0 branch state\n- record the fresh 2026-04-15 sim checks, device golden reruns, and\n 30-round trimmed benchmark table gathered on device 9\n- remove stale rebase-specific narrative and old performance claims so\n the design doc matches the current PR scope and measured results

- explain that raw paged_attention vs paged_attention_unroll numbers are\n not cross-workload comparable and derive the task-count difference from\n the current device logs\n- restore a historical optimization section for the earlier heavy branch,\n clearly marked as context rather than current PR scope\n- keep the current-branch benchmark table separate from historical notes

uv-xiao · 2026-04-15T13:09:28Z

Latest update on top of the current PR branch:

What changed

TaskOutputTensors now carries a standalone task_id, so submit/alloc results expose a usable task-id handle even for zero-output updater tasks.
The runtime now fully bypasses TensorMap lookup and insert for current-manual-scope-local tensors; boundary tensors still keep the conservative TensorMap path.
The paged-attention manual-scope examples now explicitly chain repeated updater tasks with prev_update_task instead of relying on manual-local TensorMap fallback.
The design note was refreshed to match the implemented semantics, fresh validation, benchmark results, and the new TensorMap profiling breakdown.

Fresh benchmark results

30 rounds, trimmed average, device 9, PTO-ISA d96c8784.

Example	Case	Auto Elapsed (us)	Auto Orch (us)	Manual Elapsed (us)	Manual Orch (us)	Elapsed Delta	Orch Delta
`paged_attention`	`Case1`	77.5	62.4	124.0	108.6	+46.5	+46.2
`paged_attention`	`Case2`	96.2	74.1	146.7	124.6	+50.5	+50.5
`paged_attention_unroll`	`Case1`	1138.4	800.7	1131.3	694.9	-7.1	-105.8
`paged_attention_unroll`	`Case2`	519.7	319.1	511.3	282.6	-8.4	-36.5

Fresh TensorMap profiling

Non-unroll paged_attention, 30 rounds, device-log parsing from a profiling-only rebuild.

Case	Mode	`lookup+dep` (us)	`tensormap_ins` (us)	Lookups	Inserts	Full Orch (us)
`Case1`	AUTO	4.132	1.842	40.0	12.0	194.508
`Case1`	MANUAL	1.944	1.414	16.0	3.0	259.318
`Case2`	AUTO	6.320	2.638	105.0	32.0	210.274
`Case2`	MANUAL	2.598	1.728	41.0	8.0	285.182

What these numbers show:

the manual-local TensorMap bypass is working: lookups dropped about 60%, inserts dropped about 75%, and lookup+dep time dropped about 53% to 59%
the remaining non-unroll regression is not coming from TensorMap lookup / insert anymore
the next optimization target should be explicit-dep construction in orchestration and explicit-dep validation / dedupe in the runtime submit path

The design note in docs/manual-scope-v0-design.md now records these results and the corresponding interpretation.

uv-xiao added 15 commits April 15, 2026 10:39

docs: add manual scope v0 design

abf00f1

Update: cache allocator state on fast path

728563b

Update: refresh manual-scope benchmark ledger

acaa94c

Update: record post-tuning hotspot profile

cc979cf

Refactor: drop unrelated manual-scope optimizations

bd35c79

- restore upstream allocator and tensor start-offset behavior\n- keep manual-scope metadata and validation paths intact\n- remove the runtime unit test that only covered eager offset caching

Fix: adapt rebased manual-scope branch to main

b9deae0

- restore runtime Arg to the current TaskArgsTpl contract after rebase\n- update rebased paged-attention goldens and orchestration symbols to match main\n- drop the benchmark_rounds workaround so the branch stays non-intrusive

uv-xiao requested a review from poursoul April 15, 2026 08:03

uv-xiao marked this pull request as ready for review April 15, 2026 08:03

gemini-code-assist bot reviewed Apr 15, 2026

View reviewed changes

uv-xiao added 6 commits April 15, 2026 16:18

Align manual-scope v0 task-id and TensorMap bypass

3d36370

Fix manual paged-attention updater chaining

9072bbc

Update: record manual-scope TensorMap profiling

a6ddebb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add manual-scope v0 to tensormap runtime#568

Add manual-scope v0 to tensormap runtime#568
uv-xiao wants to merge 21 commits intohw-native-sys:mainfrom
uv-xiao:manual_scope_v0

uv-xiao commented Apr 15, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 15, 2026

Uh oh!

gemini-code-assist bot Apr 15, 2026

Uh oh!

uv-xiao commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

uv-xiao commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design Reference

Constraints Of This Version

Core Runtime Semantics

Manual-local vs Boundary Tensors

Why alloc_tensors(...) Matters

Zero-output Updater Chaining

Fresh Hardware Benchmark

Fresh TensorMap Profiling

Testing

Notes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

uv-xiao commented Apr 15, 2026

What changed

Fresh benchmark results

Fresh TensorMap profiling

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

uv-xiao commented Apr 15, 2026 •

edited

Loading

Why `alloc_tensors(...)` Matters