Skip to content

Add manual-scope v0 to tensormap runtime#568

Open
uv-xiao wants to merge 21 commits intohw-native-sys:mainfrom
uv-xiao:manual_scope_v0
Open

Add manual-scope v0 to tensormap runtime#568
uv-xiao wants to merge 21 commits intohw-native-sys:mainfrom
uv-xiao:manual_scope_v0

Conversation

@uv-xiao
Copy link
Copy Markdown
Contributor

@uv-xiao uv-xiao commented Apr 15, 2026

Summary

  • add manual_scope v0 to a2a3/tensormap_and_ringbuffer without introducing a separate manual submit API family
  • keep the same submit calls as AUTO mode and express same-scope ordering through Arg.add_dep(task_id)
  • publish tasks at submit time; do not add delayed wiring, delayed linking, or a manual scope-end replay barrier
  • make submit/alloc results carry a standalone task_id, so zero-output updater chains can be expressed explicitly
  • bypass TensorMap lookup and insert for current-manual-scope-local tensors; keep TensorMap only for boundary tensors
  • add manual-scope validation coverage, paged-attention manual-scope examples, and the design note in docs/manual-scope-v0-design.md

Design Reference

Primary design note:

  • docs/manual-scope-v0-design.md

Constraints Of This Version

This PR intentionally stays narrow.

  • PTO2_SCOPE() remains AUTO by default
  • PTO2_SCOPE(PTO2ScopeMode::MANUAL) enables manual scope
  • submit APIs stay unchanged:
    • pto2_rt_submit_aic_task(...)
    • pto2_rt_submit_aiv_task(...)
    • pto2_rt_submit_task(...)
  • explicit ordering is expressed only through Arg.add_dep(task_id)
  • Arg.add_dep(...) is rejected outside manual scope
  • nested manual scopes are rejected in v0
  • no post-submit dependency API
  • no delayed wiring / delayed linking / delayed publish path
  • no manual-specific scope-end dependency replay

Core Runtime Semantics

Inside manual scope, tasks are still published immediately at submit time, like AUTO mode.

The additional v0 behavior is:

  • read explicit deps from Arg
  • validate that those deps belong to the current manual scope
  • materialize them as ordinary fanins before publish
  • classify tensor args by scope metadata to decide whether TensorMap is still required

This keeps manual scope as a lightweight extension on top of normal PTO2 submit rather than a separate graph-construction pipeline.

Manual-local vs Boundary Tensors

Current-manual-scope-local tensors are identified by producer manual-scope depth.

For manual-local tensors:

  • explicit task ids are the only ordering source
  • no TensorMap lookup
  • no TensorMap insert

For boundary tensors:

  • keep creator retention
  • keep normal TensorMap lookup / insert behavior unless manual_dep=true

Boundary tensors include:

  • external tensors
  • AUTO-scope tensors
  • outer-scope tensors
  • outer-manual-scope tensors

Why alloc_tensors(...) Matters

alloc_tensors(...) stays output-only, but v0 treats allocation as a task and returns its task id together with the output tensors.

That allows later manual tasks to depend on allocation explicitly:

auto alloc = alloc_tensors(ci0, ci1);
Arg args;
args.add_dep(alloc.task_id());

Zero-output Updater Chaining

Submit results now carry a standalone task_id, independent of whether any new OUTPUT tensor was materialized.

That lets manual orchestration express repeated updater chains explicitly:

PTO2TaskId prev_update = PTO2TaskId::invalid();

for (...) {
    Arg up = make_update_args(...);
    if (prev_update.is_valid()) {
        up.add_dep(prev_update);
    }
    TaskOutputTensors update_out = pto2_rt_submit_aiv_task(FUNC_ONLINE_UPDATE, up);
    prev_update = update_out.task_id();
}

Fresh Hardware Benchmark

30 rounds, trimmed average, device 9, PTO-ISA d96c8784.

Example Case Auto Elapsed (us) Auto Orch (us) Manual Elapsed (us) Manual Orch (us) Elapsed Delta Orch Delta
paged_attention Case1 77.5 62.4 124.0 108.6 +46.5 +46.2
paged_attention Case2 96.2 74.1 146.7 124.6 +50.5 +50.5
paged_attention_unroll Case1 1138.4 800.7 1131.3 694.9 -7.1 -105.8
paged_attention_unroll Case2 519.7 319.1 511.3 282.6 -8.4 -36.5

Interpretation:

  • non-unroll manual scope is still slower than AUTO, and the gap is mostly orch time
  • unroll manual scope is now slightly faster than AUTO on both kept cases

Fresh TensorMap Profiling

Non-unroll paged_attention, 30 rounds, device-log parsing from a profiling-only rebuild:

Case Mode lookup+dep (us) tensormap_ins (us) Lookups Inserts Full Orch (us)
Case1 AUTO 4.132 1.842 40.0 12.0 194.508
Case1 MANUAL 1.944 1.414 16.0 3.0 259.318
Case2 AUTO 6.320 2.638 105.0 32.0 210.274
Case2 MANUAL 2.598 1.728 41.0 8.0 285.182

Interpretation:

  • the manual-local TensorMap bypass is working
  • the remaining non-unroll gap is no longer explained by TensorMap lookup / insert
  • the next optimization target is explicit-dep construction / validation overhead

Testing

Verified on this branch tip:

  • ctest --test-dir tests/ut/cpp/build -R 'test_a2a3_pto2_manual_scope_(api|runtime)' --output-on-failure
  • python -m pytest tests/st/a2a3/tensormap_and_ringbuffer/test_manual_scope_validation.py --platform a2a3sim --device 0 -q
  • real-device golden reruns on a2a3, device 9:
    • paged_attention Case1 / Case2
    • paged_attention_manual_scope Case1 / Case2
    • paged_attention_unroll Case1 / Case2
    • paged_attention_unroll_manual_scope Case1 / Case2

Notes

This PR is intended to replace the earlier draft line of work around manual dependency experiments with a cleaner v0 proposal.

uv-xiao added 15 commits April 15, 2026 10:39
- add PTO2ScopeMode and Arg.add_dep(task_id) on the existing submit path
- carry pending scope mode through PTO2Runtime so manual scope works in sim/device orchestration without changing the ops table
- reject nested manual scopes, add_dep outside manual scope, foreign manual deps, and explicit deps on alloc_tensors at submit time
- add C++ API coverage and a2a3sim scene tests for the invalid-usage cases
- turn Arg.add_dep(task_id) into ordinary submit-time fanins in the orchestrator
- stamp runtime-created tensors with producer scope and manual-scope depth metadata for conservative manual-local classification
- skip TensorMap lookup and insert only for current-manual-scope-local tensors when explicit deps already provide the ordering
- add a runtime-level C++ unit test that links the real PTO2 runtime sources and verifies explicit dep fanins plus output metadata
- switch the per-q scope to PTO2ScopeMode::MANUAL in the example and
  profiling scene variant
- replay explicit task-id deps for qk->softmax, softmax->pv, and
  alloc/softmax/pv->online_update at submit time
- keep the scene-test orchestration in one manual scope so current-scope
  validation accepts the explicit dep ids
- reject unsupported paged-attention shape tuples in the example and\n  production ST orchestrations before kernel submission\n- remove the unsupported head_dim=256 positive case from the\n  production paged-attention golden list\n- add a hardware negative test that asserts the runtime reports\n  PTO2 invalid-args status for the invalid head_dim case
- summarize the supported-scope correctness checks that currently pass\n- record fresh paged-attention benchmarks for merge-base TMR,\n  current TMR, and current ABG on real hardware\n- state clearly that the branch is functionally correct for the\n  supported scope but still not performance-ready
- replace explicit-dep scope scans with exact per-scope epoch checks\n- skip redundant creator-retention work when an owner is already explicit\n- cache tensor start_offset eagerly so payload init stops recomputing it\n- add runtime tests for scope isolation and eager view offset caching
- split the pre-tuning benchmark table from later tuning work\n- add a per-optimization ledger with files, functions, and measured effect\n- record reverted experiments that failed correctness or did not hold up\n- summarize the current tuned snapshot and remaining hot buckets
- keep default paged_attention in auto mode and restore its nested auto scope
- add _manual_scope siblings for non-unroll and unroll under tests/ and examples/
- add the missing example-side paged_attention_unroll auto variant
- extend benchmark_rounds.sh to enumerate the auto/manual paged attention rows
- restore upstream allocator and tensor start-offset behavior\n- keep manual-scope metadata and validation paths intact\n- remove the runtime unit test that only covered eager offset caching
- restore runtime Arg to the current TaskArgsTpl contract after rebase\n- update rebased paged-attention goldens and orchestration symbols to match main\n- drop the benchmark_rounds workaround so the branch stays non-intrusive
- Preserve TensorMap publication and lookup for manual-scope modifier\n  tensors in pto2_submit_mixed_task() so repeated updater chains stay\n  correct even when producer-flow deps are explicit\n- Refresh docs/manual-scope-v0-design.md with the rebased runtime\n  rule, fresh device validation, and a new 30-round benchmark batch
@uv-xiao uv-xiao requested a review from poursoul April 15, 2026 08:03
@uv-xiao uv-xiao marked this pull request as ready for review April 15, 2026 08:03
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a lighter 'manual-scope' mode for the a2a3/tensormap_and_ringbuffer service, allowing for explicit task dependency management while maintaining the existing AUTO-mode submit API. The design avoids complex delayed wiring, opting for explicit Arg.add_dep(task_id) annotations within a PTO2_SCOPE(PTO2ScopeMode::MANUAL) block. I have identified a critical issue where PTO2_SCOPE_GUARD(); is used incorrectly, resulting in an immediate destruction of the guard and leaving the loop body unprotected. Additionally, the design document contains a future date that should be corrected.

const Tensor &mi_update = alloc_outs.get_ref(2);

for (uint64_t bn = 0; bn < bn_this_batch; bn++) {
PTO2_SCOPE_GUARD();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The statement PTO2_SCOPE_GUARD(); creates a temporary PTO2ScopeGuard object that is immediately destroyed at the semicolon. This means the scope it's intended to guard is empty, and the rest of the loop body executes outside of any PTO2 scope. This is likely not the intended behavior and can lead to subtle bugs.

To ensure the scope is active for the entire for loop block, you should declare a named PTO2ScopeGuard variable.

                    PTO2_SCOPE_GUARD guard;
References
  1. Use RAII guards to manage resources like thread-specific data and device contexts, ensuring cleanup is automatically handled on all function exit paths.

@@ -0,0 +1,423 @@
# Manual Scope V0 Design

Date: 2026-04-15
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The date 2026-04-15 appears to be a typo, as it is in the future. Please consider correcting it.

uv-xiao added 6 commits April 15, 2026 16:18
- drop benchmark-only and duplicate paged-attention payload trees that are\n  outside the intended manual-scope v0 review surface\n- restore the auto paged_attention orchestration file to the upstream\n  baseline so the PR only carries manual-scope-related example changes\n- remove the paged-attention validation test added by the heavier branch\n  line so follow-up benchmarking and review focus on the v0 runtime
- rewrite the validation and benchmark sections for the cleaned\n  manual_scope_v0 branch state\n- record the fresh 2026-04-15 sim checks, device golden reruns, and\n  30-round trimmed benchmark table gathered on device 9\n- remove stale rebase-specific narrative and old performance claims so\n  the design doc matches the current PR scope and measured results
- explain that raw paged_attention vs paged_attention_unroll numbers are\n  not cross-workload comparable and derive the task-count difference from\n  the current device logs\n- restore a historical optimization section for the earlier heavy branch,\n  clearly marked as context rather than current PR scope\n- keep the current-branch benchmark table separate from historical notes
@uv-xiao
Copy link
Copy Markdown
Contributor Author

uv-xiao commented Apr 15, 2026

Latest update on top of the current PR branch:

What changed

  • TaskOutputTensors now carries a standalone task_id, so submit/alloc results expose a usable task-id handle even for zero-output updater tasks.
  • The runtime now fully bypasses TensorMap lookup and insert for current-manual-scope-local tensors; boundary tensors still keep the conservative TensorMap path.
  • The paged-attention manual-scope examples now explicitly chain repeated updater tasks with prev_update_task instead of relying on manual-local TensorMap fallback.
  • The design note was refreshed to match the implemented semantics, fresh validation, benchmark results, and the new TensorMap profiling breakdown.

Fresh benchmark results

30 rounds, trimmed average, device 9, PTO-ISA d96c8784.

Example Case Auto Elapsed (us) Auto Orch (us) Manual Elapsed (us) Manual Orch (us) Elapsed Delta Orch Delta
paged_attention Case1 77.5 62.4 124.0 108.6 +46.5 +46.2
paged_attention Case2 96.2 74.1 146.7 124.6 +50.5 +50.5
paged_attention_unroll Case1 1138.4 800.7 1131.3 694.9 -7.1 -105.8
paged_attention_unroll Case2 519.7 319.1 511.3 282.6 -8.4 -36.5

Fresh TensorMap profiling

Non-unroll paged_attention, 30 rounds, device-log parsing from a profiling-only rebuild.

Case Mode lookup+dep (us) tensormap_ins (us) Lookups Inserts Full Orch (us)
Case1 AUTO 4.132 1.842 40.0 12.0 194.508
Case1 MANUAL 1.944 1.414 16.0 3.0 259.318
Case2 AUTO 6.320 2.638 105.0 32.0 210.274
Case2 MANUAL 2.598 1.728 41.0 8.0 285.182

What these numbers show:

  • the manual-local TensorMap bypass is working: lookups dropped about 60%, inserts dropped about 75%, and lookup+dep time dropped about 53% to 59%
  • the remaining non-unroll regression is not coming from TensorMap lookup / insert anymore
  • the next optimization target should be explicit-dep construction in orchestration and explicit-dep validation / dedupe in the runtime submit path

The design note in docs/manual-scope-v0-design.md now records these results and the corresponding interpretation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant