Skip to content

Feat/per ring runtime env #1029#1099

Merged
ChaoZheng109 merged 1 commit into
hw-native-sys:mainfrom
TaoZQY:feat/per-ring-runtime-env
Jun 23, 2026
Merged

Feat/per ring runtime env #1029#1099
ChaoZheng109 merged 1 commit into
hw-native-sys:mainfrom
TaoZQY:feat/per-ring-runtime-env

Conversation

@TaoZQY

@TaoZQY TaoZQY commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Summary

Add per-ring runtime sizing for the tensormap_and_ringbuffer runtime.

This keeps the scalar runtime_env fields introduced by #1042 and adds per-ring array fields so a single task can size ring0..ring3 independently:

  • ring_task_windows[4]
  • ring_heaps[4]
  • ring_dep_pools[4]

This addresses #1029, where deep kernels have very different resource pressure across scope-depth rings. A single scalar value can either be too small for deeper rings or too large for shallow rings, causing unnecessary shared-memory growth.

Behavior

Effective sizing is resolved per resource and per ring:

per-ring CallConfig value
  > scalar CallConfig value
  > per-ring PTO2_RING_* env value
  > scalar PTO2_RING_* env value
  > compile-time default

Environment variables now support either scalar values or four comma-separated per-ring values:

PTO2_RING_TASK_WINDOW=8192,16384,131072,524288
PTO2_RING_HEAP=134217728,268435456,402653184,536870912
PTO2_RING_DEP_POOL=4096,8192,16384,32768

PTO2_RING_HEAP values are integer byte counts; size suffixes such as K/M/G/T are not supported.

Usage

Users can configure per-ring sizing through environment variables or through CallConfig.runtime_env.

Environment variables

Scalar values are still supported and are broadcast to all rings:

PTO2_RING_TASK_WINDOW=131072
PTO2_RING_HEAP=4294967296
PTO2_RING_DEP_POOL=262144

Per-ring values use exactly four comma-separated entries, indexed by ring0..ring3:

PTO2_RING_TASK_WINDOW=8192,16384,131072,524288
PTO2_RING_HEAP=134217728,268435456,402653184,536870912
PTO2_RING_DEP_POOL=4096,8192,16384,32768

CallConfig

cfg.runtime_env.ring_task_windows = [8192, 16384, 131072, 524288]
cfg.runtime_env.ring_heaps = [
    128 * 1024 * 1024,
    256 * 1024 * 1024,
    384 * 1024 * 1024,
    512 * 1024 * 1024,
]
cfg.runtime_env.ring_dep_pools = [4096, 8192, 16384, 32768]

Ring entries map to scope depth as follows:

ring0 -> scope depth 0
ring1 -> scope depth 1
ring2 -> scope depth 2
ring3 -> scope depth >= 3

Verification

Run a T&R scene test with --enable-scope-stats:

PTO2_RING_TASK_WINDOW=8192,16384,131072,524288 \
PTO2_RING_HEAP=134217728,268435456,402653184,536870912 \
PTO2_RING_DEP_POOL=4096,8192,16384,32768 \
python tests/st/<case>/test_<name>.py -p a2a3 -d 0 --enable-scope-stats

Then inspect the first line of scope_stats.jsonl. The metadata contains the effective capacities:

task_window_max = [...]
heap_max        = [...]
dep_pool_max    = [...]

These arrays are indexed by ring, so they can confirm the per-ring configuration took effect.

Validated with qwen3_14b_decode and --enable-scope-stats:

task_window_max = [8192, 16384, 131072, 524288]
heap_max        = [134217728, 268435456, 402653184, 536870912]
dep_pool_max    = [4096, 8192, 16384, 32768]
fatal           = false
dropped         = 0

Documentation

The new usage is documented in:

  • src/a2a3/runtime/tensormap_and_ringbuffer/docs/MULTI_RING.md
  • src/a5/runtime/tensormap_and_ringbuffer/docs/MULTI_RING.md
  • docs/dfx/scope-stats.md

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR replaces the three scalar RuntimeEnv ring-sizing overrides with per-ring arrays (ring_task_windows, ring_heaps, ring_dep_pools of length 4), propagating the change through the C ABI, runtime layout structs, host orchestration initialization, shared-memory handle, AicpuExecutor, Python bindings, and documentation for both a2a3 and a5 targets. Environment variables now accept a single scalar (broadcast to all rings) or four comma-separated per-ring values, and PTO2_RING_HEAP gains K/M/G/T suffix support. The scalar fields are retained for backward compatibility.

Changes

Per-ring ring-buffer configuration

Layer / File(s) Summary
RuntimeEnv per-ring data contract and wire layout
src/common/task_interface/call_config.h
Adds RUNTIME_ENV_RING_COUNT=4, three per-ring array fields, per_ring_any(), updates any() and validate() with revised constraint rules, and updates static_asserts for the enlarged 15×uint64_t wire layout.
C ABI and worker function-pointer propagation
src/common/worker/pto_runtime_c_api.h, src/common/worker/chip_worker.h, src/common/worker/chip_worker.cpp, src/common/platform/onboard/host/c_api_shared.cpp, src/common/platform/sim/host/c_api_shared.cpp
Adds const uint64_t* ring array parameters to run_prepared and bind_callable_to_runtime_impl, updates the RunPreparedFn typedef, and makes ChipWorker::run memcpy the per-ring arrays from RuntimeEnv before forwarding.
Runtime layout struct: scalar→per-ring array fields
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.h, src/a2a3/.../pto_orchestrator.h, src/a2a3/.../pto_scheduler.h, src/a2a3/.../pto_shared_memory.h, src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.h, src/a5/.../pto_orchestrator.h, src/a5/.../pto_scheduler.h, src/a5/.../pto_shared_memory.h
Changes all cached sizing fields in PTO2RuntimeArenaLayout, PTO2OrchestratorLayout, and PTO2SchedulerLayout from scalars to [PTO2_MAX_RING_DEPTH] arrays, updates reserve_layout and init_data_from_layout signatures, and declares PTO2SharedMemoryHandle::init_per_ring.
Host runtime_maker: env parsing and bind_callable ABI
src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp, src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp, src/a2a3/runtime/host_build_graph/host/runtime_maker.cpp, src/a5/runtime/host_build_graph/host/runtime_maker.cpp
Adds is_power_of_2_u64, parse_uint_token (with size-suffix), apply_env_ring_values (scalar or CSV), and resolve_ring_config that merges defaults, env, and call-site overrides into effective per-ring arrays; extends bind_callable_to_runtime_impl C ABI with ring pointer parameters and rewrites the arena reservation call.
Runtime arena reservation and init overloads
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpp, src/a5/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpp
Adds per-ring overloads for scheduler/orchestrator reserve_layout and init_data_from_layout (using per-ring dep_pool_capacities, summing heap_sizes across rings), and top-level runtime_reserve_layout/runtime_init_data_from_layout with scalar-to-array wrappers.
PTO2SharedMemoryHandle::init_per_ring implementation
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/shared/pto_shared_memory.cpp, src/a5/runtime/tensormap_and_ringbuffer/runtime/shared/pto_shared_memory.cpp
init() now builds per-ring arrays and delegates to new init_per_ring(), which validates size against calculate_size_per_ring() and calls setup_pointers_per_ring/init_header_per_ring.
AicpuExecutor prebuilt-arena init switched to init_per_ring
src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp, src/a5/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
Replaces single-ring SM sizing and sm_handle->init() with calculate_size_per_ring from rt->prebuilt_layout, per-ring logging, and sm_handle->init_per_ring(); updates profiling to use dep_pool_capacities[r].
Python bindings: per-ring properties, mailbox format, scene-test config
python/bindings/task_interface.cpp, python/simpler/worker.py, simpler_setup/scene_test.py
Exports RUNTIME_ENV_RING_COUNT; adds ring_task_windows/ring_heaps/ring_dep_pools as validated read/write properties on RuntimeEnv; updates __repr__; expands _CFG_FMT from 3 to 15 uint64 fields; updates _read_config_from_mailbox to unpack 4-entry arrays; adds plural key support in SceneTestCase._build_config.
C++ and Python unit tests
tests/ut/cpp/a2a3/test_shared_memory.cpp, tests/ut/cpp/a5/test_shared_memory.cpp, tests/ut/cpp/types/test_call_config.cpp, tests/ut/py/test_chip_worker.py
Adds InitPerRingWritesHeaderValues and PerRingConfigInitializesRuntimeComponents C++ tests; updates wire-layout asserts, adds RejectsPerRingRuntimeEnvValues; updates Python tests to per-ring list fields and adds mailbox round-trip and validation rejection tests.
Documentation, diagnostic strings, and example updates
src/a2a3/.../docs/MULTI_RING.md, src/a5/.../docs/MULTI_RING.md, docs/dfx/scope-stats.md, src/a2a3/.../pto_ring_buffer.h, src/a5/.../pto_ring_buffer.h, examples/workers/l2/per_task_runtime_env/*, examples/a2a3/...
Updates MULTI_RING.md with per-ring precedence table, CSV env format, K/M/G/T suffix docs, and scope-stats JSONL verification pointer; removes "power-of-2" from ring_heap documentation and deadlock error messages throughout.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related issues

Possibly related PRs

  • hw-native-sys/simpler#846: This PR modifies the same AicpuExecutor::run prebuilt-arena fast path (switching from single-ring sm_handle->init() to init_per_ring()) that #846 originally refactored for the TRB prebuilt-runtime-arena boot flow.
  • hw-native-sys/simpler#911: This PR extends bind_callable_to_runtime_impl's ABI with per-ring array parameters, directly building on #911's rename of that binding entry point.
  • hw-native-sys/simpler#1042: This PR extends the same per-task CallConfig.runtime_env ring-sizing plumbing that #1042 established for scalar ring_task_window/ring_heap/ring_dep_pool into full per-ring array overrides with updated mailbox ABI and bindings.

Poem

🐇 Four rings to rule them all, four rings to size,
Each scope-depth gets its heap beneath the skies,
No longer one-size fits — arrays set the law,
ring_heaps[3] is tuned without a flaw!
The rabbit cheers: resolve_ring_config is done,
Per-ring, per-resource — performance has won! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 26.80% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Feat/per ring runtime env #1029' clearly indicates the main feature being added (per-ring runtime environment configuration) and references the issue number.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request description clearly relates to and documents the changeset, providing comprehensive information about per-ring runtime sizing capabilities, behavior, usage examples, and verification methods.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for independent, per-ring runtime environment overrides (task window, heap size, and dependency pool capacity) across the four scope-depth rings, updating the C++ runtime, Python bindings, test infrastructure, and documentation, while also relaxing the power-of-two constraint on heap sizes. The review feedback highlights a security vulnerability in both a2a3 and a5 runtime_maker.cpp where negative inputs to std::strtoull can wrap around and bypass validation, and suggests replacing angle-bracket placeholders in markdown bash snippets with standard shell variables to prevent syntax errors.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp Outdated
Comment thread src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp Outdated
Comment thread docs/dfx/scope-stats.md Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp`:
- Around line 495-500: When init_per_ring() fails in the AiCpuExecutor, the
runtime_init_ready_ flag is released while rt still points to a partially
initialized runtime object with a zeroed sm_handle. This bypasses the
scheduler-side rt == nullptr guard and allows dispatch to run against invalid
state. Before storing true to runtime_init_ready_ via the store() call with
memory_order_release, clear the rt pointer to nullptr so that scheduler threads
will still see rt as null and avoid dispatching against the broken runtime.

In `@src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp`:
- Around line 455-462: The resolve_ring_config() call happens after tensors have
been allocated and appended to runtime->tensor_pairs_, but if it fails, those
allocated device buffers are not cleaned up before returning -1. Move the
resolve_ring_config() call to execute before the tensor allocation loop that
populates runtime->tensor_pairs_, so that if configuration resolution fails, no
device buffers will have been allocated yet. This ensures the function exits
cleanly without leaving orphaned resources when resolve_ring_config() returns
false.
- Around line 478-481: The accumulation of eff_heap_sizes into total_heap_size
in the loop (iterating from 0 to PTO2_MAX_RING_DEPTH) can overflow since
eff_heap_sizes values can be user-provided up to uint64_t maximum. Add overflow
detection before each addition in the loop to guard against this, either by
checking if adding the next eff_heap_sizes[r] value would exceed the maximum
uint64_t value before performing the addition, or by using a safe addition
function that detects and handles overflow. This ensures total_heap_size remains
accurate when passed to setup_static_arena() and matches the heap partitioning
performed by later code.

In
`@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpp`:
- Around line 273-291: The heap size accumulation in the loop over
PTO2_MAX_RING_DEPTH does not check for integer overflow when summing heap_sizes
values into gm_heap_size and heap_offset. If the cumulative heap sizes exceed
the maximum value of uint64_t, gm_heap_size will wrap around and heap_offset
calculations will create overlapping or invalid ring heap bases. Create a
checked-sum helper function that detects overflow conditions and returns false
or nullptr when overflow would occur. Apply this checked-sum validation before
any heap_offset calculations and before calling
orch->rings[r].task_allocator.init(), ensuring the function exits early if
overflow is detected. This fix should be applied to all locations where similar
heap size accumulation occurs, including the additional location referenced in
the comment.

In `@src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp`:
- Around line 455-462: The call to resolve_ring_config() happens after device
buffers have already been allocated and appended to runtime->tensor_pairs_ in an
earlier loop, but if resolve_ring_config() fails and returns false, the function
exits without cleaning up those allocated resources. Move the
resolve_ring_config() call to execute before the tensor staging loop that
allocates device buffers and appends to runtime->tensor_pairs_, so that
configuration validation occurs first and prevents resource leaks when
resolution fails.
- Around line 478-481: The accumulation loop where total_heap_size is built by
summing eff_heap_sizes[r] values can overflow since eff_heap_sizes[r] is
user-provided and can reach uint64_t::max. Add overflow protection before each
addition operation in the loop (where r iterates from 0 to PTO2_MAX_RING_DEPTH)
to ensure that adding eff_heap_sizes[r] to total_heap_size will not wrap around.
If overflow would occur, handle it by either capping the total_heap_size to
uint64_t::max or rejecting the configuration to prevent passing an undersized
heap size to setup_static_arena() that would mismatch the per-ring partitioning
done later.

In
`@src/a5/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpp`:
- Around line 265-283: The code accumulates heap sizes into gm_heap_size and
heap_offset without checking for integer overflow, which can cause uint64_t
wraps and produce overlapping or invalid ring heap bases. Create a checked-sum
helper function that detects overflow when adding values to a running total,
then use this helper to validate that the sum of all heap_sizes values doesn't
overflow before the loop that accumulates gm_heap_size and before initializing
the task allocators via orch->rings[r].task_allocator.init. Return false or
nullptr on overflow detection instead of proceeding with initialization. Apply
the same overflow-checking pattern to the similar code section mentioned at
lines 407-415.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d1798107-5a99-489a-99ec-af841467246d

📥 Commits

Reviewing files that changed from the base of the PR and between 26b7b15 and 7a09062.

📒 Files selected for processing (39)
  • docs/dfx/scope-stats.md
  • examples/a2a3/tensormap_and_ringbuffer/paged_attention_ringbuffer/test_paged_attention_ringbuffer.py
  • examples/workers/l2/per_task_runtime_env/README.md
  • examples/workers/l2/per_task_runtime_env/main.py
  • python/bindings/task_interface.cpp
  • python/simpler/worker.py
  • simpler_setup/scene_test.py
  • src/a2a3/runtime/host_build_graph/host/runtime_maker.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/docs/MULTI_RING.md
  • src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_shared_memory.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/shared/pto_shared_memory.cpp
  • src/a5/runtime/host_build_graph/host/runtime_maker.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/docs/MULTI_RING.md
  • src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.h
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.h
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_shared_memory.h
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/shared/pto_shared_memory.cpp
  • src/common/platform/onboard/host/c_api_shared.cpp
  • src/common/platform/sim/host/c_api_shared.cpp
  • src/common/task_interface/call_config.h
  • src/common/worker/chip_worker.cpp
  • src/common/worker/chip_worker.h
  • src/common/worker/pto_runtime_c_api.h
  • tests/ut/cpp/a2a3/test_shared_memory.cpp
  • tests/ut/cpp/a5/test_shared_memory.cpp
  • tests/ut/cpp/types/test_call_config.cpp
  • tests/ut/py/test_chip_worker.py

Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp Outdated
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpp Outdated
Comment thread src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp Outdated
Comment thread src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp
Comment thread src/a5/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpp Outdated
@TaoZQY TaoZQY force-pushed the feat/per-ring-runtime-env branch 2 times, most recently from 46c0d07 to 864d8f4 Compare June 22, 2026 07:53
@TaoZQY

TaoZQY commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Addressed the AI review feedback in 864d8f42 and kept the PR as a single squashed commit.

Changes made:

  • Reject negative integer env values before strtoull in both a2a3 and a5.
  • Use copy-paste-safe shell variables in the scope-stats docs snippet.
  • Move resolve_ring_config() before tensor staging in both host runtime makers.
  • Add uint64_t overflow guards for host-side per-ring heap sums.
  • Add checked heap-sum validation in shared runtime init before deriving ring heap bases.
  • Clear rt before releasing scheduler threads on init_per_ring() failure.
  • Add C++ unit coverage for overflowing per-ring heap sums.

Validation:

  • pre-commit run --from-ref upstream/main --to-ref HEAD: passed
  • git diff --check: passed
  • cmake --build /tmp/simpler-ut-cpp-build -j$(nproc): passed
  • ctest --test-dir /tmp/simpler-ut-cpp-build --output-on-failure: 44/44 passed

I did not address CodeRabbit's generic docstring coverage warning because it is not part of this repository's enforced checks and the touched implementation is primarily C++ runtime code; adding unrelated Python docstrings would be noise for this PR.

@TaoZQY TaoZQY force-pushed the feat/per-ring-runtime-env branch from 864d8f4 to 4744287 Compare June 22, 2026 08:32
@TaoZQY

TaoZQY commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Updated the PR branch to 47442870 after investigating the failing st-onboard-a2a3 job.

The previous failure was the first examples/workers/l2/per_task_runtime_env run on dev=11 failing before runtime setup at halMemCtl rc=13 (EACCES) while retrieving AICore register addresses. I hardened the existing a2a3 onboard retry path so this transient driver-side serialization window now waits up to ~2s for EACCES only; other HAL errors still fail immediately.

Local validation:

  • pre-commit run --from-ref upstream/main --to-ref HEAD: passed
  • cmake --build /tmp/simpler-ut-cpp-build -j$(nproc): passed
  • ctest --test-dir /tmp/simpler-ut-cpp-build --output-on-failure: 44/44 passed
  • python simpler_setup/build_runtimes.py --platforms a2a3 --pto-isa-commit ddafa8da9c760ecd13fe9fe2833d6ee55fb20bd8: passed

The new workflow run is currently waiting for maintainer approval before jobs are allowed to start: https://github.com/hw-native-sys/simpler/actions/runs/27939901712

@TaoZQY TaoZQY force-pushed the feat/per-ring-runtime-env branch from 4744287 to 3a02e7c Compare June 22, 2026 08:50
@ChaoZheng109

Copy link
Copy Markdown
Collaborator

Review — per-ring runtime sizing (#1029)

Solid PR. The device/SM layer was already per-ring; this correctly widens the host path and threads eff_*[4] end-to-end. a2a3/a5 parity is clean, the wire ABI is updated in lockstep (static_assert + Python _CFG_FMT + round-trip UT), and dropping the pow2 constraint on heap is correct — the heap is a comparison/subtraction ring allocator (try_bump_heap), no masking, so non-pow2 sizes are safe (only task_window needs pow2, and that's kept). Good test coverage.

Should fix

  • Dead Runtime fields. Runtime::{task_window_size, heap_size, dep_pool_size} (tensormap_and_ringbuffer/runtime/runtime.h) are no longer written (assignment dropped from runtime_maker.cpp) nor read (reads dropped from aicpu_executor.cpp) — only zeroed in runtime.cpp. Please remove the fields + their resets (a2a3 + a5).

Consider — depth extensibility (so PTO2_MAX_RING_DEPTH + 1 doesn't ripple)

Most of the C++ already loops over the ring count, but a few spots hardcode 4 and would all need editing if the depth grows:

  • worker.py: _CFG_FMT hardcodes "Q" * 15 and _read_config_from_mailbox unpacks 12 named ring_*_0..3. RUNTIME_ENV_RING_COUNT is already exported to Python (m.attr(...)) but unused here. Suggest deriving the format from the constant and unpacking via 3 slices:
    from _task_interface import RUNTIME_ENV_RING_COUNT as _N
    _CFG_FMT = struct.Struct("=iiiiiii" + "Q" * (3 + 3 * _N) + "1024s")
    ...
    v = _CFG_FMT.unpack_from(buf, _OFF_CONFIG)
    ring_task_windows = list(v[10        : 10 +   _N])
    ring_heaps        = list(v[10 +   _N : 10 + 2*_N])
    ring_dep_pools    = list(v[10 + 2*_N : 10 + 3*_N])
    This also removes the manual "15" drift risk.
  • call_config.h static_assert: 15 * sizeof(uint64_t)(3 + 3 * RUNTIME_ENV_RING_COUNT) * sizeof(uint64_t) so it tracks the constant automatically.
  • Cross-layer coupling: chip_worker.cpp (common) builds RUNTIME_ENV_RING_COUNT-sized arrays and passes bare pointers the arch runtime indexes with PTO2_MAX_RING_DEPTH. Both are 4 today; a static_assert(RUNTIME_ENV_RING_COUNT == PTO2_MAX_RING_DEPTH, ...) in each arch runtime_maker.cpp (both headers visible) locks them together.
  • (minor) The [%PRIu64, x4] logs in runtime_maker.cpp / aicpu_executor.cpp enumerate [0..3]; a small format_ring_array() helper would make them depth-agnostic too.

Consider — docs

  • MULTI_RING.md §7.2 says invalid env values are "silently ignored", but parse_uint_token/apply_env_ring_values LOG_WARN on every failure → suggest "logged and ignored".
  • K/M/G/T suffix parsing is env-only; CallConfig.runtime_env.ring_heap(s) are uint64 (bytes). Worth a one-line note in the CallConfig example so nobody tries ring_heaps=["128M", ...].

Design discussion — collapse scalar + array into one 1-or-4 list field

The scalar ring_task_window + array ring_task_windows pair duplicates plumbing across binding / wire / C ABI (run_prepared + 4 bind_callable_to_runtime_impl) / resolve_ring_config / scene_test. Since the env side already uses a "1-or-4" rule (one value broadcasts, four = per-ring, 2/3 rejected), the CallConfig side could match it: a single list field accepting length 1 (broadcast) or 4 (per-ring), getter always returning the full ring count — no getter ambiguity, and the wire drops the 3 scalar slots.

Framed as replace (not add), this makes the PR smaller, not larger: it deletes the duplicate scalar plumbing. Migration is bounded to ~5 in-repo call sites ("ring_task_window": X"ring_task_windows": [X]); there are no out-of-repo consumers, and #1042 (the scalar's origin) is recent, so reshaping it before the per-ring API ossifies is cheap. Trade-off: loses the "custom uniform baseline + per-ring override" layering (marginal — [0,0,0,X] still overrides one ring over defaults). Not blocking, but worth a deliberate decision now while unmerged.

Minor

ChaoZheng109

This comment was marked as outdated.

ChaoZheng109

This comment was marked as outdated.

ChaoZheng109

This comment was marked as outdated.

ChaoZheng109

This comment was marked as outdated.

ChaoZheng109

This comment was marked as outdated.

#include <thread>
#include <unistd.h>

class HalMemCtlFileLock {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This HalMemCtlFileLock block — plus the EACCES retry-window change (kHalMemCtlMaxRetries 3->60, delay 50ms->500ms) and the new PTO2_HALMEMCTL_LOCK_PATH env var — is unrelated to the #1029 per-ring runtime sizing this PR is about. The underlying halMemCtl EACCES / card-usage-overlap contention is a separate infra issue already being handled elsewhere, so it shouldn't ride in this PR. Also note PTO2_HALMEMCTL_LOCK_PATH is a new behavior gate (see .claude/rules/env-macro-gating.md — needs its own justification). Please drop these host_regs.cpp changes from this PR and land them in the dedicated card-lock fix, keeping #1099 scoped to per-ring sizing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, thanks for catching this. The host_regs.cpp card-lock / halMemCtl retry changes are unrelated to #1029, so I dropped them from this PR and amended the branch.

The current head (f46d99c8) keeps #1099 scoped to per-ring runtime sizing; src/a2a3/platform/onboard/host/host_regs.cpp is back to the upstream-main behavior. Any dedicated card-lock fix can carry the HalMemCtlFileLock / retry-window / env-gate changes separately with its own justification.

@TaoZQY TaoZQY force-pushed the feat/per-ring-runtime-env branch from 3a02e7c to f46d99c Compare June 22, 2026 12:55
}

uint64_t val = 0;
if (allow_size_suffix) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest dropping this K/M/G/T size-suffix path (the whole allow_size_suffix branch). It's the only consumer of the long double / strtold / isfinite / floor machinery here, and adds ~70 lines of new code across a2a3+a5 just to accept 4G / 384M / 1.5G. Removing it collapses parse_uint_token to a single integer (strtoull) path — the same behavior the env parser had before #141 — and lets you drop the allow_size_suffix parameter (2 signatures + 5 call sites) and the <cmath> / <cctype> includes. Heap would then take raw byte counts (e.g. 402653184), consistent with task_window / dep_pool.

Trade-off to call out: issue #1029 requested the PTO2_RING_HEAP=10M,64M,1.5G,4G syntax, so this diverges from that ask — worth a deliberate decision. (a5 mirrors this branch.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I took this simplification. The latest head (3351e59c) drops the K/M/G/T suffix path entirely: parse_uint_token now only uses the integer strtoull path, allow_size_suffix is gone from the helper signatures/call sites, and the docs now show PTO2_RING_HEAP as raw byte counts.

This keeps the #1029 per-ring behavior while making env parsing consistent across task_window, heap, and dep_pool.

@ChaoZheng109

Copy link
Copy Markdown
Collaborator

Review — per-ring runtime sizing (#1029)

Solid PR. The device/SM layer was already per-ring; this correctly widens the host path and threads eff_*[4] end-to-end. a2a3/a5 parity is clean, the wire ABI is updated in lockstep (static_assert + Python _CFG_FMT + round-trip UT), and dropping the pow2 constraint on heap is correct — the heap is a comparison/subtraction ring allocator (try_bump_heap), no masking, so non-pow2 sizes are safe (only task_window needs pow2, and that's kept). Good test coverage.

Should fix

  • Dead Runtime fields. Runtime::{task_window_size, heap_size, dep_pool_size} (tensormap_and_ringbuffer/runtime/runtime.h) are no longer written (assignment dropped from runtime_maker.cpp) nor read (reads dropped from aicpu_executor.cpp) — only zeroed in runtime.cpp. Please remove the fields + their resets (a2a3 + a5).

Consider — depth extensibility (so PTO2_MAX_RING_DEPTH + 1 doesn't ripple)

Most of the C++ already loops over the ring count, but a few spots hardcode 4 and would all need editing if the depth grows:

  • worker.py: _CFG_FMT hardcodes "Q" * 15 and _read_config_from_mailbox unpacks 12 named ring_*_0..3. RUNTIME_ENV_RING_COUNT is already exported to Python (m.attr(...)) but unused here. Suggest deriving the format from the constant and unpacking via 3 slices:

    from _task_interface import RUNTIME_ENV_RING_COUNT as _N
    _CFG_FMT = struct.Struct("=iiiiiii" + "Q" * (3 + 3 * _N) + "1024s")
    ...
    v = _CFG_FMT.unpack_from(buf, _OFF_CONFIG)
    ring_task_windows = list(v[10        : 10 +   _N])
    ring_heaps        = list(v[10 +   _N : 10 + 2*_N])
    ring_dep_pools    = list(v[10 + 2*_N : 10 + 3*_N])

    This also removes the manual "15" drift risk.

  • call_config.h static_assert: 15 * sizeof(uint64_t)(3 + 3 * RUNTIME_ENV_RING_COUNT) * sizeof(uint64_t) so it tracks the constant automatically.

  • Cross-layer coupling: chip_worker.cpp (common) builds RUNTIME_ENV_RING_COUNT-sized arrays and passes bare pointers the arch runtime indexes with PTO2_MAX_RING_DEPTH. Both are 4 today; a static_assert(RUNTIME_ENV_RING_COUNT == PTO2_MAX_RING_DEPTH, ...) in each arch runtime_maker.cpp (both headers visible) locks them together.

  • (minor) The [%PRIu64, x4] logs in runtime_maker.cpp / aicpu_executor.cpp enumerate [0..3]; a small format_ring_array() helper would make them depth-agnostic too.

Consider — docs

  • MULTI_RING.md §7.2 says invalid env values are "silently ignored", but parse_uint_token/apply_env_ring_values LOG_WARN on every failure → suggest "logged and ignored".
  • K/M/G/T suffix parsing is env-only; CallConfig.runtime_env.ring_heap(s) are uint64 (bytes). Worth a one-line note in the CallConfig example so nobody tries ring_heaps=["128M", ...].

Design discussion — collapse scalar + array into one 1-or-4 list field

The scalar ring_task_window + array ring_task_windows pair duplicates plumbing across binding / wire / C ABI (run_prepared + 4 bind_callable_to_runtime_impl) / resolve_ring_config / scene_test. Since the env side already uses a "1-or-4" rule (one value broadcasts, four = per-ring, 2/3 rejected), the CallConfig side could match it: a single list field accepting length 1 (broadcast) or 4 (per-ring), getter always returning the full ring count — no getter ambiguity, and the wire drops the 3 scalar slots.

Framed as replace (not add), this makes the PR smaller, not larger: it deletes the duplicate scalar plumbing. Migration is bounded to ~5 in-repo call sites ("ring_task_window": X"ring_task_windows": [X]); there are no out-of-repo consumers, and #1042 (the scalar's origin) is recent, so reshaping it before the per-ring API ossifies is cheap. Trade-off: loses the "custom uniform baseline + per-ring override" layering (marginal — [0,0,0,X] still overrides one ring over defaults). Not blocking, but worth a deliberate decision now while unmerged.

Minor

不用做"ring_task_window": X → "ring_task_windows": [X]的转变,现状是"ring_task_window": X广播到每个ring,"ring_task_windows": [X, Y, Z, H]必须输入4个,否则报错

Add per-ring sizing for the tensormap_and_ringbuffer runtime so each scope-depth ring can use independent task-window, heap, and dependency-pool capacities.

This keeps the scalar runtime_env fields introduced by hw-native-sys#1042 and adds per-ring array fields: ring_task_windows[4], ring_heaps[4], and ring_dep_pools[4]. Effective sizing is resolved per resource and per ring with CallConfig values taking precedence over environment variables, followed by compile-time defaults.

Environment variables now support either scalar values or exactly four comma-separated per-ring integer values. The change wires the effective per-ring capacities through both a2a3 and a5 tensormap_and_ringbuffer runtimes, including host runtime creation, AICPU runtime setup, shared-memory layout, scheduler/orchestrator initialization, and scope-stats reporting.

Also reject negative integer env values before unsigned parsing, guard per-ring heap accumulation against overflow, remove obsolete Runtime sizing fields, and keep the mailbox/runtime ring count checks derived from RUNTIME_ENV_RING_COUNT.

Fixes hw-native-sys#1029.
@TaoZQY TaoZQY force-pushed the feat/per-ring-runtime-env branch from f46d99c to 3351e59 Compare June 23, 2026 01:21
@ChaoZheng109 ChaoZheng109 merged commit c68d9bb into hw-native-sys:main Jun 23, 2026
16 checks passed
ChaoZheng109 added a commit to ChaoZheng109/simpler that referenced this pull request Jun 23, 2026
hw-native-sys#1099 added per-ring array fields (ring_task_windows / ring_heaps /
ring_dep_pools) alongside the scalar runtime_env knobs, but neither
per_task_runtime_env example exercised them.

Extend both the L2 and L3 examples to also cover the per-ring form:
each scope-depth ring (0..3) sized independently. The config helpers
now iterate a RING_FIELDS tuple so a spec dict can carry either the
scalar or the array keys, and the READMEs document the full precedence
chain and the --enable-scope-stats verification path.
ChaoZheng109 added a commit that referenced this pull request Jun 24, 2026
#1099 added per-ring array fields (ring_task_windows / ring_heaps /
ring_dep_pools) alongside the scalar runtime_env knobs, but neither
per_task_runtime_env example exercised them.

Extend both the L2 and L3 examples to also cover the per-ring form:
each scope-depth ring (0..3) sized independently. The config helpers
now iterate a RING_FIELDS tuple so a spec dict can carry either the
scalar or the array keys, and the READMEs document the full precedence
chain and the --enable-scope-stats verification path.
ChaoWao pushed a commit that referenced this pull request Jun 24, 2026
…ld (#1128)

#1099 exposed ring sizing through two near-identical CallConfig.runtime_env
names per resource that differ only by a trailing `s` — `ring_task_window`
(scalar broadcast) vs `ring_task_windows` (per-ring array), etc. The one-letter
difference is an ergonomics footgun and the layered "scalar baseline + per-ring
override" semantics it bought are not worth the confusing twin names.

Collapse each pair into a single field that accepts EITHER a scalar (broadcast
to every ring) OR a 4-entry list (per-ring):

    cfg.runtime_env.ring_task_window = 128             # broadcast
    cfg.runtime_env.ring_task_window = [128, 0, 0, 0]  # per-ring; 0 falls through

Broadcast happens in the Python binding (int -> [v, v, v, v]); the wire format
now carries only the three 4-element arrays (12 uint64, down from 15) and the
getter always returns a 4-list. A 0 entry falls through to PTO2_RING_* env ->
compile-time default; the separate scalar-CallConfig precedence tier is dropped
(accepted trade-off — a 0 in a list no longer falls back to a sibling scalar).

The internal C-API (run_prepared) and wire layout are internal-only and rebuild
together via pip install, so this is a clean break with no back-compat shim.
Mirrored across a2a3/a5, both runtimes, bindings, scene-test parsing, docs,
unit tests, and the per_task_runtime_env examples.

Closes #1126.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants