Feat/per ring runtime env #1029 by TaoZQY · Pull Request #1099 · hw-native-sys/simpler

TaoZQY · 2026-06-22T03:35:50Z

Summary

Add per-ring runtime sizing for the tensormap_and_ringbuffer runtime.

This keeps the scalar runtime_env fields introduced by #1042 and adds per-ring array fields so a single task can size ring0..ring3 independently:

ring_task_windows[4]
ring_heaps[4]
ring_dep_pools[4]

This addresses #1029, where deep kernels have very different resource pressure across scope-depth rings. A single scalar value can either be too small for deeper rings or too large for shallow rings, causing unnecessary shared-memory growth.

Behavior

Effective sizing is resolved per resource and per ring:

per-ring CallConfig value
  > scalar CallConfig value
  > per-ring PTO2_RING_* env value
  > scalar PTO2_RING_* env value
  > compile-time default

Environment variables now support either scalar values or four comma-separated per-ring values:

PTO2_RING_TASK_WINDOW=8192,16384,131072,524288
PTO2_RING_HEAP=134217728,268435456,402653184,536870912
PTO2_RING_DEP_POOL=4096,8192,16384,32768

PTO2_RING_HEAP values are integer byte counts; size suffixes such as K/M/G/T are not supported.

Usage

Users can configure per-ring sizing through environment variables or through CallConfig.runtime_env.

Environment variables

Scalar values are still supported and are broadcast to all rings:

PTO2_RING_TASK_WINDOW=131072
PTO2_RING_HEAP=4294967296
PTO2_RING_DEP_POOL=262144

Per-ring values use exactly four comma-separated entries, indexed by ring0..ring3:

PTO2_RING_TASK_WINDOW=8192,16384,131072,524288
PTO2_RING_HEAP=134217728,268435456,402653184,536870912
PTO2_RING_DEP_POOL=4096,8192,16384,32768

CallConfig

cfg.runtime_env.ring_task_windows = [8192, 16384, 131072, 524288]
cfg.runtime_env.ring_heaps = [
    128 * 1024 * 1024,
    256 * 1024 * 1024,
    384 * 1024 * 1024,
    512 * 1024 * 1024,
]
cfg.runtime_env.ring_dep_pools = [4096, 8192, 16384, 32768]

Ring entries map to scope depth as follows:

ring0 -> scope depth 0
ring1 -> scope depth 1
ring2 -> scope depth 2
ring3 -> scope depth >= 3

Verification

Run a T&R scene test with --enable-scope-stats:

PTO2_RING_TASK_WINDOW=8192,16384,131072,524288 \
PTO2_RING_HEAP=134217728,268435456,402653184,536870912 \
PTO2_RING_DEP_POOL=4096,8192,16384,32768 \
python tests/st/<case>/test_<name>.py -p a2a3 -d 0 --enable-scope-stats

Then inspect the first line of scope_stats.jsonl. The metadata contains the effective capacities:

task_window_max = [...]
heap_max        = [...]
dep_pool_max    = [...]

These arrays are indexed by ring, so they can confirm the per-ring configuration took effect.

Validated with qwen3_14b_decode and --enable-scope-stats:

task_window_max = [8192, 16384, 131072, 524288]
heap_max        = [134217728, 268435456, 402653184, 536870912]
dep_pool_max    = [4096, 8192, 16384, 32768]
fatal           = false
dropped         = 0

Documentation

The new usage is documented in:

src/a2a3/runtime/tensormap_and_ringbuffer/docs/MULTI_RING.md
src/a5/runtime/tensormap_and_ringbuffer/docs/MULTI_RING.md
docs/dfx/scope-stats.md

coderabbitai · 2026-06-22T03:36:04Z

📝 Walkthrough

Walkthrough

This PR replaces the three scalar RuntimeEnv ring-sizing overrides with per-ring arrays (ring_task_windows, ring_heaps, ring_dep_pools of length 4), propagating the change through the C ABI, runtime layout structs, host orchestration initialization, shared-memory handle, AicpuExecutor, Python bindings, and documentation for both a2a3 and a5 targets. Environment variables now accept a single scalar (broadcast to all rings) or four comma-separated per-ring values, and PTO2_RING_HEAP gains K/M/G/T suffix support. The scalar fields are retained for backward compatibility.

Changes

Per-ring ring-buffer configuration

Layer / File(s)	Summary
RuntimeEnv per-ring data contract and wire layout `src/common/task_interface/call_config.h`	Adds `RUNTIME_ENV_RING_COUNT=4`, three per-ring array fields, `per_ring_any()`, updates `any()` and `validate()` with revised constraint rules, and updates `static_assert`s for the enlarged 15×uint64_t wire layout.
C ABI and worker function-pointer propagation `src/common/worker/pto_runtime_c_api.h`, `src/common/worker/chip_worker.h`, `src/common/worker/chip_worker.cpp`, `src/common/platform/onboard/host/c_api_shared.cpp`, `src/common/platform/sim/host/c_api_shared.cpp`	Adds `const uint64_t*` ring array parameters to `run_prepared` and `bind_callable_to_runtime_impl`, updates the `RunPreparedFn` typedef, and makes `ChipWorker::run` memcpy the per-ring arrays from `RuntimeEnv` before forwarding.
Runtime layout struct: scalar→per-ring array fields `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.h`, `src/a2a3/.../pto_orchestrator.h`, `src/a2a3/.../pto_scheduler.h`, `src/a2a3/.../pto_shared_memory.h`, `src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.h`, `src/a5/.../pto_orchestrator.h`, `src/a5/.../pto_scheduler.h`, `src/a5/.../pto_shared_memory.h`	Changes all cached sizing fields in `PTO2RuntimeArenaLayout`, `PTO2OrchestratorLayout`, and `PTO2SchedulerLayout` from scalars to `[PTO2_MAX_RING_DEPTH]` arrays, updates `reserve_layout` and `init_data_from_layout` signatures, and declares `PTO2SharedMemoryHandle::init_per_ring`.
Host runtime_maker: env parsing and bind_callable ABI `src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp`, `src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp`, `src/a2a3/runtime/host_build_graph/host/runtime_maker.cpp`, `src/a5/runtime/host_build_graph/host/runtime_maker.cpp`	Adds `is_power_of_2_u64`, `parse_uint_token` (with size-suffix), `apply_env_ring_values` (scalar or CSV), and `resolve_ring_config` that merges defaults, env, and call-site overrides into effective per-ring arrays; extends `bind_callable_to_runtime_impl` C ABI with ring pointer parameters and rewrites the arena reservation call.
Runtime arena reservation and init overloads `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpp`, `src/a5/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpp`	Adds per-ring overloads for scheduler/orchestrator `reserve_layout` and `init_data_from_layout` (using per-ring dep_pool_capacities, summing heap_sizes across rings), and top-level `runtime_reserve_layout`/`runtime_init_data_from_layout` with scalar-to-array wrappers.
PTO2SharedMemoryHandle::init_per_ring implementation `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/shared/pto_shared_memory.cpp`, `src/a5/runtime/tensormap_and_ringbuffer/runtime/shared/pto_shared_memory.cpp`	`init()` now builds per-ring arrays and delegates to new `init_per_ring()`, which validates size against `calculate_size_per_ring()` and calls `setup_pointers_per_ring`/`init_header_per_ring`.
AicpuExecutor prebuilt-arena init switched to init_per_ring `src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp`, `src/a5/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp`	Replaces single-ring SM sizing and `sm_handle->init()` with `calculate_size_per_ring` from `rt->prebuilt_layout`, per-ring logging, and `sm_handle->init_per_ring()`; updates profiling to use `dep_pool_capacities[r]`.
Python bindings: per-ring properties, mailbox format, scene-test config `python/bindings/task_interface.cpp`, `python/simpler/worker.py`, `simpler_setup/scene_test.py`	Exports `RUNTIME_ENV_RING_COUNT`; adds `ring_task_windows`/`ring_heaps`/`ring_dep_pools` as validated read/write properties on `RuntimeEnv`; updates `__repr__`; expands `_CFG_FMT` from 3 to 15 uint64 fields; updates `_read_config_from_mailbox` to unpack 4-entry arrays; adds plural key support in `SceneTestCase._build_config`.
C++ and Python unit tests `tests/ut/cpp/a2a3/test_shared_memory.cpp`, `tests/ut/cpp/a5/test_shared_memory.cpp`, `tests/ut/cpp/types/test_call_config.cpp`, `tests/ut/py/test_chip_worker.py`	Adds `InitPerRingWritesHeaderValues` and `PerRingConfigInitializesRuntimeComponents` C++ tests; updates wire-layout asserts, adds `RejectsPerRingRuntimeEnvValues`; updates Python tests to per-ring list fields and adds mailbox round-trip and validation rejection tests.
Documentation, diagnostic strings, and example updates `src/a2a3/.../docs/MULTI_RING.md`, `src/a5/.../docs/MULTI_RING.md`, `docs/dfx/scope-stats.md`, `src/a2a3/.../pto_ring_buffer.h`, `src/a5/.../pto_ring_buffer.h`, `examples/workers/l2/per_task_runtime_env/*`, `examples/a2a3/...`	Updates MULTI_RING.md with per-ring precedence table, CSV env format, K/M/G/T suffix docs, and scope-stats JSONL verification pointer; removes "power-of-2" from ring_heap documentation and deadlock error messages throughout.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related issues

Per-ring (per-scope-level) heap / task-window / dep-pool config — uniform sizing has no working value for deep kernels #1029: This PR implements exactly the feature described in that issue — per-ring configuration of heap, task-window, and dependency-pool sizes via new array fields in RuntimeEnv, comma-separated env-var parsing, and corresponding runtime initialization changes.

Possibly related PRs

hw-native-sys/simpler#846: This PR modifies the same AicpuExecutor::run prebuilt-arena fast path (switching from single-ring sm_handle->init() to init_per_ring()) that #846 originally refactored for the TRB prebuilt-runtime-arena boot flow.
hw-native-sys/simpler#911: This PR extends bind_callable_to_runtime_impl's ABI with per-ring array parameters, directly building on #911's rename of that binding entry point.
hw-native-sys/simpler#1042: This PR extends the same per-task CallConfig.runtime_env ring-sizing plumbing that #1042 established for scalar ring_task_window/ring_heap/ring_dep_pool into full per-ring array overrides with updated mailbox ABI and bindings.

Poem

🐇 Four rings to rule them all, four rings to size,
Each scope-depth gets its heap beneath the skies,
No longer one-size fits — arrays set the law,
ring_heaps[3] is tuned without a flaw!
The rabbit cheers: resolve_ring_config is done,
Per-ring, per-resource — performance has won! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 26.80% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Feat/per ring runtime env `#1029`' clearly indicates the main feature being added (per-ring runtime environment configuration) and references the issue number.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The pull request description clearly relates to and documents the changeset, providing comprehensive information about per-ring runtime sizing capabilities, behavior, usage examples, and verification methods.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

gemini-code-assist

Code Review

This pull request introduces support for independent, per-ring runtime environment overrides (task window, heap size, and dependency pool capacity) across the four scope-depth rings, updating the C++ runtime, Python bindings, test infrastructure, and documentation, while also relaxing the power-of-two constraint on heap sizes. The review feedback highlights a security vulnerability in both a2a3 and a5 runtime_maker.cpp where negative inputs to std::strtoull can wrap around and bypass validation, and suggests replacing angle-bracket placeholders in markdown bash snippets with standard shell variables to prevent syntax errors.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

coderabbitai

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp`:
- Around line 495-500: When init_per_ring() fails in the AiCpuExecutor, the
runtime_init_ready_ flag is released while rt still points to a partially
initialized runtime object with a zeroed sm_handle. This bypasses the
scheduler-side rt == nullptr guard and allows dispatch to run against invalid
state. Before storing true to runtime_init_ready_ via the store() call with
memory_order_release, clear the rt pointer to nullptr so that scheduler threads
will still see rt as null and avoid dispatching against the broken runtime.

In `@src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp`:
- Around line 455-462: The resolve_ring_config() call happens after tensors have
been allocated and appended to runtime->tensor_pairs_, but if it fails, those
allocated device buffers are not cleaned up before returning -1. Move the
resolve_ring_config() call to execute before the tensor allocation loop that
populates runtime->tensor_pairs_, so that if configuration resolution fails, no
device buffers will have been allocated yet. This ensures the function exits
cleanly without leaving orphaned resources when resolve_ring_config() returns
false.
- Around line 478-481: The accumulation of eff_heap_sizes into total_heap_size
in the loop (iterating from 0 to PTO2_MAX_RING_DEPTH) can overflow since
eff_heap_sizes values can be user-provided up to uint64_t maximum. Add overflow
detection before each addition in the loop to guard against this, either by
checking if adding the next eff_heap_sizes[r] value would exceed the maximum
uint64_t value before performing the addition, or by using a safe addition
function that detects and handles overflow. This ensures total_heap_size remains
accurate when passed to setup_static_arena() and matches the heap partitioning
performed by later code.

In
`@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpp`:
- Around line 273-291: The heap size accumulation in the loop over
PTO2_MAX_RING_DEPTH does not check for integer overflow when summing heap_sizes
values into gm_heap_size and heap_offset. If the cumulative heap sizes exceed
the maximum value of uint64_t, gm_heap_size will wrap around and heap_offset
calculations will create overlapping or invalid ring heap bases. Create a
checked-sum helper function that detects overflow conditions and returns false
or nullptr when overflow would occur. Apply this checked-sum validation before
any heap_offset calculations and before calling
orch->rings[r].task_allocator.init(), ensuring the function exits early if
overflow is detected. This fix should be applied to all locations where similar
heap size accumulation occurs, including the additional location referenced in
the comment.

In `@src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp`:
- Around line 455-462: The call to resolve_ring_config() happens after device
buffers have already been allocated and appended to runtime->tensor_pairs_ in an
earlier loop, but if resolve_ring_config() fails and returns false, the function
exits without cleaning up those allocated resources. Move the
resolve_ring_config() call to execute before the tensor staging loop that
allocates device buffers and appends to runtime->tensor_pairs_, so that
configuration validation occurs first and prevents resource leaks when
resolution fails.
- Around line 478-481: The accumulation loop where total_heap_size is built by
summing eff_heap_sizes[r] values can overflow since eff_heap_sizes[r] is
user-provided and can reach uint64_t::max. Add overflow protection before each
addition operation in the loop (where r iterates from 0 to PTO2_MAX_RING_DEPTH)
to ensure that adding eff_heap_sizes[r] to total_heap_size will not wrap around.
If overflow would occur, handle it by either capping the total_heap_size to
uint64_t::max or rejecting the configuration to prevent passing an undersized
heap size to setup_static_arena() that would mismatch the per-ring partitioning
done later.

In
`@src/a5/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpp`:
- Around line 265-283: The code accumulates heap sizes into gm_heap_size and
heap_offset without checking for integer overflow, which can cause uint64_t
wraps and produce overlapping or invalid ring heap bases. Create a checked-sum
helper function that detects overflow when adding values to a running total,
then use this helper to validate that the sum of all heap_sizes values doesn't
overflow before the loop that accumulates gm_heap_size and before initializing
the task allocators via orch->rings[r].task_allocator.init. Return false or
nullptr on overflow detection instead of proceeding with initialization. Apply
the same overflow-checking pattern to the similar code section mentioned at
lines 407-415.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d1798107-5a99-489a-99ec-af841467246d

📥 Commits

Reviewing files that changed from the base of the PR and between 26b7b15 and 7a09062.

📒 Files selected for processing (39)

docs/dfx/scope-stats.md
examples/a2a3/tensormap_and_ringbuffer/paged_attention_ringbuffer/test_paged_attention_ringbuffer.py
examples/workers/l2/per_task_runtime_env/README.md
examples/workers/l2/per_task_runtime_env/main.py
python/bindings/task_interface.cpp
python/simpler/worker.py
simpler_setup/scene_test.py
src/a2a3/runtime/host_build_graph/host/runtime_maker.cpp
src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
src/a2a3/runtime/tensormap_and_ringbuffer/docs/MULTI_RING.md
src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.h
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.h
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_shared_memory.h
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpp
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/shared/pto_shared_memory.cpp
src/a5/runtime/host_build_graph/host/runtime_maker.cpp
src/a5/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
src/a5/runtime/tensormap_and_ringbuffer/docs/MULTI_RING.md
src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp
src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h
src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.h
src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.h
src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_shared_memory.h
src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h
src/a5/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpp
src/a5/runtime/tensormap_and_ringbuffer/runtime/shared/pto_shared_memory.cpp
src/common/platform/onboard/host/c_api_shared.cpp
src/common/platform/sim/host/c_api_shared.cpp
src/common/task_interface/call_config.h
src/common/worker/chip_worker.cpp
src/common/worker/chip_worker.h
src/common/worker/pto_runtime_c_api.h
tests/ut/cpp/a2a3/test_shared_memory.cpp
tests/ut/cpp/a5/test_shared_memory.cpp
tests/ut/cpp/types/test_call_config.cpp
tests/ut/py/test_chip_worker.py

TaoZQY · 2026-06-22T07:58:17Z

Addressed the AI review feedback in 864d8f42 and kept the PR as a single squashed commit.

Changes made:

Reject negative integer env values before strtoull in both a2a3 and a5.
Use copy-paste-safe shell variables in the scope-stats docs snippet.
Move resolve_ring_config() before tensor staging in both host runtime makers.
Add uint64_t overflow guards for host-side per-ring heap sums.
Add checked heap-sum validation in shared runtime init before deriving ring heap bases.
Clear rt before releasing scheduler threads on init_per_ring() failure.
Add C++ unit coverage for overflowing per-ring heap sums.

Validation:

pre-commit run --from-ref upstream/main --to-ref HEAD: passed
git diff --check: passed
cmake --build /tmp/simpler-ut-cpp-build -j$(nproc): passed
ctest --test-dir /tmp/simpler-ut-cpp-build --output-on-failure: 44/44 passed

I did not address CodeRabbit's generic docstring coverage warning because it is not part of this repository's enforced checks and the touched implementation is primarily C++ runtime code; adding unrelated Python docstrings would be noise for this PR.

TaoZQY · 2026-06-22T08:33:28Z

Updated the PR branch to 47442870 after investigating the failing st-onboard-a2a3 job.

The previous failure was the first examples/workers/l2/per_task_runtime_env run on dev=11 failing before runtime setup at halMemCtl rc=13 (EACCES) while retrieving AICore register addresses. I hardened the existing a2a3 onboard retry path so this transient driver-side serialization window now waits up to ~2s for EACCES only; other HAL errors still fail immediately.

Local validation:

pre-commit run --from-ref upstream/main --to-ref HEAD: passed
cmake --build /tmp/simpler-ut-cpp-build -j$(nproc): passed
ctest --test-dir /tmp/simpler-ut-cpp-build --output-on-failure: 44/44 passed
python simpler_setup/build_runtimes.py --platforms a2a3 --pto-isa-commit ddafa8da9c760ecd13fe9fe2833d6ee55fb20bd8: passed

The new workflow run is currently waiting for maintainer approval before jobs are allowed to start: https://github.com/hw-native-sys/simpler/actions/runs/27939901712

ChaoZheng109 · 2026-06-22T11:17:21Z

Review — per-ring runtime sizing (#1029)

Solid PR. The device/SM layer was already per-ring; this correctly widens the host path and threads eff_*[4] end-to-end. a2a3/a5 parity is clean, the wire ABI is updated in lockstep (static_assert + Python _CFG_FMT + round-trip UT), and dropping the pow2 constraint on heap is correct — the heap is a comparison/subtraction ring allocator (try_bump_heap), no masking, so non-pow2 sizes are safe (only task_window needs pow2, and that's kept). Good test coverage.

Should fix

Dead Runtime fields. Runtime::{task_window_size, heap_size, dep_pool_size} (tensormap_and_ringbuffer/runtime/runtime.h) are no longer written (assignment dropped from runtime_maker.cpp) nor read (reads dropped from aicpu_executor.cpp) — only zeroed in runtime.cpp. Please remove the fields + their resets (a2a3 + a5).

Consider — depth extensibility (so `PTO2_MAX_RING_DEPTH + 1` doesn't ripple)

Most of the C++ already loops over the ring count, but a few spots hardcode 4 and would all need editing if the depth grows:

worker.py: _CFG_FMT hardcodes "Q" * 15 and _read_config_from_mailbox unpacks 12 named ring_*_0..3. RUNTIME_ENV_RING_COUNT is already exported to Python (m.attr(...)) but unused here. Suggest deriving the format from the constant and unpacking via 3 slices:

from _task_interface import RUNTIME_ENV_RING_COUNT as _N
_CFG_FMT = struct.Struct("=iiiiiii" + "Q" * (3 + 3 * _N) + "1024s")
...
v = _CFG_FMT.unpack_from(buf, _OFF_CONFIG)
ring_task_windows = list(v[10        : 10 +   _N])
ring_heaps        = list(v[10 +   _N : 10 + 2*_N])
ring_dep_pools    = list(v[10 + 2*_N : 10 + 3*_N])

This also removes the manual "15" drift risk.

call_config.h static_assert: 15 * sizeof(uint64_t) → (3 + 3 * RUNTIME_ENV_RING_COUNT) * sizeof(uint64_t) so it tracks the constant automatically.
Cross-layer coupling: chip_worker.cpp (common) builds RUNTIME_ENV_RING_COUNT-sized arrays and passes bare pointers the arch runtime indexes with PTO2_MAX_RING_DEPTH. Both are 4 today; a static_assert(RUNTIME_ENV_RING_COUNT == PTO2_MAX_RING_DEPTH, ...) in each arch runtime_maker.cpp (both headers visible) locks them together.
(minor) The [%PRIu64, x4] logs in runtime_maker.cpp / aicpu_executor.cpp enumerate [0..3]; a small format_ring_array() helper would make them depth-agnostic too.

Consider — docs

MULTI_RING.md §7.2 says invalid env values are "silently ignored", but parse_uint_token/apply_env_ring_values LOG_WARN on every failure → suggest "logged and ignored".
K/M/G/T suffix parsing is env-only; CallConfig.runtime_env.ring_heap(s) are uint64 (bytes). Worth a one-line note in the CallConfig example so nobody tries ring_heaps=["128M", ...].

Design discussion — collapse scalar + array into one `1-or-4` list field

The scalar ring_task_window + array ring_task_windows pair duplicates plumbing across binding / wire / C ABI (run_prepared + 4 bind_callable_to_runtime_impl) / resolve_ring_config / scene_test. Since the env side already uses a "1-or-4" rule (one value broadcasts, four = per-ring, 2/3 rejected), the CallConfig side could match it: a single list field accepting length 1 (broadcast) or 4 (per-ring), getter always returning the full ring count — no getter ambiguity, and the wire drops the 3 scalar slots.

Framed as replace (not add), this makes the PR smaller, not larger: it deletes the duplicate scalar plumbing. Migration is bounded to ~5 in-repo call sites ("ring_task_window": X → "ring_task_windows": [X]); there are no out-of-repo consumers, and #1042 (the scalar's origin) is recent, so reshaping it before the per-ring API ossifies is cheap. Trade-off: loses the "custom uniform baseline + per-ring override" layering (marginal — [0,0,0,X] still overrides one ring over defaults). Not blocking, but worth a deliberate decision now while unmerged.

Minor

heap_used_bytes() holds the only % in the ring buffer and has no callers (dead since Add: scope_stats collector for per-scope queue-fill peaks #858) — optional cleanup, out of scope here.

ChaoZheng109 · 2026-06-22T11:54:42Z

 #include <thread>
+#include <unistd.h>
+
+class HalMemCtlFileLock {


This HalMemCtlFileLock block — plus the EACCES retry-window change (kHalMemCtlMaxRetries 3->60, delay 50ms->500ms) and the new PTO2_HALMEMCTL_LOCK_PATH env var — is unrelated to the #1029 per-ring runtime sizing this PR is about. The underlying halMemCtl EACCES / card-usage-overlap contention is a separate infra issue already being handled elsewhere, so it shouldn't ride in this PR. Also note PTO2_HALMEMCTL_LOCK_PATH is a new behavior gate (see .claude/rules/env-macro-gating.md — needs its own justification). Please drop these host_regs.cpp changes from this PR and land them in the dedicated card-lock fix, keeping #1099 scoped to per-ring sizing.

Agreed, thanks for catching this. The host_regs.cpp card-lock / halMemCtl retry changes are unrelated to #1029, so I dropped them from this PR and amended the branch.

The current head (f46d99c8) keeps #1099 scoped to per-ring runtime sizing; src/a2a3/platform/onboard/host/host_regs.cpp is back to the upstream-main behavior. Any dedicated card-lock fix can carry the HalMemCtlFileLock / retry-window / env-gate changes separately with its own justification.

ChaoZheng109 · 2026-06-23T01:08:01Z

+    }
+
+    uint64_t val = 0;
+    if (allow_size_suffix) {


Suggest dropping this K/M/G/T size-suffix path (the whole allow_size_suffix branch). It's the only consumer of the long double / strtold / isfinite / floor machinery here, and adds ~70 lines of new code across a2a3+a5 just to accept 4G / 384M / 1.5G. Removing it collapses parse_uint_token to a single integer (strtoull) path — the same behavior the env parser had before #141 — and lets you drop the allow_size_suffix parameter (2 signatures + 5 call sites) and the <cmath> / <cctype> includes. Heap would then take raw byte counts (e.g. 402653184), consistent with task_window / dep_pool.

Trade-off to call out: issue #1029 requested the PTO2_RING_HEAP=10M,64M,1.5G,4G syntax, so this diverges from that ask — worth a deliberate decision. (a5 mirrors this branch.)

Agreed, I took this simplification. The latest head (3351e59c) drops the K/M/G/T suffix path entirely: parse_uint_token now only uses the integer strtoull path, allow_size_suffix is gone from the helper signatures/call sites, and the docs now show PTO2_RING_HEAP as raw byte counts.

This keeps the #1029 per-ring behavior while making env parsing consistent across task_window, heap, and dep_pool.

ChaoZheng109 · 2026-06-23T01:13:07Z

Review — per-ring runtime sizing (#1029)

Solid PR. The device/SM layer was already per-ring; this correctly widens the host path and threads eff_*[4] end-to-end. a2a3/a5 parity is clean, the wire ABI is updated in lockstep (static_assert + Python _CFG_FMT + round-trip UT), and dropping the pow2 constraint on heap is correct — the heap is a comparison/subtraction ring allocator (try_bump_heap), no masking, so non-pow2 sizes are safe (only task_window needs pow2, and that's kept). Good test coverage.

Should fix

Dead Runtime fields. Runtime::{task_window_size, heap_size, dep_pool_size} (tensormap_and_ringbuffer/runtime/runtime.h) are no longer written (assignment dropped from runtime_maker.cpp) nor read (reads dropped from aicpu_executor.cpp) — only zeroed in runtime.cpp. Please remove the fields + their resets (a2a3 + a5).

Consider — depth extensibility (so PTO2_MAX_RING_DEPTH + 1 doesn't ripple)

Most of the C++ already loops over the ring count, but a few spots hardcode 4 and would all need editing if the depth grows:
worker.py: _CFG_FMT hardcodes "Q" * 15 and _read_config_from_mailbox unpacks 12 named ring_*_0..3. RUNTIME_ENV_RING_COUNT is already exported to Python (m.attr(...)) but unused here. Suggest deriving the format from the constant and unpacking via 3 slices:
from _task_interface import RUNTIME_ENV_RING_COUNT as _N
_CFG_FMT = struct.Struct("=iiiiiii" + "Q" * (3 + 3 * _N) + "1024s")
...
v = _CFG_FMT.unpack_from(buf, _OFF_CONFIG)
ring_task_windows = list(v[10        : 10 +   _N])
ring_heaps        = list(v[10 +   _N : 10 + 2*_N])
ring_dep_pools    = list(v[10 + 2*_N : 10 + 3*_N])
This also removes the manual "15" drift risk.
call_config.h static_assert: 15 * sizeof(uint64_t) → (3 + 3 * RUNTIME_ENV_RING_COUNT) * sizeof(uint64_t) so it tracks the constant automatically.

Cross-layer coupling: chip_worker.cpp (common) builds RUNTIME_ENV_RING_COUNT-sized arrays and passes bare pointers the arch runtime indexes with PTO2_MAX_RING_DEPTH. Both are 4 today; a static_assert(RUNTIME_ENV_RING_COUNT == PTO2_MAX_RING_DEPTH, ...) in each arch runtime_maker.cpp (both headers visible) locks them together.

(minor) The [%PRIu64, x4] logs in runtime_maker.cpp / aicpu_executor.cpp enumerate [0..3]; a small format_ring_array() helper would make them depth-agnostic too.
Consider — docs

MULTI_RING.md §7.2 says invalid env values are "silently ignored", but parse_uint_token/apply_env_ring_values LOG_WARN on every failure → suggest "logged and ignored".

K/M/G/T suffix parsing is env-only; CallConfig.runtime_env.ring_heap(s) are uint64 (bytes). Worth a one-line note in the CallConfig example so nobody tries ring_heaps=["128M", ...].

Design discussion — collapse scalar + array into one 1-or-4 list field

The scalar ring_task_window + array ring_task_windows pair duplicates plumbing across binding / wire / C ABI (run_prepared + 4 bind_callable_to_runtime_impl) / resolve_ring_config / scene_test. Since the env side already uses a "1-or-4" rule (one value broadcasts, four = per-ring, 2/3 rejected), the CallConfig side could match it: a single list field accepting length 1 (broadcast) or 4 (per-ring), getter always returning the full ring count — no getter ambiguity, and the wire drops the 3 scalar slots.

Framed as replace (not add), this makes the PR smaller, not larger: it deletes the duplicate scalar plumbing. Migration is bounded to ~5 in-repo call sites ("ring_task_window": X → "ring_task_windows": [X]); there are no out-of-repo consumers, and #1042 (the scalar's origin) is recent, so reshaping it before the per-ring API ossifies is cheap. Trade-off: loses the "custom uniform baseline + per-ring override" layering (marginal — [0,0,0,X] still overrides one ring over defaults). Not blocking, but worth a deliberate decision now while unmerged.

Minor

heap_used_bytes() holds the only % in the ring buffer and has no callers (dead since Add: scope_stats collector for per-scope queue-fill peaks #858) — optional cleanup, out of scope here.

不用做"ring_task_window": X → "ring_task_windows": [X]的转变，现状是"ring_task_window": X广播到每个ring，"ring_task_windows": [X, Y, Z, H]必须输入4个，否则报错

Add per-ring sizing for the tensormap_and_ringbuffer runtime so each scope-depth ring can use independent task-window, heap, and dependency-pool capacities. This keeps the scalar runtime_env fields introduced by hw-native-sys#1042 and adds per-ring array fields: ring_task_windows[4], ring_heaps[4], and ring_dep_pools[4]. Effective sizing is resolved per resource and per ring with CallConfig values taking precedence over environment variables, followed by compile-time defaults. Environment variables now support either scalar values or exactly four comma-separated per-ring integer values. The change wires the effective per-ring capacities through both a2a3 and a5 tensormap_and_ringbuffer runtimes, including host runtime creation, AICPU runtime setup, shared-memory layout, scheduler/orchestrator initialization, and scope-stats reporting. Also reject negative integer env values before unsigned parsing, guard per-ring heap accumulation against overflow, remove obsolete Runtime sizing fields, and keep the mailbox/runtime ring count checks derived from RUNTIME_ENV_RING_COUNT. Fixes hw-native-sys#1029.

hw-native-sys#1099 added per-ring array fields (ring_task_windows / ring_heaps / ring_dep_pools) alongside the scalar runtime_env knobs, but neither per_task_runtime_env example exercised them. Extend both the L2 and L3 examples to also cover the per-ring form: each scope-depth ring (0..3) sized independently. The config helpers now iterate a RING_FIELDS tuple so a spec dict can carry either the scalar or the array keys, and the READMEs document the full precedence chain and the --enable-scope-stats verification path.

#1099 added per-ring array fields (ring_task_windows / ring_heaps / ring_dep_pools) alongside the scalar runtime_env knobs, but neither per_task_runtime_env example exercised them. Extend both the L2 and L3 examples to also cover the per-ring form: each scope-depth ring (0..3) sized independently. The config helpers now iterate a RING_FIELDS tuple so a spec dict can carry either the scalar or the array keys, and the READMEs document the full precedence chain and the --enable-scope-stats verification path.

…ld (#1128) #1099 exposed ring sizing through two near-identical CallConfig.runtime_env names per resource that differ only by a trailing `s` — `ring_task_window` (scalar broadcast) vs `ring_task_windows` (per-ring array), etc. The one-letter difference is an ergonomics footgun and the layered "scalar baseline + per-ring override" semantics it bought are not worth the confusing twin names. Collapse each pair into a single field that accepts EITHER a scalar (broadcast to every ring) OR a 4-entry list (per-ring): cfg.runtime_env.ring_task_window = 128 # broadcast cfg.runtime_env.ring_task_window = [128, 0, 0, 0] # per-ring; 0 falls through Broadcast happens in the Python binding (int -> [v, v, v, v]); the wire format now carries only the three 4-element arrays (12 uint64, down from 15) and the getter always returns a 4-list. A 0 entry falls through to PTO2_RING_* env -> compile-time default; the separate scalar-CallConfig precedence tier is dropped (accepted trade-off — a 0 in a list no longer falls back to a sibling scalar). The internal C-API (run_prepared) and wire layout are internal-only and rebuild together via pip install, so this is a clean break with no back-compat shim. Mirrored across a2a3/a5, both runtimes, bindings, scene-test parsing, docs, unit tests, and the per_task_runtime_env examples. Closes #1126.

gemini-code-assist Bot reviewed Jun 22, 2026

View reviewed changes

Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp Outdated

Comment thread src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp Outdated

Comment thread docs/dfx/scope-stats.md Outdated

coderabbitai Bot reviewed Jun 22, 2026

View reviewed changes

TaoZQY force-pushed the feat/per-ring-runtime-env branch 2 times, most recently from 46c0d07 to 864d8f4 Compare June 22, 2026 07:53

TaoZQY force-pushed the feat/per-ring-runtime-env branch from 864d8f4 to 4744287 Compare June 22, 2026 08:32

TaoZQY force-pushed the feat/per-ring-runtime-env branch from 4744287 to 3a02e7c Compare June 22, 2026 08:50

This comment was marked as outdated.

Sign in to view

ChaoZheng109 reviewed Jun 22, 2026

View reviewed changes

TaoZQY force-pushed the feat/per-ring-runtime-env branch from 3a02e7c to f46d99c Compare June 22, 2026 12:55

ChaoZheng109 reviewed Jun 23, 2026

View reviewed changes

TaoZQY force-pushed the feat/per-ring-runtime-env branch from f46d99c to 3351e59 Compare June 23, 2026 01:21

ChaoZheng109 approved these changes Jun 23, 2026

View reviewed changes

ChaoZheng109 merged commit c68d9bb into hw-native-sys:main Jun 23, 2026
16 checks passed

ChaoZheng109 mentioned this pull request Jun 23, 2026

docs(examples): demonstrate per-ring runtime_env sizing #1122

Merged

coderabbitai Bot mentioned this pull request Jun 24, 2026

[Optimization] Replace wiring with polling-based task readiness test (~17% median device speedup) #1137

Open

7 tasks

Uh oh!

Conversation

TaoZQY commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Behavior

Usage

Environment variables

CallConfig

Verification

Documentation

Uh oh!

coderabbitai Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TaoZQY commented Jun 22, 2026

Uh oh!

TaoZQY commented Jun 22, 2026

Uh oh!

ChaoZheng109 commented Jun 22, 2026

Review — per-ring runtime sizing (#1029)

Should fix

Consider — depth extensibility (so PTO2_MAX_RING_DEPTH + 1 doesn't ripple)

Consider — docs

Design discussion — collapse scalar + array into one 1-or-4 list field

Minor

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

ChaoZheng109 Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

TaoZQY Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

ChaoZheng109 Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

TaoZQY Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

ChaoZheng109 commented Jun 23, 2026

Review — per-ring runtime sizing (#1029)

Should fix

Consider — depth extensibility (so PTO2_MAX_RING_DEPTH + 1 doesn't ripple)

Consider — docs

Design discussion — collapse scalar + array into one 1-or-4 list field

Minor

TaoZQY commented Jun 22, 2026 •

edited

Loading

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading

Consider — depth extensibility (so `PTO2_MAX_RING_DEPTH + 1` doesn't ripple)

Design discussion — collapse scalar + array into one `1-or-4` list field

Consider — depth extensibility (so `PTO2_MAX_RING_DEPTH + 1` doesn't ripple)

Design discussion — collapse scalar + array into one `1-or-4` list field