Feat/per ring runtime env #1029#1099
Conversation
📝 WalkthroughWalkthroughThis PR replaces the three scalar ChangesPer-ring ring-buffer configuration
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces support for independent, per-ring runtime environment overrides (task window, heap size, and dependency pool capacity) across the four scope-depth rings, updating the C++ runtime, Python bindings, test infrastructure, and documentation, while also relaxing the power-of-two constraint on heap sizes. The review feedback highlights a security vulnerability in both a2a3 and a5 runtime_maker.cpp where negative inputs to std::strtoull can wrap around and bypass validation, and suggests replacing angle-bracket placeholders in markdown bash snippets with standard shell variables to prevent syntax errors.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp`:
- Around line 495-500: When init_per_ring() fails in the AiCpuExecutor, the
runtime_init_ready_ flag is released while rt still points to a partially
initialized runtime object with a zeroed sm_handle. This bypasses the
scheduler-side rt == nullptr guard and allows dispatch to run against invalid
state. Before storing true to runtime_init_ready_ via the store() call with
memory_order_release, clear the rt pointer to nullptr so that scheduler threads
will still see rt as null and avoid dispatching against the broken runtime.
In `@src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp`:
- Around line 455-462: The resolve_ring_config() call happens after tensors have
been allocated and appended to runtime->tensor_pairs_, but if it fails, those
allocated device buffers are not cleaned up before returning -1. Move the
resolve_ring_config() call to execute before the tensor allocation loop that
populates runtime->tensor_pairs_, so that if configuration resolution fails, no
device buffers will have been allocated yet. This ensures the function exits
cleanly without leaving orphaned resources when resolve_ring_config() returns
false.
- Around line 478-481: The accumulation of eff_heap_sizes into total_heap_size
in the loop (iterating from 0 to PTO2_MAX_RING_DEPTH) can overflow since
eff_heap_sizes values can be user-provided up to uint64_t maximum. Add overflow
detection before each addition in the loop to guard against this, either by
checking if adding the next eff_heap_sizes[r] value would exceed the maximum
uint64_t value before performing the addition, or by using a safe addition
function that detects and handles overflow. This ensures total_heap_size remains
accurate when passed to setup_static_arena() and matches the heap partitioning
performed by later code.
In
`@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpp`:
- Around line 273-291: The heap size accumulation in the loop over
PTO2_MAX_RING_DEPTH does not check for integer overflow when summing heap_sizes
values into gm_heap_size and heap_offset. If the cumulative heap sizes exceed
the maximum value of uint64_t, gm_heap_size will wrap around and heap_offset
calculations will create overlapping or invalid ring heap bases. Create a
checked-sum helper function that detects overflow conditions and returns false
or nullptr when overflow would occur. Apply this checked-sum validation before
any heap_offset calculations and before calling
orch->rings[r].task_allocator.init(), ensuring the function exits early if
overflow is detected. This fix should be applied to all locations where similar
heap size accumulation occurs, including the additional location referenced in
the comment.
In `@src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp`:
- Around line 455-462: The call to resolve_ring_config() happens after device
buffers have already been allocated and appended to runtime->tensor_pairs_ in an
earlier loop, but if resolve_ring_config() fails and returns false, the function
exits without cleaning up those allocated resources. Move the
resolve_ring_config() call to execute before the tensor staging loop that
allocates device buffers and appends to runtime->tensor_pairs_, so that
configuration validation occurs first and prevents resource leaks when
resolution fails.
- Around line 478-481: The accumulation loop where total_heap_size is built by
summing eff_heap_sizes[r] values can overflow since eff_heap_sizes[r] is
user-provided and can reach uint64_t::max. Add overflow protection before each
addition operation in the loop (where r iterates from 0 to PTO2_MAX_RING_DEPTH)
to ensure that adding eff_heap_sizes[r] to total_heap_size will not wrap around.
If overflow would occur, handle it by either capping the total_heap_size to
uint64_t::max or rejecting the configuration to prevent passing an undersized
heap size to setup_static_arena() that would mismatch the per-ring partitioning
done later.
In
`@src/a5/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpp`:
- Around line 265-283: The code accumulates heap sizes into gm_heap_size and
heap_offset without checking for integer overflow, which can cause uint64_t
wraps and produce overlapping or invalid ring heap bases. Create a checked-sum
helper function that detects overflow when adding values to a running total,
then use this helper to validate that the sum of all heap_sizes values doesn't
overflow before the loop that accumulates gm_heap_size and before initializing
the task allocators via orch->rings[r].task_allocator.init. Return false or
nullptr on overflow detection instead of proceeding with initialization. Apply
the same overflow-checking pattern to the similar code section mentioned at
lines 407-415.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d1798107-5a99-489a-99ec-af841467246d
📒 Files selected for processing (39)
docs/dfx/scope-stats.mdexamples/a2a3/tensormap_and_ringbuffer/paged_attention_ringbuffer/test_paged_attention_ringbuffer.pyexamples/workers/l2/per_task_runtime_env/README.mdexamples/workers/l2/per_task_runtime_env/main.pypython/bindings/task_interface.cpppython/simpler/worker.pysimpler_setup/scene_test.pysrc/a2a3/runtime/host_build_graph/host/runtime_maker.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/docs/MULTI_RING.mdsrc/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.hsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.hsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.hsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_shared_memory.hsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.hsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/shared/pto_shared_memory.cppsrc/a5/runtime/host_build_graph/host/runtime_maker.cppsrc/a5/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cppsrc/a5/runtime/tensormap_and_ringbuffer/docs/MULTI_RING.mdsrc/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cppsrc/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.hsrc/a5/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.hsrc/a5/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.hsrc/a5/runtime/tensormap_and_ringbuffer/runtime/pto_shared_memory.hsrc/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.hsrc/a5/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cppsrc/a5/runtime/tensormap_and_ringbuffer/runtime/shared/pto_shared_memory.cppsrc/common/platform/onboard/host/c_api_shared.cppsrc/common/platform/sim/host/c_api_shared.cppsrc/common/task_interface/call_config.hsrc/common/worker/chip_worker.cppsrc/common/worker/chip_worker.hsrc/common/worker/pto_runtime_c_api.htests/ut/cpp/a2a3/test_shared_memory.cpptests/ut/cpp/a5/test_shared_memory.cpptests/ut/cpp/types/test_call_config.cpptests/ut/py/test_chip_worker.py
46c0d07 to
864d8f4
Compare
|
Addressed the AI review feedback in Changes made:
Validation:
I did not address CodeRabbit's generic docstring coverage warning because it is not part of this repository's enforced checks and the touched implementation is primarily C++ runtime code; adding unrelated Python docstrings would be noise for this PR. |
864d8f4 to
4744287
Compare
|
Updated the PR branch to The previous failure was the first Local validation:
The new workflow run is currently waiting for maintainer approval before jobs are allowed to start: https://github.com/hw-native-sys/simpler/actions/runs/27939901712 |
4744287 to
3a02e7c
Compare
Review — per-ring runtime sizing (#1029)Solid PR. The device/SM layer was already per-ring; this correctly widens the host path and threads Should fix
Consider — depth extensibility (so
|
| #include <thread> | ||
| #include <unistd.h> | ||
|
|
||
| class HalMemCtlFileLock { |
There was a problem hiding this comment.
This HalMemCtlFileLock block — plus the EACCES retry-window change (kHalMemCtlMaxRetries 3->60, delay 50ms->500ms) and the new PTO2_HALMEMCTL_LOCK_PATH env var — is unrelated to the #1029 per-ring runtime sizing this PR is about. The underlying halMemCtl EACCES / card-usage-overlap contention is a separate infra issue already being handled elsewhere, so it shouldn't ride in this PR. Also note PTO2_HALMEMCTL_LOCK_PATH is a new behavior gate (see .claude/rules/env-macro-gating.md — needs its own justification). Please drop these host_regs.cpp changes from this PR and land them in the dedicated card-lock fix, keeping #1099 scoped to per-ring sizing.
There was a problem hiding this comment.
Agreed, thanks for catching this. The host_regs.cpp card-lock / halMemCtl retry changes are unrelated to #1029, so I dropped them from this PR and amended the branch.
The current head (f46d99c8) keeps #1099 scoped to per-ring runtime sizing; src/a2a3/platform/onboard/host/host_regs.cpp is back to the upstream-main behavior. Any dedicated card-lock fix can carry the HalMemCtlFileLock / retry-window / env-gate changes separately with its own justification.
3a02e7c to
f46d99c
Compare
| } | ||
|
|
||
| uint64_t val = 0; | ||
| if (allow_size_suffix) { |
There was a problem hiding this comment.
Suggest dropping this K/M/G/T size-suffix path (the whole allow_size_suffix branch). It's the only consumer of the long double / strtold / isfinite / floor machinery here, and adds ~70 lines of new code across a2a3+a5 just to accept 4G / 384M / 1.5G. Removing it collapses parse_uint_token to a single integer (strtoull) path — the same behavior the env parser had before #141 — and lets you drop the allow_size_suffix parameter (2 signatures + 5 call sites) and the <cmath> / <cctype> includes. Heap would then take raw byte counts (e.g. 402653184), consistent with task_window / dep_pool.
Trade-off to call out: issue #1029 requested the PTO2_RING_HEAP=10M,64M,1.5G,4G syntax, so this diverges from that ask — worth a deliberate decision. (a5 mirrors this branch.)
There was a problem hiding this comment.
Agreed, I took this simplification. The latest head (3351e59c) drops the K/M/G/T suffix path entirely: parse_uint_token now only uses the integer strtoull path, allow_size_suffix is gone from the helper signatures/call sites, and the docs now show PTO2_RING_HEAP as raw byte counts.
This keeps the #1029 per-ring behavior while making env parsing consistent across task_window, heap, and dep_pool.
不用做"ring_task_window": X → "ring_task_windows": [X]的转变,现状是"ring_task_window": X广播到每个ring,"ring_task_windows": [X, Y, Z, H]必须输入4个,否则报错 |
Add per-ring sizing for the tensormap_and_ringbuffer runtime so each scope-depth ring can use independent task-window, heap, and dependency-pool capacities. This keeps the scalar runtime_env fields introduced by hw-native-sys#1042 and adds per-ring array fields: ring_task_windows[4], ring_heaps[4], and ring_dep_pools[4]. Effective sizing is resolved per resource and per ring with CallConfig values taking precedence over environment variables, followed by compile-time defaults. Environment variables now support either scalar values or exactly four comma-separated per-ring integer values. The change wires the effective per-ring capacities through both a2a3 and a5 tensormap_and_ringbuffer runtimes, including host runtime creation, AICPU runtime setup, shared-memory layout, scheduler/orchestrator initialization, and scope-stats reporting. Also reject negative integer env values before unsigned parsing, guard per-ring heap accumulation against overflow, remove obsolete Runtime sizing fields, and keep the mailbox/runtime ring count checks derived from RUNTIME_ENV_RING_COUNT. Fixes hw-native-sys#1029.
f46d99c to
3351e59
Compare
hw-native-sys#1099 added per-ring array fields (ring_task_windows / ring_heaps / ring_dep_pools) alongside the scalar runtime_env knobs, but neither per_task_runtime_env example exercised them. Extend both the L2 and L3 examples to also cover the per-ring form: each scope-depth ring (0..3) sized independently. The config helpers now iterate a RING_FIELDS tuple so a spec dict can carry either the scalar or the array keys, and the READMEs document the full precedence chain and the --enable-scope-stats verification path.
#1099 added per-ring array fields (ring_task_windows / ring_heaps / ring_dep_pools) alongside the scalar runtime_env knobs, but neither per_task_runtime_env example exercised them. Extend both the L2 and L3 examples to also cover the per-ring form: each scope-depth ring (0..3) sized independently. The config helpers now iterate a RING_FIELDS tuple so a spec dict can carry either the scalar or the array keys, and the READMEs document the full precedence chain and the --enable-scope-stats verification path.
…ld (#1128) #1099 exposed ring sizing through two near-identical CallConfig.runtime_env names per resource that differ only by a trailing `s` — `ring_task_window` (scalar broadcast) vs `ring_task_windows` (per-ring array), etc. The one-letter difference is an ergonomics footgun and the layered "scalar baseline + per-ring override" semantics it bought are not worth the confusing twin names. Collapse each pair into a single field that accepts EITHER a scalar (broadcast to every ring) OR a 4-entry list (per-ring): cfg.runtime_env.ring_task_window = 128 # broadcast cfg.runtime_env.ring_task_window = [128, 0, 0, 0] # per-ring; 0 falls through Broadcast happens in the Python binding (int -> [v, v, v, v]); the wire format now carries only the three 4-element arrays (12 uint64, down from 15) and the getter always returns a 4-list. A 0 entry falls through to PTO2_RING_* env -> compile-time default; the separate scalar-CallConfig precedence tier is dropped (accepted trade-off — a 0 in a list no longer falls back to a sibling scalar). The internal C-API (run_prepared) and wire layout are internal-only and rebuild together via pip install, so this is a clean break with no back-compat shim. Mirrored across a2a3/a5, both runtimes, bindings, scene-test parsing, docs, unit tests, and the per_task_runtime_env examples. Closes #1126.
Summary
Add per-ring runtime sizing for the
tensormap_and_ringbufferruntime.This keeps the scalar
runtime_envfields introduced by #1042 and adds per-ring array fields so a single task can sizering0..ring3independently:ring_task_windows[4]ring_heaps[4]ring_dep_pools[4]This addresses #1029, where deep kernels have very different resource pressure across scope-depth rings. A single scalar value can either be too small for deeper rings or too large for shallow rings, causing unnecessary shared-memory growth.
Behavior
Effective sizing is resolved per resource and per ring:
Environment variables now support either scalar values or four comma-separated per-ring values:
PTO2_RING_HEAPvalues are integer byte counts; size suffixes such asK/M/G/Tare not supported.Usage
Users can configure per-ring sizing through environment variables or through
CallConfig.runtime_env.Environment variables
Scalar values are still supported and are broadcast to all rings:
Per-ring values use exactly four comma-separated entries, indexed by
ring0..ring3:CallConfig
Ring entries map to scope depth as follows:
Verification
Run a T&R scene test with
--enable-scope-stats:Then inspect the first line of
scope_stats.jsonl. The metadata contains the effective capacities:These arrays are indexed by
ring, so they can confirm the per-ring configuration took effect.Validated with
qwen3_14b_decodeand--enable-scope-stats:Documentation
The new usage is documented in:
src/a2a3/runtime/tensormap_and_ringbuffer/docs/MULTI_RING.mdsrc/a5/runtime/tensormap_and_ringbuffer/docs/MULTI_RING.mddocs/dfx/scope-stats.md