Skip to content

Add: runtime timeout env overrides#1127

Merged
ChaoZheng109 merged 1 commit into
hw-native-sys:mainfrom
sunkaixuan2018:fix-issue-1112
Jun 26, 2026
Merged

Add: runtime timeout env overrides#1127
ChaoZheng109 merged 1 commit into
hw-native-sys:mainfrom
sunkaixuan2018:fix-issue-1112

Conversation

@sunkaixuan2018

@sunkaixuan2018 sunkaixuan2018 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add host-side parsing for PTO2_OP_EXECUTE_TIMEOUT_US, PTO2_STREAM_SYNC_TIMEOUT_MS, and PTO2_SCHEDULER_TIMEOUT_MS
  • Propagate scheduler timeout overrides into a2a3/a5 runtime layouts for AICPU scheduler use
  • Keep invalid env values and broken onboard ordering on safe defaults with warnings
  • Use the resolved stream-sync timeout in normal stream/device sync and force-reset drain paths
  • Document the runtime env overrides and add unit coverage for parsing/order validation

Timeout ordering policy

  • On onboard builds, env overrides are accepted only when scheduler < op-execute < stream-sync
  • The host also requires stream-sync > scheduler + 1.5s to leave room for cold init before the AICPU scheduler no-progress timer is armed
  • The 1.5s guard covers fixed/cold costs such as kernel registration, orchestration SO dlopen, runtime init, and AICore handshake
  • The max orchestration producer wall time is graph-specific and is not knowable during env parsing, so users that raise scheduler/op timeouts must size PTO2_STREAM_SYNC_TIMEOUT_MS for their worst-case orchestration window
  • Sim builds skip onboard ordering because there is no STARS or ACL stream-sync timeout; only the scheduler env changes the sim scheduler budget

Testing

  • git diff --check
  • myserver: python tests/lint/check_headers.py $(git diff --name-only)
  • myserver: python tests/lint/check_english_only.py $(git diff --name-only)
  • myserver: ./tests/ut/cpp/build/test_runtime_timeout_config (5/5 passed)
  • myserver: PYTHONNOUSERSITE=1 PYTHONPATH=$PWD:$PWD/python python -S simpler_setup/build_runtimes.py --platforms a2a3sim a5sim

Fixes #1112

@gemini-code-assist

Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bb8b0e64-f556-4e4f-8a1d-97eefe2102b5

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Introduces three env-var overrides (PTO2_OP_EXECUTE_TIMEOUT_US, PTO2_STREAM_SYNC_TIMEOUT_MS, PTO2_SCHEDULER_TIMEOUT_MS) for hang-detection timeouts without rebuilding. A new shared header provides parsing, validation, and ordering-constraint utilities. Host-side timeouts are resolved in DeviceRunnerBase; the onboard scheduler timeout is propagated through PTO2RuntimeArenaLayout to the AICPU scheduler dispatch loop. Applied symmetrically to both a2a3 and a5 platforms.

Changes

Env-var configurable runtime timeouts

Layer / File(s) Summary
RuntimeTimeoutConfig parsing and validation infrastructure
src/common/platform/include/host/runtime_timeout_config.h
New header defines env-var name constants, RuntimeTimeoutConfig/RuntimeTimeoutParseStatus structs, RuntimeTimeoutOrderStatus enum, and inline helpers for token trimming, unsigned-int parsing, apply_runtime_timeout_override overloads, resolve_runtime_timeout_config, and validate_runtime_timeout_order.
Platform timeout constants and spin_hint wiring
src/a2a3/platform/include/common/platform_config.h, src/a5/platform/include/common/platform_config.h, src/common/platform/onboard/aicpu/spin_hint.h, src/common/platform/sim/aicpu/spin_hint.h
Both platform configs add PLATFORM_ONBOARD_SCHEDULER_TIMEOUT_MS (2000 ms) and expand timeout documentation. Onboard spin_hint.h replaces the hardcoded 2000 ms with a constexpr alias; sim spin_hint.h gains a comment about the env override.
DeviceRunnerBase: resolved timeout config member and ACL call wiring
src/common/platform/onboard/host/device_runner_base.h, src/common/platform/onboard/host/device_runner_base.cpp
DeviceRunnerBase gains a RuntimeTimeoutConfig timeout_config_ member. An anonymous helper resolves and validates env overrides at attach time. configure_aicore_op_timeout and both sync_run_streams ACL calls switch to timeout_config_ values.
Per-platform device runner stream-sync override
src/a2a3/platform/onboard/host/device_runner.cpp, src/a5/platform/onboard/host/device_runner.cpp
Both DeviceRunner::recover_device_or_mark_unusable change aclrtSynchronizeDeviceWithTimeout to use timeout_config_.stream_sync_timeout_ms instead of the platform constant.
Scheduler timeout: arena layout field, runtime_maker resolution, onboard dispatch use
src/a2a3/runtime/.../pto_runtime2.h, src/a5/runtime/.../pto_runtime2.h, src/a{2a3,5}/runtime/.../host/runtime_maker.cpp, src/a{2a3,5}/runtime/.../scheduler/scheduler_dispatch.cpp
PTO2RuntimeArenaLayout gains scheduler_timeout_ms (default 0). Both runtime_maker.cpp files add is_sim_platform and resolve_scheduler_timeout_ms and populate the layout field. Both scheduler_dispatch.cpp files compute scheduler_timeout_cycles from the layout field and use it in the idle-loop stall check.
Tests, build wiring, and documentation
tests/ut/cpp/CMakeLists.txt, tests/ut/cpp/common/test_runtime_timeout_config.cpp, docs/dfx/tensor-dump.md
New GoogleTest suite covers defaults, valid overrides, invalid tokens, and ordering violations. CMakeLists.txt registers the target. tensor-dump.md documents the three env vars, fallback/warning behavior, and sim-only applicability.

Sequence Diagram(s)

sequenceDiagram
  participant User as User (env vars set)
  participant DeviceRunnerBase
  participant resolve_onboard_timeout_config
  participant ACL as ACL (aclrtSet / aclrtSynchronize)
  participant runtime_maker as runtime_maker.cpp
  participant resolve_scheduler_timeout_ms
  participant PTO2RuntimeArenaLayout
  participant SchedulerDispatch as SchedulerContext (AICPU)

  rect rgba(70, 130, 180, 0.5)
    Note over DeviceRunnerBase, ACL: Host-side timeout resolution (op-execute + stream-sync)
    DeviceRunnerBase->>resolve_onboard_timeout_config: attach_current_thread (first call)
    resolve_onboard_timeout_config->>resolve_onboard_timeout_config: getenv PTO2_OP_EXECUTE_TIMEOUT_US, PTO2_STREAM_SYNC_TIMEOUT_MS
    resolve_onboard_timeout_config->>resolve_onboard_timeout_config: validate_runtime_timeout_order
    resolve_onboard_timeout_config-->>DeviceRunnerBase: RuntimeTimeoutConfig timeout_config_
    DeviceRunnerBase->>ACL: aclrtSetOpExecuteTimeOutV2(timeout_config_.op_execute_timeout_us)
    DeviceRunnerBase->>ACL: aclrtSynchronizeStreamWithTimeout(timeout_config_.stream_sync_timeout_ms)
  end

  rect rgba(34, 139, 34, 0.5)
    Note over runtime_maker, SchedulerDispatch: Scheduler timeout propagation to AICPU device
    runtime_maker->>resolve_scheduler_timeout_ms: getenv PTO2_SCHEDULER_TIMEOUT_MS + ordering check
    resolve_scheduler_timeout_ms-->>runtime_maker: scheduler_timeout_ms (or 0)
    runtime_maker->>PTO2RuntimeArenaLayout: layout.scheduler_timeout_ms = value
    Note over PTO2RuntimeArenaLayout: struct DMA'd to device arena
    SchedulerDispatch->>PTO2RuntimeArenaLayout: read prebuilt_layout.scheduler_timeout_ms
    SchedulerDispatch->>SchedulerDispatch: scheduler_timeout_cycles = ms × (FREQ/1000)
    SchedulerDispatch->>SchedulerDispatch: stall check: elapsed > scheduler_timeout_cycles → emergency shutdown
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • hw-native-sys/simpler#930: Modifies the same SchedulerContext::resolve_and_dispatch idle-loop timeout latch in scheduler_dispatch.cpp that this PR makes runtime-configurable via scheduler_timeout_ms.
  • hw-native-sys/simpler#1035: Touches the same scheduler hang/timeout exit path in scheduler_dispatch.cpp, changing tensor-dump flush behavior on timeout — directly adjacent to the scheduler_timeout_cycles comparison site modified here.
  • hw-native-sys/simpler#1063: Changes the platform-specific default SCHEDULER_TIMEOUT_MS constants that SCHEDULER_TIMEOUT_CYCLES derives from — the same defaults this PR wraps with env-var override logic.

Poem

🐇 Hopping through the timeout maze,
No more recompile haze!
Set your env var, watch it fly,
PTO2_SCHEDULER_TIMEOUT_MS — oh my!
Ordering checked, warnings in place,
The rabbit runs at configurable pace. 🕐

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.79% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed The PR implements the requested env overrides for host op-execute/stream-sync and device scheduler timeouts, with validation, defaults, and a2a3/a5 propagation.
Out of Scope Changes check ✅ Passed The docs, tests, and platform-specific updates all support the timeout override feature and do not appear unrelated.
Description check ✅ Passed The description matches the timeout override, runtime propagation, documentation, and test additions shown in the changeset.
Title check ✅ Passed The title clearly matches the main change: adding runtime timeout environment variable overrides.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp`:
- Around line 276-278: The sim-platform override in
runtime_maker::get_scheduler_timeout_ms is accepted too early, so an invalid
env-provided scheduler timeout can bypass the expected scheduler < op_execute <
stream_sync ordering. Update the timeout selection logic in runtime_maker.cpp to
validate the candidate scheduler timeout against the related op_execute and
stream_sync values before returning it, and if the ordering is broken, fall back
to the safe default instead of propagating the override.

In `@src/common/platform/include/host/runtime_timeout_config.h`:
- Around line 122-123: Reset the parse-status fields at the start of
resolve_runtime_timeout_config so a reused RuntimeTimeoutParseStatus cannot
carry stale state between calls. In resolve_runtime_timeout_config, before any
env-var parsing logic runs, explicitly initialize all *_env_set and *_valid
members on the optional status object to a clean default, then proceed with the
existing updates for each parsed field. This keeps downstream checks based on
RuntimeTimeoutParseStatus accurate even when the same status instance is passed
in multiple times.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9ee4b20a-2dc0-4b90-9218-ce38d984ebc5

📥 Commits

Reviewing files that changed from the base of the PR and between ae59a8e and 4b294e5.

📒 Files selected for processing (18)
  • docs/dfx/tensor-dump.md
  • src/a2a3/platform/include/common/platform_config.h
  • src/a2a3/platform/onboard/host/device_runner.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
  • src/a5/platform/include/common/platform_config.h
  • src/a5/platform/onboard/host/device_runner.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.h
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
  • src/common/platform/include/host/runtime_timeout_config.h
  • src/common/platform/onboard/aicpu/spin_hint.h
  • src/common/platform/onboard/host/device_runner_base.cpp
  • src/common/platform/onboard/host/device_runner_base.h
  • src/common/platform/sim/aicpu/spin_hint.h
  • tests/ut/cpp/CMakeLists.txt
  • tests/ut/cpp/common/test_runtime_timeout_config.cpp

Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp Outdated
Comment thread src/common/platform/include/host/runtime_timeout_config.h
@sunkaixuan2018 sunkaixuan2018 force-pushed the fix-issue-1112 branch 2 times, most recently from c672a45 to cfcc53a Compare June 24, 2026 07:06
@sunkaixuan2018 sunkaixuan2018 changed the title Add runtime timeout env overrides Add: runtime timeout env overrides Jun 24, 2026
@sunkaixuan2018 sunkaixuan2018 force-pushed the fix-issue-1112 branch 2 times, most recently from c8450d9 to d1e8d3a Compare June 25, 2026 08:24
@ChaoZheng109

Copy link
Copy Markdown
Collaborator

Review:sim 上 scheduler 超时实际无法调大

整体 onboard 路径实现得很干净,a2a3/a5 对称、单测覆盖了 header 原语 👍。但有一个 must-fix,外加几个小项。

🔴 Must fix:sim 的 scheduler 覆盖被 onboard 的 op/stream 常量卡死

resolve_scheduler_timeout_ms()(runtime_maker.cpp)是 runtime 代码,sim 和 onboard 都会编译进去。它无条件用 PLATFORM_OP_EXECUTE_TIMEOUT_US(3s)和 PLATFORM_STREAM_SYNC_TIMEOUT_MS(4s)来校验 scheduler env,而这两个常量在 sim 上根本不生效(sim 没有 STARS / ACL 流同步超时)。

后果:

PR 描述写的是 "Sim builds skip onboard ordering … only the scheduler env changes the sim scheduler budget" —— 但代码并没有跳过。而且 args-dump.md 文档里又写成"sim 上仍会校验",PR 描述和文档自相矛盾

建议:sim 上跳过 op/stream 排序,只校验 scheduler > 0。这版已经 #include "host/platform_compile_info.h" 了但没用上 —— 用它的 get_platform() 即可区分:

const char *plat = get_platform();
bool is_sim = plat != nullptr && std::strstr(plat, "sim") != nullptr;
if (is_sim) {
    return cfg.scheduler_timeout_ms;  // sim 无 STARS/ACL 超时,scheduler 不与 op/stream 耦合
}
// onboard 才做完整排序校验
RuntimeTimeoutOrderStatus status = validate_runtime_timeout_order(cfg);
...

🟡 Should fix

  1. 死字段 + 重复解析:device_runner_base.cppresolve_onboard_timeout_config() 解析并校验了 timeout_config_.scheduler_timeout_ms,但该成员从未被读取(实际下发设备的值来自 runtime_maker 里另一条 resolve_scheduler_timeout_ms())。结果是同一环境变量两个真相来源,非法值会打两遍告警。建议只解析一次。
  2. 注释换行被打乱:platform_config.h(a2a3+a5)的 doc 注释出现句中断行(如 STARS actively / * monitors / * AICore),请重排成自然换行。
  3. 未使用的 include:runtime_maker.cpp 引入了 host/platform_compile_info.h 但没调用 —— 要么按上面用起来,要么删掉。
  4. 缺 sim 路径测试:resolve_scheduler_timeout_ms() 和 sim/onboard 分叉零覆盖,补一个能钉住 sim 行为的用例。

Verdict:request changes(主要是 🔴 那条)

- Add runtime env overrides for op-execute, stream-sync, and scheduler timeouts
- Keep onboard timeout ordering checks so scheduler fires before host timeouts
- Let sim scheduler overrides skip onboard-only op/stream ordering limits
- Add timeout parsing and platform-ordering unit coverage
@ChaoZheng109 ChaoZheng109 merged commit a405b95 into hw-native-sys:main Jun 26, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Make scheduler/op-execute/stream-sync timeouts env-var configurable (no rebuild)

2 participants