Add: runtime timeout env overrides#1127
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughIntroduces three env-var overrides ( ChangesEnv-var configurable runtime timeouts
Sequence Diagram(s)sequenceDiagram
participant User as User (env vars set)
participant DeviceRunnerBase
participant resolve_onboard_timeout_config
participant ACL as ACL (aclrtSet / aclrtSynchronize)
participant runtime_maker as runtime_maker.cpp
participant resolve_scheduler_timeout_ms
participant PTO2RuntimeArenaLayout
participant SchedulerDispatch as SchedulerContext (AICPU)
rect rgba(70, 130, 180, 0.5)
Note over DeviceRunnerBase, ACL: Host-side timeout resolution (op-execute + stream-sync)
DeviceRunnerBase->>resolve_onboard_timeout_config: attach_current_thread (first call)
resolve_onboard_timeout_config->>resolve_onboard_timeout_config: getenv PTO2_OP_EXECUTE_TIMEOUT_US, PTO2_STREAM_SYNC_TIMEOUT_MS
resolve_onboard_timeout_config->>resolve_onboard_timeout_config: validate_runtime_timeout_order
resolve_onboard_timeout_config-->>DeviceRunnerBase: RuntimeTimeoutConfig timeout_config_
DeviceRunnerBase->>ACL: aclrtSetOpExecuteTimeOutV2(timeout_config_.op_execute_timeout_us)
DeviceRunnerBase->>ACL: aclrtSynchronizeStreamWithTimeout(timeout_config_.stream_sync_timeout_ms)
end
rect rgba(34, 139, 34, 0.5)
Note over runtime_maker, SchedulerDispatch: Scheduler timeout propagation to AICPU device
runtime_maker->>resolve_scheduler_timeout_ms: getenv PTO2_SCHEDULER_TIMEOUT_MS + ordering check
resolve_scheduler_timeout_ms-->>runtime_maker: scheduler_timeout_ms (or 0)
runtime_maker->>PTO2RuntimeArenaLayout: layout.scheduler_timeout_ms = value
Note over PTO2RuntimeArenaLayout: struct DMA'd to device arena
SchedulerDispatch->>PTO2RuntimeArenaLayout: read prebuilt_layout.scheduler_timeout_ms
SchedulerDispatch->>SchedulerDispatch: scheduler_timeout_cycles = ms × (FREQ/1000)
SchedulerDispatch->>SchedulerDispatch: stall check: elapsed > scheduler_timeout_cycles → emergency shutdown
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp`:
- Around line 276-278: The sim-platform override in
runtime_maker::get_scheduler_timeout_ms is accepted too early, so an invalid
env-provided scheduler timeout can bypass the expected scheduler < op_execute <
stream_sync ordering. Update the timeout selection logic in runtime_maker.cpp to
validate the candidate scheduler timeout against the related op_execute and
stream_sync values before returning it, and if the ordering is broken, fall back
to the safe default instead of propagating the override.
In `@src/common/platform/include/host/runtime_timeout_config.h`:
- Around line 122-123: Reset the parse-status fields at the start of
resolve_runtime_timeout_config so a reused RuntimeTimeoutParseStatus cannot
carry stale state between calls. In resolve_runtime_timeout_config, before any
env-var parsing logic runs, explicitly initialize all *_env_set and *_valid
members on the optional status object to a clean default, then proceed with the
existing updates for each parsed field. This keeps downstream checks based on
RuntimeTimeoutParseStatus accurate even when the same status instance is passed
in multiple times.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9ee4b20a-2dc0-4b90-9218-ce38d984ebc5
📒 Files selected for processing (18)
docs/dfx/tensor-dump.mdsrc/a2a3/platform/include/common/platform_config.hsrc/a2a3/platform/onboard/host/device_runner.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.hsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cppsrc/a5/platform/include/common/platform_config.hsrc/a5/platform/onboard/host/device_runner.cppsrc/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cppsrc/a5/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.hsrc/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cppsrc/common/platform/include/host/runtime_timeout_config.hsrc/common/platform/onboard/aicpu/spin_hint.hsrc/common/platform/onboard/host/device_runner_base.cppsrc/common/platform/onboard/host/device_runner_base.hsrc/common/platform/sim/aicpu/spin_hint.htests/ut/cpp/CMakeLists.txttests/ut/cpp/common/test_runtime_timeout_config.cpp
c672a45 to
cfcc53a
Compare
c8450d9 to
d1e8d3a
Compare
Review:sim 上 scheduler 超时实际无法调大整体 onboard 路径实现得很干净,a2a3/a5 对称、单测覆盖了 header 原语 👍。但有一个 must-fix,外加几个小项。 🔴 Must fix:sim 的 scheduler 覆盖被 onboard 的 op/stream 常量卡死
后果:
PR 描述写的是 "Sim builds skip onboard ordering … only the scheduler env changes the sim scheduler budget" —— 但代码并没有跳过。而且 建议:sim 上跳过 op/stream 排序,只校验 const char *plat = get_platform();
bool is_sim = plat != nullptr && std::strstr(plat, "sim") != nullptr;
if (is_sim) {
return cfg.scheduler_timeout_ms; // sim 无 STARS/ACL 超时,scheduler 不与 op/stream 耦合
}
// onboard 才做完整排序校验
RuntimeTimeoutOrderStatus status = validate_runtime_timeout_order(cfg);
...🟡 Should fix
Verdict:request changes(主要是 🔴 那条) |
87871ca to
49749d2
Compare
- Add runtime env overrides for op-execute, stream-sync, and scheduler timeouts - Keep onboard timeout ordering checks so scheduler fires before host timeouts - Let sim scheduler overrides skip onboard-only op/stream ordering limits - Add timeout parsing and platform-ordering unit coverage
49749d2 to
5131faa
Compare
Summary
PTO2_OP_EXECUTE_TIMEOUT_US,PTO2_STREAM_SYNC_TIMEOUT_MS, andPTO2_SCHEDULER_TIMEOUT_MSTimeout ordering policy
scheduler < op-execute < stream-syncstream-sync > scheduler + 1.5sto leave room for cold init before the AICPU scheduler no-progress timer is armedPTO2_STREAM_SYNC_TIMEOUT_MSfor their worst-case orchestration windowTesting
git diff --checkpython tests/lint/check_headers.py $(git diff --name-only)python tests/lint/check_english_only.py $(git diff --name-only)./tests/ut/cpp/build/test_runtime_timeout_config(5/5 passed)PYTHONNOUSERSITE=1 PYTHONPATH=$PWD:$PWD/python python -S simpler_setup/build_runtimes.py --platforms a2a3sim a5simFixes #1112