Refactor: Optimize the AICPU scheduler hot loop by poursoul · Pull Request #589 · hw-native-sys/simpler

poursoul · 2026-04-17T08:16:32Z

Summary

Optimize the AICPU scheduler hot loop in resolve_and_dispatch_pto2 for reduced icache pressure, fewer atomic operations, and better cache locality across both a2a3 and a5 platforms.

Key changes:

Extract cold paths from the main dispatch loop — 6 noinline/cold helper functions (handle_orchestrator_exit, handle_core_transition, check_idle_fatal_error, log_stall_diagnostics, handle_timeout_exit, log_profiling_summary) remove ~328 lines / ~4KB of instructions from the hot loop body, allowing the steady-state completion+dispatch path to fit in L1 icache
Consolidate profiling counters into per-thread struct — Replace 20+ profiling local variables threaded through 10 function signatures with a single alignas(64) SchedProfilingCounters member array indexed by thread_idx, eliminating all #if PTO2_PROFILING / #if PTO2_SCHED_PROFILING parameter blocks (net -146 lines)
Merge AIC/AIV completion checks — De-template check_running_cores_for_completion to use get_all_running_cores() in a single traversal; remove CoreType ct parameter from complete_slot_task (reads hank[core_id].core_type directly)
Replace per-ring wiring queues with a single global queue — Consolidate 4 per-ring PTO2ReadyQueue instances into one global queue; wire_task derives ring_id from ws->ring_id for per-ring dep_pool access
Add wiring queue backoff — Skip pop_batch when fewer than WIRING_BATCH_SIZE tasks are queued, reducing contention with the orchestrator's push path; bypassed via force_drain when orchestrator is done
Replace MPMC queue with wait-free SPSC — New PTO2SpscQueue based on Rigtorp's cached-index technique: push is 1 relaxed load + 1 release store (zero CAS), pop_batch is 1 acquire load + N plain loads + 1 release store; 4 alignas(64) fields eliminate false sharing
Optimize cache layout — RingSchedState: isolate dep_pool with alignas(64) from completion-path fields (advance_lock CAS traffic); PTO2SchedulerState: pack count+index+batch[31] tightly, separate wiring_queue with alignas(64); wire_task accepts pre-resolved RingSchedState& and wfanin to eliminate 3 redundant pointer dereferences per task

… icache pressure Extract 6 rarely-executed code paths into separate noinline/cold member functions to shrink the hot dispatch loop's instruction footprint: - handle_orchestrator_exit: end-of-execution exit checks (~26 lines) - handle_core_transition: orch-to-sched reassign, default disabled (~15 lines) - check_idle_fatal_error: periodic error poll every 1024 idle iters (~12 lines) - log_stall_diagnostics: full task/core state dump on stall (~90 lines) - handle_timeout_exit: max-idle-iterations bailout (~15 lines) - log_profiling_summary: post-loop profiling aggregation (~170 lines) Total ~328 lines / ~4KB instructions removed from the hot loop body, allowing the steady-state completion+dispatch path (~2KB) to fit comfortably in L1 icache. Add LoopAction enum for control flow signaling from extracted functions back to the main loop.

…ruct Replace 20+ profiling local variables passed through 8 functions with a single per-thread SchedProfilingCounters struct stored in AicpuExecutor. Each member function accesses its counters directly via sched_perf_[thread_idx], eliminating all #if PTO2_PROFILING and #if PTO2_SCHED_PROFILING parameter blocks from function signatures. - Add alignas(64) SchedProfilingCounters struct to prevent false sharing - Remove profiling params from: complete_slot_task, check_running_cores_for_completion, dispatch_shape, pop_ready_tasks_batch, dispatch_block, dispatch_subtask_to_core, dispatch_mix_block_to_cluster, handle_drain_mode, drain_worker_dispatch, log_profiling_summary - Simplify resolve_and_dispatch_pto2 call sites (net -146 lines)

Remove template parameter from check_running_cores_for_completion and use get_all_running_cores() to iterate all running cores in one pass. Drop CoreType ct parameter from complete_slot_task — profiling path reads core_type directly from hank[core_id].core_type instead.

Skip pop_batch from wiring_queue when fewer than 16 tasks are queued, reducing atomic contention between orchestrator push and scheduler pop on the shared MPSC queue. Backoff is bypassed once orchestrator is done (force_drain=true) to ensure the tail is fully flushed.

…rection - Consolidate per-ring wiring_queue into single global queue; wire_task derives ring_id from ws->ring_id for dep_pool access - RingSchedState: isolate dep_pool with alignas(64) from completion path fields (advance_lock CAS traffic no longer invalidates wiring) - PTO2SchedulerState: pack count+index+batch[15] into exactly 2 cache lines (128B); separate wiring_queue with alignas(64) for refill path - wire_task(rss, ws, wfanin): accept pre-resolved RingSchedState ref and fanin count, eliminating 3 redundant pointer dereferences per task

Replace the wiring_queue's MPMC Vyukov queue with a wait-free SPSC ring buffer based on Rigtorp's cached-index technique. The wiring queue has exactly one producer (orchestrator) and one consumer (scheduler thread 0), so MPMC's per-slot CAS and sequence counters are pure overhead. - PTO2SpscQueue: 4 cache-line-aligned fields (head, tail_cached, tail, head_cached) eliminate false sharing; cached indices avoid cross-core loads on the hot path - push: relaxed load + release store (zero CAS, wait-free) - pop_batch: 1 acquire load + N plain loads + 1 release store - Update pto_scheduler.cpp init/destroy for new queue type

- Add power-of-2 capacity validation in PTO2SpscQueue::init - Add push full-condition comment explaining no-sentinel-slot design - Change WIRING_BATCH_SIZE to uint64_t for sign-safe size() comparison - Update batch size comment to match actual value (31 = 256B) - Update MULTI_RING.md and RUNTIME_LOGIC.md to reflect global SPSC wiring queue replacing per-ring MPMC queues

Sync 7 a2a3 commits (90ffb39..7f819a2) to a5 tensormap_and_ringbuffer: - Extract cold paths from resolve_and_dispatch_pto2 (noinline/cold) - Consolidate profiling counters into per-thread SchedProfilingCounters - Merge AIC/AIV completion checks into single traversal - Add wiring queue backoff (batch-size threshold before pop) - Consolidate per-ring wiring queues into single global queue - Replace MPMC wiring queue with wait-free SPSC (Rigtorp cached-index) - Optimize cache layout: RingSchedState dep_pool isolation, batch buffer packing, wiring_queue alignas(64) separation - Update MULTI_RING.md and RUNTIME_LOGIC.md for global SPSC queue

gemini-code-assist

Code Review

This pull request refactors the AICPU executor by extracting cold-path logic into helper functions and consolidating scheduler profiling counters into a per-thread structure. It also replaces per-ring wiring queues with a global wait-free SPSC queue to reduce cache contention. Feedback includes correcting an off-by-one error in the SPSC queue's full condition that wastes a buffer slot, removing an unused profiling counter, and addressing a narrowing conversion for the wiring batch size. Additionally, improvements to the documentation regarding the backoff mechanism were suggested for better clarity.

- Fix misleading SPSC push() comment: capacity-1 usable slots (one wasted as sentinel), not full capacity as previously claimed - Clarify wiring queue backoff description in RUNTIME_LOGIC.md: non-blocking deferred drain, not a blocking wait

The wiring queue backoff (skip pop when size < WIRING_BATCH_SIZE) causes deadlock when the orchestrator submits a small number of tasks then immediately spin-waits on get/set_tensor_data. The scheduler never wires the pending tasks because the backoff threshold is not met, while the orchestrator blocks forever waiting for task completion. Add orch_needs_drain atomic flag to PTO2SchedulerState. The orchestrator sets it before entering wait_for_tensor_ready and clears it after the wait completes (including error paths). The scheduler's drain_wiring_queue checks this flag alongside force_drain to bypass the backoff when the orchestrator is actively blocking.

The previous backoff (skip when queue.size() < WIRING_BATCH_SIZE) could deadlock when the allocator spin-waits for last_task_alive but fewer than batch-size tasks are queued for wiring. Replace with a counter that defers pop on under-filled queue but forces a pop after WIRING_BACKOFF_LIMIT (32) consecutive deferrals. Retain orch_needs_drain for immediate bypass during tensor waits. Reset wiring_backoff_counter in pto2_scheduler_init for multi-round.

poursoul added 8 commits April 17, 2026 10:56

gemini-code-assist bot reviewed Apr 17, 2026

View reviewed changes

poursoul added 2 commits April 17, 2026 16:58

poursoul changed the title ~~Refactor sched logical~~ Refactor: Optimize the AICPU scheduler hot loop Apr 17, 2026

jvjhfhg approved these changes Apr 17, 2026

View reviewed changes

jvjhfhg merged commit 7740dd2 into hw-native-sys:main Apr 17, 2026
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: Optimize the AICPU scheduler hot loop#589

Refactor: Optimize the AICPU scheduler hot loop#589
jvjhfhg merged 11 commits intohw-native-sys:mainfrom
poursoul:refactor-sched-logical

poursoul commented Apr 17, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

poursoul commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

poursoul commented Apr 17, 2026 •

edited

Loading