Refactor: Optimize the AICPU scheduler hot loop#589
Merged
jvjhfhg merged 11 commits intohw-native-sys:mainfrom Apr 17, 2026
Merged
Refactor: Optimize the AICPU scheduler hot loop#589jvjhfhg merged 11 commits intohw-native-sys:mainfrom
jvjhfhg merged 11 commits intohw-native-sys:mainfrom
Conversation
… icache pressure Extract 6 rarely-executed code paths into separate noinline/cold member functions to shrink the hot dispatch loop's instruction footprint: - handle_orchestrator_exit: end-of-execution exit checks (~26 lines) - handle_core_transition: orch-to-sched reassign, default disabled (~15 lines) - check_idle_fatal_error: periodic error poll every 1024 idle iters (~12 lines) - log_stall_diagnostics: full task/core state dump on stall (~90 lines) - handle_timeout_exit: max-idle-iterations bailout (~15 lines) - log_profiling_summary: post-loop profiling aggregation (~170 lines) Total ~328 lines / ~4KB instructions removed from the hot loop body, allowing the steady-state completion+dispatch path (~2KB) to fit comfortably in L1 icache. Add LoopAction enum for control flow signaling from extracted functions back to the main loop.
…ruct Replace 20+ profiling local variables passed through 8 functions with a single per-thread SchedProfilingCounters struct stored in AicpuExecutor. Each member function accesses its counters directly via sched_perf_[thread_idx], eliminating all #if PTO2_PROFILING and #if PTO2_SCHED_PROFILING parameter blocks from function signatures. - Add alignas(64) SchedProfilingCounters struct to prevent false sharing - Remove profiling params from: complete_slot_task, check_running_cores_for_completion, dispatch_shape, pop_ready_tasks_batch, dispatch_block, dispatch_subtask_to_core, dispatch_mix_block_to_cluster, handle_drain_mode, drain_worker_dispatch, log_profiling_summary - Simplify resolve_and_dispatch_pto2 call sites (net -146 lines)
Remove template parameter from check_running_cores_for_completion and use get_all_running_cores() to iterate all running cores in one pass. Drop CoreType ct parameter from complete_slot_task — profiling path reads core_type directly from hank[core_id].core_type instead.
Skip pop_batch from wiring_queue when fewer than 16 tasks are queued, reducing atomic contention between orchestrator push and scheduler pop on the shared MPSC queue. Backoff is bypassed once orchestrator is done (force_drain=true) to ensure the tail is fully flushed.
…rection - Consolidate per-ring wiring_queue into single global queue; wire_task derives ring_id from ws->ring_id for dep_pool access - RingSchedState: isolate dep_pool with alignas(64) from completion path fields (advance_lock CAS traffic no longer invalidates wiring) - PTO2SchedulerState: pack count+index+batch[15] into exactly 2 cache lines (128B); separate wiring_queue with alignas(64) for refill path - wire_task(rss, ws, wfanin): accept pre-resolved RingSchedState ref and fanin count, eliminating 3 redundant pointer dereferences per task
Replace the wiring_queue's MPMC Vyukov queue with a wait-free SPSC ring buffer based on Rigtorp's cached-index technique. The wiring queue has exactly one producer (orchestrator) and one consumer (scheduler thread 0), so MPMC's per-slot CAS and sequence counters are pure overhead. - PTO2SpscQueue: 4 cache-line-aligned fields (head, tail_cached, tail, head_cached) eliminate false sharing; cached indices avoid cross-core loads on the hot path - push: relaxed load + release store (zero CAS, wait-free) - pop_batch: 1 acquire load + N plain loads + 1 release store - Update pto_scheduler.cpp init/destroy for new queue type
- Add power-of-2 capacity validation in PTO2SpscQueue::init - Add push full-condition comment explaining no-sentinel-slot design - Change WIRING_BATCH_SIZE to uint64_t for sign-safe size() comparison - Update batch size comment to match actual value (31 = 256B) - Update MULTI_RING.md and RUNTIME_LOGIC.md to reflect global SPSC wiring queue replacing per-ring MPMC queues
Sync 7 a2a3 commits (90ffb39..7f819a2) to a5 tensormap_and_ringbuffer: - Extract cold paths from resolve_and_dispatch_pto2 (noinline/cold) - Consolidate profiling counters into per-thread SchedProfilingCounters - Merge AIC/AIV completion checks into single traversal - Add wiring queue backoff (batch-size threshold before pop) - Consolidate per-ring wiring queues into single global queue - Replace MPMC wiring queue with wait-free SPSC (Rigtorp cached-index) - Optimize cache layout: RingSchedState dep_pool isolation, batch buffer packing, wiring_queue alignas(64) separation - Update MULTI_RING.md and RUNTIME_LOGIC.md for global SPSC queue
There was a problem hiding this comment.
Code Review
This pull request refactors the AICPU executor by extracting cold-path logic into helper functions and consolidating scheduler profiling counters into a per-thread structure. It also replaces per-ring wiring queues with a global wait-free SPSC queue to reduce cache contention. Feedback includes correcting an off-by-one error in the SPSC queue's full condition that wastes a buffer slot, removing an unused profiling counter, and addressing a narrowing conversion for the wiring batch size. Additionally, improvements to the documentation regarding the backoff mechanism were suggested for better clarity.
- Fix misleading SPSC push() comment: capacity-1 usable slots (one wasted as sentinel), not full capacity as previously claimed - Clarify wiring queue backoff description in RUNTIME_LOGIC.md: non-blocking deferred drain, not a blocking wait
The wiring queue backoff (skip pop when size < WIRING_BATCH_SIZE) causes deadlock when the orchestrator submits a small number of tasks then immediately spin-waits on get/set_tensor_data. The scheduler never wires the pending tasks because the backoff threshold is not met, while the orchestrator blocks forever waiting for task completion. Add orch_needs_drain atomic flag to PTO2SchedulerState. The orchestrator sets it before entering wait_for_tensor_ready and clears it after the wait completes (including error paths). The scheduler's drain_wiring_queue checks this flag alongside force_drain to bypass the backoff when the orchestrator is actively blocking.
The previous backoff (skip when queue.size() < WIRING_BATCH_SIZE) could deadlock when the allocator spin-waits for last_task_alive but fewer than batch-size tasks are queued for wiring. Replace with a counter that defers pop on under-filled queue but forces a pop after WIRING_BACKOFF_LIMIT (32) consecutive deferrals. Retain orch_needs_drain for immediate bypass during tensor waits. Reset wiring_backoff_counter in pto2_scheduler_init for multi-round.
jvjhfhg
approved these changes
Apr 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Optimize the AICPU scheduler hot loop in
resolve_and_dispatch_pto2for reduced icache pressure, fewer atomic operations, and better cache locality across both a2a3 and a5 platforms.Key changes:
noinline/coldhelper functions (handle_orchestrator_exit,handle_core_transition,check_idle_fatal_error,log_stall_diagnostics,handle_timeout_exit,log_profiling_summary) remove ~328 lines / ~4KB of instructions from the hot loop body, allowing the steady-state completion+dispatch path to fit in L1 icachealignas(64) SchedProfilingCountersmember array indexed bythread_idx, eliminating all#if PTO2_PROFILING/#if PTO2_SCHED_PROFILINGparameter blocks (net -146 lines)check_running_cores_for_completionto useget_all_running_cores()in a single traversal; removeCoreType ctparameter fromcomplete_slot_task(readshank[core_id].core_typedirectly)PTO2ReadyQueueinstances into one global queue;wire_taskderivesring_idfromws->ring_idfor per-ringdep_poolaccesspop_batchwhen fewer thanWIRING_BATCH_SIZEtasks are queued, reducing contention with the orchestrator's push path; bypassed viaforce_drainwhen orchestrator is donePTO2SpscQueuebased on Rigtorp's cached-index technique: push is 1 relaxed load + 1 release store (zero CAS), pop_batch is 1 acquire load + N plain loads + 1 release store; 4alignas(64)fields eliminate false sharingRingSchedState: isolatedep_poolwithalignas(64)from completion-path fields (advance_lockCAS traffic);PTO2SchedulerState: packcount+index+batch[31]tightly, separatewiring_queuewithalignas(64);wire_taskaccepts pre-resolvedRingSchedState&andwfaninto eliminate 3 redundant pointer dereferences per task