Skip to content

Refactor: Optimize the AICPU scheduler hot loop#589

Merged
jvjhfhg merged 11 commits intohw-native-sys:mainfrom
poursoul:refactor-sched-logical
Apr 17, 2026
Merged

Refactor: Optimize the AICPU scheduler hot loop#589
jvjhfhg merged 11 commits intohw-native-sys:mainfrom
poursoul:refactor-sched-logical

Conversation

@poursoul
Copy link
Copy Markdown
Collaborator

@poursoul poursoul commented Apr 17, 2026

Summary

Optimize the AICPU scheduler hot loop in resolve_and_dispatch_pto2 for reduced icache pressure, fewer atomic operations, and better cache locality across both a2a3 and a5 platforms.

Key changes:

  • Extract cold paths from the main dispatch loop — 6 noinline/cold helper functions (handle_orchestrator_exit, handle_core_transition, check_idle_fatal_error, log_stall_diagnostics, handle_timeout_exit, log_profiling_summary) remove ~328 lines / ~4KB of instructions from the hot loop body, allowing the steady-state completion+dispatch path to fit in L1 icache
  • Consolidate profiling counters into per-thread struct — Replace 20+ profiling local variables threaded through 10 function signatures with a single alignas(64) SchedProfilingCounters member array indexed by thread_idx, eliminating all #if PTO2_PROFILING / #if PTO2_SCHED_PROFILING parameter blocks (net -146 lines)
  • Merge AIC/AIV completion checks — De-template check_running_cores_for_completion to use get_all_running_cores() in a single traversal; remove CoreType ct parameter from complete_slot_task (reads hank[core_id].core_type directly)
  • Replace per-ring wiring queues with a single global queue — Consolidate 4 per-ring PTO2ReadyQueue instances into one global queue; wire_task derives ring_id from ws->ring_id for per-ring dep_pool access
  • Add wiring queue backoff — Skip pop_batch when fewer than WIRING_BATCH_SIZE tasks are queued, reducing contention with the orchestrator's push path; bypassed via force_drain when orchestrator is done
  • Replace MPMC queue with wait-free SPSC — New PTO2SpscQueue based on Rigtorp's cached-index technique: push is 1 relaxed load + 1 release store (zero CAS), pop_batch is 1 acquire load + N plain loads + 1 release store; 4 alignas(64) fields eliminate false sharing
  • Optimize cache layoutRingSchedState: isolate dep_pool with alignas(64) from completion-path fields (advance_lock CAS traffic); PTO2SchedulerState: pack count+index+batch[31] tightly, separate wiring_queue with alignas(64); wire_task accepts pre-resolved RingSchedState& and wfanin to eliminate 3 redundant pointer dereferences per task

… icache pressure

Extract 6 rarely-executed code paths into separate noinline/cold member
functions to shrink the hot dispatch loop's instruction footprint:

- handle_orchestrator_exit: end-of-execution exit checks (~26 lines)
- handle_core_transition: orch-to-sched reassign, default disabled (~15 lines)
- check_idle_fatal_error: periodic error poll every 1024 idle iters (~12 lines)
- log_stall_diagnostics: full task/core state dump on stall (~90 lines)
- handle_timeout_exit: max-idle-iterations bailout (~15 lines)
- log_profiling_summary: post-loop profiling aggregation (~170 lines)

Total ~328 lines / ~4KB instructions removed from the hot loop body,
allowing the steady-state completion+dispatch path (~2KB) to fit
comfortably in L1 icache. Add LoopAction enum for control flow
signaling from extracted functions back to the main loop.
…ruct

Replace 20+ profiling local variables passed through 8 functions with
a single per-thread SchedProfilingCounters struct stored in
AicpuExecutor. Each member function accesses its counters directly
via sched_perf_[thread_idx], eliminating all #if PTO2_PROFILING and
#if PTO2_SCHED_PROFILING parameter blocks from function signatures.

- Add alignas(64) SchedProfilingCounters struct to prevent false sharing
- Remove profiling params from: complete_slot_task,
  check_running_cores_for_completion, dispatch_shape,
  pop_ready_tasks_batch, dispatch_block, dispatch_subtask_to_core,
  dispatch_mix_block_to_cluster, handle_drain_mode,
  drain_worker_dispatch, log_profiling_summary
- Simplify resolve_and_dispatch_pto2 call sites (net -146 lines)
Remove template parameter from check_running_cores_for_completion and
use get_all_running_cores() to iterate all running cores in one pass.
Drop CoreType ct parameter from complete_slot_task — profiling path
reads core_type directly from hank[core_id].core_type instead.
Skip pop_batch from wiring_queue when fewer than 16 tasks are queued,
reducing atomic contention between orchestrator push and scheduler
pop on the shared MPSC queue. Backoff is bypassed once orchestrator
is done (force_drain=true) to ensure the tail is fully flushed.
…rection

- Consolidate per-ring wiring_queue into single global queue; wire_task
  derives ring_id from ws->ring_id for dep_pool access
- RingSchedState: isolate dep_pool with alignas(64) from completion
  path fields (advance_lock CAS traffic no longer invalidates wiring)
- PTO2SchedulerState: pack count+index+batch[15] into exactly 2 cache
  lines (128B); separate wiring_queue with alignas(64) for refill path
- wire_task(rss, ws, wfanin): accept pre-resolved RingSchedState ref
  and fanin count, eliminating 3 redundant pointer dereferences per task
Replace the wiring_queue's MPMC Vyukov queue with a wait-free SPSC
ring buffer based on Rigtorp's cached-index technique. The wiring
queue has exactly one producer (orchestrator) and one consumer
(scheduler thread 0), so MPMC's per-slot CAS and sequence counters
are pure overhead.

- PTO2SpscQueue: 4 cache-line-aligned fields (head, tail_cached,
  tail, head_cached) eliminate false sharing; cached indices avoid
  cross-core loads on the hot path
- push: relaxed load + release store (zero CAS, wait-free)
- pop_batch: 1 acquire load + N plain loads + 1 release store
- Update pto_scheduler.cpp init/destroy for new queue type
- Add power-of-2 capacity validation in PTO2SpscQueue::init
- Add push full-condition comment explaining no-sentinel-slot design
- Change WIRING_BATCH_SIZE to uint64_t for sign-safe size() comparison
- Update batch size comment to match actual value (31 = 256B)
- Update MULTI_RING.md and RUNTIME_LOGIC.md to reflect global SPSC
  wiring queue replacing per-ring MPMC queues
Sync 7 a2a3 commits (90ffb39..7f819a2) to a5 tensormap_and_ringbuffer:

- Extract cold paths from resolve_and_dispatch_pto2 (noinline/cold)
- Consolidate profiling counters into per-thread SchedProfilingCounters
- Merge AIC/AIV completion checks into single traversal
- Add wiring queue backoff (batch-size threshold before pop)
- Consolidate per-ring wiring queues into single global queue
- Replace MPMC wiring queue with wait-free SPSC (Rigtorp cached-index)
- Optimize cache layout: RingSchedState dep_pool isolation, batch
  buffer packing, wiring_queue alignas(64) separation
- Update MULTI_RING.md and RUNTIME_LOGIC.md for global SPSC queue
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the AICPU executor by extracting cold-path logic into helper functions and consolidating scheduler profiling counters into a per-thread structure. It also replaces per-ring wiring queues with a global wait-free SPSC queue to reduce cache contention. Feedback includes correcting an off-by-one error in the SPSC queue's full condition that wastes a buffer slot, removing an unused profiling counter, and addressing a narrowing conversion for the wiring batch size. Additionally, improvements to the documentation regarding the backoff mechanism were suggested for better clarity.

Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h
Comment thread src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/docs/RUNTIME_LOGIC.md Outdated
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h
Comment thread src/a5/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
Comment thread src/a5/runtime/tensormap_and_ringbuffer/docs/RUNTIME_LOGIC.md Outdated
Comment thread src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h
- Fix misleading SPSC push() comment: capacity-1 usable slots (one
  wasted as sentinel), not full capacity as previously claimed
- Clarify wiring queue backoff description in RUNTIME_LOGIC.md:
  non-blocking deferred drain, not a blocking wait
The wiring queue backoff (skip pop when size < WIRING_BATCH_SIZE)
causes deadlock when the orchestrator submits a small number of tasks
then immediately spin-waits on get/set_tensor_data. The scheduler
never wires the pending tasks because the backoff threshold is not met,
while the orchestrator blocks forever waiting for task completion.

Add orch_needs_drain atomic flag to PTO2SchedulerState. The
orchestrator sets it before entering wait_for_tensor_ready and clears
it after the wait completes (including error paths). The scheduler's
drain_wiring_queue checks this flag alongside force_drain to bypass
the backoff when the orchestrator is actively blocking.
@poursoul poursoul changed the title Refactor sched logical Refactor: Optimize the AICPU scheduler hot loop Apr 17, 2026
The previous backoff (skip when queue.size() < WIRING_BATCH_SIZE)
could deadlock when the allocator spin-waits for last_task_alive
but fewer than batch-size tasks are queued for wiring.

Replace with a counter that defers pop on under-filled queue but
forces a pop after WIRING_BACKOFF_LIMIT (32) consecutive deferrals.
Retain orch_needs_drain for immediate bypass during tensor waits.
Reset wiring_backoff_counter in pto2_scheduler_init for multi-round.
@jvjhfhg jvjhfhg merged commit 7740dd2 into hw-native-sys:main Apr 17, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants