[Performance] Wireless (Polling-based) task readiness detection brings 15~40% speedup#1107
[Performance] Wireless (Polling-based) task readiness detection brings 15~40% speedup#1107SergioMartin86 wants to merge 15 commits into
Conversation
Squash of 12 commits (afb5c5a..wireless2-pre-rebase) carried forward over upstream/main (c4b0aac), resolving overlap with intervening upstream changes. Preserves all optimizations and simplifications from this branch: * 73e23bd Stripping all unnecessary stuff * be89bbe Reformatting * 0340ec8 Simplifying and moving cpp functions into their h files * 6fba249 More simplifications * 91f7157 more simplifications * c569f34 Removing spill storage * 7af17f9 Polling readiness: replace fanout-chain wiring with pending-list polling * 1ab69fb0 Collapse multi-ring layout to a single ring Conflict resolutions against upstream: * pto_runtime2_types.h: drop the hard-coded 256B scalar-region static assert (upstream hw-native-sys#1056 lowered MAX_SCALAR_ARGS to 16, making it 128B). The assert is now an identity expressed in terms of MAX_SCALAR_ARGS. * pto_orchestrator.h: drop the local extern decl of set_dump_tensor_task_mask — upstream's tensor_dump_aicpu.h now declares it with a different signature (TensorDumpArgMask). * scheduler_types.h: PLATFORM_MAX_IDLE_ITERATIONS was removed upstream (a5 uses a fixed STALL_LOG_INTERVAL); match that approach. Also switch SCHEDULER_TIMEOUT_MS to use PLATFORM_SCHEDULER_TIMEOUT_MS. * runtime.h: add device_memset hook to HostApi (upstream platform code now populates it; matches the a5 HostApi shape). Validated post-rebase on a2a3 onboard: * Case4 paged-attention: trimmed device avg ~1362 us (matches pre-rebase Step 1 baseline ~1365). * Case1 paged-attention: device avg ~28801 us/round over 10 rounds (matches pre-rebase ~28172). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drop the per-completion fanout_refcount notification from consumer tasks to their fanin producers. Each ring now carries a single monotonic completed_watermark — the highest local_id W such that every task 0..W has reached COMPLETED. On submit, the orchestrator stamps each producer's last_consumer_local_id with max(prev, self) (single-writer, plain int32_t). On completion, the scheduler CAS-advances the watermark forward through consecutive COMPLETED slots up to its own id, then retires tail slots whose last_consumer_local_id is at or below the watermark. Removes fanout_count/fanout_refcount, the CONSUMED state, on_task_release, release_producer, check_and_handle_consumed, on_scope_end's release loop, and the deferred_release_slot_states buffer threaded through complete_slot_task / check_running_cores_for_completion / poll_and_complete. Case4 trimmed device avg: 1360 us. Case1 trimmed device avg: 28286 us (vs rebased baseline ~28801 us).
Replace the per-fanin pointer chase to producer slot_state.task_state with a byte read from a contiguous per-ring completion_flags array indexed by producer local_id & task_window_mask. Each task carries fanin_local_ids[] (4B per id) in place of fanin_slot_states[] (8B per pointer), and the completer writes a single byte instead of publishing through a 128B-aligned slot. For Case1's working set (16384 slots), the flag array is 16KB and fits L1. Thread 0's fanin_satisfied polling now condenses 16 fanin checks into 1-2 cache lines instead of one per producer slot. The orchestrator clears the new slot's byte in prepare_task before the wiring-queue push (release) makes it visible to thread 0; reset happens single-threaded so no atomic is needed. The completer's set uses release ordering to publish the producer's output writes to acquire-loading consumers. Case4 trimmed device avg: 1308 us (was 1360). Case1 trimmed device avg: 28047 us (was 28286); trimmed host avg: 292834 us (was 453591).
Replace the intrusive next_pending pointer in PTO2TaskSlotState with a thread-0-private circular FIFO of slot pointers, sized to the per-ring task window (PTO2_TASK_WINDOW_SIZE) and allocated from the scheduler arena. Same memory budget (was 8B per slot × window_size; now one contiguous buffer of the same total size), but keeps scheduler-private linkage out of the task struct. Push/pop become array writes/reads at head_idx/tail_idx & mask. The buffer's cache lines amortize across 64 entries per line, matching the hit rate the old design got from co-locating next_pending with the slot_state cache line that fanin_satisfied already loaded. Case4 trimmed device avg: 1319 us (was 1308 us). Case1 trimmed device avg: 28080 us (was 28047 us). Differences are within shared-box noise.
Add SchedulerThreadProfile (per-phase cumulative cycles + entry counts)
and instrument the main loop to attribute time to:
- completion check
- async wait poll
- drain_wiring_queue (split into SPSC drain vs pending FIFO poll)
- dummy ready-queue drain
- dispatch_ready_tasks
- idle spin
Dump via LOG_INFO_V9 once per resolve_and_dispatch exit so the hot path
only accumulates cycle counters. Output is tagged CLAUDE_PROFILING and
written to ${HOME}/ascend/log/debug/; pull it with
cat /root/ascend/log/debug/*/* | grep CLAUDE_PROFILING
Used to identify thread 0's pending FIFO fanin polling as the
dominant cost in Case1 (54% of round time) — the data-driven basis
for the wake-list optimization that follows.
Replace the pure-polling pending-FIFO loop with a hybrid:
- 0 unmet fanins → push to ready_queues (unchanged)
- exactly 1 unmet → register the consumer on that producer's wake list
and remove from FIFO (was: push back to FIFO)
- 2+ unmet → push back to FIFO for the next poll (unchanged)
Each producer slot gets a wake_list_head atomic pointer. Registration
is a CAS push onto the head. Completion does an atomic-exchange to a
SENTINEL (refusing further registrations) and pushes every waiter to
ready_queues. Slots reset wake_list_head to nullptr on reuse.
The intuition: most pending lifetime is spent waiting on the last
fanin to complete. The polling model re-walks every fanin on every
poll iteration even though only one byte changes. Wake-list registration
costs one CAS per task and zero further polls — the producer pushes the
waiter on completion. The submission-time variant of this idea ((f) in
the investigation) regressed because cross-thread cache traffic on the
orchestrator's hot path overwhelmed the savings; restricting wake-list
work to the scheduler-side keeps the writers on the same cache line.
Case1 (large workload, 65K tasks): -2.2% trimmed device time
(~28072 µs → ~27451 µs).
Case4 (small workload): +2.2% trimmed device time
(~1322 µs → ~1351 µs). The per-task atomic exchange overhead is not
amortized at this scale.
Profile shift on Case1 (thread 0):
drain_wiring_cycles 819K → 396K (-52%)
pending_poll_cycles 767K → 343K (-55%)
All threads run ~40% fewer main-loop iterations (denser per-iteration
work).
Break down the completion phase further: separate complete_slot_task body time from the per-iter cond_ptr-read + transition-decide overhead, plus a count of cores scanned per iter. Lets future investigations see which sub-phase actually dominates compl_cyc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
(m) PTO2TaskSlotState::task_state was a redundant completion signal — completion_flags already records the same transition with the right memory ordering. Drop the atomic release store on the completion path, switch the watermark CAS-advance loop and the wait/stall-dump readers to consult completion_flags directly. Saves one atomic store per task. (q) In complete_slot_task, read deferred_slab->count before deferred_slab->error_code. Kernels that don't register async conditions leave count at 0 (the dispatch-time reset value), so checking count first lets the common path skip the error_code load + branch and the condition-forwarding loop. Each change is neutral on Case1 in isolation (within ±50 µs run-to-run variance over 80-round trimmed avgs), but both clean up redundant work on the completion hot path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous commit dropped the producer-side .store(COMPLETED) — the field had no remaining writers on the hot path. Remove the field itself, the orchestrator's no-longer-needed PENDING-init at submit time, and the SCALAR_DATA_ACCESS / MULTI_RING doc snippets that still spelled the spin-wait and watermark-walk in terms of task_state. completion_flags is now the sole completion signal in a2a3. The a2a3 test_task_state.cpp UT was a leftover copy of the a5 version — it #includes "scheduler/pto_scheduler.h" (an a5-only path) and calls release_fanin_and_check_ready / release_producer methods that don't exist in the a2a3 scheduler. It never compiled against a2a3; remove it and the matching CMakeLists entry. Note: RUNTIME_LOGIC.md sections 6.2 / 7.3 / 8.2 / 8.4 still describe a much older fanout_lock + CONSUMED state architecture that no longer exists in the codebase. That cleanup is out of scope here — flagged for a follow-up doc pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Walk the recently-touched scheduler / orchestrator surface for unused
parameters and dead state, and drop what no caller or body actually
exercises:
- on_mixed_task_complete / complete_slot_task / check_running_cores_for_completion:
drop the threaded-through `local_bufs` argument (none of these bodies
read it anymore — it was a leftover from the (g)/(g') wake-list-via-
local-bufs variants that didn't ship). Also drops `local_bufs` from
AsyncWaitList::poll_and_complete and the DrainCompletionSink field.
- check_running_cores_for_completion / complete_slot_task: drop the
`Handshake *hank` argument (only forwarded, never read). The local
`hank` in resolve_and_dispatch's loop scope is dropped with it.
- dispatch_shape / dispatch_ready_tasks: drop the `bool &try_pushed`
out-param chain. Set deep inside dispatch_shape but the only
consumer in resolve_and_dispatch was a (void) suppression.
- pop_ready_tasks_batch: drop the unused `thread_idx` argument.
- log_stall_diagnostics: drop the [[maybe_unused]] `task_count`.
- log_shutdown_stall_snapshot + handle_timeout_exit: drop the
[[maybe_unused]] `trigger_idle_iterations` / `trigger_last_progress_count`
and the matching unused `idle_iterations` / `last_progress_count` on
the timeout-exit caller.
- handle_orchestrator_exit: drop the `int32_t &task_count` out-param —
the caller's only use was a `if (...task_count > 0) { if (...) {} }`
with an empty inner body. Read total_tasks_ directly instead.
- resolve_and_dispatch loop: drop the now-dead `task_count` and
`last_progress_count` locals (and the three write-only updates to
the latter); inline the `try_completed = ...; if (try_completed)`
pattern into a single `if`.
- PTO2SchedulerState::print_stats / print_queues: empty no-op stubs,
never called — remove (along with the cold-path API comment that
pointed at them).
- PTO2TensorMap::print_stats: 45-line stat-collection function whose
output goes nowhere (the per-ring loop body is also empty) — remove.
- orch_report_fatal_v: drop the dead vsnprintf-into-a-buffer-then-
discard block; just latch the error code via orch_mark_fatal. The
fmt + va_list params are kept (unnamed) since callers pass them and
the wider rt_report_fatal -> orchestrator.report_fatal -> v API
surface is symmetric for a future logging-sink hookup.
Build is clean, Case4 and Case1 pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
c3f74c7 (the foundational wireless2 collapse) dropped the log_info_v ops pointer and the LOG_INFO_V0..V9 macros from pto_orchestration_api.h as part of its general cleanup. That left any orchestration .cpp that called LOG_INFO_V<n> without a "#ifdef ENABLE_PROFILING" guard failing to compile — paged_attention_ manual_scope and benchmark_bgemm both hit "'LOG_INFO_V9' was not declared in this scope" against current header state. Restore the surface: - Add log_info_v function pointer to both copies of PTO2RuntimeOps (the runtime-local one in pto_runtime2.h and the orchestration- facing mirror in pto_orchestration_api.h — keep them in sync). - Add LOG_INFO_V0..V9 macros at the end of pto_orchestration_api.h that route through current_runtime()->ops->log_info_v. - Implement rt_log_info_v in pto_runtime2.h: format the message with vsnprintf and forward to unified_log_info_v, which already owns the runtime verbosity gate. - Wire rt_log_info_v into s_runtime_ops. paged_attention_manual_scope Case1 and benchmark_bgemm Case0 now build and run; paged_attention Case4 still passes (no regression on runtime hot path). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a significant refactoring of the runtime, orchestrator, and scheduler, consolidating several source files into headers, simplifying the ring buffer layout, and removing verbose logging. The review feedback identifies a critical deadlock bug in the CPU affinity gating due to a potential mask overflow on systems with 16 or more cores, and a high-severity data race on last_consumer_local_id that should be resolved using atomic operations. Additionally, the feedback highlights a potential null pointer dereference in the tensor map, wasted CPU cycles in statistics printing, a performance regression from removing the read-only tensor copy-back optimization, and the loss of critical troubleshooting logs for dynamic loading failures.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| static bool s_thread_survive[MAX_GATE_THREADS]; | ||
|
|
||
| static inline int32_t popcount64(uint64_t v) { return __builtin_popcountll(static_cast<unsigned long long>(v)); } | ||
| static std::atomic<uint16_t> g_cpumask{0}; |
There was a problem hiding this comment.
Using a uint16_t for g_cpumask limits the maximum CPU ID to 15. On systems with 16 or more CPU cores, or if the assigned CPU IDs are >= 16, the expression 1 << cpu on line 47 will overflow, resulting in a 0 mask when cast back to uint16_t. This prevents g_cpumask from accumulating the correct number of active threads, causing the barrier loop on line 50 to spin infinitely (deadlock). Please use uint64_t or std::atomic<uint64_t> instead.
| static std::atomic<uint16_t> g_cpumask{0}; | |
| static std::atomic<uint64_t> g_cpumask{0}; |
| // safe to reclaim when the per-ring completed_watermark reaches at least | ||
| // this id (i.e. every task up to and including the last consumer has | ||
| // transitioned to COMPLETED). Single-writer (orchestrator) at submit time. | ||
| int32_t last_consumer_local_id; |
There was a problem hiding this comment.
The last_consumer_local_id field is written by the orchestrator thread in submit_task_common and read concurrently by the scheduler thread in advance_ring_pointers without any synchronization. This constitutes a data race, which is Undefined Behavior in C++ and can lead to compiler optimizations serving stale values or causing infinite loops. Please make this field atomic (e.g., std::atomic<int32_t>) and update its accesses using relaxed memory order (e.g., load(std::memory_order_relaxed) and store(..., std::memory_order_relaxed)) in pto_orchestrator.h (lines 484, 584) and pto_scheduler.h (line 394).
| int32_t last_consumer_local_id; | |
| std::atomic<int32_t> last_consumer_local_id; |
References
- In self-correcting re-polling loops, using std::memory_order_relaxed for atomic loads of a single variable is sufficient and avoids the overhead of std::memory_order_acquire.
| PTO2TensorMapEntry *entry = new_entry(); | ||
| entry->copy_from_tensor(tensor); |
There was a problem hiding this comment.
If the TensorMap entry pool is exhausted, new_entry() will return nullptr. Calling entry->copy_from_tensor(tensor) directly without a null check will result in a segmentation fault and crash the orchestrator. Please add a null check and report a fatal error or use always_assert to prevent a segmentation fault.
| PTO2TensorMapEntry *entry = new_entry(); | |
| entry->copy_from_tensor(tensor); | |
| PTO2TensorMapEntry *entry = new_entry(); | |
| always_assert(entry != nullptr); | |
| entry->copy_from_tensor(tensor); |
| void print_stats() | ||
| { | ||
| int32_t valid = 0; | ||
| int32_t stale = 0; | ||
| int32_t empty_buckets = 0; | ||
| int32_t max_chain = 0; | ||
| int64_t total_chain = 0; | ||
| int32_t non_empty_buckets = 0; | ||
|
|
||
| // Count entries | ||
| for (int32_t i = 0; i < pool_size; i++) | ||
| { | ||
| if (entry_pool[i].bucket_index != -1) | ||
| { | ||
| if (entry_valid(entry_pool[i])) valid++; | ||
| else stale++; | ||
| } | ||
| } | ||
|
|
||
| #if PTO2_TENSORMAP_PROFILING | ||
| struct PTO2TensorMapProfilingData { | ||
| uint64_t lookup_chain_total; | ||
| uint64_t lookup_count; | ||
| int32_t lookup_chain_max; | ||
| uint64_t overlap_checks; | ||
| uint64_t overlap_hits; | ||
| uint64_t insert_count; | ||
| }; | ||
| // Count bucket stats | ||
| for (int32_t b = 0; b < num_buckets; b++) | ||
| { | ||
| int32_t chain_len = 0; | ||
| auto cur_entry = buckets[b]; | ||
|
|
||
| while (cur_entry != nullptr) | ||
| { | ||
| chain_len++; | ||
| cur_entry = cur_entry->next_in_bucket; | ||
| } | ||
|
|
||
| if (chain_len == 0) | ||
| { | ||
| empty_buckets++; | ||
| } | ||
| else | ||
| { | ||
| non_empty_buckets++; | ||
| total_chain += chain_len; | ||
| if (chain_len > max_chain) max_chain = chain_len; | ||
| } | ||
| } | ||
|
|
||
| for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) | ||
| {} | ||
| } |
There was a problem hiding this comment.
| // copying back. | ||
| bool needs_copy_back = !(signature != nullptr && i < sig_count && signature[i] == ArgDirection::IN); | ||
| runtime->tensor_pairs_.push_back({host_ptr, dev_ptr, size, needs_copy_back}); | ||
| runtime->tensor_pairs_.push_back({host_ptr, dev_ptr, size}); |
There was a problem hiding this comment.
Removing the optimization to skip copy-back for read-only INPUT tensors (such as model weights) introduces significant and unnecessary D2H transfer overhead. Please ensure that INPUT tensors are marked as child_memory to explicitly skip the D2H copy-back, while ensuring OUTPUT tensors are not marked as child_memory since they require copy-back.
References
- By design,
child_memoryis only used for INPUT tensors. OUTPUT tensors must not be marked aschild_memorybecause they require a D2H copy-back, whichchild_memoryexplicitly skips.
| if (handle == nullptr) | ||
| { |
There was a problem hiding this comment.
Removing the error log when dlopen fails makes troubleshooting extremely difficult, as the executor will silently return -1 without any indication of why the orchestration SO could not be loaded. Please restore the LOG_ERROR call to print the dlerror() message.
| if (handle == nullptr) | |
| { | |
| if (handle == nullptr) | |
| { | |
| LOG_ERROR("Thread %d: dlopen failed: %s", thread_idx, dlerror()); |
| if (entry_dlsym_error != nullptr) | ||
| { |
There was a problem hiding this comment.
Removing the error log when dlsym fails for the entry symbol makes troubleshooting extremely difficult, as the executor will silently return -1 without any indication of why the symbol could not be resolved. Please restore the LOG_ERROR call to print the entry_dlsym_error message.
| if (entry_dlsym_error != nullptr) | |
| { | |
| if (entry_dlsym_error != nullptr) | |
| { | |
| LOG_ERROR("Thread %d: dlsym failed for entry symbol '%s': %s", thread_idx, entry_symbol, entry_dlsym_error); |
Squash-merge of wireless2 (c4b0aac + 11 commits) onto current upstream/main (83728d2). Per-commit replay was not viable: upstream added speculative early-dispatch (hw-native-sys#1079) which touches the same data structures wireless2 redesigned, and refactored TaskArgs / Tensor along with several module collapses that fundamentally diverge from wireless2's earlier collapse-and-poll redesign. Resolution strategy: - Modify/delete (8 paths): accept wireless2's deletion. The `scheduler/*` and `shared/*` directories were collapsed into header-only modules in wireless2 (c3f74c7); upstream kept modifying them. We keep the collapse. - Pure upstream additions (DumpArgSelection / strided TaskArgs / Tensor refactor, AICore receive_time / swimlane, NUMA gate, lookup profiling externs, MIX classification fix, prefetch helper, etc.): take upstream's version. Wireless2 wasn't redesigning these. - Wireless architecture (completion_flags polling, fanin_local_ids[], wake-list, watermark reclamation, pending FIFO out-of-band): keep wireless2's design. fanin_local_ids[] is THE entry point for the polling loop. - PTO2TaskPayload: keep wireless2's flat fanin_local_ids[] alongside upstream's fanin_inline_slot_states + spec-dispatch storage as a compatibility layer, so spec-dispatch code links. Both populated at submit; the wireless poller reads fanin_local_ids, spec dispatch reads its own fields. Long-term we'd dedupe, but the squash needs to compile first. - pto_types.h and tensor.h: took upstream entire. The TaskArgs and Tensor refactor is large; wireless2 only had cosmetic conflicts here. Adapt wireless2 code paths to the new TaskArgs surface in a follow-up if any breakage surfaces. The build is NOT yet verified by this commit — there will be follow-up fixes for code paths that referenced now-removed symbols (notably the orchestrator-side fanin builder, any direct fanin_refcount touch points, and the spec-dispatch release path that needs to consult completion_flags instead of fanin_refcount). This commit captures the merge resolution as a stable starting point; verification + adaptation commits land next. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two fixes after the rebase commit: 1. pto_runtime2_types.h: the PTO2TaskPayload compatibility layer for upstream spec-dispatch references PTO2FaninPool and PTO2_FANIN_INLINE_CAP. Upstream defines them in this same header but the merge dropped the lines. Restore: #define PTO2_FANIN_INLINE_CAP 64 and forward-declare struct PTO2FaninPool alongside PTO2_MAX_FANIN. 2. orchestration/common.cpp: assert_impl + AssertionError + the addr2line / backtrace machinery used to live inline in wireless2's runtime/common.h. Upstream moved the declarations to src/common/task_interface/assert_compat.h and expects the runtime target to provide the definitions in orchestration/common.cpp (a5 does so). Port a5's common.cpp into the a2a3 orchestration path. Sidestep the LOG_ERROR vs LOG_INFO_V macro conflict by not pulling common/unified_log.h (would re-#define LOG_INFO_V0..V9 already supplied by pto_orchestration_api.h) and using a local stderr-printing LOG_ERROR for the assert path. paged_attention Case4 passes (1389 µs, 10 rounds). Case1 trimmed device avg = 30587 µs over 100 rounds — works but ~11% slower than the same wireless2 stack on the c4b0aac baseline (27451 µs). The extra cost is likely overhead from coexisting with upstream's additions (spec-dispatch storage, profiling fields, etc.) that the wireless poller never reads but the orchestrator still populates. Investigation + tightening of the coexistence layer is a follow-up. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Root cause of the +11% Case1 / pa_manual_scope regression I measured on wireless3 yesterday. When I merged wireless2 onto upstream/main I added a "compatibility layer" to PTO2TaskPayload: kept upstream's fanin_inline_slot_states[PTO2_FANIN_INLINE_CAP] // 512 B fanin_actual_count, fanin_spill_start fanin_spill_pool* staged_core_mask[PTO2_SPEC_CORE_MASK_WORDS] // 16 B dispatch_fanin, allow_early_resolve, spec_state, dispatch_propagated, spec_chain_active, spec_chain_depth alongside the wireless model's flat fanin_local_ids[]. The intent was to give spec-dispatch's release path something to link against. But the spec-dispatch implementation lived in scheduler/* and pto_orchestrator.cpp / pto_runtime2.cpp — files we deleted as part of the wireless directory collapse. After the merge nothing in the tree actually reads/writes any of those fields (verified by grep). So: ~560 bytes of dead per-payload storage. With 65K tasks per Case1 round that's ~36 MB of cache thrash per round even though the wireless poller never touches the bytes. Bench confirmed: the regression was workload-size-correlated and only hit the biggest workloads (Case1, pa_manual_scope Case1/2). Remove: - fanin_inline_slot_states, fanin_spill_pool, fanin_*_count|start - staged_core_mask, dispatch_fanin, allow_early_resolve, spec_state, dispatch_propagated, spec_chain_active, spec_chain_depth - PTO2SpecState enum and PTO2_SPEC_CORE_MASK_WORDS constant - PTO2_FANIN_INLINE_CAP define and PTO2FaninPool fwd decl - The init() block that zeroed those fields - The +512 prefetch in prefetch() that targeted them - A reset_for_reuse comment referring to them Bench post-fix (wireless3 vs wireless2 on bench_baseline): paged_attention Case1 27919 vs 27692 (+0.8% wash) paged_attention Case4 1134 vs 1382 (−18%) paged_attention CaseSmall1 302 vs 650 (−54%) pa_unroll_manual_scope Case1 1626 vs 1883 (−14%) pa_unroll_manual_scope Case2 1016 vs 1272 (−20%) paged_attention_manual_scope Case1 25249 vs 24933 (+1.3% wash) paged_attention_manual_scope Case2 13382 vs 13109 (+2.1% wash) benchmark_bgemm Case0 1038 vs 1274 (−19%) The three heavy cases are within run-to-run noise of wireless2; every other case is significantly faster (smaller workloads benefit from upstream's improvements between c4b0aac and current main). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
wireless3 was the post-rebase form of the wireless2 work — same set of optimizations re-applied on top of current upstream/main plus the spec-dispatch-coexistence cleanup that recovered the rebase's per-task overhead. Bringing it back into wireless2 so future work continues on a single perf branch riding current upstream. Resolution: take wireless3's tree exactly. The wireless2-side history predates wireless3 and is fully represented in wireless3's squashed "Rebase wireless2 stack onto upstream/main (squashed)" commit, so a content-level merge would just re-do the same conflict resolution we already settled. Recording the merge as a no-conflict two-parent commit preserves the history linkage without re-litigating it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Closing, as this change has been refined into a much less invasive PR: |
I am posting this as a draft PR because it has diverged so much from main, it'd be too disruptive to propose it as normal PR.
Nevertheless, this is proof there is a high potential for >>10% speedups if we change the current "wiring" strategy to keep track of dependencies, and instead we use a polling-based approach. The gist:
Here are my experimental result, across several tests, comparing this PR (branch: wireless2) with upstream/main @ c4b0aac (Fix: monotonic scope_stats heap accounting for multi-wrap scopes (#996) (#1031) Date: 2026-06-17)
Wireless2 is faster on every case, by 16% to 48%. Geo-mean speedup of 30%. These are all speedups on Device Time (host time is not considered, as it's outside the scope of our investigation)
Noah Baumann (@noabauma) is currently working on a minimal patch to apply these optimizations to main without too many disruptive changes. Nevertheless, I thought you'd like to have access to this PR to test it yourselves.
Adding AI-generated summary of the optimizations, which may help replicating it:
wireless-architecture.md