Skip to content

[Performance] Wireless (Polling-based) task readiness detection brings 15~40% speedup#1107

Closed
SergioMartin86 wants to merge 15 commits into
hw-native-sys:mainfrom
huawei-csl:wireless2
Closed

[Performance] Wireless (Polling-based) task readiness detection brings 15~40% speedup#1107
SergioMartin86 wants to merge 15 commits into
hw-native-sys:mainfrom
huawei-csl:wireless2

Conversation

@SergioMartin86

@SergioMartin86 SergioMartin86 commented Jun 22, 2026

Copy link
Copy Markdown

I am posting this as a draft PR because it has diverged so much from main, it'd be too disruptive to propose it as normal PR.

Nevertheless, this is proof there is a high potential for >>10% speedups if we change the current "wiring" strategy to keep track of dependencies, and instead we use a polling-based approach. The gist:

  • On adding a task, only remember the indexes of the tasks upon which the task depends
  • To detect whether the task is ready to go, the scheduler "constantly" polls a common shared array (initially full of zeros). If all its dependencies are 1, then the task is ready to go
  • If the task only has one dependency left, assign it as a "notify" task to that producer, so polling is no longer needed
  • Consumers do not notify producers, whose memory are freed up regularly (not on notification).

Here are my experimental result, across several tests, comparing this PR (branch: wireless2) with upstream/main @ c4b0aac (Fix: monotonic scope_stats heap accounting for multi-wrap scopes (#996) (#1031) Date: 2026-06-17)

image

Wireless2 is faster on every case, by 16% to 48%. Geo-mean speedup of 30%. These are all speedups on Device Time (host time is not considered, as it's outside the scope of our investigation)

Noah Baumann (@noabauma) is currently working on a minimal patch to apply these optimizations to main without too many disruptive changes. Nevertheless, I thought you'd like to have access to this PR to test it yourselves.

Adding AI-generated summary of the optimizations, which may help replicating it:

wireless-architecture.md

SergioMartin86 and others added 11 commits June 17, 2026 10:16
Squash of 12 commits (afb5c5a..wireless2-pre-rebase) carried forward over
upstream/main (c4b0aac), resolving overlap with intervening upstream
changes. Preserves all optimizations and simplifications from this branch:

  * 73e23bd Stripping all unnecessary stuff
  * be89bbe Reformatting
  * 0340ec8 Simplifying and moving cpp functions into their h files
  * 6fba249 More simplifications
  * 91f7157 more simplifications
  * c569f34 Removing spill storage
  * 7af17f9 Polling readiness: replace fanout-chain wiring with pending-list polling
  * 1ab69fb0 Collapse multi-ring layout to a single ring

Conflict resolutions against upstream:
  * pto_runtime2_types.h: drop the hard-coded 256B scalar-region static
    assert (upstream hw-native-sys#1056 lowered MAX_SCALAR_ARGS to 16, making it 128B).
    The assert is now an identity expressed in terms of MAX_SCALAR_ARGS.
  * pto_orchestrator.h: drop the local extern decl of
    set_dump_tensor_task_mask — upstream's tensor_dump_aicpu.h now
    declares it with a different signature (TensorDumpArgMask).
  * scheduler_types.h: PLATFORM_MAX_IDLE_ITERATIONS was removed upstream
    (a5 uses a fixed STALL_LOG_INTERVAL); match that approach. Also
    switch SCHEDULER_TIMEOUT_MS to use PLATFORM_SCHEDULER_TIMEOUT_MS.
  * runtime.h: add device_memset hook to HostApi (upstream platform code
    now populates it; matches the a5 HostApi shape).

Validated post-rebase on a2a3 onboard:
  * Case4 paged-attention: trimmed device avg ~1362 us (matches pre-rebase
    Step 1 baseline ~1365).
  * Case1 paged-attention: device avg ~28801 us/round over 10 rounds
    (matches pre-rebase ~28172).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drop the per-completion fanout_refcount notification from consumer tasks
to their fanin producers. Each ring now carries a single monotonic
completed_watermark — the highest local_id W such that every task 0..W
has reached COMPLETED. On submit, the orchestrator stamps each producer's
last_consumer_local_id with max(prev, self) (single-writer, plain
int32_t). On completion, the scheduler CAS-advances the watermark forward
through consecutive COMPLETED slots up to its own id, then retires tail
slots whose last_consumer_local_id is at or below the watermark.

Removes fanout_count/fanout_refcount, the CONSUMED state, on_task_release,
release_producer, check_and_handle_consumed, on_scope_end's release loop,
and the deferred_release_slot_states buffer threaded through
complete_slot_task / check_running_cores_for_completion /
poll_and_complete.

Case4 trimmed device avg: 1360 us. Case1 trimmed device avg:
28286 us (vs rebased baseline ~28801 us).
Replace the per-fanin pointer chase to producer slot_state.task_state
with a byte read from a contiguous per-ring completion_flags array
indexed by producer local_id & task_window_mask. Each task carries
fanin_local_ids[] (4B per id) in place of fanin_slot_states[] (8B
per pointer), and the completer writes a single byte instead of
publishing through a 128B-aligned slot.

For Case1's working set (16384 slots), the flag array is 16KB and
fits L1. Thread 0's fanin_satisfied polling now condenses 16 fanin
checks into 1-2 cache lines instead of one per producer slot.

The orchestrator clears the new slot's byte in prepare_task before
the wiring-queue push (release) makes it visible to thread 0; reset
happens single-threaded so no atomic is needed. The completer's set
uses release ordering to publish the producer's output writes to
acquire-loading consumers.

Case4 trimmed device avg: 1308 us (was 1360). Case1 trimmed device
avg: 28047 us (was 28286); trimmed host avg: 292834 us (was 453591).
Replace the intrusive next_pending pointer in PTO2TaskSlotState with a
thread-0-private circular FIFO of slot pointers, sized to the per-ring
task window (PTO2_TASK_WINDOW_SIZE) and allocated from the scheduler
arena. Same memory budget (was 8B per slot × window_size; now one
contiguous buffer of the same total size), but keeps scheduler-private
linkage out of the task struct.

Push/pop become array writes/reads at head_idx/tail_idx & mask. The
buffer's cache lines amortize across 64 entries per line, matching the
hit rate the old design got from co-locating next_pending with the
slot_state cache line that fanin_satisfied already loaded.

Case4 trimmed device avg: 1319 us (was 1308 us). Case1 trimmed device
avg: 28080 us (was 28047 us). Differences are within shared-box noise.
Add SchedulerThreadProfile (per-phase cumulative cycles + entry counts)
and instrument the main loop to attribute time to:
  - completion check
  - async wait poll
  - drain_wiring_queue (split into SPSC drain vs pending FIFO poll)
  - dummy ready-queue drain
  - dispatch_ready_tasks
  - idle spin

Dump via LOG_INFO_V9 once per resolve_and_dispatch exit so the hot path
only accumulates cycle counters. Output is tagged CLAUDE_PROFILING and
written to ${HOME}/ascend/log/debug/; pull it with
  cat /root/ascend/log/debug/*/* | grep CLAUDE_PROFILING

Used to identify thread 0's pending FIFO fanin polling as the
dominant cost in Case1 (54% of round time) — the data-driven basis
for the wake-list optimization that follows.
Replace the pure-polling pending-FIFO loop with a hybrid:
  - 0 unmet fanins  → push to ready_queues (unchanged)
  - exactly 1 unmet → register the consumer on that producer's wake list
                      and remove from FIFO (was: push back to FIFO)
  - 2+ unmet        → push back to FIFO for the next poll (unchanged)

Each producer slot gets a wake_list_head atomic pointer. Registration
is a CAS push onto the head. Completion does an atomic-exchange to a
SENTINEL (refusing further registrations) and pushes every waiter to
ready_queues. Slots reset wake_list_head to nullptr on reuse.

The intuition: most pending lifetime is spent waiting on the last
fanin to complete. The polling model re-walks every fanin on every
poll iteration even though only one byte changes. Wake-list registration
costs one CAS per task and zero further polls — the producer pushes the
waiter on completion. The submission-time variant of this idea ((f) in
the investigation) regressed because cross-thread cache traffic on the
orchestrator's hot path overwhelmed the savings; restricting wake-list
work to the scheduler-side keeps the writers on the same cache line.

Case1 (large workload, 65K tasks): -2.2% trimmed device time
  (~28072 µs → ~27451 µs).
Case4 (small workload): +2.2% trimmed device time
  (~1322 µs → ~1351 µs). The per-task atomic exchange overhead is not
  amortized at this scale.

Profile shift on Case1 (thread 0):
  drain_wiring_cycles  819K → 396K (-52%)
  pending_poll_cycles  767K → 343K (-55%)
  All threads run ~40% fewer main-loop iterations (denser per-iteration
  work).
Break down the completion phase further: separate complete_slot_task
body time from the per-iter cond_ptr-read + transition-decide overhead,
plus a count of cores scanned per iter. Lets future investigations see
which sub-phase actually dominates compl_cyc.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
(m) PTO2TaskSlotState::task_state was a redundant completion signal —
completion_flags already records the same transition with the right
memory ordering. Drop the atomic release store on the completion path,
switch the watermark CAS-advance loop and the wait/stall-dump readers
to consult completion_flags directly. Saves one atomic store per task.

(q) In complete_slot_task, read deferred_slab->count before
deferred_slab->error_code. Kernels that don't register async conditions
leave count at 0 (the dispatch-time reset value), so checking count
first lets the common path skip the error_code load + branch and the
condition-forwarding loop.

Each change is neutral on Case1 in isolation (within ±50 µs run-to-run
variance over 80-round trimmed avgs), but both clean up redundant
work on the completion hot path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous commit dropped the producer-side .store(COMPLETED) — the
field had no remaining writers on the hot path. Remove the field itself,
the orchestrator's no-longer-needed PENDING-init at submit time, and
the SCALAR_DATA_ACCESS / MULTI_RING doc snippets that still spelled the
spin-wait and watermark-walk in terms of task_state. completion_flags
is now the sole completion signal in a2a3.

The a2a3 test_task_state.cpp UT was a leftover copy of the a5 version —
it #includes "scheduler/pto_scheduler.h" (an a5-only path) and calls
release_fanin_and_check_ready / release_producer methods that don't
exist in the a2a3 scheduler. It never compiled against a2a3; remove it
and the matching CMakeLists entry.

Note: RUNTIME_LOGIC.md sections 6.2 / 7.3 / 8.2 / 8.4 still describe a
much older fanout_lock + CONSUMED state architecture that no longer
exists in the codebase. That cleanup is out of scope here — flagged
for a follow-up doc pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Walk the recently-touched scheduler / orchestrator surface for unused
parameters and dead state, and drop what no caller or body actually
exercises:

- on_mixed_task_complete / complete_slot_task / check_running_cores_for_completion:
  drop the threaded-through `local_bufs` argument (none of these bodies
  read it anymore — it was a leftover from the (g)/(g') wake-list-via-
  local-bufs variants that didn't ship). Also drops `local_bufs` from
  AsyncWaitList::poll_and_complete and the DrainCompletionSink field.
- check_running_cores_for_completion / complete_slot_task: drop the
  `Handshake *hank` argument (only forwarded, never read). The local
  `hank` in resolve_and_dispatch's loop scope is dropped with it.
- dispatch_shape / dispatch_ready_tasks: drop the `bool &try_pushed`
  out-param chain. Set deep inside dispatch_shape but the only
  consumer in resolve_and_dispatch was a (void) suppression.
- pop_ready_tasks_batch: drop the unused `thread_idx` argument.
- log_stall_diagnostics: drop the [[maybe_unused]] `task_count`.
- log_shutdown_stall_snapshot + handle_timeout_exit: drop the
  [[maybe_unused]] `trigger_idle_iterations` / `trigger_last_progress_count`
  and the matching unused `idle_iterations` / `last_progress_count` on
  the timeout-exit caller.
- handle_orchestrator_exit: drop the `int32_t &task_count` out-param —
  the caller's only use was a `if (...task_count > 0) { if (...) {} }`
  with an empty inner body. Read total_tasks_ directly instead.
- resolve_and_dispatch loop: drop the now-dead `task_count` and
  `last_progress_count` locals (and the three write-only updates to
  the latter); inline the `try_completed = ...; if (try_completed)`
  pattern into a single `if`.
- PTO2SchedulerState::print_stats / print_queues: empty no-op stubs,
  never called — remove (along with the cold-path API comment that
  pointed at them).
- PTO2TensorMap::print_stats: 45-line stat-collection function whose
  output goes nowhere (the per-ring loop body is also empty) — remove.
- orch_report_fatal_v: drop the dead vsnprintf-into-a-buffer-then-
  discard block; just latch the error code via orch_mark_fatal. The
  fmt + va_list params are kept (unnamed) since callers pass them and
  the wider rt_report_fatal -> orchestrator.report_fatal -> v API
  surface is symmetric for a future logging-sink hookup.

Build is clean, Case4 and Case1 pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
c3f74c7 (the foundational wireless2 collapse) dropped the
log_info_v ops pointer and the LOG_INFO_V0..V9 macros from
pto_orchestration_api.h as part of its general cleanup. That left
any orchestration .cpp that called LOG_INFO_V<n> without a
"#ifdef ENABLE_PROFILING" guard failing to compile — paged_attention_
manual_scope and benchmark_bgemm both hit "'LOG_INFO_V9' was not
declared in this scope" against current header state.

Restore the surface:
- Add log_info_v function pointer to both copies of PTO2RuntimeOps
  (the runtime-local one in pto_runtime2.h and the orchestration-
  facing mirror in pto_orchestration_api.h — keep them in sync).
- Add LOG_INFO_V0..V9 macros at the end of pto_orchestration_api.h
  that route through current_runtime()->ops->log_info_v.
- Implement rt_log_info_v in pto_runtime2.h: format the message
  with vsnprintf and forward to unified_log_info_v, which already
  owns the runtime verbosity gate.
- Wire rt_log_info_v into s_runtime_ops.

paged_attention_manual_scope Case1 and benchmark_bgemm Case0 now
build and run; paged_attention Case4 still passes (no regression on
runtime hot path).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d765811f-ebf8-4f20-b41a-aea1075c31e5

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@SergioMartin86 SergioMartin86 changed the title [Performance] Wireless (Polling-based) Task Readiness Implementation brings 15~20% speedup [Performance] Wireless (Polling-based) Task Readiness Implementation brings 15~40% speedup Jun 22, 2026
@SergioMartin86 SergioMartin86 changed the title [Performance] Wireless (Polling-based) Task Readiness Implementation brings 15~40% speedup [Performance] Wireless (Polling-based) task readiness detection brings 15~40% speedup Jun 22, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant refactoring of the runtime, orchestrator, and scheduler, consolidating several source files into headers, simplifying the ring buffer layout, and removing verbose logging. The review feedback identifies a critical deadlock bug in the CPU affinity gating due to a potential mask overflow on systems with 16 or more cores, and a high-severity data race on last_consumer_local_id that should be resolved using atomic operations. Additionally, the feedback highlights a potential null pointer dereference in the tensor map, wasted CPU cycles in statistics printing, a performance regression from removing the read-only tensor copy-back optimization, and the loss of critical troubleshooting logs for dynamic loading failures.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

static bool s_thread_survive[MAX_GATE_THREADS];

static inline int32_t popcount64(uint64_t v) { return __builtin_popcountll(static_cast<unsigned long long>(v)); }
static std::atomic<uint16_t> g_cpumask{0};

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Using a uint16_t for g_cpumask limits the maximum CPU ID to 15. On systems with 16 or more CPU cores, or if the assigned CPU IDs are >= 16, the expression 1 << cpu on line 47 will overflow, resulting in a 0 mask when cast back to uint16_t. This prevents g_cpumask from accumulating the correct number of active threads, causing the barrier loop on line 50 to spin infinitely (deadlock). Please use uint64_t or std::atomic<uint64_t> instead.

Suggested change
static std::atomic<uint16_t> g_cpumask{0};
static std::atomic<uint64_t> g_cpumask{0};

// safe to reclaim when the per-ring completed_watermark reaches at least
// this id (i.e. every task up to and including the last consumer has
// transitioned to COMPLETED). Single-writer (orchestrator) at submit time.
int32_t last_consumer_local_id;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The last_consumer_local_id field is written by the orchestrator thread in submit_task_common and read concurrently by the scheduler thread in advance_ring_pointers without any synchronization. This constitutes a data race, which is Undefined Behavior in C++ and can lead to compiler optimizations serving stale values or causing infinite loops. Please make this field atomic (e.g., std::atomic<int32_t>) and update its accesses using relaxed memory order (e.g., load(std::memory_order_relaxed) and store(..., std::memory_order_relaxed)) in pto_orchestrator.h (lines 484, 584) and pto_scheduler.h (line 394).

Suggested change
int32_t last_consumer_local_id;
std::atomic<int32_t> last_consumer_local_id;
References
  1. In self-correcting re-polling loops, using std::memory_order_relaxed for atomic loads of a single variable is sufficient and avoids the overhead of std::memory_order_acquire.

Comment on lines 383 to 384
PTO2TensorMapEntry *entry = new_entry();
entry->copy_from_tensor(tensor);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If the TensorMap entry pool is exhausted, new_entry() will return nullptr. Calling entry->copy_from_tensor(tensor) directly without a null check will result in a segmentation fault and crash the orchestrator. Please add a null check and report a fatal error or use always_assert to prevent a segmentation fault.

Suggested change
PTO2TensorMapEntry *entry = new_entry();
entry->copy_from_tensor(tensor);
PTO2TensorMapEntry *entry = new_entry();
always_assert(entry != nullptr);
entry->copy_from_tensor(tensor);

Comment on lines +475 to +520
void print_stats()
{
int32_t valid = 0;
int32_t stale = 0;
int32_t empty_buckets = 0;
int32_t max_chain = 0;
int64_t total_chain = 0;
int32_t non_empty_buckets = 0;

// Count entries
for (int32_t i = 0; i < pool_size; i++)
{
if (entry_pool[i].bucket_index != -1)
{
if (entry_valid(entry_pool[i])) valid++;
else stale++;
}
}

#if PTO2_TENSORMAP_PROFILING
struct PTO2TensorMapProfilingData {
uint64_t lookup_chain_total;
uint64_t lookup_count;
int32_t lookup_chain_max;
uint64_t overlap_checks;
uint64_t overlap_hits;
uint64_t insert_count;
};
// Count bucket stats
for (int32_t b = 0; b < num_buckets; b++)
{
int32_t chain_len = 0;
auto cur_entry = buckets[b];

while (cur_entry != nullptr)
{
chain_len++;
cur_entry = cur_entry->next_in_bucket;
}

if (chain_len == 0)
{
empty_buckets++;
}
else
{
non_empty_buckets++;
total_chain += chain_len;
if (chain_len > max_chain) max_chain = chain_len;
}
}

for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++)
{}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The print_stats() function computes various statistics but does not print them because all the LOG_INFO_V0 calls were removed. This makes the function a no-op that wastes CPU cycles if called. Please either restore the logging statements or remove the unused computation.

// copying back.
bool needs_copy_back = !(signature != nullptr && i < sig_count && signature[i] == ArgDirection::IN);
runtime->tensor_pairs_.push_back({host_ptr, dev_ptr, size, needs_copy_back});
runtime->tensor_pairs_.push_back({host_ptr, dev_ptr, size});

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Removing the optimization to skip copy-back for read-only INPUT tensors (such as model weights) introduces significant and unnecessary D2H transfer overhead. Please ensure that INPUT tensors are marked as child_memory to explicitly skip the D2H copy-back, while ensuring OUTPUT tensors are not marked as child_memory since they require copy-back.

References
  1. By design, child_memory is only used for INPUT tensors. OUTPUT tensors must not be marked as child_memory because they require a D2H copy-back, which child_memory explicitly skips.

Comment on lines +261 to +262
if (handle == nullptr)
{

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Removing the error log when dlopen fails makes troubleshooting extremely difficult, as the executor will silently return -1 without any indication of why the orchestration SO could not be loaded. Please restore the LOG_ERROR call to print the dlerror() message.

Suggested change
if (handle == nullptr)
{
if (handle == nullptr)
{
LOG_ERROR("Thread %d: dlopen failed: %s", thread_idx, dlerror());

Comment on lines +279 to +280
if (entry_dlsym_error != nullptr)
{

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Removing the error log when dlsym fails for the entry symbol makes troubleshooting extremely difficult, as the executor will silently return -1 without any indication of why the symbol could not be resolved. Please restore the LOG_ERROR call to print the entry_dlsym_error message.

Suggested change
if (entry_dlsym_error != nullptr)
{
if (entry_dlsym_error != nullptr)
{
LOG_ERROR("Thread %d: dlsym failed for entry symbol '%s': %s", thread_idx, entry_symbol, entry_dlsym_error);

SergioMartin86 and others added 4 commits June 22, 2026 13:28
Squash-merge of wireless2 (c4b0aac + 11 commits) onto current
upstream/main (83728d2). Per-commit replay was not viable: upstream
added speculative early-dispatch (hw-native-sys#1079) which touches the same data
structures wireless2 redesigned, and refactored TaskArgs / Tensor
along with several module collapses that fundamentally diverge from
wireless2's earlier collapse-and-poll redesign.

Resolution strategy:
- Modify/delete (8 paths): accept wireless2's deletion. The
  `scheduler/*` and `shared/*` directories were collapsed into
  header-only modules in wireless2 (c3f74c7); upstream kept
  modifying them. We keep the collapse.
- Pure upstream additions (DumpArgSelection / strided TaskArgs /
  Tensor refactor, AICore receive_time / swimlane, NUMA gate, lookup
  profiling externs, MIX classification fix, prefetch helper, etc.):
  take upstream's version. Wireless2 wasn't redesigning these.
- Wireless architecture (completion_flags polling, fanin_local_ids[],
  wake-list, watermark reclamation, pending FIFO out-of-band): keep
  wireless2's design. fanin_local_ids[] is THE entry point for the
  polling loop.
- PTO2TaskPayload: keep wireless2's flat fanin_local_ids[] alongside
  upstream's fanin_inline_slot_states + spec-dispatch storage as a
  compatibility layer, so spec-dispatch code links. Both populated at
  submit; the wireless poller reads fanin_local_ids, spec dispatch
  reads its own fields. Long-term we'd dedupe, but the squash needs
  to compile first.
- pto_types.h and tensor.h: took upstream entire. The TaskArgs and
  Tensor refactor is large; wireless2 only had cosmetic conflicts
  here. Adapt wireless2 code paths to the new TaskArgs surface in
  a follow-up if any breakage surfaces.

The build is NOT yet verified by this commit — there will be
follow-up fixes for code paths that referenced now-removed
symbols (notably the orchestrator-side fanin builder, any direct
fanin_refcount touch points, and the spec-dispatch release path
that needs to consult completion_flags instead of fanin_refcount).
This commit captures the merge resolution as a stable starting
point; verification + adaptation commits land next.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two fixes after the rebase commit:

1. pto_runtime2_types.h: the PTO2TaskPayload compatibility layer for
   upstream spec-dispatch references PTO2FaninPool and
   PTO2_FANIN_INLINE_CAP. Upstream defines them in this same header
   but the merge dropped the lines. Restore: #define
   PTO2_FANIN_INLINE_CAP 64 and forward-declare struct PTO2FaninPool
   alongside PTO2_MAX_FANIN.

2. orchestration/common.cpp: assert_impl + AssertionError + the
   addr2line / backtrace machinery used to live inline in
   wireless2's runtime/common.h. Upstream moved the declarations to
   src/common/task_interface/assert_compat.h and expects the runtime
   target to provide the definitions in orchestration/common.cpp
   (a5 does so). Port a5's common.cpp into the a2a3 orchestration
   path. Sidestep the LOG_ERROR vs LOG_INFO_V macro conflict by not
   pulling common/unified_log.h (would re-#define LOG_INFO_V0..V9
   already supplied by pto_orchestration_api.h) and using a local
   stderr-printing LOG_ERROR for the assert path.

paged_attention Case4 passes (1389 µs, 10 rounds). Case1 trimmed
device avg = 30587 µs over 100 rounds — works but ~11% slower than
the same wireless2 stack on the c4b0aac baseline (27451 µs). The
extra cost is likely overhead from coexisting with upstream's
additions (spec-dispatch storage, profiling fields, etc.) that the
wireless poller never reads but the orchestrator still populates.
Investigation + tightening of the coexistence layer is a follow-up.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Root cause of the +11% Case1 / pa_manual_scope regression I measured
on wireless3 yesterday.

When I merged wireless2 onto upstream/main I added a "compatibility
layer" to PTO2TaskPayload: kept upstream's
  fanin_inline_slot_states[PTO2_FANIN_INLINE_CAP]   // 512 B
  fanin_actual_count, fanin_spill_start
  fanin_spill_pool*
  staged_core_mask[PTO2_SPEC_CORE_MASK_WORDS]      // 16 B
  dispatch_fanin, allow_early_resolve, spec_state,
  dispatch_propagated, spec_chain_active, spec_chain_depth
alongside the wireless model's flat fanin_local_ids[]. The intent was
to give spec-dispatch's release path something to link against. But
the spec-dispatch implementation lived in scheduler/* and
pto_orchestrator.cpp / pto_runtime2.cpp — files we deleted as part
of the wireless directory collapse. After the merge nothing in the
tree actually reads/writes any of those fields (verified by grep).

So: ~560 bytes of dead per-payload storage. With 65K tasks per
Case1 round that's ~36 MB of cache thrash per round even though
the wireless poller never touches the bytes. Bench confirmed: the
regression was workload-size-correlated and only hit the
biggest workloads (Case1, pa_manual_scope Case1/2).

Remove:
- fanin_inline_slot_states, fanin_spill_pool, fanin_*_count|start
- staged_core_mask, dispatch_fanin, allow_early_resolve, spec_state,
  dispatch_propagated, spec_chain_active, spec_chain_depth
- PTO2SpecState enum and PTO2_SPEC_CORE_MASK_WORDS constant
- PTO2_FANIN_INLINE_CAP define and PTO2FaninPool fwd decl
- The init() block that zeroed those fields
- The +512 prefetch in prefetch() that targeted them
- A reset_for_reuse comment referring to them

Bench post-fix (wireless3 vs wireless2 on bench_baseline):
  paged_attention Case1                 27919  vs 27692  (+0.8% wash)
  paged_attention Case4                  1134  vs  1382  (−18%)
  paged_attention CaseSmall1              302  vs   650  (−54%)
  pa_unroll_manual_scope Case1           1626  vs  1883  (−14%)
  pa_unroll_manual_scope Case2           1016  vs  1272  (−20%)
  paged_attention_manual_scope Case1    25249  vs 24933  (+1.3% wash)
  paged_attention_manual_scope Case2    13382  vs 13109  (+2.1% wash)
  benchmark_bgemm Case0                  1038  vs  1274  (−19%)

The three heavy cases are within run-to-run noise of wireless2;
every other case is significantly faster (smaller workloads benefit
from upstream's improvements between c4b0aac and current main).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
wireless3 was the post-rebase form of the wireless2 work — same set of
optimizations re-applied on top of current upstream/main plus the
spec-dispatch-coexistence cleanup that recovered the rebase's per-task
overhead. Bringing it back into wireless2 so future work continues on
a single perf branch riding current upstream.

Resolution: take wireless3's tree exactly. The wireless2-side history
predates wireless3 and is fully represented in wireless3's squashed
"Rebase wireless2 stack onto upstream/main (squashed)" commit, so a
content-level merge would just re-do the same conflict resolution we
already settled. Recording the merge as a no-conflict two-parent
commit preserves the history linkage without re-litigating it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@SergioMartin86

SergioMartin86 commented Jun 24, 2026

Copy link
Copy Markdown
Author

Closing, as this change has been refined into a much less invasive PR:
#1137

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant