From 7ecb56c769223a6936edfbd3669142b0dd5ef8e9 Mon Sep 17 00:00:00 2001 From: Hyunyul Cho Date: Tue, 28 Apr 2026 18:41:19 +0900 Subject: [PATCH] perf(ioctl): cached-head fast path for ioctl_send_cmds ioctl_send_cmds unconditionally fired a synchronous PCIe read of the device-side head register (read_ctrl_from_device) before every batch push. On a 24-sub PXL device this added ~18us per call, putting SEND_CMDS at ~30us vs the ~12us achieved by the sibling ioctl_send_cmd_with_data, which already uses a cached + busy-poll-on-full pattern. The cached head only ever lags the real device head (the device monotonically advances head as it consumes commands), so cached_pushable is always <= real_pushable. If the cached count already covers the caller's nr_cmds, the actual count is at least as large -- safe to skip the read. Only fall back to the synchronous read when the cached count is short. Worst-case behaviour: when the real head moved further than cached between calls, we may push fewer commands than physically possible. The caller resubmits the remainder on the next call. No correctness impact (we never overflow the queue), only a transient throughput underutilization that self-corrects. Measured impact (PXL echo bench, one-per-sub, batch=1, 24 subs, 1000 reps): closes the ~600us per-execute gap between PXL's clean dispatch path (which fires 24 SEND_CMDS calls) and the dirty path (which fires 24 SEND_CMD_WITH_DATA calls), eliminating the structural penalty for 1pkt/sub workloads in PXL Phase 5. --- ioctl.c | 28 +++++++++++++++++++++++----- 1 file changed, 23 insertions(+), 5 deletions(-) diff --git a/ioctl.c b/ioctl.c index 6a7d6b9..d1fd36a 100644 --- a/ioctl.c +++ b/ioctl.c @@ -218,13 +218,31 @@ static long ioctl_send_cmds(struct mx_pci_dev *mx_pdev, unsigned long arg) sq_mbox = mx_pdev->sq_mbox_list[send_cmd.qid]; mutex_lock(&sq_mbox->lock); - if (read_ctrl_from_device(mx_pdev, (char __user *)&ctx.u64, sizeof(uint64_t), (loff_t *)&sq_mbox->r_ctx_addr, IO_OPCODE_SQ_READ) <= 0) { - mutex_unlock(&sq_mbox->lock); - return -EINTR; - } - sq_mbox->ctx.head = ctx.head; + /* + * Cached-head fast path. The cached head only ever lags the real + * device head (device monotonically advances head as it consumes), + * so cached_pushable <= real_pushable. If the cached pushable count + * already covers the requested batch we can skip the synchronous + * PCIe read of the device-side head register entirely. + * + * Skipping the read shaves the per-call cost from ~30us down to + * the same order as ioctl_send_cmd_with_data (~12us), since that + * sibling already uses a cached + busy-poll-on-full pattern. Loss + * case: when real head moved further than cached, we may push less + * than physically possible -- caller resubmits the remainder, no + * correctness impact. + */ count = get_pushable_count(sq_mbox); + if (count < send_cmd.nr_cmds) { + if (read_ctrl_from_device(mx_pdev, (char __user *)&ctx.u64, sizeof(uint64_t), (loff_t *)&sq_mbox->r_ctx_addr, IO_OPCODE_SQ_READ) <= 0) { + mutex_unlock(&sq_mbox->lock); + return -EINTR; + } + sq_mbox->ctx.head = ctx.head; + count = get_pushable_count(sq_mbox); + } + if (count == 0) goto out;