From 7ecb56c769223a6936edfbd3669142b0dd5ef8e9 Mon Sep 17 00:00:00 2001
From: Hyunyul Cho <hyunyul.cho@xcena.com>
Date: Tue, 28 Apr 2026 18:41:19 +0900
Subject: [PATCH] perf(ioctl): cached-head fast path for ioctl_send_cmds

ioctl_send_cmds unconditionally fired a synchronous PCIe read of the
device-side head register (read_ctrl_from_device) before every batch
push. On a 24-sub PXL device this added ~18us per call, putting
SEND_CMDS at ~30us vs the ~12us achieved by the sibling
ioctl_send_cmd_with_data, which already uses a cached + busy-poll-on-full
pattern.

The cached head only ever lags the real device head (the device
monotonically advances head as it consumes commands), so cached_pushable
is always <= real_pushable. If the cached count already covers the
caller's nr_cmds, the actual count is at least as large -- safe to skip
the read. Only fall back to the synchronous read when the cached count
is short.

Worst-case behaviour: when the real head moved further than cached
between calls, we may push fewer commands than physically possible.
The caller resubmits the remainder on the next call. No correctness
impact (we never overflow the queue), only a transient throughput
underutilization that self-corrects.

Measured impact (PXL echo bench, one-per-sub, batch=1, 24 subs, 1000
reps): closes the ~600us per-execute gap between PXL's clean dispatch
path (which fires 24 SEND_CMDS calls) and the dirty path (which fires
24 SEND_CMD_WITH_DATA calls), eliminating the structural penalty for
1pkt/sub workloads in PXL Phase 5.
---
 ioctl.c | 28 +++++++++++++++++++++++-----
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/ioctl.c b/ioctl.c
index 6a7d6b9..d1fd36a 100644
--- a/ioctl.c
+++ b/ioctl.c
@@ -218,13 +218,31 @@ static long ioctl_send_cmds(struct mx_pci_dev *mx_pdev, unsigned long arg)
 	sq_mbox = mx_pdev->sq_mbox_list[send_cmd.qid];
 
 	mutex_lock(&sq_mbox->lock);
-	if (read_ctrl_from_device(mx_pdev, (char __user *)&ctx.u64, sizeof(uint64_t), (loff_t *)&sq_mbox->r_ctx_addr, IO_OPCODE_SQ_READ) <= 0) {
-		mutex_unlock(&sq_mbox->lock);
-		return -EINTR;
-	}
-	sq_mbox->ctx.head = ctx.head;
 
+	/*
+	 * Cached-head fast path. The cached head only ever lags the real
+	 * device head (device monotonically advances head as it consumes),
+	 * so cached_pushable <= real_pushable. If the cached pushable count
+	 * already covers the requested batch we can skip the synchronous
+	 * PCIe read of the device-side head register entirely.
+	 *
+	 * Skipping the read shaves the per-call cost from ~30us down to
+	 * the same order as ioctl_send_cmd_with_data (~12us), since that
+	 * sibling already uses a cached + busy-poll-on-full pattern. Loss
+	 * case: when real head moved further than cached, we may push less
+	 * than physically possible -- caller resubmits the remainder, no
+	 * correctness impact.
+	 */
 	count = get_pushable_count(sq_mbox);
+	if (count < send_cmd.nr_cmds) {
+		if (read_ctrl_from_device(mx_pdev, (char __user *)&ctx.u64, sizeof(uint64_t), (loff_t *)&sq_mbox->r_ctx_addr, IO_OPCODE_SQ_READ) <= 0) {
+			mutex_unlock(&sq_mbox->lock);
+			return -EINTR;
+		}
+		sq_mbox->ctx.head = ctx.head;
+		count = get_pushable_count(sq_mbox);
+	}
+
 	if (count == 0)
 		goto out;