perf(ioctl): cached-head fast path for ioctl_send_cmds#19
Closed
hyunyul-XCENA wants to merge 1 commit into
Closed
Conversation
ioctl_send_cmds unconditionally fired a synchronous PCIe read of the device-side head register (read_ctrl_from_device) before every batch push. On a 24-sub PXL device this added ~18us per call, putting SEND_CMDS at ~30us vs the ~12us achieved by the sibling ioctl_send_cmd_with_data, which already uses a cached + busy-poll-on-full pattern. The cached head only ever lags the real device head (the device monotonically advances head as it consumes commands), so cached_pushable is always <= real_pushable. If the cached count already covers the caller's nr_cmds, the actual count is at least as large -- safe to skip the read. Only fall back to the synchronous read when the cached count is short. Worst-case behaviour: when the real head moved further than cached between calls, we may push fewer commands than physically possible. The caller resubmits the remainder on the next call. No correctness impact (we never overflow the queue), only a transient throughput underutilization that self-corrects. Measured impact (PXL echo bench, one-per-sub, batch=1, 24 subs, 1000 reps): closes the ~600us per-execute gap between PXL's clean dispatch path (which fires 24 SEND_CMDS calls) and the dirty path (which fires 24 SEND_CMD_WITH_DATA calls), eliminating the structural penalty for 1pkt/sub workloads in PXL Phase 5.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🤔 배경 및 동기 (Why)
ioctl_send_cmds는 모든 호출에서 device-sidehead레지스터를 동기 PCIe read (read_ctrl_from_device,IO_OPCODE_SQ_READ)로 가져와pushable_count를 산출한다. 큐가 saturate 되지 않은 일반 워크로드에서도 매번 PCIe 왕복 비용이 들어가서 호출당 ~30 μs.같은 SQ를 다루는
ioctl_send_cmd_with_data는 cachedsq_mbox->ctx.head기반으로 진행하다가is_full()이 true 일 때만 fallback read 하는 낙관적 패턴을 이미 사용 — 호출당 ~12 μs. 두 sibling 사이에 ~18 μs 격차가 구조적으로 존재.PXL 레벨에서 측정한 영향 (24-sub 디바이스, 1 packet/sub, 1000 reps/exec):
SEND_CMDS(batch)mean: ~30 μs → ~10 μs (-67 %, light scenarios)SEND_CMDS(1pkt)admin path mean: ~30 μs → ~9 μsMap::executewall: 1pkt/sub clean 2.98 ms → 2.68 ms (-9.9 %), 2pkt/sub change 3.62 ms → 3.19 ms (-11.8 %), heavy/same 8.13 ms → 6.67 ms (-17.9 %)기능 / ABI / device-file interface / HW command layout 변경 없음.
nr_cmds반환 의미 동일.🏗️ 설계 변경점
ioctl_send_cmds의 lock 진입 직후 cached-head fast path 추가pushable_count가 caller 의nr_cmds를 만족하면 PCIeread_ctrl_from_device생략장단점:
📝 상세 구현 내용
ioctl.c,+23 / -5, 1 hunk:핵심 invariant: cached head 는 항상 real device head 보다 뒤쳐진다 (device 가 head 를 monotonic 하게 전진시키므로). 따라서
cached_pushable ≤ real_pushable. cached 가 충분하다고 보면 실제도 충분 — read 생략 안전.리뷰어가 봐줬으면 하는 부분:
sq_mbox->ctx.head를 strictly stale 이외 방향으로 갱신하는 곳이 있는지 — 이게 깨지면 가정 무너짐.✅ 테스트
pxl-echo-bench6 시나리오 (one/two-per-sub × same/change, heavy × same/change) 측정.SEND_CMDS_bmean ~30 μs → ~10 μs, wall -9 ~ -12 %SEND_CMDS_bunchanged (~37 μs), wall 변화는 DI/DF + Wait 감소에 기인-EINTR, send count 누락 등) 관측 안 됨상세 측정 데이터: PXL repo 의
phase5/.local/driver-cached-head-comparison.md(내부 share).🔗 관련 이슈(선택)
🌿 관련 PR(선택)
hycho/feat/map-prime-task-persist) 의 1pkt/sub clean-path 가 본 driver fix 후 baseline 대비 명확한 win 으로 전환됨. 독립적이지만 sibling improvement.🌿 관련 Branch(선택)
📦 Release Note (자동 생성용 / 영문 작성)
NEW
CHANGED
FIXED
IMPORTANT NOTES