perf(driver): minimize driver-side DMA submission overhead#18
Closed
daeyeong-XCENA wants to merge 8 commits into
Closed
perf(driver): minimize driver-side DMA submission overhead#18daeyeong-XCENA wants to merge 8 commits into
daeyeong-XCENA wants to merge 8 commits into
Conversation
Lowest RT band keeps the I/O handlers ahead of CFS noise so userspace submissions don't pay CFS wake latency under CPU pressure. Handlers still yield via cond_resched() and sleep in swait_event when idle, so softlockup/RCU stalls remain bounded.
Wake sq_wait and cq_wait at the top of every data/context/ioctl path so the handler kthreads start running in parallel with page pinning, DMA mapping, and command construction. The wake is a cheap no-op when the handler is already running, and removes the cold-start component of wake latency when it wasn't.
Restrict mx_submit_thd and mx_complete_thd to the device's NUMA node via set_cpus_allowed_ptr at queue init. Keeps handler cache traffic (descriptor ring, sq/cq_wait, transfer structs) node-local instead of letting the scheduler place them on any CPU in the system. Uses set_cpus_allowed_ptr rather than kthread_bind so operators can still taskset to colocate handlers with a specific userspace CPU for tighter tuning. No-op on devices without NUMA affinity.
Register a cpu_latency_qos request with a 50us wake-up budget at device probe and release it at device remove. This blocks deep idle states whose exit latency would stretch the freq ramp-up window we observed adding ~12us to cold DMA submissions (governor reaching boost freq after the CPU wakes from a deep idle). Held across the device's lifetime; shallow idle remains allowed so we don't force a polling-idle CPU. Freed on both success and out_fail paths via destroy_mx_pdev.
Add pages_inline[1], sg_inline[1], and a 64 B cmd_inline area to struct mx_transfer so the single-page hot path skips kcalloc(pages), sg_alloc_table_from_pages(), and kzalloc(mx_command). Free paths detect inline use by pointer identity and skip the corresponding kfree / sg_free_table. BUILD_BUG_ON guards cmd_inline against future growth of struct mx_command (v1=32 B, v2=64 B today).
Replace the generic kmalloc bucket allocation with a SLAB_HWCACHE_ALIGN kmem_cache sized exactly to struct mx_transfer. The per-cpu slab magazine keeps freshly freed transfers hot for the next allocation, cutting slab-partial contention on repeated small-I/O loops. Cache lifetime is tied to module load: create after class_create() in mxdma_init(), destroy after the PCI / bus teardown that drains all in-flight transfers in mxdma_exit().
v1 profile showed memcpy_fromio(sizeof(struct mx_command)) at ~6.5 % of total cycles — four MMIO readq per pop — while the completion path only consumes the header (id / control) and host_addr (result). size and device_addr are producer-side fields that the host never reads on completion. Drop the full 32 B memcpy_fromio for two explicit readq covering just the required words, saving ~500–1000 ns per op. Zero the untouched words so dev_dbg doesn't print stack garbage for them.
is_pushable() readq of the SQ mbox context showed up at ~2.8 % in the v1 profile because the submit_handler re-checks it on every command. Tail moves only from our own push_mx_command and head only grows as HW consumes, so the locally tracked free_space is a conservative lower bound — if we already see room for at least two commands there is no need to read HW for just this one. Keep the readq for the genuinely-full case so the HW refresh still drives forward progress when the queue fills up.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🤔 배경 및 동기 (Why)
driver overhead 만 노출.
구조 레벨 수정에 한정. 각 commit 은 독립 revertable.
🏗️ 설계 변경점
struct mx_transfer내부 inline storage (pages_inline,sg_inline,cmd_inline) 도입 — per-submit slab churn 제거mx_transfer전용kmem_cache분리 (SLAB_HWCACHE_ALIGN)pop_mx_command/is_pushableMMIO 축소 —readq횟수 감소및 local cache 기반 skip
SCHED_FIFOlow band, device-local NUMA cpumaskbinding, DMA/ioctl entry 에서 sq/cq_wait pre-wake
cpu_latency_qos_add_request(50us)보유📝 상세 구현 내용
Per-submit slab churn reduction
대한
sg_free_table/kfree오호출 차단.cmd_inlinesize:BUILD_BUG_ON으로mx_command확장 build-time감시.
kmem_cachelifetime: module load/unload 결속. 기존 teardown 순서내 drain.
v1 MMIO reduction
pop_mx_command: completion path 소비 word 2 개만readq(이전
readq× 4).is_pushable: localfree_space의 conservative lower bound 로skip. Full 직전만 HW re-read fallback.
Handler scheduling & affinity
SCHED_FIFOlowest RT band.cond_resched/swait_event유지로softlockup / RCU stall 경계 보전.
set_cpus_allowed_ptr사용. Allowed mask 형태라 운영 측taskset재조정 여지.
CPU power state
📦 Release Note