perf(driver): minimize driver-side DMA submission overhead by daeyeong-XCENA · Pull Request #18 · xcena-dev/mxdriver

daeyeong-XCENA · 2026-04-24T05:12:23Z

🤔 배경 및 동기 (Why)

타겟: driver 가 DMA submit/complete path 에 추가하는 overhead.
측정 기준: payload 전송 비용이 0 에 수렴하는 최소 사이즈(8B). 순수
driver overhead 만 노출.
Profile 에서 반복 관측된 비용 축:
- Per-submit slab alloc/free churn
- v1 submit_handler 의 과도한 MMIO 접근
- User path 대비 handler kthread 의 wake/run 지연
- Cold submission 과 겹치는 deep C-state exit latency
기능 / ABI / device file interface / HW command layout 변경 없음.
구조 레벨 수정에 한정. 각 commit 은 독립 revertable.

🏗️ 설계 변경점

struct mx_transfer 내부 inline storage (pages_inline, sg_inline,
cmd_inline) 도입 — per-submit slab churn 제거
mx_transfer 전용 kmem_cache 분리 (SLAB_HWCACHE_ALIGN)
v1 pop_mx_command / is_pushable MMIO 축소 — readq 횟수 감소
및 local cache 기반 skip
IO handler kthread: SCHED_FIFO low band, device-local NUMA cpumask
binding, DMA/ioctl entry 에서 sq/cq_wait pre-wake
Device lifetime 동안 cpu_latency_qos_add_request(50us) 보유

📝 상세 구현 내용

Per-submit slab churn reduction

Inline 분기: free path 에서 pointer identity 로 판별. 정적 배열에
대한 sg_free_table / kfree 오호출 차단.
cmd_inline size: BUILD_BUG_ON 으로 mx_command 확장 build-time
감시.
kmem_cache lifetime: module load/unload 결속. 기존 teardown 순서
내 drain.

v1 MMIO reduction

pop_mx_command: completion path 소비 word 2 개만 readq
(이전 readq × 4).
is_pushable: local free_space 의 conservative lower bound 로
skip. Full 직전만 HW re-read fallback.

Handler scheduling & affinity

SCHED_FIFO lowest RT band. cond_resched / swait_event 유지로
softlockup / RCU stall 경계 보전.
set_cpus_allowed_ptr 사용. Allowed mask 형태라 운영 측 taskset
재조정 여지.
Pre-wake: already-running 상태에서 no-op.

CPU power state

Probe acquire / remove release. out_fail path 포함 대칭.
Shallow idle 허용. Polling-idle 강제 회피.

📦 Release Note

Lowest RT band keeps the I/O handlers ahead of CFS noise so userspace submissions don't pay CFS wake latency under CPU pressure. Handlers still yield via cond_resched() and sleep in swait_event when idle, so softlockup/RCU stalls remain bounded.

Wake sq_wait and cq_wait at the top of every data/context/ioctl path so the handler kthreads start running in parallel with page pinning, DMA mapping, and command construction. The wake is a cheap no-op when the handler is already running, and removes the cold-start component of wake latency when it wasn't.

Restrict mx_submit_thd and mx_complete_thd to the device's NUMA node via set_cpus_allowed_ptr at queue init. Keeps handler cache traffic (descriptor ring, sq/cq_wait, transfer structs) node-local instead of letting the scheduler place them on any CPU in the system. Uses set_cpus_allowed_ptr rather than kthread_bind so operators can still taskset to colocate handlers with a specific userspace CPU for tighter tuning. No-op on devices without NUMA affinity.

Register a cpu_latency_qos request with a 50us wake-up budget at device probe and release it at device remove. This blocks deep idle states whose exit latency would stretch the freq ramp-up window we observed adding ~12us to cold DMA submissions (governor reaching boost freq after the CPU wakes from a deep idle). Held across the device's lifetime; shallow idle remains allowed so we don't force a polling-idle CPU. Freed on both success and out_fail paths via destroy_mx_pdev.

Add pages_inline[1], sg_inline[1], and a 64 B cmd_inline area to struct mx_transfer so the single-page hot path skips kcalloc(pages), sg_alloc_table_from_pages(), and kzalloc(mx_command). Free paths detect inline use by pointer identity and skip the corresponding kfree / sg_free_table. BUILD_BUG_ON guards cmd_inline against future growth of struct mx_command (v1=32 B, v2=64 B today).

Replace the generic kmalloc bucket allocation with a SLAB_HWCACHE_ALIGN kmem_cache sized exactly to struct mx_transfer. The per-cpu slab magazine keeps freshly freed transfers hot for the next allocation, cutting slab-partial contention on repeated small-I/O loops. Cache lifetime is tied to module load: create after class_create() in mxdma_init(), destroy after the PCI / bus teardown that drains all in-flight transfers in mxdma_exit().

v1 profile showed memcpy_fromio(sizeof(struct mx_command)) at ~6.5 % of total cycles — four MMIO readq per pop — while the completion path only consumes the header (id / control) and host_addr (result). size and device_addr are producer-side fields that the host never reads on completion. Drop the full 32 B memcpy_fromio for two explicit readq covering just the required words, saving ~500–1000 ns per op. Zero the untouched words so dev_dbg doesn't print stack garbage for them.

is_pushable() readq of the SQ mbox context showed up at ~2.8 % in the v1 profile because the submit_handler re-checks it on every command. Tail moves only from our own push_mx_command and head only grows as HW consumes, so the locally tracked free_space is a conservative lower bound — if we already see room for at least two commands there is no need to read HW for just this one. Keep the readq for the genuinely-full case so the HW refresh still drives forward progress when the queue fills up.

daeyeong-XCENA added 8 commits April 23, 2026 10:53

daeyeong-XCENA closed this Apr 24, 2026

daeyeong-XCENA deleted the dylee/perf/handlers_hot_conservative branch April 24, 2026 05:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(driver): minimize driver-side DMA submission overhead#18

perf(driver): minimize driver-side DMA submission overhead#18
daeyeong-XCENA wants to merge 8 commits into
mainfrom
dylee/perf/handlers_hot_conservative

daeyeong-XCENA commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

daeyeong-XCENA commented Apr 24, 2026

🤔 배경 및 동기 (Why)

🏗️ 설계 변경점

📝 상세 구현 내용

Per-submit slab churn reduction

v1 MMIO reduction

Handler scheduling & affinity

CPU power state

📦 Release Note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant