Skip to content

perf(driver): minimize driver-side DMA submission overhead#18

Closed
daeyeong-XCENA wants to merge 8 commits into
mainfrom
dylee/perf/handlers_hot_conservative
Closed

perf(driver): minimize driver-side DMA submission overhead#18
daeyeong-XCENA wants to merge 8 commits into
mainfrom
dylee/perf/handlers_hot_conservative

Conversation

@daeyeong-XCENA
Copy link
Copy Markdown
Collaborator

🤔 배경 및 동기 (Why)

  • 타겟: driver 가 DMA submit/complete path 에 추가하는 overhead.
  • 측정 기준: payload 전송 비용이 0 에 수렴하는 최소 사이즈(8B). 순수
    driver overhead 만 노출.
  • Profile 에서 반복 관측된 비용 축:
    • Per-submit slab alloc/free churn
    • v1 submit_handler 의 과도한 MMIO 접근
    • User path 대비 handler kthread 의 wake/run 지연
    • Cold submission 과 겹치는 deep C-state exit latency
  • 기능 / ABI / device file interface / HW command layout 변경 없음.
    구조 레벨 수정에 한정. 각 commit 은 독립 revertable.

🏗️ 설계 변경점

  • struct mx_transfer 내부 inline storage (pages_inline, sg_inline,
    cmd_inline) 도입 — per-submit slab churn 제거
  • mx_transfer 전용 kmem_cache 분리 (SLAB_HWCACHE_ALIGN)
  • v1 pop_mx_command / is_pushable MMIO 축소 — readq 횟수 감소
    및 local cache 기반 skip
  • IO handler kthread: SCHED_FIFO low band, device-local NUMA cpumask
    binding, DMA/ioctl entry 에서 sq/cq_wait pre-wake
  • Device lifetime 동안 cpu_latency_qos_add_request(50us) 보유

📝 상세 구현 내용

Per-submit slab churn reduction

  • Inline 분기: free path 에서 pointer identity 로 판별. 정적 배열에
    대한 sg_free_table / kfree 오호출 차단.
  • cmd_inline size: BUILD_BUG_ON 으로 mx_command 확장 build-time
    감시.
  • kmem_cache lifetime: module load/unload 결속. 기존 teardown 순서
    내 drain.

v1 MMIO reduction

  • pop_mx_command: completion path 소비 word 2 개만 readq
    (이전 readq × 4).
  • is_pushable: local free_space 의 conservative lower bound 로
    skip. Full 직전만 HW re-read fallback.

Handler scheduling & affinity

  • SCHED_FIFO lowest RT band. cond_resched / swait_event 유지로
    softlockup / RCU stall 경계 보전.
  • set_cpus_allowed_ptr 사용. Allowed mask 형태라 운영 측 taskset
    재조정 여지.
  • Pre-wake: already-running 상태에서 no-op.

CPU power state

  • Probe acquire / remove release. out_fail path 포함 대칭.
  • Shallow idle 허용. Polling-idle 강제 회피.

📦 Release Note

Lowest RT band keeps the I/O handlers ahead of CFS noise so userspace
submissions don't pay CFS wake latency under CPU pressure.  Handlers
still yield via cond_resched() and sleep in swait_event when idle,
so softlockup/RCU stalls remain bounded.
Wake sq_wait and cq_wait at the top of every data/context/ioctl path
so the handler kthreads start running in parallel with page pinning,
DMA mapping, and command construction.  The wake is a cheap no-op
when the handler is already running, and removes the cold-start
component of wake latency when it wasn't.
Restrict mx_submit_thd and mx_complete_thd to the device's NUMA node
via set_cpus_allowed_ptr at queue init.  Keeps handler cache traffic
(descriptor ring, sq/cq_wait, transfer structs) node-local instead
of letting the scheduler place them on any CPU in the system.

Uses set_cpus_allowed_ptr rather than kthread_bind so operators can
still taskset to colocate handlers with a specific userspace CPU for
tighter tuning.  No-op on devices without NUMA affinity.
Register a cpu_latency_qos request with a 50us wake-up budget at
device probe and release it at device remove.  This blocks deep
idle states whose exit latency would stretch the freq ramp-up
window we observed adding ~12us to cold DMA submissions (governor
reaching boost freq after the CPU wakes from a deep idle).

Held across the device's lifetime; shallow idle remains allowed
so we don't force a polling-idle CPU.  Freed on both success and
out_fail paths via destroy_mx_pdev.
Add pages_inline[1], sg_inline[1], and a 64 B cmd_inline area to
struct mx_transfer so the single-page hot path skips kcalloc(pages),
sg_alloc_table_from_pages(), and kzalloc(mx_command).

Free paths detect inline use by pointer identity and skip the
corresponding kfree / sg_free_table. BUILD_BUG_ON guards cmd_inline
against future growth of struct mx_command (v1=32 B, v2=64 B today).
Replace the generic kmalloc bucket allocation with a SLAB_HWCACHE_ALIGN
kmem_cache sized exactly to struct mx_transfer. The per-cpu slab
magazine keeps freshly freed transfers hot for the next allocation,
cutting slab-partial contention on repeated small-I/O loops.

Cache lifetime is tied to module load: create after class_create() in
mxdma_init(), destroy after the PCI / bus teardown that drains all
in-flight transfers in mxdma_exit().
v1 profile showed memcpy_fromio(sizeof(struct mx_command)) at ~6.5 %
of total cycles — four MMIO readq per pop — while the completion
path only consumes the header (id / control) and host_addr (result).
size and device_addr are producer-side fields that the host never
reads on completion.

Drop the full 32 B memcpy_fromio for two explicit readq covering
just the required words, saving ~500–1000 ns per op.  Zero the
untouched words so dev_dbg doesn't print stack garbage for them.
is_pushable() readq of the SQ mbox context showed up at ~2.8 % in the
v1 profile because the submit_handler re-checks it on every command.
Tail moves only from our own push_mx_command and head only grows as
HW consumes, so the locally tracked free_space is a conservative
lower bound — if we already see room for at least two commands there
is no need to read HW for just this one.

Keep the readq for the genuinely-full case so the HW refresh still
drives forward progress when the queue fills up.
@daeyeong-XCENA daeyeong-XCENA deleted the dylee/perf/handlers_hot_conservative branch April 24, 2026 05:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant