feat: opt-in KSM support for VMM guest memory#569
feat: opt-in KSM support for VMM guest memory#569airhorns wants to merge 1 commit intourunc-dev:mainfrom
Conversation
Add a new [ksm] section to the urunc configuration that, when enabled,
causes urunc to call prctl(PR_SET_MEMORY_MERGE, 1) on itself before
execing the VMM. The flag lives in MMF_INIT_MASK so it survives execve,
and every anonymous mapping the VMM creates afterwards (including guest
RAM) is auto-marked MADV_MERGEABLE. The host's ksmd then deduplicates
identical pages across VMM processes.
This matters in particular for Firecracker, whose seccomp filter only
permits madvise(MADV_DONTNEED), so the VMM cannot opt itself in. Doing
the prctl in urunc before exec is the only portable way to make
Firecracker guest memory KSM-eligible.
The feature is off by default and opt-in via /etc/urunc/config.toml:
[ksm]
enable = true
Configuration follows the existing [log] / [timestamps] pattern: the
value is loaded at urunc create, stored in state.json annotations, and
round-tripped through UruncConfig.Map / UruncConfigFromMap so it
survives reexec.
Requires Linux 6.4+ with CONFIG_KSM=y. On older kernels prctl returns
EINVAL; urunc logs a warning and continues without KSM. The host's ksmd
must also be running (/sys/kernel/mm/ksm/run=1) for actual merging to
happen; urunc does not manage ksmd itself.
The prctl call must happen before vmm.PreExec, since HVT's PreExec
installs a seccomp filter that does not permit prctl(PR_SET_MEMORY_MERGE).
urunc-deploy gains an ENABLE_KSM env var (default false) that writes
ksm.enable=true into the installed config.toml via tomlq.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
✅ Deploy Preview for urunc ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Hello @airhorns , thank you for this PR. In general, please follow the contribution guide and in particular:
Regarding the suggested feature, I am very skeptical due to the security implications. However, if there are use cases where this can help you and since this is introduced as a configuration option it could be fine. However, there are a few questions:
|
Apologies -- this is a very fair expectation!
Indeed. My organization is still assessing if we'll turn it on or not, but I figured we should still upstream the support we added to try it out! As long as it it opt-in. One small note: you can't actually read sibling guest memory. There's no shared memory region or actual holes that have been discovered where one guest can read arbitrary pages from another guest. Instead, the CVE above is a Spectre style attack, where via a timing side channel, you can tell if someone else on the same machine has the same thing in memory as you, which allows you to effectively guess what might be in memory elsewhere. Say you're an attacker trying to dump secret keys from some other process that you suspect has them in memory, you have to enumerate all the possible 4KiB pages of memory that your victim might have in memory and do the timing attack to identify which page is in memory or not. Similar to Spectre, there are mitigations for this, like reducing timer resolution for guests or not allowing guests to run long enough to do the enumeration. Also notable, Tencent's new sandboxing platform thing they released a few days ago also includes KSM support: https://github.com/TencentCloud/CubeSandbox My take is that you could conceivably make it secure with appropriate mitigations, so this feature is worth including, but I am just a humble contributor!
Yup, it should! As long as the memory is marked
Yep absolutely. You should just need to turn KSM on on your host machine: echo 1 | sudo tee /sys/kernel/mm/ksm/runand then start several instances of the same image with the [ksm]
enable = truethen you can review the KSM stats over time cat /sys/kernel/mm/ksm/pages_shared
cat /sys/kernel/mm/ksm/pages_sharing
cat /sys/kernel/mm/ksm/pages_unshared
cat /sys/kernel/mm/ksm/full_scansKSM is working if grep -A20 KSM /proc/<pid>/smaps |
|
Hello @airhorns
I see. As long as this is a configuration option and easily disabled, we can merge it. The users can decide whether to enable it or not.
Nice. One of the reasons for asking is to decide if it should be a monitor-specific (e.g. Qemu/Firecracker) configuration or a generic. However, it seems it should be generic.
I will try that out! Thank you for the steps. Overall, we have no issues to review and merge this PR. One point worth discussing is the scope at which this should be applied. If implemented as a |
This adds a new opt-in configuration to allow host nodes to deduplicate same memory pages using the kernels KSM feature.
We add a
[ksm]section to the urunc configuration that causes urunc to callprctl(PR_SET_MEMORY_MERGE, 1)on itself before execing the VMM. The flag lives inMMF_INIT_MASKso it survivesexecve, and every anonymous mapping the VMM creates afterwards (including guest RAM) is auto-markedMADV_MERGEABLE. The host'sksmdthen deduplicates identical pages across VMM processes.This matters in particular for Firecracker, whose seccomp filter only permits
madvise(MADV_DONTNEED), so the VMM cannot opt itself in. Doing theprctlin urunc before exec is the only portable way to make Firecracker guest memory KSM-eligible.Usage
Off by default. Opt in via
/etc/urunc/config.toml:Configuration follows the existing
[log]/[timestamps]pattern: the value is loaded aturunc create, stored instate.jsonannotations, and round-tripped throughUruncConfig.Map/UruncConfigFromMapso it survives reexec.urunc-deploy gains an
ENABLE_KSMenv var (defaultfalse) that writesksm.enable=trueinto the installedconfig.tomlviatomlq.Note that KSM tends to be disabled by default because it presents a side channel timing attack where attacker guests can profile what other guests have pages in memory. See https://www.sentinelone.com/vulnerability-database/cve-2024-0564/ for more details. I think the considerations of turning KSM on or off are upstream of
urunc, but I think this means we should be defaulting it to off (and not changing behaviour at all).Requirements
CONFIG_KSM=y. On older kernelsprctlreturnsEINVAL; urunc logs a warning and continues without KSM.ksmdmust be running (/sys/kernel/mm/ksm/run=1) for actual merging to happen. urunc does not manageksmditself.Implementation notes
prctlcall must happen beforevmm.PreExec, since HVT'sPreExecinstalls a seccomp filter that does not permitprctl(PR_SET_MEMORY_MERGE).ksm_linux.go/ksm_other.gowith build tags so non-Linux builds still compile (the non-Linux stub is a no-op).Observed impact
Measured on a 3-node GKE cluster running 14 firecracker-backed workerd unikernels: ~90% of each firecracker's guest RAM merged once
ksmdsettled, +5.3 GiB cluster-wide RSS savings vs. the same workload without KSM.