Skip to content

feat: opt-in KSM support for VMM guest memory#569

Open
airhorns wants to merge 1 commit intourunc-dev:mainfrom
airhorns:feat/enable-ksm
Open

feat: opt-in KSM support for VMM guest memory#569
airhorns wants to merge 1 commit intourunc-dev:mainfrom
airhorns:feat/enable-ksm

Conversation

@airhorns
Copy link
Copy Markdown

@airhorns airhorns commented Apr 21, 2026

This adds a new opt-in configuration to allow host nodes to deduplicate same memory pages using the kernels KSM feature.

We add a [ksm] section to the urunc configuration that causes urunc to call prctl(PR_SET_MEMORY_MERGE, 1) on itself before execing the VMM. The flag lives in MMF_INIT_MASK so it survives execve, and every anonymous mapping the VMM creates afterwards (including guest RAM) is auto-marked MADV_MERGEABLE. The host's ksmd then deduplicates identical pages across VMM processes.

This matters in particular for Firecracker, whose seccomp filter only permits madvise(MADV_DONTNEED), so the VMM cannot opt itself in. Doing the prctl in urunc before exec is the only portable way to make Firecracker guest memory KSM-eligible.

Usage

Off by default. Opt in via /etc/urunc/config.toml:

[ksm]
enable = true

Configuration follows the existing [log] / [timestamps] pattern: the value is loaded at urunc create, stored in state.json annotations, and round-tripped through UruncConfig.Map / UruncConfigFromMap so it survives reexec.

urunc-deploy gains an ENABLE_KSM env var (default false) that writes ksm.enable=true into the installed config.toml via tomlq.

Note that KSM tends to be disabled by default because it presents a side channel timing attack where attacker guests can profile what other guests have pages in memory. See https://www.sentinelone.com/vulnerability-database/cve-2024-0564/ for more details. I think the considerations of turning KSM on or off are upstream of urunc, but I think this means we should be defaulting it to off (and not changing behaviour at all).

Requirements

  • Linux 6.4+ with CONFIG_KSM=y. On older kernels prctl returns EINVAL; urunc logs a warning and continues without KSM.
  • Host ksmd must be running (/sys/kernel/mm/ksm/run=1) for actual merging to happen. urunc does not manage ksmd itself.

Implementation notes

  • The prctl call must happen before vmm.PreExec, since HVT's PreExec installs a seccomp filter that does not permit prctl(PR_SET_MEMORY_MERGE).
  • Platform split: ksm_linux.go / ksm_other.go with build tags so non-Linux builds still compile (the non-Linux stub is a no-op).

Observed impact

Measured on a 3-node GKE cluster running 14 firecracker-backed workerd unikernels: ~90% of each firecracker's guest RAM merged once ksmd settled, +5.3 GiB cluster-wide RSS savings vs. the same workload without KSM.

Add a new [ksm] section to the urunc configuration that, when enabled,
causes urunc to call prctl(PR_SET_MEMORY_MERGE, 1) on itself before
execing the VMM. The flag lives in MMF_INIT_MASK so it survives execve,
and every anonymous mapping the VMM creates afterwards (including guest
RAM) is auto-marked MADV_MERGEABLE. The host's ksmd then deduplicates
identical pages across VMM processes.

This matters in particular for Firecracker, whose seccomp filter only
permits madvise(MADV_DONTNEED), so the VMM cannot opt itself in. Doing
the prctl in urunc before exec is the only portable way to make
Firecracker guest memory KSM-eligible.

The feature is off by default and opt-in via /etc/urunc/config.toml:

    [ksm]
    enable = true

Configuration follows the existing [log] / [timestamps] pattern: the
value is loaded at urunc create, stored in state.json annotations, and
round-tripped through UruncConfig.Map / UruncConfigFromMap so it
survives reexec.

Requires Linux 6.4+ with CONFIG_KSM=y. On older kernels prctl returns
EINVAL; urunc logs a warning and continues without KSM. The host's ksmd
must also be running (/sys/kernel/mm/ksm/run=1) for actual merging to
happen; urunc does not manage ksmd itself.

The prctl call must happen before vmm.PreExec, since HVT's PreExec
installs a seccomp filter that does not permit prctl(PR_SET_MEMORY_MERGE).

urunc-deploy gains an ENABLE_KSM env var (default false) that writes
ksm.enable=true into the installed config.toml via tomlq.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 21, 2026

Deploy Preview for urunc ready!

Name Link
🔨 Latest commit 9087490
🔍 Latest deploy log https://app.netlify.com/projects/urunc/deploys/69e7a87b6424a7000845639b
😎 Deploy Preview https://deploy-preview-569--urunc.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@airhorns airhorns marked this pull request as ready for review April 21, 2026 16:44
@cmainas
Copy link
Copy Markdown
Contributor

cmainas commented Apr 22, 2026

Hello @airhorns ,

thank you for this PR. In general, please follow the contribution guide and in particular:

  • Please open an issue before submitting a PR for a specific bug / feature so we can discuss and verify it.
  • Please do not overwrite the PR template.

Regarding the suggested feature, I am very skeptical due to the security implications. However, if there are use cases where this can help you and since this is introduced as a configuration option it could be fine. However, there are a few questions:

  • Does KSM apply for non KVM-based sandboxes (e.g. for seccom-based ones) too?
  • Would it be possible to share how you performed the evaluation so we can also verify the impact you mention in the PR?

@airhorns
Copy link
Copy Markdown
Author

airhorns commented Apr 23, 2026

In general, please follow the contribution guide

Apologies -- this is a very fair expectation!

I am very skeptical due to the security implications

Indeed. My organization is still assessing if we'll turn it on or not, but I figured we should still upstream the support we added to try it out! As long as it it opt-in.

One small note: you can't actually read sibling guest memory. There's no shared memory region or actual holes that have been discovered where one guest can read arbitrary pages from another guest. Instead, the CVE above is a Spectre style attack, where via a timing side channel, you can tell if someone else on the same machine has the same thing in memory as you, which allows you to effectively guess what might be in memory elsewhere. Say you're an attacker trying to dump secret keys from some other process that you suspect has them in memory, you have to enumerate all the possible 4KiB pages of memory that your victim might have in memory and do the timing attack to identify which page is in memory or not. Similar to Spectre, there are mitigations for this, like reducing timer resolution for guests or not allowing guests to run long enough to do the enumeration.

Also notable, Tencent's new sandboxing platform thing they released a few days ago also includes KSM support: https://github.com/TencentCloud/CubeSandbox

My take is that you could conceivably make it secure with appropriate mitigations, so this feature is worth including, but I am just a humble contributor!

Does KSM apply for non KVM-based sandboxes (e.g. for seccom-based ones) too?

Yup, it should! As long as the memory is marked MADV_MERGEABLE then the kernel can deduplicate it. I believe KSM is most often used in virtualization contexts, but applies just the same outside of them too!

Would it be possible to share how you performed the evaluation so we can also verify the impact you mention in the PR?

Yep absolutely. You should just need to turn KSM on on your host machine:

echo 1 | sudo tee /sys/kernel/mm/ksm/run

and then start several instances of the same image with the urunc from this branch, with

[ksm]
enable = true

then you can review the KSM stats over time

cat /sys/kernel/mm/ksm/pages_shared
cat /sys/kernel/mm/ksm/pages_sharing
cat /sys/kernel/mm/ksm/pages_unshared
cat /sys/kernel/mm/ksm/full_scans

KSM is working if pages_shared goes up the more VMs you start. You can also look at the smaps for a process to see how much of its memory is being shared with:

grep -A20 KSM /proc/<pid>/smaps

@cmainas
Copy link
Copy Markdown
Contributor

cmainas commented Apr 27, 2026

Hello @airhorns

I am very skeptical due to the security implications

Indeed. My organization is still assessing if we'll turn it on or not, but I figured we should still upstream the support we added to try it out! As long as it it opt-in.

One small note: you can't actually read sibling guest memory. There's no shared memory region or actual holes that have been discovered where one guest can read arbitrary pages from another guest. Instead, the CVE above is a Spectre style attack, where via a timing side channel, you can tell if someone else on the same machine has the same thing in memory as you, which allows you to effectively guess what might be in memory elsewhere. Say you're an attacker trying to dump secret keys from some other process that you suspect has them in memory, you have to enumerate all the possible 4KiB pages of memory that your victim might have in memory and do the timing attack to identify which page is in memory or not. Similar to Spectre, there are mitigations for this, like reducing timer resolution for guests or not allowing guests to run long enough to do the enumeration.

I see. As long as this is a configuration option and easily disabled, we can merge it. The users can decide whether to enable it or not.

Does KSM apply for non KVM-based sandboxes (e.g. for seccom-based ones) too?

Yup, it should! As long as the memory is marked MADV_MERGEABLE then the kernel can deduplicate it. I believe KSM is most often used in virtualization contexts, but applies just the same outside of them too!

Nice. One of the reasons for asking is to decide if it should be a monitor-specific (e.g. Qemu/Firecracker) configuration or a generic. However, it seems it should be generic.

Would it be possible to share how you performed the evaluation so we can also verify the impact you mention in the PR?

Yep absolutely. You should just need to turn KSM on on your host machine:

echo 1 | sudo tee /sys/kernel/mm/ksm/run

and then start several instances of the same image with the urunc from this branch, with

[ksm]
enable = true

then you can review the KSM stats over time

cat /sys/kernel/mm/ksm/pages_shared
cat /sys/kernel/mm/ksm/pages_sharing
cat /sys/kernel/mm/ksm/pages_unshared
cat /sys/kernel/mm/ksm/full_scans

KSM is working if pages_shared goes up the more VMs you start. You can also look at the smaps for a process to see how much of its memory is being shared with:

grep -A20 KSM /proc/<pid>/smaps

I will try that out! Thank you for the steps.

Overall, we have no issues to review and merge this PR. One point worth discussing is the scope at which this should be applied. If implemented as a urunc configuration option, it would affect all containers. Alternatively, it could be introduced as a runtime annotation, hence enabling it only for specific containers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants