Write provisional state.json with PID -1 before cgroup Apply by HirazawaUi · Pull Request #5257 · opencontainers/runc

HirazawaUi · 2026-04-19T15:42:19Z

Write a provisional state.json with PID -1 (sentinel) before cgroup Apply in initProcess.start(), and set initProcess to nil to achieve this.

If we do not set initProcess to nil, state.json will contain the real PID of the STAGE_PARENT process. This PID belongs to the first-stage runc init process, which will be reaped (killed) in waitForChildExit() and replaced by the final STAGE_INIT PID.

The problem is: if an external tool (such as conmon) invokes runc state within this time window, it will:

Read the STAGE_PARENT PID from state.json
Check whether this PID is alive (yes, it is temporarily still alive)
Check whether the exec.fifo exists (yes, it is created at the beginning of start())
Conclude that the container status is "created"

The external tool then calls runc start, and runc start begins monitoring the health status of this STAGE_PARENT PID. However, almost simultaneously, waitForChildExit() inside runc create will reap this PID, causing runc start to detect that the "process is dead" and report an error.

By setting initProcess to nil, the PID written to state.json is -1 (a nonexistent process). When an external tool calls runc state, it will:

Read PID -1
Check /proc/-1/stat → does not exist
hasInit() returns false → the container state is determined to be "stopped"
The external tool sees "stopped" and continues waiting (it will not call runc start)

This continues until runc create reaches the procReady stage, at which point the actual STAGE_INIT PID has stabilized, and updateState(p) overwrites state.json with the correct PID, formally transitioning the container to "created".

Signed-off-by: HirazawaUi <695097494plus@gmail.com>

HirazawaUi · 2026-04-19T15:46:11Z

/assign @rata @kolyshkin

This PR aims to once again fix the issue in #4757.

rata

@HirazawaUi thanks for tackling this again. The approach seems reasonable to me :)

rata · 2026-04-20T09:47:27Z

+	_, uerr := p.container.updateState(nil)
+	p.container.initProcess = savedInit
+	if uerr != nil {
+		return fmt.Errorf("unable to store init state: %w", uerr)


nit: we are using the same string above, let's use a different message here so it's not ambigous later if we hit this error.

rata · 2026-04-20T10:18:53Z

+	savedInit := p.container.initProcess
+	p.container.initProcess = nil
+	_, uerr := p.container.updateState(nil)
+	p.container.initProcess = savedInit


I'd move this to a new function

thaJeztah · 2026-04-20T11:20:08Z

I haven't looked in-depth, but saw the PR title, and recall some cases where PID -1 leaked to callers that didn't use it as a sentinel value, and tried to kill process -1, which, erm, is rather broad.

rata · 2026-04-20T12:27:47Z

@thaJeztah ouch. Not sure one of the scenarios is good:

  If pid is positive, then signal sig is sent to the process with the ID specified by pid.
  If pid equals 0, then sig is sent to every process in the process group of the calling process.
  If  pid  equals  -1,  then sig is sent to every process for which the calling process has permission to send signals, except for process 1 (init), but see below.
  If pid is less than -1, then sig is sent to every process in the process group whose ID is -pid.

It's not obvious how to fix this. Maybe we need another file (not state.json, something like cleanup.json) with the cgroup path and runc delete should check that when deleting too. Not sure if there is a better way to handle this.

rata

The -1 as PID has a meaning and can be problematic, as @thaJeztah pointed out.

I think we need to rework this, the only reasonable way out I see now is using a cleanup.json or so that runc delete honors. Anyone has another idea?

thaJeztah · 2026-04-20T18:35:42Z

ouch. Not sure one of the scenarios is good:

Yeah, it's a tricky one; to be clear, I think we already have cases where either 0 or -1 could be returned, so users should check for positive values.

I think at the time I went looking through containerd and moby to see possible paths missing a check, but it's very easy to miss, and (e.g.) have something like;

if pid != 0 {
    // kill process
}

As I said; I don't think it's new mostly that I recall at least one case where it was, erm, lead to some fun, but I don't think we ever found the actual code-path that could trigger it 😅 https://x.com/ibuildthecloud/status/1159143536597450752?s=12&t=Y92XZlAhVJRKCtPDml0Csg

kolyshkin · 2026-04-20T18:51:47Z

I don't like the idea of writing JSON a few times during create.

Maybe we can just create some empty file (say .creating, in the same state dir as state.json?) and use its existence as a flag to runc state? Then remove the flag file together with creating state.json. Something like this.

rata · 2026-04-21T13:27:16Z

@kolyshkin the thing is, we need to cleanup the cgroup (IIRC, @HirazawaUi correct me if I'm wrong). So we need to know the cgroup path or something like that to clean it up.

HirazawaUi · 2026-04-21T16:23:56Z

I think we need to rework this, the only reasonable way out I see now is using a cleanup.json or so that runc delete honors. Anyone has another idea?

I’m leaning towards this option. We could introduce a temporary file and remove it once state.json is created. This file would only be accessed during runc delete: if the temporary file exists but state.json is missing, we can clean up the cgroup based on the temporary file. If state.json is already present, we will continue to rely on it for the cleanup.

rata · 2026-04-22T12:20:02Z

@kolyshkin I'm not a fan either. I wish there was a simpler way to handle this. Maybe with systemd (when it manages the cgroups) we can ask it to clean it up somehow? It won't work for all cases, but almost everyone is using systemd driver, it is a good chunk of users

HirazawaUi · 2026-04-22T15:44:57Z

@kolyshkin I'm not a fan either. I wish there was a simpler way to handle this. Maybe with systemd (when it manages the cgroups) we can ask it to clean it up somehow? It won't work for all cases, but almost everyone is using systemd driver, it is a good chunk of users

However, how does systemd distinguish between whether a cgroup was frozen intentionally by the user or unintentionally? It seems that determining the intent behind the freeze could be quite a challenging issue to address.

rata · 2026-04-23T09:05:03Z

@HirazawaUi I don't know, my hope is that maybe we can instruct systemd to cleanup "orphan" cgroups or so. I doubt it is possible, but as systemd is always running, there might be something it can do for us.

kolyshkin · 2026-04-24T20:23:23Z

@rata that won't work for at least non-systemd users (we still have fs drivers and support those).

I need to look deeper to provide a good answer, but from the top of my head I think we can write just the cgroupPath to the .creating file (or make it a json but with fewer fields, not the complete state.json) and use it.

HirazawaUi · 2026-04-25T16:48:27Z

I have carefully reviewed this issue again.

For runc delete, the current internal path is:
runc delete -> getContainer() -> libcontainer.Load() -> refreshState() -> hasInit() -> Stopped -> Destroy(), so it does not end up signaling PID -1 in this flow.

The real concern is the sentinel value leaking via state.json to external consumers (or any future code path that might use the raw PID unsafely).

If we omit init_process_pid (using omitempty) during state serialization when it is unknown, rather than explicitly writing -1.

This keeps delete behavior unchanged while removing the -1 leak. Consumers should treat missing/non-positive PID as “not available” and only signal when pid > 0.

I have submitted the code and verified that it passes all existing tests, which should ensure no regressions in current functionality. Could you please let me know what you think of this approach? Thanks!

@rata @kolyshkin @thaJeztah

rata · 2026-04-27T08:51:31Z

@HirazawaUi if we omit it, then we need to validate what users do. It is completely possible that it takes the zero-value in golang, that is 0. And that also has a special meaning. (I think @thaJeztah pointed into this too? Or my answer with the manpage quote?)

HirazawaUi · 2026-04-27T09:20:22Z

@rata I have previously considered this issue. Just as we avoid signaling PID -1, we should also avoid signaling PID 0, as the "blast radius" is equally significant. If an external system or a runc code path were to read a PID of 0 and attempt to kill it, the signal would be broadcast to all processes in its process group. I believe this would similarly lead to system instability or unavailability.

rata · 2026-04-27T09:58:07Z

Exactly. My point is: skipping this field my lead to that behavior. We at least need to check the users to see if that will happen with main users. But, at the same time, I'm unsure if just using another file is cleaner and avoids that completely.

HirazawaUi · 2026-04-27T15:54:00Z

I believe both omitting init_process_pid (using omitempty) during state serialization when unknown and using a separate file to record cgroupPath are viable approaches.

Regarding the former, it has not only been verified in this PR, but I also manually ran the test cases from the github.com/containers/common project, and it performed well there as well.

While this may not fully guarantee that all major runc users will be unaffected by the change, I think it could be a valid option to consider if further investigation yields no other solutions.

Write provisional state.json with PID -1 before cgroup Apply

5ca26e6

Signed-off-by: HirazawaUi <695097494plus@gmail.com>

HirazawaUi force-pushed the write-state-json branch from 45a880f to 5ca26e6 Compare April 19, 2026 15:43

rata reviewed Apr 20, 2026

View reviewed changes

rata requested changes Apr 20, 2026

View reviewed changes

test

fabe44e

Conversation

HirazawaUi commented Apr 19, 2026

Uh oh!

HirazawaUi commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rata left a comment

Choose a reason for hiding this comment

Uh oh!

rata Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

rata Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

thaJeztah commented Apr 20, 2026

Uh oh!

rata commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rata left a comment

Choose a reason for hiding this comment

Uh oh!

thaJeztah commented Apr 20, 2026

Uh oh!

kolyshkin commented Apr 20, 2026

Uh oh!

rata commented Apr 21, 2026

Uh oh!

HirazawaUi commented Apr 21, 2026

Uh oh!

rata commented Apr 22, 2026

Uh oh!

HirazawaUi commented Apr 22, 2026

Uh oh!

rata commented Apr 23, 2026

Uh oh!

kolyshkin commented Apr 24, 2026

Uh oh!

HirazawaUi commented Apr 25, 2026

Uh oh!

rata commented Apr 27, 2026

Uh oh!

HirazawaUi commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rata commented Apr 27, 2026

Uh oh!

HirazawaUi commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HirazawaUi commented Apr 19, 2026 •

edited

Loading

rata commented Apr 20, 2026 •

edited

Loading

HirazawaUi commented Apr 27, 2026 •

edited

Loading