Skip to content

Write provisional state.json with PID -1 before cgroup Apply#5257

Open
HirazawaUi wants to merge 2 commits intoopencontainers:mainfrom
HirazawaUi:write-state-json
Open

Write provisional state.json with PID -1 before cgroup Apply#5257
HirazawaUi wants to merge 2 commits intoopencontainers:mainfrom
HirazawaUi:write-state-json

Conversation

@HirazawaUi
Copy link
Copy Markdown
Contributor

Write a provisional state.json with PID -1 (sentinel) before cgroup Apply in initProcess.start(), and set initProcess to nil to achieve this.

If we do not set initProcess to nil, state.json will contain the real PID of the STAGE_PARENT process. This PID belongs to the first-stage runc init process, which will be reaped (killed) in waitForChildExit() and replaced by the final STAGE_INIT PID.

The problem is: if an external tool (such as conmon) invokes runc state within this time window, it will:

  • Read the STAGE_PARENT PID from state.json
  • Check whether this PID is alive (yes, it is temporarily still alive)
  • Check whether the exec.fifo exists (yes, it is created at the beginning of start())
  • Conclude that the container status is "created"

The external tool then calls runc start, and runc start begins monitoring the health status of this STAGE_PARENT PID. However, almost simultaneously, waitForChildExit() inside runc create will reap this PID, causing runc start to detect that the "process is dead" and report an error.

By setting initProcess to nil, the PID written to state.json is -1 (a nonexistent process). When an external tool calls runc state, it will:

  • Read PID -1
  • Check /proc/-1/stat → does not exist
  • hasInit() returns false → the container state is determined to be "stopped"
  • The external tool sees "stopped" and continues waiting (it will not call runc start)

This continues until runc create reaches the procReady stage, at which point the actual STAGE_INIT PID has stabilized, and updateState(p) overwrites state.json with the correct PID, formally transitioning the container to "created".

Signed-off-by: HirazawaUi <695097494plus@gmail.com>
@HirazawaUi
Copy link
Copy Markdown
Contributor Author

HirazawaUi commented Apr 19, 2026

/assign @rata @kolyshkin

This PR aims to once again fix the issue in #4757.

Copy link
Copy Markdown
Member

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HirazawaUi thanks for tackling this again. The approach seems reasonable to me :)

_, uerr := p.container.updateState(nil)
p.container.initProcess = savedInit
if uerr != nil {
return fmt.Errorf("unable to store init state: %w", uerr)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we are using the same string above, let's use a different message here so it's not ambigous later if we hit this error.

savedInit := p.container.initProcess
p.container.initProcess = nil
_, uerr := p.container.updateState(nil)
p.container.initProcess = savedInit
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd move this to a new function

@thaJeztah
Copy link
Copy Markdown
Member

I haven't looked in-depth, but saw the PR title, and recall some cases where PID -1 leaked to callers that didn't use it as a sentinel value, and tried to kill process -1, which, erm, is rather broad.

@rata
Copy link
Copy Markdown
Member

rata commented Apr 20, 2026

@thaJeztah ouch. Not sure one of the scenarios is good:

  If pid is positive, then signal sig is sent to the process with the ID specified by pid.
  If pid equals 0, then sig is sent to every process in the process group of the calling process.
  If  pid  equals  -1,  then sig is sent to every process for which the calling process has permission to send signals, except for process 1 (init), but see below.
  If pid is less than -1, then sig is sent to every process in the process group whose ID is -pid.

It's not obvious how to fix this. Maybe we need another file (not state.json, something like cleanup.json) with the cgroup path and runc delete should check that when deleting too. Not sure if there is a better way to handle this.

Copy link
Copy Markdown
Member

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The -1 as PID has a meaning and can be problematic, as @thaJeztah pointed out.

I think we need to rework this, the only reasonable way out I see now is using a cleanup.json or so that runc delete honors. Anyone has another idea?

@thaJeztah
Copy link
Copy Markdown
Member

ouch. Not sure one of the scenarios is good:

Yeah, it's a tricky one; to be clear, I think we already have cases where either 0 or -1 could be returned, so users should check for positive values.

I think at the time I went looking through containerd and moby to see possible paths missing a check, but it's very easy to miss, and (e.g.) have something like;

if pid != 0 {
    // kill process
}

As I said; I don't think it's new mostly that I recall at least one case where it was, erm, lead to some fun, but I don't think we ever found the actual code-path that could trigger it 😅 https://x.com/ibuildthecloud/status/1159143536597450752?s=12&t=Y92XZlAhVJRKCtPDml0Csg

Screenshot 2026-04-20 at 20 34 31

@kolyshkin
Copy link
Copy Markdown
Contributor

I don't like the idea of writing JSON a few times during create.

Maybe we can just create some empty file (say .creating, in the same state dir as state.json?) and use its existence as a flag to runc state? Then remove the flag file together with creating state.json. Something like this.

@rata
Copy link
Copy Markdown
Member

rata commented Apr 21, 2026

@kolyshkin the thing is, we need to cleanup the cgroup (IIRC, @HirazawaUi correct me if I'm wrong). So we need to know the cgroup path or something like that to clean it up.

@HirazawaUi
Copy link
Copy Markdown
Contributor Author

I think we need to rework this, the only reasonable way out I see now is using a cleanup.json or so that runc delete honors. Anyone has another idea?

I’m leaning towards this option. We could introduce a temporary file and remove it once state.json is created. This file would only be accessed during runc delete: if the temporary file exists but state.json is missing, we can clean up the cgroup based on the temporary file. If state.json is already present, we will continue to rely on it for the cleanup.

@rata
Copy link
Copy Markdown
Member

rata commented Apr 22, 2026

@kolyshkin I'm not a fan either. I wish there was a simpler way to handle this. Maybe with systemd (when it manages the cgroups) we can ask it to clean it up somehow? It won't work for all cases, but almost everyone is using systemd driver, it is a good chunk of users

@HirazawaUi
Copy link
Copy Markdown
Contributor Author

@kolyshkin I'm not a fan either. I wish there was a simpler way to handle this. Maybe with systemd (when it manages the cgroups) we can ask it to clean it up somehow? It won't work for all cases, but almost everyone is using systemd driver, it is a good chunk of users

However, how does systemd distinguish between whether a cgroup was frozen intentionally by the user or unintentionally? It seems that determining the intent behind the freeze could be quite a challenging issue to address.

@rata
Copy link
Copy Markdown
Member

rata commented Apr 23, 2026

@HirazawaUi I don't know, my hope is that maybe we can instruct systemd to cleanup "orphan" cgroups or so. I doubt it is possible, but as systemd is always running, there might be something it can do for us.

@kolyshkin
Copy link
Copy Markdown
Contributor

@rata that won't work for at least non-systemd users (we still have fs drivers and support those).

I need to look deeper to provide a good answer, but from the top of my head I think we can write just the cgroupPath to the .creating file (or make it a json but with fewer fields, not the complete state.json) and use it.

@HirazawaUi
Copy link
Copy Markdown
Contributor Author

I have carefully reviewed this issue again.

For runc delete, the current internal path is:
runc delete -> getContainer() -> libcontainer.Load() -> refreshState() -> hasInit() -> Stopped -> Destroy(), so it does not end up signaling PID -1 in this flow.

The real concern is the sentinel value leaking via state.json to external consumers (or any future code path that might use the raw PID unsafely).

If we omit init_process_pid (using omitempty) during state serialization when it is unknown, rather than explicitly writing -1.

This keeps delete behavior unchanged while removing the -1 leak. Consumers should treat missing/non-positive PID as “not available” and only signal when pid > 0.

I have submitted the code and verified that it passes all existing tests, which should ensure no regressions in current functionality. Could you please let me know what you think of this approach? Thanks!

@rata @kolyshkin @thaJeztah

@rata
Copy link
Copy Markdown
Member

rata commented Apr 27, 2026

@HirazawaUi if we omit it, then we need to validate what users do. It is completely possible that it takes the zero-value in golang, that is 0. And that also has a special meaning. (I think @thaJeztah pointed into this too? Or my answer with the manpage quote?)

@HirazawaUi
Copy link
Copy Markdown
Contributor Author

HirazawaUi commented Apr 27, 2026

@rata I have previously considered this issue. Just as we avoid signaling PID -1, we should also avoid signaling PID 0, as the "blast radius" is equally significant. If an external system or a runc code path were to read a PID of 0 and attempt to kill it, the signal would be broadcast to all processes in its process group. I believe this would similarly lead to system instability or unavailability.

@rata
Copy link
Copy Markdown
Member

rata commented Apr 27, 2026

Exactly. My point is: skipping this field my lead to that behavior. We at least need to check the users to see if that will happen with main users. But, at the same time, I'm unsure if just using another file is cleaner and avoids that completely.

@HirazawaUi
Copy link
Copy Markdown
Contributor Author

I believe both omitting init_process_pid (using omitempty) during state serialization when unknown and using a separate file to record cgroupPath are viable approaches.

Regarding the former, it has not only been verified in this PR, but I also manually ran the test cases from the github.com/containers/common project, and it performed well there as well.

While this may not fully guarantee that all major runc users will be unaffected by the change, I think it could be a valid option to consider if further investigation yields no other solutions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants