Write provisional state.json with PID -1 before cgroup Apply#5257
Write provisional state.json with PID -1 before cgroup Apply#5257HirazawaUi wants to merge 2 commits intoopencontainers:mainfrom
Conversation
Signed-off-by: HirazawaUi <695097494plus@gmail.com>
45a880f to
5ca26e6
Compare
|
/assign @rata @kolyshkin This PR aims to once again fix the issue in #4757. |
rata
left a comment
There was a problem hiding this comment.
@HirazawaUi thanks for tackling this again. The approach seems reasonable to me :)
| _, uerr := p.container.updateState(nil) | ||
| p.container.initProcess = savedInit | ||
| if uerr != nil { | ||
| return fmt.Errorf("unable to store init state: %w", uerr) |
There was a problem hiding this comment.
nit: we are using the same string above, let's use a different message here so it's not ambigous later if we hit this error.
| savedInit := p.container.initProcess | ||
| p.container.initProcess = nil | ||
| _, uerr := p.container.updateState(nil) | ||
| p.container.initProcess = savedInit |
|
I haven't looked in-depth, but saw the PR title, and recall some cases where |
|
@thaJeztah ouch. Not sure one of the scenarios is good:
It's not obvious how to fix this. Maybe we need another file (not state.json, something like cleanup.json) with the cgroup path and runc delete should check that when deleting too. Not sure if there is a better way to handle this. |
rata
left a comment
There was a problem hiding this comment.
The -1 as PID has a meaning and can be problematic, as @thaJeztah pointed out.
I think we need to rework this, the only reasonable way out I see now is using a cleanup.json or so that runc delete honors. Anyone has another idea?
Yeah, it's a tricky one; to be clear, I think we already have cases where either I think at the time I went looking through containerd and moby to see possible paths missing a check, but it's very easy to miss, and (e.g.) have something like; if pid != 0 {
// kill process
}As I said; I don't think it's new mostly that I recall at least one case where it was, erm, lead to some fun, but I don't think we ever found the actual code-path that could trigger it 😅 https://x.com/ibuildthecloud/status/1159143536597450752?s=12&t=Y92XZlAhVJRKCtPDml0Csg
|
|
I don't like the idea of writing JSON a few times during create. Maybe we can just create some empty file (say |
|
@kolyshkin the thing is, we need to cleanup the cgroup (IIRC, @HirazawaUi correct me if I'm wrong). So we need to know the cgroup path or something like that to clean it up. |
I’m leaning towards this option. We could introduce a temporary file and remove it once |
|
@kolyshkin I'm not a fan either. I wish there was a simpler way to handle this. Maybe with systemd (when it manages the cgroups) we can ask it to clean it up somehow? It won't work for all cases, but almost everyone is using systemd driver, it is a good chunk of users |
However, how does systemd distinguish between whether a cgroup was frozen intentionally by the user or unintentionally? It seems that determining the intent behind the freeze could be quite a challenging issue to address. |
|
@HirazawaUi I don't know, my hope is that maybe we can instruct systemd to cleanup "orphan" cgroups or so. I doubt it is possible, but as systemd is always running, there might be something it can do for us. |
|
@rata that won't work for at least non-systemd users (we still have fs drivers and support those). I need to look deeper to provide a good answer, but from the top of my head I think we can write just the cgroupPath to the |
|
I have carefully reviewed this issue again. For The real concern is the sentinel value leaking via If we omit This keeps delete behavior unchanged while removing the I have submitted the code and verified that it passes all existing tests, which should ensure no regressions in current functionality. Could you please let me know what you think of this approach? Thanks! |
|
@HirazawaUi if we omit it, then we need to validate what users do. It is completely possible that it takes the zero-value in golang, that is |
|
@rata I have previously considered this issue. Just as we avoid signaling PID -1, we should also avoid signaling PID 0, as the "blast radius" is equally significant. If an external system or a |
|
Exactly. My point is: skipping this field my lead to that behavior. We at least need to check the users to see if that will happen with main users. But, at the same time, I'm unsure if just using another file is cleaner and avoids that completely. |
|
I believe both omitting Regarding the former, it has not only been verified in this PR, but I also manually ran the test cases from the While this may not fully guarantee that all major |

Write a provisional
state.jsonwith PID -1 (sentinel) before cgroup Apply ininitProcess.start(), and setinitProcessto nil to achieve this.If we do not set
initProcessto nil,state.jsonwill contain the real PID of the STAGE_PARENT process. This PID belongs to the first-stagerunc initprocess, which will be reaped (killed) inwaitForChildExit()and replaced by the final STAGE_INIT PID.The problem is: if an external tool (such as conmon) invokes
runc statewithin this time window, it will:state.jsonexec.fifoexists (yes, it is created at the beginning ofstart())"created"The external tool then calls
runc start, andrunc startbegins monitoring the health status of this STAGE_PARENT PID. However, almost simultaneously,waitForChildExit()insiderunc createwill reap this PID, causingrunc startto detect that the "process is dead" and report an error.By setting
initProcessto nil, the PID written tostate.jsonis -1 (a nonexistent process). When an external tool callsrunc state, it will:/proc/-1/stat→ does not existhasInit()returns false → the container state is determined to be"stopped""stopped"and continues waiting (it will not callrunc start)This continues until
runc createreaches theprocReadystage, at which point the actual STAGE_INIT PID has stabilized, andupdateState(p)overwritesstate.jsonwith the correct PID, formally transitioning the container to"created".