-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Description
What happened?
ProcessManager.stopProcess() calls destroy()/destroyForcibly() to terminate child processes but never calls Process.waitFor() to collect the exit status. On POSIX systems, this leaves the terminated child as a zombie (state Z/defunct) in the kernel process table until the parent process exits.
In long-running environments like Flink TaskManagers using --environment_type=PROCESS, expansion service processes (/opt/apache/beam/java_boot) are repeatedly spawned and stopped but never reaped. Over time this leads to significant zombie accumulation — we observed 176+ zombie Java processes on production Flink TaskManager pods.
Container-level init systems (e.g. dumb-init, tini) cannot help because the zombies are children of the still-running Java TaskManager process — only the parent can reap its own children.
How to reproduce
- Run a Flink pipeline with --environment_type=PROCESS
- Let it process work for an extended period (hours/days)
- Check for zombie processes: ps aux | grep defunct
Zombies will accumulate over time as expansion service processes are started and stopped without being reaped.
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
- Component: Python SDK
- Component: Java SDK
- Component: Go SDK
- Component: Typescript SDK
- Component: IO connector
- Component: Beam YAML
- Component: Beam examples
- Component: Beam playground
- Component: Beam katas
- Component: Website
- Component: Infrastructure
- Component: Spark Runner
- Component: Flink Runner
- Component: Samza Runner
- Component: Twister2 Runner
- Component: Hazelcast Jet Runner
- Component: Google Cloud Dataflow Runner