[Bug]: ProcessManager does not reap child processes, causing zombie accumulation on long-running Flink deployments

### What happened?                                                                                                                                                                                                                     
                  
**ProcessManager.stopProcess()** calls **destroy()/destroyForcibly()** to terminate child processes but never calls **Process.waitFor()** to collect the exit status. On POSIX systems, this leaves the terminated child as a zombie (state Z/defunct) in the kernel process table until the parent process exits.
                                                                                                                                                                                                                                           
In long-running environments like Flink TaskManagers using **--environment_type=PROCESS**, expansion service processes (**/opt/apache/beam/java_boot**) are repeatedly spawned and stopped but never reaped. Over time this leads to significant zombie accumulation — we observed 176+ zombie Java processes on production Flink TaskManager pods.
                                                                                                                                                                                                                                           
 Container-level init systems (e.g. dumb-init, tini) cannot help because the zombies are children of the still-running Java TaskManager process — only the parent can reap its own children.                                              
   
### **How to reproduce**                                                                                                                                                                                                                         
                  
  1. Run a Flink pipeline with --environment_type=PROCESS                                                                                                                                                                                  
  2. Let it process work for an extended period (hours/days)
  3. Check for zombie processes: ps aux | grep defunct                                                                                                                                                                                     
                                                                                                                                                                                                                                           
Zombies will accumulate over time as expansion service processes are started and stopped without being reaped.                                                                                                                           

### Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

### Issue Components

- [ ] Component: Python SDK
- [ ] Component: Java SDK
- [ ] Component: Go SDK
- [ ] Component: Typescript SDK
- [ ] Component: IO connector
- [ ] Component: Beam YAML
- [ ] Component: Beam examples
- [ ] Component: Beam playground
- [ ] Component: Beam katas
- [ ] Component: Website
- [ ] Component: Infrastructure
- [ ] Component: Spark Runner
- [x] Component: Flink Runner
- [ ] Component: Samza Runner
- [ ] Component: Twister2 Runner
- [ ] Component: Hazelcast Jet Runner
- [ ] Component: Google Cloud Dataflow Runner

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: ProcessManager does not reap child processes, causing zombie accumulation on long-running Flink deployments #37930

What happened?

How to reproduce

Issue Priority

Issue Components

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: ProcessManager does not reap child processes, causing zombie accumulation on long-running Flink deployments #37930

Description

What happened?

How to reproduce

Issue Priority

Issue Components

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions