Skip to content

[Bug]: ProcessManager does not reap child processes, causing zombie accumulation on long-running Flink deployments #37930

@andresti

Description

@andresti

What happened?

ProcessManager.stopProcess() calls destroy()/destroyForcibly() to terminate child processes but never calls Process.waitFor() to collect the exit status. On POSIX systems, this leaves the terminated child as a zombie (state Z/defunct) in the kernel process table until the parent process exits.

In long-running environments like Flink TaskManagers using --environment_type=PROCESS, expansion service processes (/opt/apache/beam/java_boot) are repeatedly spawned and stopped but never reaped. Over time this leads to significant zombie accumulation — we observed 176+ zombie Java processes on production Flink TaskManager pods.

Container-level init systems (e.g. dumb-init, tini) cannot help because the zombies are children of the still-running Java TaskManager process — only the parent can reap its own children.

How to reproduce

  1. Run a Flink pipeline with --environment_type=PROCESS
  2. Let it process work for an extended period (hours/days)
  3. Check for zombie processes: ps aux | grep defunct

Zombies will accumulate over time as expansion service processes are started and stopped without being reaped.

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions