Skip to content

fix: cap cluster worker count via IFRAMELY_WORKERS_COUNT (OOMKilled on large nodes)#4

Merged
pratapalakshmi merged 1 commit into
mainfrom
fix/iframely-worker-count-oom
Jun 8, 2026
Merged

fix: cap cluster worker count via IFRAMELY_WORKERS_COUNT (OOMKilled on large nodes)#4
pratapalakshmi merged 1 commit into
mainfrom
fix/iframely-worker-count-oom

Conversation

@pratapalakshmi

Copy link
Copy Markdown

Problem (customer incident)

iframely's cluster forks os.cpus().length workers (graceful-cluster default). That's the host node's vCPU count and ignores the container CPU limit. On a 32-vCPU node, a pod with a 1000m CPU limit still forks ~32 workers, each independently loading ~1886 domains + connecting to Redis + fetching AWS Secrets Manager. Combined startup memory blows past the pod memory limit → OOMKilled / CrashLoopBackOff within ~40s.

There was no supported way to cap the worker countcluster.js never passed workersCount, so neither an env var nor config.local.js could limit it.

Fix

  • cluster.js: pass workersCount: CONFIG.CLUSTER_WORKERS_COUNT to GracefulCluster.start.
  • config.loader.js: set CLUSTER_WORKERS_COUNT from IFRAMELY_WORKERS_COUNT (alias IFRAMELY_WORKERS).
  • When unset, behaviour is unchanged (graceful-cluster falls back to os.cpus().length).

This complements the existing IFRAMELY_WORKER_MAX_MEMORY_MB knob (added in #3): cap the worker count to the CPU/memory the container actually has.

Example

IFRAMELY_WORKERS_COUNT=4          # 4 workers regardless of node size
IFRAMELY_WORKER_MAX_MEMORY_MB=400

Testing

  • node --check on both files.
  • Unit-tested the env resolution: canonical var, alias, canonical-wins-over-alias, unset→fallback, 0→ignored, non-numeric→ignored — all pass.

Note

This is a follow-up to the merged #3 (released as v2.5.2). A customer hit this on 32-vCPU nodes. Recommend cutting a patch release once merged so it can flow through downstream image mirrors.

🤖 Generated with Claude Code

graceful-cluster defaults to os.cpus().length workers, which is the HOST
node's vCPU count and ignores the container's CPU limit. On large nodes
(e.g. 32 vCPU) this forks ~32 workers, each independently loading ~1886
domains + Redis + Secrets Manager, exhausting the pod memory limit ->
OOMKilled/CrashLoopBackOff.

Pass workersCount to GracefulCluster.start from CONFIG.CLUSTER_WORKERS_COUNT,
settable via IFRAMELY_WORKERS_COUNT (alias IFRAMELY_WORKERS). When unset,
behaviour is unchanged (falls back to os.cpus().length).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@pratapalakshmi pratapalakshmi force-pushed the fix/iframely-worker-count-oom branch from bdd9b1b to 2193519 Compare June 8, 2026 04:56
@pratapalakshmi pratapalakshmi merged commit 957d938 into main Jun 8, 2026
@pratapalakshmi pratapalakshmi deleted the fix/iframely-worker-count-oom branch June 8, 2026 04:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant