fix: cap cluster worker count via IFRAMELY_WORKERS_COUNT (OOMKilled on large nodes)#4
Merged
Merged
Conversation
graceful-cluster defaults to os.cpus().length workers, which is the HOST node's vCPU count and ignores the container's CPU limit. On large nodes (e.g. 32 vCPU) this forks ~32 workers, each independently loading ~1886 domains + Redis + Secrets Manager, exhausting the pod memory limit -> OOMKilled/CrashLoopBackOff. Pass workersCount to GracefulCluster.start from CONFIG.CLUSTER_WORKERS_COUNT, settable via IFRAMELY_WORKERS_COUNT (alias IFRAMELY_WORKERS). When unset, behaviour is unchanged (falls back to os.cpus().length). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
bdd9b1b to
2193519
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem (customer incident)
iframely's cluster forks
os.cpus().lengthworkers (graceful-clusterdefault). That's the host node's vCPU count and ignores the container CPU limit. On a 32-vCPU node, a pod with a 1000m CPU limit still forks ~32 workers, each independently loading ~1886 domains + connecting to Redis + fetching AWS Secrets Manager. Combined startup memory blows past the pod memory limit → OOMKilled / CrashLoopBackOff within ~40s.There was no supported way to cap the worker count —
cluster.jsnever passedworkersCount, so neither an env var norconfig.local.jscould limit it.Fix
cluster.js: passworkersCount: CONFIG.CLUSTER_WORKERS_COUNTtoGracefulCluster.start.config.loader.js: setCLUSTER_WORKERS_COUNTfromIFRAMELY_WORKERS_COUNT(aliasIFRAMELY_WORKERS).graceful-clusterfalls back toos.cpus().length).This complements the existing
IFRAMELY_WORKER_MAX_MEMORY_MBknob (added in #3): cap the worker count to the CPU/memory the container actually has.Example
Testing
node --checkon both files.0→ignored, non-numeric→ignored — all pass.Note
This is a follow-up to the merged #3 (released as v2.5.2). A customer hit this on 32-vCPU nodes. Recommend cutting a patch release once merged so it can flow through downstream image mirrors.
🤖 Generated with Claude Code