stanza_service: loosen healthcheck so busy workers aren't killed by mircealungu · Pull Request #662 · zeeguu/api

mircealungu · 2026-07-03T10:29:01Z

Change (`stanza_service/Dockerfile`)

HEALTHCHECK timeout 5s → 15s, chosen from measured latency rather than feel: across ~41,340 real tokenizations, >5s is the p99.9 (0.11%) and the legitimate slow tail reaches ~12s for large texts. So the old 5s ceiling tripped on genuine big-text work. 15s clears the observed max with margin; interval/retries stay at 30s/3 so a real outage is still caught in ~90s.

Why it matters

In the deployed stack this false-positive let autoheal restart a healthy-but-busy stanza (~15–30s downtime + model reload), stalling the API and surfacing as transient ~15s page loads.

Note

The deployed docker-compose.yml defines its own stanza healthcheck that overrides this at runtime — updated in the companion ops PR (which also adds GUNICORN_WORKERS: "2", mem_limit: 16g, and crawl cpu_shares). This keeps standalone docker run builds consistent.

🤖 Generated with Claude Code

Tokenization is CPU-bound and a long text can keep the (single) gunicorn worker busy, delaying /health. The aggressive 5s timeout / 3 retries let orchestrators (autoheal in the deployed stack) restart a healthy-but-busy container under load, causing ~15-30s of downtime + model reload. Loosen to timeout 30s / retries 5. Note: the deployed compose file defines its own stanza healthcheck which overrides this; that is updated in the ops repo. This keeps standalone `docker run` builds consistent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-07-03T10:29:44Z

ArchLens - No architecturally relevant changes to the existing views

Keep ~90s detection of a genuine outage (30s interval x 3 retries) instead of ballooning to ~2.5min; only widen the per-probe timeout 5s -> 10s. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

>5s is the p99.9 of real tokenizations and the legitimate tail reaches ~12s for large texts, so 5s tripped on genuine work. 15s clears it; keep 30s/3 for ~90s outage detection. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mircealungu and others added 3 commits July 3, 2026 12:58

stanza healthcheck: timeout 10s / retries 3, not 30s / 5

23c5982

Keep ~90s detection of a genuine outage (30s interval x 3 retries) instead of ballooning to ~2.5min; only widen the per-probe timeout 5s -> 10s. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

comment: detection is ~60-90s, not a flat ~90s

7fae296

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mircealungu mentioned this pull request Jul 3, 2026

Revisit preload_app for stanza via the ADR-017 OMP fix (gated on latency histogram) #664

Open

mircealungu merged commit 371e4fa into master Jul 3, 2026
1 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

stanza_service: loosen healthcheck so busy workers aren't killed#662

stanza_service: loosen healthcheck so busy workers aren't killed#662
mircealungu merged 4 commits into
masterfrom
fix-stanza-healthcheck-timeout

mircealungu commented Jul 3, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mircealungu commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change (stanza_service/Dockerfile)

Why it matters

Note

Uh oh!

github-actions Bot commented Jul 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mircealungu commented Jul 3, 2026 •

edited

Loading

Change (`stanza_service/Dockerfile`)