stanza_service: expose tokenize latency as a Prometheus histogram#663
Merged
Conversation
We could only see the >5s tail (STANZA-SLOW logs) — no p50/p90/p99. But
log_request already computes `elapsed` for every request and discards the
sub-5s values. This records them into cumulative Prometheus histogram
buckets (0.1s..60s) and emits stanza_tokenize_duration_seconds_{bucket,sum,
count} on /metrics, which Prometheus already scrapes.
Overhead is ~nil: a handful of integer increments under the lock
log_request already holds; no per-request I/O, no DB, no new dependency.
Per-worker note: with preload_app off each gunicorn worker keeps its own
counters, so the series is labelled by pid (worker=) to stay monotonic;
aggregate in PromQL with
histogram_quantile(0.9, sum by (le) (rate(stanza_tokenize_duration_seconds_bucket[5m])))
Metrics-only; no behaviour change. Safe to deploy ahead of the healthcheck
work so we can tune timeouts against real percentiles.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
ArchLens - No architecturally relevant changes to the existing views |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
We can currently only see the >5s tail of tokenization latency (the
STANZA-SLOWlog lines) — no p50/p90/p99. That left us tuning healthcheck/client timeouts against one point of the distribution instead of the real shape.What
log_requestalready computeselapsedfor every request and throws away everything under 5s. This records those into cumulative Prometheus histogram buckets and exposes them on/metrics(which Prometheus already scrapes):Buckets:
0.1, 0.25, 0.5, 1, 2, 5, 10, 15, 30, 60s— fine-grained under 1s (where ~99% of requests land), tail coverage to 60s.Overhead
~nil: a few integer increments under the lock
log_requestalready holds. No per-request I/O, no DB, no new dependency. Metrics-only — no behaviour change.Per-worker note
preload_appis off, so each gunicorn worker keeps its own in-memory counters; a bare series would sawtooth as scrapes hit different workers. Series are labelled byworker=<pid>to stay monotonic — aggregate in PromQL:Follow-ups this unlocks
stanza_crawl's throttled latency distribution (it's CPU-limited viacpu_shares), to inform a progress-based liveness check.Safe to merge/deploy ahead of the healthcheck PRs.
🤖 Generated with Claude Code