Skip to content

stanza_service: expose tokenize latency as a Prometheus histogram#663

Merged
mircealungu merged 1 commit into
masterfrom
stanza-latency-histogram
Jul 3, 2026
Merged

stanza_service: expose tokenize latency as a Prometheus histogram#663
mircealungu merged 1 commit into
masterfrom
stanza-latency-histogram

Conversation

@mircealungu

Copy link
Copy Markdown
Member

Why

We can currently only see the >5s tail of tokenization latency (the STANZA-SLOW log lines) — no p50/p90/p99. That left us tuning healthcheck/client timeouts against one point of the distribution instead of the real shape.

What

log_request already computes elapsed for every request and throws away everything under 5s. This records those into cumulative Prometheus histogram buckets and exposes them on /metrics (which Prometheus already scrapes):

stanza_tokenize_duration_seconds_bucket{worker="<pid>",le="0.1"} ...
stanza_tokenize_duration_seconds_bucket{worker="<pid>",le="+Inf"} <count>
stanza_tokenize_duration_seconds_sum{worker="<pid>"} <sum>
stanza_tokenize_duration_seconds_count{worker="<pid>"} <count>

Buckets: 0.1, 0.25, 0.5, 1, 2, 5, 10, 15, 30, 60s — fine-grained under 1s (where ~99% of requests land), tail coverage to 60s.

Overhead

~nil: a few integer increments under the lock log_request already holds. No per-request I/O, no DB, no new dependency. Metrics-only — no behaviour change.

Per-worker note

preload_app is off, so each gunicorn worker keeps its own in-memory counters; a bare series would sawtooth as scrapes hit different workers. Series are labelled by worker=<pid> to stay monotonic — aggregate in PromQL:

histogram_quantile(0.9, sum by (le) (rate(stanza_tokenize_duration_seconds_bucket[5m])))

Follow-ups this unlocks

  • Real p50/p90/p99 to tune the stanza healthcheck timeout (currently set from the >5s tail alone).
  • Characterize stanza_crawl's throttled latency distribution (it's CPU-limited via cpu_shares), to inform a progress-based liveness check.

Safe to merge/deploy ahead of the healthcheck PRs.

🤖 Generated with Claude Code

We could only see the >5s tail (STANZA-SLOW logs) — no p50/p90/p99. But
log_request already computes `elapsed` for every request and discards the
sub-5s values. This records them into cumulative Prometheus histogram
buckets (0.1s..60s) and emits stanza_tokenize_duration_seconds_{bucket,sum,
count} on /metrics, which Prometheus already scrapes.

Overhead is ~nil: a handful of integer increments under the lock
log_request already holds; no per-request I/O, no DB, no new dependency.

Per-worker note: with preload_app off each gunicorn worker keeps its own
counters, so the series is labelled by pid (worker=) to stay monotonic;
aggregate in PromQL with
  histogram_quantile(0.9, sum by (le) (rate(stanza_tokenize_duration_seconds_bucket[5m])))

Metrics-only; no behaviour change. Safe to deploy ahead of the healthcheck
work so we can tune timeouts against real percentiles.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown

ArchLens - No architecturally relevant changes to the existing views

@mircealungu mircealungu merged commit 47bd29e into master Jul 3, 2026
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant