stanza_service: expose tokenize latency as a Prometheus histogram by mircealungu · Pull Request #663 · zeeguu/api

mircealungu · 2026-07-03T12:50:09Z

Why

We can currently only see the >5s tail of tokenization latency (the STANZA-SLOW log lines) — no p50/p90/p99. That left us tuning healthcheck/client timeouts against one point of the distribution instead of the real shape.

What

log_request already computes elapsed for every request and throws away everything under 5s. This records those into cumulative Prometheus histogram buckets and exposes them on /metrics (which Prometheus already scrapes):

stanza_tokenize_duration_seconds_bucket{worker="<pid>",le="0.1"} ...
stanza_tokenize_duration_seconds_bucket{worker="<pid>",le="+Inf"} <count>
stanza_tokenize_duration_seconds_sum{worker="<pid>"} <sum>
stanza_tokenize_duration_seconds_count{worker="<pid>"} <count>

Buckets: 0.1, 0.25, 0.5, 1, 2, 5, 10, 15, 30, 60s — fine-grained under 1s (where ~99% of requests land), tail coverage to 60s.

Overhead

~nil: a few integer increments under the lock log_request already holds. No per-request I/O, no DB, no new dependency. Metrics-only — no behaviour change.

Per-worker note

preload_app is off, so each gunicorn worker keeps its own in-memory counters; a bare series would sawtooth as scrapes hit different workers. Series are labelled by worker=<pid> to stay monotonic — aggregate in PromQL:

histogram_quantile(0.9, sum by (le) (rate(stanza_tokenize_duration_seconds_bucket[5m])))

Follow-ups this unlocks

Real p50/p90/p99 to tune the stanza healthcheck timeout (currently set from the >5s tail alone).
Characterize stanza_crawl's throttled latency distribution (it's CPU-limited via cpu_shares), to inform a progress-based liveness check.

Safe to merge/deploy ahead of the healthcheck PRs.

🤖 Generated with Claude Code

We could only see the >5s tail (STANZA-SLOW logs) — no p50/p90/p99. But log_request already computes `elapsed` for every request and discards the sub-5s values. This records them into cumulative Prometheus histogram buckets (0.1s..60s) and emits stanza_tokenize_duration_seconds_{bucket,sum, count} on /metrics, which Prometheus already scrapes. Overhead is ~nil: a handful of integer increments under the lock log_request already holds; no per-request I/O, no DB, no new dependency. Per-worker note: with preload_app off each gunicorn worker keeps its own counters, so the series is labelled by pid (worker=) to stay monotonic; aggregate in PromQL with histogram_quantile(0.9, sum by (le) (rate(stanza_tokenize_duration_seconds_bucket[5m]))) Metrics-only; no behaviour change. Safe to deploy ahead of the healthcheck work so we can tune timeouts against real percentiles. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-07-03T12:50:47Z

ArchLens - No architecturally relevant changes to the existing views

mircealungu merged commit 47bd29e into master Jul 3, 2026
1 of 3 checks passed

mircealungu mentioned this pull request Jul 3, 2026

Revisit preload_app for stanza via the ADR-017 OMP fix (gated on latency histogram) #664

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

stanza_service: expose tokenize latency as a Prometheus histogram#663

stanza_service: expose tokenize latency as a Prometheus histogram#663
mircealungu merged 1 commit into
masterfrom
stanza-latency-histogram

mircealungu commented Jul 3, 2026

Uh oh!

github-actions Bot commented Jul 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mircealungu commented Jul 3, 2026

Why

What

Overhead

Per-worker note

Follow-ups this unlocks

Uh oh!

github-actions Bot commented Jul 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant