feat(metrics): usage & system-scale observability#239
Open
appleboy wants to merge 7 commits into
Open
Conversation
Implements plan.md (workstreams B–E + docs) end to end.
- WS-B: rate_limit_exceeded_total{endpoint} recorded in the 429 handler
(Endpoint + Recorder threaded through middleware.RateLimitConfig).
- WS-C: oauth_consent_granted_total{flow} / oauth_consent_revoked_total{actor}
via AuthorizationService (recorder injected, nil->noop default).
- WS-D: authgate_users_active{auth_source}, authgate_users_registered,
authgate_oauth_clients_registered{status}, authgate_build_info{version,commit}
reconciled through the cached gauge-update job. New store aggregates
CountActiveUsersByAuthSource / CountAllUsers; MetricsStore + CacheWrapper extended.
- WS-E: RecordSessionExpired decrements sessions_active and increments
sessions_expired_total{reason} on idle-timeout and fingerprint-mismatch,
fixing the upward drift.
- Docs: METRICS.md drift fixed (removed phantom auth_external_api_duration_seconds
/ sessions_invalidated_total + dead INTEGRATION.md link), documented existing
cache_* metrics and all new series with Grafana queries + alerts; MONITORING.md
and CLAUDE.md summaries synced.
Open-question answers folded in: authgate_ prefix, auth_source label on active
users, and the build_info gauge.
Tests: metrics unit (new recorders + cache getters), store
(CountActiveUsersByAuthSource / CountAllUsers), middleware integration
(rate-limit 429 -> metric, idle-timeout + fingerprint-mismatch drift), consent
service paths, and bootstrap-level /metrics scrape e2e (happy-path series
exposed, disabled-silent 404, bearer-token auth). mock_metrics.go / mock_store.go
regenerated via make generate. make fmt/lint/test all green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Post-review cleanup (no behavior change): - Reuse: replace the duplicate buildInfoVersion() in bootstrap with the existing version-default logic, now exported as version.GetVersion(). - Altitude: derive the authgate_users_active label set from the canonical auth.Provider* constants instead of hardcoded "github"/"gitea"/... strings, so the metric labels can't silently drift from the provider registry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
…a live session
Code-review follow-up. The sessions_active drift fix (WS-E) covered the
idle-timeout and fingerprint-mismatch session-clear paths but missed the third
path the plan itself named: loadUserFromSession clears a *live* authenticated
cookie session when the backing user was deleted or disabled, yet never
decremented the gauge — so disabling/deleting an active user left sessions_active
drifting upward (the same bug WS-E set out to close).
- middleware/auth.go: add clearLiveSession() which clears + records
sessions_expired_total{reason="account_inactive"} and decrements
sessions_active only on a successful save (mirrors the idle/fingerprint paths).
Thread an optional recorder through loadUserFromSession / RequireAuth /
OptionalAuth (nil-safe).
- bootstrap/router.go: pass the recorder to the auth middleware at all call sites.
- Tests: TestRequireAuth_DeletedUser_* now asserts the decrement + counter.
- Docs/comments: document the new account_inactive reason.
make fmt/lint/test all green (1763 tests).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Re-pad the session metrics table so the rows match the widened labels column
Performance follow-up from the code review. - Index the columns the population-gauge reconciliation filters on so the per-source / per-status COUNTs use index scans instead of full table scans: a composite users(is_active, auth_source) index for authgate_users_active and an oauth_applications(status) index for authgate_oauth_clients_registered. Both are added via gorm tags and created by the existing AutoMigrate. - Harden rate_limit_exceeded_total: when a limiter has no Endpoint set, fall back to the constant "unknown" instead of the raw request path, so a misconfigured limiter can't explode the metric's label cardinality. - Add a store test asserting both indexes exist (guards the perf fix). - Document the index-backed COUNTs in METRICS.md. make fmt/lint/test all green (1764 tests). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…oups Cleanup follow-up. setupAllRoutes already constructs optionalAuth and injectPending once and reuses them; the auth-middleware threading added a fresh middleware.RequireAuth(h.userService, prometheusMetrics) at each of the 6 route groups. Build it once into a local and reuse, matching the surrounding pattern and collapsing the duplicated multi-line .Use() blocks. No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ly has
Code-review follow-up. The population gauge labelled authgate_users_active by
auth_source using {local, http_api, microsoft, github, gitea, gitlab}, but
User.AuthSource never holds a provider name: createUserWithOAuth stores every
OAuth user (GitHub/Gitea/GitLab/Microsoft) as "local" (the provider lives in
OAuthConnection), and external users are "http_api". So the four provider labels
were permanently-zero series and the "local" count silently conflated local +
OAuth users.
Reduce metricsAuthSources to the values auth_source can actually take
({local, http_api}); drop the now-unused internal/auth import. Per-provider
breakdown remains available via the runtime-labelled auth_login_total{auth_source}.
Docs/comments and the example labels in two tests updated to match.
make fmt/lint/test all green (1764 tests).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Operators running AuthGate behind Prometheus/Grafana could see request/token/auth rates but not how the system is actually used — active users, registered clients, consent activity, or rate-limit abuse — and the
sessions_activegauge drifted upward because non-logout expiries never decremented it. This PR adds four categories of usage/system-scale metrics, fixes the session gauge drift (across all server-observable clear paths), and correctsdocs/METRICS.mddrift. New recording is additive (no renamed/removed series) and a no-op whenMETRICS_ENABLED=false.rate_limit_exceeded_total{endpoint}recorded in the 429 handler (Endpoint+Recorderthreaded throughmiddleware.RateLimitConfig).oauth_consent_granted_total{flow}/oauth_consent_revoked_total{actor}viaAuthorizationService(recorder injected,nil→noop default).authgate_users_active{auth_source},authgate_users_registered,authgate_oauth_clients_registered{status},authgate_build_info{version,commit}, reconciled through the cached gauge-update job. New read-only store aggregatesCountActiveUsersByAuthSource/CountAllUsers/CountClientsByStatus.sessions_activedrift fix:RecordSessionExpired(reason)incrementssessions_expired_total{reason}and decrementssessions_activeon idle-timeout, fingerprint-mismatch, and account-inactive (deleted/disabled user) clear paths — previously onlylogoutdecremented.INTEGRATION.mdlink fromdocs/METRICS.md; documented the existingcache_*metrics and all new series with Grafana queries + alerts; synceddocs/MONITORING.mdandCLAUDE.md.Architecture / flow
How recording is wired. Every node maps to a file this PR changes; the two green nodes are the review-critical follow-ups (the
account_inactivedrift fix and the index-backed COUNTs). All recording funnels throughcore.Recorder, which is the realMetricswhen enabled and a no-op otherwise.flowchart TD subgraph req[Request-path recording] RL["ratelimit.go — 429 handler"] SESS["auth.go — session clear<br/>idle / fingerprint / account_inactive"] CONS["authorization.go — consent grant/revoke"] end subgraph bg[Background job + startup] JOB["server.go — updatePopulationGauges"] STORE[("store COUNT — index-backed")] BUILD["cache.go — SetBuildInfo (startup)"] end REC{{"core.Recorder<br/>Metrics or Noop"}} RL -->|RecordRateLimitExceeded| REC SESS -->|RecordSessionExpired| REC CONS -->|"RecordConsentGranted / Revoked"| REC JOB --> STORE STORE -->|"SetUsers* / SetClients*Count"| REC BUILD -->|SetBuildInfo| REC REC --> SCRAPE["GET /metrics (router.go)"] subgraph series[New series exposed] S1["rate_limit_exceeded_total"] S2["sessions_active / sessions_expired_total"] S3["oauth_consent_granted / revoked_total"] S4["authgate_users_active / _registered"] S5["authgate_oauth_clients_registered"] S6["authgate_build_info"] end SCRAPE --> S1 & S2 & S3 & S4 & S5 & S6 style SESS fill:#dff0d8,stroke:#3c763d style STORE fill:#dff0d8,stroke:#3c763dTwo indexes are added via gorm tags and created by the existing
AutoMigrateon next startup:users(is_active, auth_source)→idx_users_active_source(forauthgate_users_active)oauth_applications(status)(forauthgate_oauth_clients_registered)They make the population-gauge
COUNTqueries index scans instead of full table scans. Deploy note: on a large existing Postgres table the first startup runs a one-timeCREATE INDEX(brief write lock); SQLite is fast. No columns/tables added, no data migration.Post-review follow-ups (commits 2–5)
The first commit was the feature; the rest are review/perf follow-ups (all green):
refactor— reuseversion.GetVersion()instead of a duplicate helper; derive the auth_source label set fromauth.Provider*constants instead of magic strings.fix— complete thesessions_activedrift fix for the account-inactive clear path (the third path the plan named but the feature commit missed); thread a nil-safe recorder throughRequireAuth/OptionalAuth.perf— the two indexes above; hardenrate_limit_exceeded_totalto fall back to a constant"unknown"label (never the raw request path) so a misconfigured limiter can't explode cardinality.docs— table alignment.AI Authorship
Change classification
Touches the shared
core.Recorderinterface, session/rate-limit/auth middleware, bootstrap wiring, and the DB schema (indexes). Mitigating factor: all new recording is additive and a no-op whenMETRICS_ENABLED=false; the indexes are additive and don't change query results.Plan reference
plan.md(force-added to the branch). Done-definition checklist is fully ticked.Verification
TestPopulationGaugeIndexes).middleware/metrics_integration_test.go(429→counter, idle-timeout + fingerprint-mismatch decrement);services/authorization_metrics_test.go(all four consent paths);middleware/auth_test.go(deleted-user clear →account_inactivedecrement).bootstrap/metrics_endpoint_test.go): happy-path scrape exposes the new series;METRICS_ENABLED=false→/metrics404 (silent); bearer-token auth (401 without/wrong, 200 with).make generate && make build;METRICS_ENABLED=true ./bin/authgate server;curl -s localhost:8080/metrics | grep -E '^(rate_limit_exceeded|oauth_consent|authgate_users_|authgate_oauth_clients_registered|cache_(hits|misses))'.Gates:
make generate(mocks regenerated, no diff),make fmt,make lint(0 issues),make test(1764 passed / 20 packages).Verifiability check
docs/METRICS.md.database_query_errors_totalcovers gauge-reconciliation query failures.Security check
s3cr3t-metrics-token) inmetrics_endpoint_test.go.endpointlabel now falls back to a constant"unknown", not the raw path, so cardinality stays bounded even if misconfigured./metricsbearer-token auth path is exercised by the e2e test.session.Save().Risk & rollback
CREATE INDEX(brief write lock on large Postgres tables).git revertthe relevant commit(s). Reverting code leaves the (harmless, unused) indexes in place — drop them manually if desired. No data or config migration.Reviewer guide
internal/middleware/auth.go—clearLiveSession+ theRecordSessionExpiredcalls on the idle-timeout / fingerprint-mismatch / account-inactive clear paths (confirm no control-flow change; recorded only after a successful save).internal/models/user.go+oauth_application.go— the new index tags (the migration).internal/core/metrics.go+internal/metrics/{metrics,http,noop}.go— new Recorder methods land in interface + Metrics + Noop + mock together.internal/bootstrap/server.go/router.go—updatePopulationGaugesreconciliation and the recorder threaded through allRequireAuth/OptionalAuthcall sites.internal/services/authorization.go— the four consent recording sites.mock_metrics.go/mock_store.go;docs/*andCLAUDE.md; test files; the mechanical*_test.goconstructor-signature updates.🤖 Generated with Claude Code