Skip to content

feat(metrics): usage & system-scale observability#239

Open
appleboy wants to merge 7 commits into
mainfrom
feat/metrics-usage-observability
Open

feat(metrics): usage & system-scale observability#239
appleboy wants to merge 7 commits into
mainfrom
feat/metrics-usage-observability

Conversation

@appleboy

@appleboy appleboy commented Jun 5, 2026

Copy link
Copy Markdown
Member

Summary

Operators running AuthGate behind Prometheus/Grafana could see request/token/auth rates but not how the system is actually used — active users, registered clients, consent activity, or rate-limit abuse — and the sessions_active gauge drifted upward because non-logout expiries never decremented it. This PR adds four categories of usage/system-scale metrics, fixes the session gauge drift (across all server-observable clear paths), and corrects docs/METRICS.md drift. New recording is additive (no renamed/removed series) and a no-op when METRICS_ENABLED=false.

  • WS-B — Rate-limit hits: rate_limit_exceeded_total{endpoint} recorded in the 429 handler (Endpoint + Recorder threaded through middleware.RateLimitConfig).
  • WS-C — Consent: oauth_consent_granted_total{flow} / oauth_consent_revoked_total{actor} via AuthorizationService (recorder injected, nil→noop default).
  • WS-D — Population gauges: authgate_users_active{auth_source}, authgate_users_registered, authgate_oauth_clients_registered{status}, authgate_build_info{version,commit}, reconciled through the cached gauge-update job. New read-only store aggregates CountActiveUsersByAuthSource / CountAllUsers / CountClientsByStatus.
  • WS-E — sessions_active drift fix: RecordSessionExpired(reason) increments sessions_expired_total{reason} and decrements sessions_active on idle-timeout, fingerprint-mismatch, and account-inactive (deleted/disabled user) clear paths — previously only logout decremented.
  • Docs: removed phantom metrics + dead INTEGRATION.md link from docs/METRICS.md; documented the existing cache_* metrics and all new series with Grafana queries + alerts; synced docs/MONITORING.md and CLAUDE.md.

Architecture / flow

How recording is wired. Every node maps to a file this PR changes; the two green nodes are the review-critical follow-ups (the account_inactive drift fix and the index-backed COUNTs). All recording funnels through core.Recorder, which is the real Metrics when enabled and a no-op otherwise.

flowchart TD
    subgraph req[Request-path recording]
        RL["ratelimit.go — 429 handler"]
        SESS["auth.go — session clear<br/>idle / fingerprint / account_inactive"]
        CONS["authorization.go — consent grant/revoke"]
    end
    subgraph bg[Background job + startup]
        JOB["server.go — updatePopulationGauges"]
        STORE[("store COUNT — index-backed")]
        BUILD["cache.go — SetBuildInfo (startup)"]
    end

    REC{{"core.Recorder<br/>Metrics or Noop"}}

    RL -->|RecordRateLimitExceeded| REC
    SESS -->|RecordSessionExpired| REC
    CONS -->|"RecordConsentGranted / Revoked"| REC
    JOB --> STORE
    STORE -->|"SetUsers* / SetClients*Count"| REC
    BUILD -->|SetBuildInfo| REC

    REC --> SCRAPE["GET /metrics (router.go)"]

    subgraph series[New series exposed]
        S1["rate_limit_exceeded_total"]
        S2["sessions_active / sessions_expired_total"]
        S3["oauth_consent_granted / revoked_total"]
        S4["authgate_users_active / _registered"]
        S5["authgate_oauth_clients_registered"]
        S6["authgate_build_info"]
    end
    SCRAPE --> S1 & S2 & S3 & S4 & S5 & S6

    style SESS fill:#dff0d8,stroke:#3c763d
    style STORE fill:#dff0d8,stroke:#3c763d
Loading

⚠️ This PR includes a DB schema change (migration)

Two indexes are added via gorm tags and created by the existing AutoMigrate on next startup:

  • composite users(is_active, auth_source)idx_users_active_source (for authgate_users_active)
  • oauth_applications(status) (for authgate_oauth_clients_registered)

They make the population-gauge COUNT queries index scans instead of full table scans. Deploy note: on a large existing Postgres table the first startup runs a one-time CREATE INDEX (brief write lock); SQLite is fast. No columns/tables added, no data migration.

Post-review follow-ups (commits 2–5)

The first commit was the feature; the rest are review/perf follow-ups (all green):

  • refactor — reuse version.GetVersion() instead of a duplicate helper; derive the auth_source label set from auth.Provider* constants instead of magic strings.
  • fix — complete the sessions_active drift fix for the account-inactive clear path (the third path the plan named but the feature commit missed); thread a nil-safe recorder through RequireAuth/OptionalAuth.
  • perf — the two indexes above; harden rate_limit_exceeded_total to fall back to a constant "unknown" label (never the raw request path) so a misconfigured limiter can't explode cardinality.
  • docs — table alignment.

AI Authorship

  • AI was used — this branch is entirely AI-authored.
    • Tool / model: Claude Code (Opus 4.8), across the feature, a max-effort code review (which surfaced and fixed the account-inactive drift bug), and a performance review (which surfaced and fixed the unindexed COUNTs).
    • AI-authored files: all files in this PR.
    • Human line-by-line reviewed: none yet — treat as unreviewed AI code; read the core paths line-by-line (see Reviewer guide).

Change classification

  • Core code (broad impact — needs line-by-line review)

Touches the shared core.Recorder interface, session/rate-limit/auth middleware, bootstrap wiring, and the DB schema (indexes). Mitigating factor: all new recording is additive and a no-op when METRICS_ENABLED=false; the indexes are additive and don't change query results.

Plan reference

plan.md (force-added to the branch). Done-definition checklist is fully ticked.

Verification

  • Unit tests — one per new Recorder method (real Prometheus registry); noop coverage; cache getters; store aggregates; a regression test asserting both new indexes exist (TestPopulationGaugeIndexes).
  • Integration testsmiddleware/metrics_integration_test.go (429→counter, idle-timeout + fingerprint-mismatch decrement); services/authorization_metrics_test.go (all four consent paths); middleware/auth_test.go (deleted-user clear → account_inactive decrement).
  • 3 e2e tests (bootstrap/metrics_endpoint_test.go): happy-path scrape exposes the new series; METRICS_ENABLED=false/metrics 404 (silent); bearer-token auth (401 without/wrong, 200 with).
  • Stress / soak test: N/A — no new long-running/async paths; gauge job reuses the existing cached ticker loop.
  • Manual: make generate && make build; METRICS_ENABLED=true ./bin/authgate server; curl -s localhost:8080/metrics | grep -E '^(rate_limit_exceeded|oauth_consent|authgate_users_|authgate_oauth_clients_registered|cache_(hits|misses))'.

Gates: make generate (mocks regenerated, no diff), make fmt, make lint (0 issues), make test (1764 passed / 20 packages).

Verifiability check

  • New series, labels (fixed low-cardinality enums), and Grafana/alert queries documented in docs/METRICS.md.
  • Reviewer can judge correctness from the e2e scrape assertions + recorder/store unit tests without reading every implementation line.
  • Failures surface in monitoring (these are the monitoring); database_query_errors_total covers gauge-reconciliation query failures.

Security check

  • No real secrets — the only credential-looking string is a test literal (s3cr3t-metrics-token) in metrics_endpoint_test.go.
  • All metric labels are fixed enums (endpoint, flow, actor, reason, status, auth_source) — never user_id/client_id/IP. The rate-limit endpoint label now falls back to a constant "unknown", not the raw path, so cardinality stays bounded even if misconfigured.
  • /metrics bearer-token auth path is exercised by the e2e test.
  • No change to auth/session/rate-limit control flow — recording is added alongside existing audit calls; session redirect behavior is byte-identical. The account-inactive path records only after a successful session.Save().
  • Population counts are read-only aggregate queries; the new indexes don't change results.

Risk & rollback

  • Risk: (1) a mis-wired gauge/label — bounded, metrics are off by default; (2) the index migration — additive, but the first deploy runs CREATE INDEX (brief write lock on large Postgres tables).
  • Rollback: git revert the relevant commit(s). Reverting code leaves the (harmless, unused) indexes in place — drop them manually if desired. No data or config migration.

Reviewer guide

  • Read carefully:
    • internal/middleware/auth.goclearLiveSession + the RecordSessionExpired calls on the idle-timeout / fingerprint-mismatch / account-inactive clear paths (confirm no control-flow change; recorded only after a successful save).
    • internal/models/user.go + oauth_application.go — the new index tags (the migration).
    • internal/core/metrics.go + internal/metrics/{metrics,http,noop}.go — new Recorder methods land in interface + Metrics + Noop + mock together.
    • internal/bootstrap/server.go / router.goupdatePopulationGauges reconciliation and the recorder threaded through all RequireAuth/OptionalAuth call sites.
    • internal/services/authorization.go — the four consent recording sites.
  • Spot-check OK: generated mock_metrics.go / mock_store.go; docs/* and CLAUDE.md; test files; the mechanical *_test.go constructor-signature updates.

🤖 Generated with Claude Code

appleboy and others added 2 commits June 5, 2026 22:11
Implements plan.md (workstreams B–E + docs) end to end.

- WS-B: rate_limit_exceeded_total{endpoint} recorded in the 429 handler
  (Endpoint + Recorder threaded through middleware.RateLimitConfig).
- WS-C: oauth_consent_granted_total{flow} / oauth_consent_revoked_total{actor}
  via AuthorizationService (recorder injected, nil->noop default).
- WS-D: authgate_users_active{auth_source}, authgate_users_registered,
  authgate_oauth_clients_registered{status}, authgate_build_info{version,commit}
  reconciled through the cached gauge-update job. New store aggregates
  CountActiveUsersByAuthSource / CountAllUsers; MetricsStore + CacheWrapper extended.
- WS-E: RecordSessionExpired decrements sessions_active and increments
  sessions_expired_total{reason} on idle-timeout and fingerprint-mismatch,
  fixing the upward drift.
- Docs: METRICS.md drift fixed (removed phantom auth_external_api_duration_seconds
  / sessions_invalidated_total + dead INTEGRATION.md link), documented existing
  cache_* metrics and all new series with Grafana queries + alerts; MONITORING.md
  and CLAUDE.md summaries synced.

Open-question answers folded in: authgate_ prefix, auth_source label on active
users, and the build_info gauge.

Tests: metrics unit (new recorders + cache getters), store
(CountActiveUsersByAuthSource / CountAllUsers), middleware integration
(rate-limit 429 -> metric, idle-timeout + fingerprint-mismatch drift), consent
service paths, and bootstrap-level /metrics scrape e2e (happy-path series
exposed, disabled-silent 404, bearer-token auth). mock_metrics.go / mock_store.go
regenerated via make generate. make fmt/lint/test all green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Post-review cleanup (no behavior change):

- Reuse: replace the duplicate buildInfoVersion() in bootstrap with the
  existing version-default logic, now exported as version.GetVersion().
- Altitude: derive the authgate_users_active label set from the canonical
  auth.Provider* constants instead of hardcoded "github"/"gitea"/... strings,
  so the metric labels can't silently drift from the provider registry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 5, 2026

Copy link
Copy Markdown

appleboy and others added 5 commits June 5, 2026 23:41
…a live session

Code-review follow-up. The sessions_active drift fix (WS-E) covered the
idle-timeout and fingerprint-mismatch session-clear paths but missed the third
path the plan itself named: loadUserFromSession clears a *live* authenticated
cookie session when the backing user was deleted or disabled, yet never
decremented the gauge — so disabling/deleting an active user left sessions_active
drifting upward (the same bug WS-E set out to close).

- middleware/auth.go: add clearLiveSession() which clears + records
  sessions_expired_total{reason="account_inactive"} and decrements
  sessions_active only on a successful save (mirrors the idle/fingerprint paths).
  Thread an optional recorder through loadUserFromSession / RequireAuth /
  OptionalAuth (nil-safe).
- bootstrap/router.go: pass the recorder to the auth middleware at all call sites.
- Tests: TestRequireAuth_DeletedUser_* now asserts the decrement + counter.
- Docs/comments: document the new account_inactive reason.

make fmt/lint/test all green (1763 tests).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Re-pad the session metrics table so the rows match the widened labels column
Performance follow-up from the code review.

- Index the columns the population-gauge reconciliation filters on so the
  per-source / per-status COUNTs use index scans instead of full table scans:
  a composite users(is_active, auth_source) index for authgate_users_active and
  an oauth_applications(status) index for authgate_oauth_clients_registered.
  Both are added via gorm tags and created by the existing AutoMigrate.
- Harden rate_limit_exceeded_total: when a limiter has no Endpoint set, fall
  back to the constant "unknown" instead of the raw request path, so a
  misconfigured limiter can't explode the metric's label cardinality.
- Add a store test asserting both indexes exist (guards the perf fix).
- Document the index-backed COUNTs in METRICS.md.

make fmt/lint/test all green (1764 tests).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…oups

Cleanup follow-up. setupAllRoutes already constructs optionalAuth and
injectPending once and reuses them; the auth-middleware threading added a fresh
middleware.RequireAuth(h.userService, prometheusMetrics) at each of the 6 route
groups. Build it once into a local and reuse, matching the surrounding pattern
and collapsing the duplicated multi-line .Use() blocks. No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ly has

Code-review follow-up. The population gauge labelled authgate_users_active by
auth_source using {local, http_api, microsoft, github, gitea, gitlab}, but
User.AuthSource never holds a provider name: createUserWithOAuth stores every
OAuth user (GitHub/Gitea/GitLab/Microsoft) as "local" (the provider lives in
OAuthConnection), and external users are "http_api". So the four provider labels
were permanently-zero series and the "local" count silently conflated local +
OAuth users.

Reduce metricsAuthSources to the values auth_source can actually take
({local, http_api}); drop the now-unused internal/auth import. Per-provider
breakdown remains available via the runtime-labelled auth_login_total{auth_source}.
Docs/comments and the example labels in two tests updated to match.

make fmt/lint/test all green (1764 tests).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant