Skip to content

feat(auth): Microsoft Entra JWT authentication for the Context Intelligence Server#29

Open
colombod wants to merge 15 commits into
mainfrom
feat/entra-auth
Open

feat(auth): Microsoft Entra JWT authentication for the Context Intelligence Server#29
colombod wants to merge 15 commits into
mainfrom
feat/entra-auth

Conversation

@colombod

Copy link
Copy Markdown
Collaborator

Adds an optional auth_mode=entra that validates Microsoft Entra JWTs (RS256, dual-audience, tid/scp/oid checks) and maps the token oid to a write-once created_by contributor, with a clean static<->entra switch via a PrincipalResolver seam (StaticKeyResolver | EntraResolver).

Highlights:

  • Fail-CLOSED: no-auth configured -> loud startup refusal (no silent fail-open); post-startup JWKS-unreachable -> 401.
  • AC10 proven LIVE: a real az token -> isolated DTU server (auth_mode=entra) -> created_by=colombod read back from the Neo4j graph.
  • Operator + developer + ops docs (docs/entra-auth-setup.md) + AGENTS.md Entra section (placeholders only).
  • Reduced JWKS hardening: pinned PyJWKClient lifespan, distinct auth log tags (auth_denied vs resolver_unexpected_exception); no over-built lock/cap (per design-review council).
  • Minimal web_ui_enabled API-only lockdown: docs/openapi off, web routes + /logs/stream exempt removed.
  • ~1536 unit/integration tests; built test-first; persona-council reviewed at each step (found+fixed 2 live 500-crash bugs and a silent fail-open).

Deferred (named, off the pilot critical path): the app-profile factory, JWKS concurrency/rotation tests, and a data.timestamp ingest-validation gap (see ops runbook §4.4).

colombod and others added 10 commits June 27, 2026 19:08
Adds pyjwt[crypto]>=2.8.0 (resolves pyjwt 2.13.0 + cryptography 49) and an
import smoke test. Foundation dependency for the EntraResolver (T4); not yet used.
…(T2)

Pure refactor, no behaviour change. BearerTokenMiddleware now delegates token
resolution to a PrincipalResolver protocol; StaticKeyResolver wraps the existing
sha256 keystore lookup; the resolver is constructed in create_asgi_app(). Prepares
the seam for an EntraResolver (T4) without adding it. All existing test_auth.py
tests pass unchanged; 15 new seam tests added.
…dators (T3)

Adds the Entra-auth config surface to Settings (no resolver yet — that's T4):
- auth_mode: Literal[static, entra] (default static; existing behaviour unchanged)
- azure_client_id / azure_tenant_id (empty/whitespace normalized to None)
- entra_identities: oid->contributor map, exact api_keys parity, value {id} only
- _validate_entra_identities (mirrors _validate_api_keys): GUID re.fullmatch keys
  (rejects braces / urn:uuid / trailing-junk / all-zeros), non-dict value and
  missing/empty/whitespace id rejection, lowercase-normalized keys
- model_validator (AC7): auth_mode=entra requires client_id + tenant_id + identities,
  else a loud startup refusal
- build_identity_map() mirrors build_keystore()

56 new config tests cover the plan's edge matrix incl. the env-var (production) path,
GUID edges (unicode / zero-width / braces), coexistence with api_keys, and the
duplicate-oid last-wins documentation. tester-breaker-reviewed (verdict CONCERN:
nothing breaks; gaps were test-coverage, now closed). 106 config tests green; 1438 suite green.
Second PrincipalResolver; drops into the T2 seam with no middleware rewrite.
- RS256-pinned jwt.decode, dual audience [client_id, api://client_id], v2 issuer;
  explicit tid check; scp must contain access_as_user (space-split, no substring
  trap); oid extracted and mapped oid->contributor.
- AuthError(status): 401 for invalid/missing-oid token; 403 for a valid token whose
  oid is not in entra_identities (the 403 names the oid for operator diagnosis).
- Eager JWKS prefetch at construction, fail-closed: raises if the endpoint is
  unreachable OR returns zero signing keys.
- BearerTokenMiddleware maps AuthError->status and has a fail-closed catch-all
  (an unexpected resolver exception denies with 401 + a loud log, never a 500).

Adversarially reviewed (tester-breaker): found and fixed two live 500-crash bugs
(non-string oid/scp) and a 403-vs-401 semantic bug. 46 tests incl. a real-crypto
tier proving expired / wrong-aud / tampered / alg=none / HS256 rejection, dual-aud
(bare GUID + api://), nbf, app-only(roles)/no-scp, alg case variants, empty/garbage
bearer, aud-array, and empty-JWKS-at-startup. Full suite 1484 green.

Not yet wired into create_asgi_app (auth_mode switch = T7); JWKS global cap +
per-kid dedup lock = T5 (TODO left in code).

🤖 Generated with [Amplifier](https://github.com/microsoft/amplifier)

Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>
create_asgi_app() now selects the resolver from settings.auth_mode: entra ->
EntraResolver(client_id, tenant_id, build_identity_map()); static ->
StaticKeyResolver(build_keystore()). The middleware no longer special-cases a
concrete resolver type -- PrincipalResolver gains an auth_enabled property
(StaticKeyResolver: bool(keystore); EntraResolver: True).

Closes a CRITICAL silent fail-open (AC13/H2): a server with no auth configured
(static + no keys -- previously a pass-through) now REFUSES to start with a loud
RuntimeError, unless the explicit, default-false allow_unauthenticated flag is set
(test harness only). A six-lens council review flagged this as the #1 issue
(restless-old-brian verdict: FAIL) -- auth_mode=entra was previously inert and
silently unauthenticated.

Cleanups (cranky-old-sam): delete dead _is_hex() + its test class; drop
@runtime_checkable + the circular protocol test; remove dead isinstance(meta,dict)
branches in both validators; correct the PrincipalResolver / _validate_api_keys
docstrings (resolvers raise AuthError 401 OR 403).

13 new switch/fail-closed tests incl. AC13 startup-refusal (RED->GREEN) and AC8
(auth_mode actually changes the resolver). Full suite 1493 green.
… real HTTP (T8)

Proves the seams unit tests can't: a real RS256-signed token (in-test keypair,
stub JWKS) through httpx -> asgi_app in auth_mode=entra:
- valid mapped-oid token -> 202 and created_by == the mapped contributor in the
  queued payload (AC2/AC9, the load-bearing provenance chain, asserted via the
  durable-queue capture technique, no Neo4j needed)
- valid unmapped-oid token -> HTTP 403 (real response, not a mocked ASGI send)
- expired / garbage / missing bearer -> HTTP 401
- /status and /skills/* still exempt under entra mode
- static auth path regression intact

Tests only, no production change. Full non-neo4j suite 1543 green.
…tra section

Adds docs/entra-auth-setup.md (the council's #1 unblock -- an operator can't build
the oid->contributor map and a developer can't get a token without it):
- operator guide: config shape (YAML + env), `az ad user show` for an oid, a bold
  PII/secret-hygiene warning, the 403-names-oid recovery loop, the real startup-
  validator messages, and the fail-closed allow_unauthenticated note
- developer guide: scope access_as_user on api://<client-id>, az account
  get-access-token, the Bearer header + a full curl (incl. data.timestamp), and a
  401-vs-403 table from the caller's POV
- ops runbook: write-once wrong-oid permanence (+ verify-before-apply), JWKS
  ~5-min cache / ~6-week rotation guidance, and reading auth logs

AGENTS.md gains an Entra-auth subsection alongside the static-key section plus a
secret-hygiene rule. Placeholders only -- no real oids/client-ids/tenant-ids in
the product repo.

🤖 Generated with [Amplifier](https://github.com/microsoft/amplifier)

Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>
…d-run + log tags

Council-trimmed T5 (cranky-old-sam + crusty: no custom per-kid lock or global cap --
PyJWKClient handles per-kid caching + lifespan-bounded refresh natively):
- pin PyJWKClient(lifespan=JWKS_CACHE_LIFESPAN_SECONDS=300) so the cache-TTL contract
  is visible in code
- distinct, greppable auth log tags: auth_event=auth_denied (INFO, normal denial) vs
  auth_event=resolver_unexpected_exception (ERROR, catch-all) so an operator can tell
  'rejected a bad token' from 'resolver is broken'; raw token never logged
- prove post-startup JWKS-unreachable is fail-closed: a signing-key fetch that fails
  mid-run (connection error OR malformed JWKS) -> AuthError(401), not an unhandled 500;
  a previously-cached kid still resolves

14 new tests; zero concurrency primitives added (council direction). Full suite 1516 green.

🤖 Generated with Amplifier

Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>
…kdown

Council revised plan §3b (factory -> one conditional): a single `web_ui_enabled`
settings flag (default True), read once at app construction -- NOT a two-factory
split (cranky-old-sam: two profiles, trivial divergence). When web_ui_enabled=false
(the locked-down pilot profile):
- FastAPI built with docs_url/redoc_url/openapi_url=None (no Swagger, no schema leak)
- browser routes not registered: /, /dashboard, /static, /logs/stream
- kept: the API, /status, /version, and /skills/* (the bundle fetches skills here)
- the auth-exempt set narrows to {/status,/version}; /logs/stream LEAVES the exempt
  set (it was auth-exempt and only the dashboard used it -> no unauthenticated log
  drain). Unauth /logs/stream -> 401; with a token -> 404 (route absent).

20 tests incl. the tester-breaker F4 bypass guards: /openapi.json not 200, /dashboard
invalid-token not bypassed, /logs/stream unauth -> 401, /skills still reachable;
web_ui_enabled=true regression intact. Full suite 1536 green.

Generated with Amplifier

Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>
Adds ops-runbook note: an authenticated event missing data.timestamp is still accepted
(202, created_by stamped) but the durable drainer dead-letters it (no graph node). Ingest
validation, not auth -- surfaced in the live AC10 run; matters for curl/hand-rolled payloads.

🤖 Generated with [Amplifier](https://github.com/microsoft/amplifier)

Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>
colombod and others added 2 commits June 28, 2026 10:47
…silent dead-letter

The /events endpoint accepted events whose data.timestamp was missing/empty (202,
created_by stamped) but the durable drainer then crashed building the graph node
(datetime.fromisoformat('') -> ValueError), retried, and dead-lettered them -- no node,
and no error surfaced to the caller. Now:
- post_events validates data.timestamp is present, a non-empty string, and valid
  ISO-8601 BEFORE queuing; otherwise HTTP 400 with a clear, value-naming message.
- make_node_id wraps the parse and re-raises a NAMED error (event + session in the
  message) so any malformed event that bypasses ingest dead-letters legibly, not as a
  bare 'Invalid isoformat string'.

Verified safe against real traffic: 224,530 real events across 759 on-disk records all
carry data.timestamp, so the 400 only catches malformed/hand-rolled payloads (the gap
surfaced in the live AC10 run). 11 new tests; 15 pre-existing /events tests that sent
timestamp-less payloads updated to well-formed bodies (assertions unchanged). Suite 1376 green.

🤖 Generated with [Amplifier](https://github.com/microsoft/amplifier)

Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>
… after folding the ingest fix

The folded-in ingest validation (HTTP 400 on missing data.timestamp) made the entra
created_by integration test's event body well-formed; the other entra integration tests
are short-circuited by auth (401/403) before ingest and were unaffected. Full suite 1547 green.

🤖 Generated with [Amplifier](https://github.com/microsoft/amplifier)

Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>
@colombod

Copy link
Copy Markdown
Collaborator Author

Folded in the data.timestamp ingest-validation fix (was PR #30): /events now returns HTTP 400 for a missing/empty/invalid data.timestamp instead of accepting it (202) and silently dead-lettering it in the graph drainer; make_node_id re-raises a named error as defense-in-depth. Verified safe against real traffic (224,530 on-disk events, 0 missing the field). Full suite 1547 green.

colombod and others added 3 commits June 28, 2026 11:40
…iagrams

Audited every server doc and DOT diagram against the implemented Entra-auth feature:
- entra-auth-setup.md: §4.4 rewritten — a missing/invalid data.timestamp now returns
  HTTP 400 at ingest (was documented as silent 202 -> dead-letter); §4.3 states the live
  auth_event=auth_denied / resolver_unexpected_exception log tags (dropped the stale
  "finalized in T5" note); §3.4 notes the exempt set shrinks under web_ui_enabled=false.
- README.md: added the six missing settings rows (auth_mode, azure_client_id,
  azure_tenant_id, entra_identities, allow_unauthenticated, web_ui_enabled) and an Entra
  option in First-Run Setup.
- service-setup.md and managing-api-keys.md cross-reference entra mode; AGENTS.md corrects
  the auth.py description to the Bearer-token middleware / resolver model.
- architecture diagrams: NEW 06-auth-flow.dot (per-request: bearer -> middleware ->
  resolver[static|entra] -> 401/403 -> created_by -> 202) and 07-auth-startup.dot
  (create_asgi_app auth_mode switch + fail-closed gate + web_ui_enabled exempt selection);
  extended 05-durable-ingest-queue.dot with the auth middleware, the data.timestamp 400,
  and created_by stamping; architecture/README.md indexes both. PNGs rendered (graphviz);
  both new diagrams vision-checked (readable + correct flow).

Docs only; no code change. Verified against the live source.
…app (fail-open)

The context-intelligence-server-dev DTU profile launched `uvicorn ...main:app` -- the raw
FastAPI app with BearerTokenMiddleware NOT in the chain -- so an unauthenticated write
returned 422 (body validation), not 401, even though the profile generates and configures an
API key. Silent fail-open. Now serves `main:asgi_app` (auth-wrapped) in both the start and
update flows and exports AMPLIFIER_CONTEXT_INTELLIGENCE_SERVER_API_KEY=$CI_KEY so the
generated key is actually enforced (authenticated write -> 202, unauthenticated -> 401).
Mirrors the live-verified AC10 entra variant. Rule: launchers/profiles/Dockerfiles MUST serve
main:asgi_app, never main:app.

🤖 Generated with [Amplifier](https://github.com/microsoft/amplifier)

Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>
…egated az-CLI token model

Adds an "Authentication model & Entra App Registration" section to entra-auth-setup.md
making the current setup explicit:
- the server accepts DELEGATED (user) tokens only — scp must contain access_as_user, a
  claim that exists only on user-context tokens
- the single App Registration: Expose an API (api://<client>, delegated scope access_as_user,
  admin+user consent); the scope GUID is internal and never referenced by callers; a
  "what the server checks" table tied to auth.py (RS256, dual aud, v2 issuer, tid, scp,
  oid -> created_by)
- how a token is obtained today: `az login` + `az account get-access-token --resource
  api://<client>`; and DefaultAzureCredential().get_token("api://<client>/.default"),
  compatible ONLY when it resolves to a user-context credential (AzureCliCredential / VS Code
  / interactive)
- limitation: app-only credentials (Managed Identity / SP client-secret) carry `roles`, not
  `scp`, and are rejected; supporting them needs an App Role + server `roles` handling (not done)

AGENTS.md gains a one-line capture of the delegated-only model. Placeholders only; verified
against auth.py (algorithms=[RS256], expected_aud=[client_id, api://client_id], tid check,
scp.split() membership).
colombod added a commit that referenced this pull request Jun 29, 2026
… + live static keystore

T1-T3 of runtime identity-map management:
- config: admin_api_key (YAML config and/or env, consistent Settings pattern); api-keys &
  entra store paths
- IdentityStore (identity_store.py): durable JSON map with write-file-then-swap-memory commit
  order and fail-closed-on-corrupt load (never crash-loop); a live flat_dict reference
- wire the static keystore to the live store (first-boot seed from config, store-wins); a put()
  is visible to the resolver immediately, no restart
41 new tests; suite 1406 green.

NOTE: branched from main, which lacks the Entra auth code (auth_mode / EntraResolver /
entra_identities — those are on PR #29 / feat/entra-auth). The entra-side store wiring is
deferred until this work is re-based onto feat/entra-auth.

🤖 Generated with [Amplifier](https://github.com/microsoft/amplifier)

Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant