Skip to content

Refactor JWT verification and implement session versioning#647

Merged
jfrench9 merged 6 commits into
mainfrom
refactor/jwt-cache-invalidation-fix
May 1, 2026
Merged

Refactor JWT verification and implement session versioning#647
jfrench9 merged 6 commits into
mainfrom
refactor/jwt-cache-invalidation-fix

Conversation

@jfrench9
Copy link
Copy Markdown
Member

@jfrench9 jfrench9 commented May 1, 2026

Summary

Refactors the JWT verification pipeline and auth cache invalidation logic, introducing session versioning to enable reliable, granular token revocation without relying solely on token blocklists.

Key Changes

Session Versioning

  • Added session_version column to both users and API keys tables via a new Alembic migration
  • JWT tokens now embed a session version at creation time, which is validated against the current database value during verification
  • Incrementing a user's session version instantly invalidates all previously issued tokens without needing to track individual JWTs

JWT Verification Refactor

  • Restructured the JWT verification function to cleanly separate token decoding, claims validation, and session version checking
  • Improved the return value structure from the verification pipeline to carry richer context for downstream consumers

Auth Cache Invalidation Improvements

  • Enhanced the auth cache layer with session-version-aware invalidation logic
  • Cache entries are now tied to session versions, ensuring stale cached auth state is properly evicted when tokens are revoked

Auth Router Updates

  • Updated all auth-related routers (login, register, SSO, email verification, password reset, session management) to integrate with the new session versioning system
  • Session router now supports explicit session version bumps for "logout everywhere" functionality

Model Enhancements

  • Extended User and UserApiKey models with session version fields and methods for version management

Breaking Changes

  • JWT payload structure: Tokens now include a session_version claim. Existing tokens without this claim will fail verification and require re-authentication.
  • Database migration required: A new migration adds session_version columns to the users and API keys tables.
  • Verification return type: The verify function return value has changed, which affects any code consuming its output directly (all internal call sites and tests have been updated).

Testing

  • Comprehensive test updates across 8 test files covering:
    • Auth dependency injection with new verification return values
    • Session versioning lifecycle (creation, increment, validation)
    • JWT revocation via session version bumps
    • Graph access control with updated mock signatures
    • Email verification and password reset flows
    • SSO token exchange with session versioning
  • Net addition of ~624 lines of test code reflecting expanded coverage

Infrastructure Considerations

  • A database migration must be applied before deploying this change to add the session_version columns
  • Cache infrastructure should be monitored after deployment for any increase in invalidation rates during the transition period as existing sessions re-authenticate
  • Rolling deployments should account for the JWT payload change — old application instances will not recognize the new session_version claim, so a coordinated deployment is recommended

🤖 Generated with Claude Code

Branch Info:

  • Source: refactor/jwt-cache-invalidation-fix
  • Target: main
  • Type: refactor

Co-Authored-By: Claude noreply@anthropic.com

- Added session_version to User model to track session state changes.
- Updated JWT creation and verification to include session_version claims.
- Implemented session invalidation on password reset and user deactivation.
- Enhanced JWT verification to check session_version against the current user session.
- Updated SSO token generation to include session_version.
- Modified API key model to include a deterministic fingerprint for cache invalidation.
- Improved tests to cover new session_version logic and cache interactions.
…n logic

- Renamed `verify_jwt_token` to `verify_jwt_claims` for clarity in its purpose.
- Updated all references to the renamed function across the codebase, including middleware, routers, and tests.
- Enhanced `_invalidate_auth_cache` method in the User model to ensure both cache invalidation attempts succeed before considering the operation successful, with added logging for failures.
- Added regression tests for the `_invalidate_auth_cache` method to cover various scenarios, including success, transient failures, and critical failure logging.
- Updated comments and documentation to reflect changes in the cache invalidation strategy and JWT verification process.
@jfrench9
Copy link
Copy Markdown
Member Author

jfrench9 commented May 1, 2026

@claude please review this PR

@claude
Copy link
Copy Markdown

claude Bot commented May 1, 2026

Claude finished @jfrench9's task in 4m 24s —— View job


PR Review: Refactor JWT verification and implement session versioning

Tasks

  • Gather context and understand request
  • Read migration file
  • Read JWT and auth middleware changes
  • Read model changes
  • Read router changes
  • Read test changes
  • Post review feedback

Overall this is a well-structured security refactor with a clear design rationale. The session-versioning approach is sound, the backward-compatibility story is good, and the PR found and fixed at least one pre-existing bug (API key cache invalidation). A few issues worth addressing before merge, plus several smaller notes below.


Significant Find: Pre-existing API Key Cache Invalidation Bug — Fixed ✅

In validate_api_key (utils.py:156), the cache key has always been sha256(plaintext_key). But the old _invalidate_cache was calling api_key_cache.invalidate_api_key(self.key_hash) where key_hash is a bcrypt hash — a completely different value. Every call to _invalidate_cache was trying to delete a cache entry that didn't exist. This PR corrects it by introducing key_fingerprint (SHA-256) and using that instead. Worth calling out in the PR description since it's a real security fix, not just a refactor artifact.


Medium Issues

1. create_jwt_token performs a redundant DB query

_get_user_session_version opens a new DB session (or re-queries via the passed session) to read user.session_version. In every call site the caller already holds a loaded User object with session_version populated — yet the function re-queries:

# robosystems/middleware/auth/jwt.py
"session_version": _get_user_session_version(user_id, session=session) or 0,

vs. every caller having:

create_jwt_token(str(user.id), device_fingerprint, session=session)
# user.session_version is right there

The cleaner approach — one that also eliminates any sync-vs-async session ambiguity — is to accept session_version: int directly:

def create_jwt_token(
    user_id: str,
    device_fingerprint: dict[str, Any] | None = None,
    session_version: int = 0,
) -> str:

Callers already have the value: create_jwt_token(user.id, device_fingerprint, session_version=user.session_version). The helper _get_user_session_version could be removed entirely. Fix this →

2. Dead code: cache_jwt_validation / get_cached_jwt_validation are now unreachable

The old per-token jwt:* cache entries (cache.py:949, cache.py:1006) are no longer written to or read from anywhere in the new code. They should be removed. Similarly, the api_key_cache.invalidate_jwt_token(jwt_token) call in session.py:264 is now a no-op (no code writes to those entries after this refactor) — it can be removed to avoid misleading future readers.

3. time.sleep() inside a model method (user.py)

_invalidate_auth_cache calls time.sleep(0.05) (line ~196) for retry backoff. This blocks the calling thread. In this codebase routers use sync sessions, so it won't stall the event loop directly, but it's still a 50ms penalty on every password reset / deactivation that hits a Redis blip. A better pattern for retry backoff here would be to let the caller (router) decide — or accept that one Redis failure is logged loudly and operators page on it. The code comment already acknowledges the tradeoff; just worth reconsidering the sleep.


Low / Nit

4. PR description overstates the breaking change

"Existing tokens without this claim will fail verification and require re-authentication."

This is not accurate. verify_jwt_claims defaults payload.get("session_version", 0) to 0, and all existing users get session_version = 0 (via server_default). Old tokens are silently treated as version 0 and continue to work. This is actually a nice backward-compatible design — the description should reflect it so operators don't brace for a forced re-auth wave.

5. get_cached_jwt_user_data returns the cache envelope, not user data

The function is documented as returning "cached JWT user data" but actually returns cache_data — the full wrapper dict ({ user_data, session_version, cached_at, cache_version }). The caller in _get_user_for_verified_jwt correctly does cached_data.get("user_data", {}), but the function name and docstring create a false expectation. Consider renaming to get_cached_jwt_auth_entry or similar.

6. Stale session_version-mismatch cache entries not evicted

In get_cached_jwt_user_data, when cached_version != token_session_version, the stale entry is returned as None but not deleted:

# cache.py ~line 858
if cached_version != int(session_version):
    logger.debug(...)
    return None   # ← no delete

Under normal operation _invalidate_auth_cache already deleted this entry, so the case is uncommon. But if invalidation failed (the "fail-open" window), the stale version-0 entry survives for the full cache TTL and every version-1 request falls back to DB for that whole window. Evicting it here on version mismatch would halve the DB overhead during that window. Fix this →

7. Redundant is_active check in get_current_user_with_graph

# dependencies.py:358
if user and bool(user.is_active):

_get_user_for_verified_jwt already returns None for inactive users, so user.is_active is always True here. The outer check is harmless but misleading.

8. Session version bump on deactivate()

deactivate() now increments session_version. This means if a user is temporarily deactivated and then reactivated, they'll be forced to re-authenticate. That may be intentional but it's a behavior change worth documenting on the method.


Positive Security Notes

  • Grace period path hardened: refresh_session's expired-token decode now validates issuer and audience claims (previously missing for the verify_exp=False decode path). Good catch.
  • Password reset revocation fix: The old code iterated UserToken email/reset records and called revoke_jwt_token on them — those are never JWTs, so the revocation was a no-op. The new user.invalidate_sessions() correctly bumps the version and invalidates all JWT sessions.
  • Fail-open is explicitly documented: The comment in _invalidate_auth_cache naming the race condition and its bounded window is exactly the right level of documentation for a security-adjacent tradeoff.
  • key_fingerprint backfill strategy is clearly documented: The deploy note in user_api_key.py about pre-migration rows is thorough.

Test Coverage

The test restructuring is clean — separating TestJWTTokenVerification (claim-level, no DB) from TestGetUserForVerifiedJWT (DB/version comparison layer) mirrors the new design boundary well. The autouse _stub_session_version fixture in test_jwt_revocation.py is a good pattern for keeping these tests database-independent.

One gap: there's no test for the case where get_cached_jwt_user_data finds a cached entry but its session_version doesn't match the token (i.e., the invalidation raced with a cache hit). A unit test for that path in TestGetUserForVerifiedJWT would be good to have.


Comment thread robosystems/models/core/user/user_api_key.py Fixed
Comment thread robosystems/middleware/auth/cache.py Fixed
@jfrench9
Copy link
Copy Markdown
Member Author

jfrench9 commented May 1, 2026

@claude have we resolved all issues

@claude
Copy link
Copy Markdown

claude Bot commented May 1, 2026

Claude finished @jfrench9's task in 3m 23s —— View job


Tasks


Issue Status Review

Not all issues from the previous review have been resolved. Here's the current status:

✅ Resolved

  • Pre-existing API key cache bug_invalidate_cache now uses key_fingerprint (SHA-256) matching the cache key, fixing the bcrypt/SHA-256 mismatch that silently no-op'd every cache invalidation.
  • Dead code: cache_jwt_validation / get_cached_jwt_validation — these old per-token JWT cache functions are no longer present in the codebase.
  • api_key_cache.invalidate_jwt_token call in session.py:264 — that call is gone; the correct revoke_jwt_token is used instead.
  • Missing test for cache version mismatch pathtest_dependencies.py:413 (test_cache_version_mismatch_falls_through_to_db) covers the dependency behavior when get_cached_jwt_user_data returns None on a version mismatch.

❌ Still Open

Medium #1create_jwt_token performs a redundant DB query (jwt.py:252)

_get_user_session_version still exists and is called unconditionally; when a session is passed it queries the DB even though every caller already holds a loaded User. Refactoring to accept session_version: int directly would eliminate both the helper and the extra query.

# jwt.py:252 — still queries DB:
"session_version": _get_user_session_version(user_id, session=session) or 0,

Fix this →


Medium #3time.sleep(0.05) inside _invalidate_auth_cache (user.py:217)

The blocking sleep is still there on the retry path. Not critical, but a 50ms thread block on every Redis blip that hits a password reset or deactivation.


CodeQL false positive — SHA-256 flagged as "weak password hashing" (user_api_key.py:310)

The CodeQL alert is a false positive: _fingerprint_api_key uses SHA-256 as a cache lookup key for API keys, not for password storage. The actual password hashing uses bcrypt (lines 323-330). Adding a suppression comment or inline # nosec annotation would silence the scanner without changing behavior.


Code quality — Silent empty except in cache.py:878-880

except Exception:
    pass   # ← no log, no explanation

The github-code-quality bot flagged this. Changing to except Exception as e: logger.debug(f"cleanup failed: {e}") would satisfy the linter and preserve the best-effort semantics.
Fix this →


Low #5get_cached_jwt_user_data name/docstring mismatch (cache.py:816)

The function returns the full auth envelope ({ user_data, session_version, cached_at, cache_version }) but its name and docstring imply it returns user data. Callers correctly unwrap with .get("user_data", {}), but the mismatch is a readability hazard. Consider renaming to get_cached_jwt_auth_entry.


Low #6 — Stale cache entries not evicted on session_version mismatch (cache.py:844-850)

When cached_version != int(session_version), the stale entry is returned as None but not deleted. During the fail-open window (invalidation failed), every request for the new version falls back to DB for the full cache TTL.

if cached_version != int(session_version):
    logger.debug(...)
    return None   # ← should also delete cache_key, signature_key here

Fix this →


Low #7 — Redundant is_active check (dependencies.py:358)

if user and bool(user.is_active):   # user.is_active is always True here

_get_user_for_verified_jwt already returns None for inactive users (line 150), so the outer bool(user.is_active) guard is always True and misleads readers into thinking user might be inactive at this point.


Low #8 — deactivate() session version bump is undocumented (user.py:138)

deactivate() increments session_version, meaning a reactivated user must re-authenticate. This behavior change is not noted in the method docstring.


Summary

4 of the 12 flagged issues are resolved. The 8 remaining are all low-to-medium severity with the most impactful being the redundant DB query in create_jwt_token (medium) and the empty except flagged by the automated quality scanner (will block merge if gated on CodeQL/quality checks).

Comment thread robosystems/models/core/user/user_api_key.py Dismissed
@jfrench9 jfrench9 merged commit feb4f1f into main May 1, 2026
11 checks passed
@jfrench9 jfrench9 deleted the refactor/jwt-cache-invalidation-fix branch May 1, 2026 02:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants