Add LRU capacity to ValidatingCache, remove sentinel pattern, add storage Update#4669
Add LRU capacity to ValidatingCache, remove sentinel pattern, add storage Update#4669
Conversation
There was a problem hiding this comment.
Large PR Detected
This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.
How to unblock this PR:
Add a section to your PR description with the following format:
## Large PR Justification
[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformationAlternative:
Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.
See our Contributing Guidelines for more details.
This review will be automatically dismissed once you add the justification section.
There was a problem hiding this comment.
Pull request overview
This PR introduces a fully-typed, capacity-bounded LRU ValidatingCache and updates the session manager/storage integration to remove the prior sentinel-based anti-resurrection approach by using a conditional storage Update (SET XX semantics).
Changes:
- Add
pkg/cache.ValidatingCache[K,V]with optional LRU capacity and singleflight-deduplicated loads. - Extend
transport/session.DataStoragewithUpdate(conditional overwrite) and wire it into session metadata updates/decoration to prevent resurrection after deletes. - Replace sessionmanager’s internal cache/sentinel pattern with the shared cache package and simplify termination flow.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/vmcp/server/sessionmanager/session_manager.go | Switch to pkg/cache.ValidatingCache, remove sentinel-based termination/metadata logic, use storage Update. |
| pkg/vmcp/server/sessionmanager/session_manager_test.go | Update tests to the typed cache and storage Update behavior; adjust race/resurrection coverage. |
| pkg/vmcp/server/sessionmanager/factory.go | Add FactoryConfig.CacheCapacity to configure node-local cache size. |
| pkg/vmcp/server/sessionmanager/cache.go | Remove the old RestorableCache implementation and sentinel-related APIs. |
| pkg/transport/session/session_data_storage.go | Add DataStorage.Update to the interface contract (SET XX semantics). |
| pkg/transport/session/session_data_storage_redis.go | Implement Redis-backed Update via SET XX + TTL. |
| pkg/transport/session/session_data_storage_local.go | Implement local Update with mutex to be atomic w.r.t. Delete. |
| pkg/cache/validating_cache.go | New shared typed cache with LRU capacity, per-hit validation, and singleflight miss deduplication. |
| pkg/cache/validating_cache_test.go | Move/reshape cache tests to cover LRU eviction and typed APIs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4669 +/- ##
==========================================
+ Coverage 68.60% 68.62% +0.01%
==========================================
Files 515 515
Lines 53518 53641 +123
==========================================
+ Hits 36718 36811 +93
- Misses 13958 13975 +17
- Partials 2842 2855 +13 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Large PR justification has been provided. Thank you!
|
✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
pkg/vmcp/server/sessionmanager/session_manager.go:142
- The onEvict callback now runs for both expiry-driven evictions and capacity/LRU evictions, but the log message says "evicted expired session". This will be misleading once CacheCapacity is set. Consider adjusting the message (or passing eviction reason through the cache) so LRU evictions aren’t reported as expirations.
func(id string, sess vmcpsession.MultiSession) {
if closeErr := sess.Close(); closeErr != nil {
slog.Warn("session cache: error closing evicted session",
"session_id", id, "error", closeErr)
}
slog.Warn("session cache: evicted expired session from node-local cache",
"session_id", id)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
36eeed9 to
49e8631
Compare
…pdate doc comment
…-healing Remove the Delete/Peek/CompareAndSwap methods from ValidatingCache — the cache now self-heals lazily: on the next Get, checkSession loads from storage and evicts if the session is gone or terminated. Terminate detects phase (full MultiSession vs placeholder) by loading from storage and checking MetadataKeyTokenHash rather than Peeking the cache. Full sessions trigger a storage.Delete; placeholders are marked terminated in storage. The node-local cache catches up on next access. DecorateSession drops the CAS+rollback pattern in favour of sessions.Set followed by storage.Update (SET XX), which acts as the concurrency guard. A concurrent termination between the two calls is detected by Update returning false. LocalSessionDataStorage replaces its mixed sync.Map+sync.Mutex with a plain map guarded exclusively by a single sync.Mutex, eliminating the dual-primitive complexity. The Update contract tests are extended to cover missing-key, existing-key, and post-Delete cases. Add defaultCacheCapacity = 1000 so capacity=0 no longer means "unlimited"; document the long-term goal of unifying cache and storage behind a single interface.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
pkg/cache/validating_cache_test.go:405
- These tests use
sync.WaitGroup.Go(...), but elsewhere in this same filesync.WaitGroupis used withAdd+go+Done. Unless you’re relying on a Go version that addsWaitGroup.Go, this won’t compile; even if it does, the mixed styles are confusing. Consider rewriting these to the standardAdd/Donepattern (or useerrgroup.Groupif you want aGohelper).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
JAORMX
left a comment
There was a problem hiding this comment.
Nice work on this one! The move to hashicorp/golang-lru/v2 and the storage.Update (SET XX) pattern are solid choices. The sentinel removal cleans things up a lot. I do have a few things I'd like to discuss before we merge though.
The most important one I couldn't leave as an inline comment because server.go isn't in the diff:
pkg/vmcp/server/server.go:453-460 builds the FactoryConfig without setting CacheCapacity. Go defaults int to 0, and sessionmanager.New (line 105) now rejects anything < 1. So the vMCP server won't start on this branch. This needs to be wired from config or given a sensible default.
Other things worth discussing:
- Phase 1 Terminate still uses
Upsertinstead ofUpdate(line 485), which can recreate a deleted key... the exact race we're fixing. - Phase 2 Terminate deletes from storage but doesn't evict from cache.
ValidatingCachehas noDeletemethod. Terminated sessions hold backend connections until the nextGetor LRU pressure. DecorateSessionwrites cache before storage (line 722). Flipping the order would avoid cache/storage divergence on transient storage errors.- Duplicate Update test cases in
session_data_storage_test.go. cleanupFailedPlaceholdermutates the caller's map in place (line 392).
See the inline comments for details on each. None of these are blocking... but the server.go wiring and the Upsert on line 485 are pretty important to address.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
jerm-dro
left a comment
There was a problem hiding this comment.
Minor suggestions inline — all can be addressed in follow-up PRs.
Summary
Fixes #4494
Type of change
Test plan
task test)task test-e2e)task lint-fix)Changes
Does this introduce a user-facing change?
Special notes for reviewers
Large PR Justification
This is a complete pr covering a single functionality. Cannot be split.