Skip to content

feat:[ENG-2073] add short-lived result cache in SwarmCoordinator#390

Merged
danhdoan merged 2 commits intoproj/swarm-memoryfrom
feat/ENG-2073
Apr 13, 2026
Merged

feat:[ENG-2073] add short-lived result cache in SwarmCoordinator#390
danhdoan merged 2 commits intoproj/swarm-memoryfrom
feat/ENG-2073

Conversation

@cuongdo-byterover
Copy link
Copy Markdown
Collaborator

Problem:** The agent calls swarm_query multiple times during a single conversation, and each call re-queries all providers from scratch (340ms+). GBrain subprocess alone takes 200-300ms. Identical queries seconds apart produce identical results — wasted work.

  • Why it matters: Agent conversations involve bursts of tool calls. A 3-query burst takes 1s+ when the results are already known. This is noticeable latency in the agent's response loop.
  • What changed: Added a short-lived, in-memory LRU result cache in SwarmCoordinator. Identical queries within a 10-second TTL window return instantly from cache. Cache is invalidated on store() (content changed) and refreshHealth() (provider availability changed). Configurable via performance.result_cache_ttl_ms (default 10s, 0 disables).
  • What did NOT change (scope boundary): No changes to the query pipeline, provider adapters, search precision, result merging, write routing, CLI output formatting, or agent tool interface. The cache is fully transparent — same SwarmQueryResult type, same JSON output structure.

Type of change

  • Bug fix
  • New feature
  • Refactor (no behavior change)
  • Documentation
  • Test
  • Chore (build, dependencies, CI)

Scope (select all touched areas)

  • TUI / REPL
  • Agent / Tools
  • LLM Providers
  • Server / Daemon
  • Shared (constants, types, transport events)
  • CLI Commands (oclif)
  • Hub / Connectors
  • Cloud Sync
  • CI/CD / Infra

Linked issues

  • Closes ENG-2073

Root cause (bug fixes only, otherwise write N/A)

N/A

Test plan

  • Coverage added:
    • Unit test
    • Integration test
    • Manual verification only
  • Test file(s):
    • test/unit/agent/swarm/swarm-coordinator.test.ts (+12 tests)
    • test/unit/agent/swarm/provider-factory.test.ts (type alignment)
  • Key scenario(s) covered:
    • Cache hit: identical query → graph not called, result returned in 0ms
    • Cache hit: case insensitive ("JWT Refresh" → "jwt refresh")
    • Cache hit: whitespace normalized ("jwt refresh" → "jwt refresh")
    • Cache miss: different query, different scope, different maxResults
    • TTL expiry: wait > TTL → cache miss, full re-execution
    • TTL 0: disables caching entirely
    • Invalidation: store() clears cache on successful write
    • Invalidation: refreshHealth() clears cache
    • LRU eviction: 21 queries → oldest entry evicted, newest retained
    • Mutation safety: mutating returned result doesn't corrupt cache
    • Clone verification: cached results are different object references
  • Manual verification (in-process):
    • First query: 417ms → cached: 0ms → after store invalidation: 248ms

User-visible changes

  • New config option: performance.result_cache_ttl_ms (default: 10000, 0 to disable)
  • Agent tool swarm_query returns faster on repeated identical queries within 10s
  • No change to CLI behavior (each CLI invocation creates a new coordinator)
  • No change to output format or structure

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets

In-process manual test:

Test 1: First query        417ms, 7 results
Test 2: Cached (same)        0ms, 7 results  PASS
Test 3: Cached (CAPS)        0ms, 7 results  PASS
Test 4: Cache miss (new)   256ms, 1 results  PASS
Test 5: Mutation safety    results intact     PASS
Test 6: After store()     248ms, re-executed  PASS

Checklist

  • Tests added or updated and passing (npm test) — 5892 passing
  • Lint passes (npm run lint) — pre-commit hook verified
  • Type check passes (npm run typecheck)
  • Build succeeds (npm run build)
  • Commits follow Conventional Commits format
  • Documentation updated (if applicable)
  • No breaking changes (or clearly documented above)
  • Branch is up to date with main

Risks and mitigations

  • Risk: Cached result returned by reference could be mutated by downstream consumers, corrupting the cache.

    • Mitigation: Both cache store and cache hit clone the result object ({...result, results: [...result.results]}). Verified by mutation-safe test.
  • Risk: Cache key collision if query text contains the delimiter character.

    • Mitigation: Cache key uses JSON.stringify([query, scope, maxResults]) — no delimiter ambiguity.
  • Risk: Stale TTL-expired entries could accumulate in memory.

    • Mitigation: Expired entries are deleted on access. Max 20 entries with LRU eviction. Cache dies with the coordinator instance (one per agent session).
  • Risk: totalQueries counter could undercount when cache hits bypass the main execute path.

    • Mitigation: totalQueries++ is called on both cache hit and cache miss paths.

- Add resultCacheTtlMs (default 10s) to performance config schema
- Cache check at start of execute() returns instantly on hit
- LRU semantics via delete+re-set on cache hit (Map insertion order)
- Cache store after successful execution with max 20 entries
- Invalidation on store() (content changed) and refreshHealth()
  (provider availability changed)
- TTL 0 disables caching entirely
- 11 test cases covering hits, misses, TTL, invalidation, LRU eviction
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 13, 2026

Claude finished @cuongdo-byterover's task in 2m 29s —— View job


PR Review: feat:[ENG-2073] add short-lived result cache in SwarmCoordinator

  • Gather context
  • Read changed files
  • Review implementation
  • Check tests
  • Post final review

Summary

Overall Score: 4 / 5

The implementation is clean, well-tested, and transparent to callers. The LRU semantics are correctly implemented via Map insertion-order, mutation safety is addressed with shallow cloning, and the config schema change is consistent with existing patterns. The cache invalidation strategy (on store() success and refreshHealth()) is conservative and correct.

Two correctness bugs in buildCacheKey block a full score — both involve cache parameters that affect query routing but are not included in the key.


Issues

suggestion: Cache key missing request.type — can return wrong cached result

File: src/agent/infra/swarm/swarm-coordinator.ts, lines 415–419

QueryRequest.type is an optional caller-supplied override that bypasses classifyQuery() and directly controls which providers are selected via selectProviders(queryType, healthyIds). If two calls share the same query text but differ in type (e.g., type: 'factual' vs type: 'temporal'), the second call will get a cache hit from a different provider selection — silently returning wrong results.

// Current — type is not included
private buildCacheKey(request: QueryRequest): string {
  const q = request.query.toLowerCase().trim().replaceAll(/\s+/g, ' ')
  const scope = request.scope ?? ''
  const max = request.maxResults ?? this.config.routing.defaultMaxResults
  return JSON.stringify([q, scope, max])
}

// Fix — add type (undefined serializes consistently in JSON.stringify)
private buildCacheKey(request: QueryRequest): string {
  const q = request.query.toLowerCase().trim().replaceAll(/\s+/g, ' ')
  const scope = request.scope ?? ''
  const max = request.maxResults ?? this.config.routing.defaultMaxResults
  return JSON.stringify([q, scope, max, request.type])
}

Fix this →


suggestion: Cache key missing request.timeRange — can return stale temporal results

File: src/agent/infra/swarm/swarm-coordinator.ts, lines 415–419

request.timeRange is forwarded to this.graph.execute(request, activeIds) and from there to individual provider adapters as a temporal filter. Two calls with the same query but different time ranges (e.g., after: Date.now() - 3600_000 vs after: Date.now() - 86400_000) will hit the same cache entry. This is especially problematic for providers that have strong temporal filtering (e.g., hindsight).

// Fix — include timeRange in key
return JSON.stringify([q, scope, max, request.type, request.timeRange])

Fix this →


nitpick: Shallow clone doesn't protect nested object mutations

File: src/agent/infra/swarm/swarm-coordinator.ts, lines 201 and 279–281

[...result.results] copies the array reference but not the QueryResult objects inside it. A consumer mutating r1.results[0].score = 0 or r1.results[0].metadata.source = 'tampered' would corrupt the cached entry through the shared reference.

The PR's mutation-safe test only covers the case of replacing the array (r1.results.length = 0), not mutating items within it. Whether this matters in practice depends on downstream consumers. If QueryResult is treated as immutable everywhere, the current approach is fine — but worth documenting.

// Option A: document the assumption
// Cache store — clones array reference; individual QueryResult items are shared.
// Callers must treat QueryResult as immutable.
result: {...result, results: [...result.results]},

// Option B: deep-clone items (zero-cost at 10s TTL, avoids a whole class of bugs)
result: {...result, results: result.results.map(r => ({...r, metadata: {...r.metadata}}))},

nitpick: resultCacheTtlMs = 0 still writes entries to cache

File: src/agent/infra/swarm/swarm-coordinator.ts, lines 278–283

When resultCacheTtlMs = 0, cache entries are written but can never be retrieved (since Date.now() - timestamp < 0 is always false). Entries accumulate up to the 20-slot LRU limit and trigger LRU eviction runs on every query — pure overhead. Simple guard:

if (this.resultCacheTtlMs > 0) {
  this.resultCache.set(cacheKey, {
    result: {...result, results: [...result.results]},
    timestamp: Date.now(),
  })
  this.evictIfOverSize()
}

nitpick: resultCache and healthCache can be readonly

File: src/agent/infra/swarm/swarm-coordinator.ts, lines 157–160

Both maps are only mutated in-place (.set(), .delete(), .clear()) and never reassigned. Adding readonly documents intent and prevents accidental reassignment:

private readonly healthCache: Map<string, boolean> = new Map()
private readonly resultCache: Map<string, {result: SwarmQueryResult; timestamp: number}> = new Map()

What's done well

  • LRU semantics via Map insertion-order + delete-then-set on hit is idiomatic and correct.
  • Cache invalidation is conservative (clears on any successful store(), any refreshHealth()) — right tradeoff for a 10s TTL.
  • resultCacheTtlMs: z.number().int().min(0) correctly allows 0 (unlike sibling fields that use .positive()).
  • totalQueries++ on cache hit path — good, keeps accounting consistent.
  • Test coverage is thorough: 12 cases covering hit, miss, case normalization, whitespace, TTL expiry, TTL=0, both invalidation paths, LRU eviction, and mutation safety. The eviction test pattern (21 entries, verify oldest evicted, newest retained) is exactly right.
  • JSON.stringify as key avoids delimiter-collision issues that string concatenation would introduce.

- Add request.type and request.timeRange to cache key to prevent
  cross-type and cross-timerange cache collisions
- Deep-clone QueryResult items (including metadata) on both cache
  store and cache hit to prevent nested object mutation
- Skip cache store when resultCacheTtlMs=0 to avoid dead writes
  and unnecessary LRU eviction runs
- Mark healthCache and resultCache as readonly (mutated in-place
  via .set/.delete/.clear, never reassigned)
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 13, 2026

Claude finished @cuongdo-byterover's task in 3m 0s —— View job


PR Review: feat:[ENG-2073] add short-lived result cache in SwarmCoordinator

  • Gather context
  • Read changed files
  • Review implementation
  • Check tests
  • Post final review

Summary

Overall Score: 5 / 5

All five issues from the previous review are resolved. The implementation is correct, well-tested, and transparent to callers. Two new minor observations below — neither is a blocker.

Previous issues — all fixed:

  • request.type added to buildCacheKey ✓ (swarm-coordinator.ts:421)
  • request.timeRange added to buildCacheKey ✓ (swarm-coordinator.ts:421)
  • Deep clone upgraded to include metadata: {...r, metadata: {...r.metadata}} on both store and hit paths ✓ (swarm-coordinator.ts:201, 281)
  • if (this.resultCacheTtlMs > 0) guard prevents writes when TTL=0 ✓ (swarm-coordinator.ts:279)
  • Both healthCache and resultCache now readonly ✓ (swarm-coordinator.ts:157, 160)

Issues

suggestion: Missing tests for type and timeRange cache key discrimination

src/agent/infra/swarm/swarm-coordinator.ts, lines 417–421

Both request.type and request.timeRange were added to the cache key as correctness fixes from the previous review. The tests cover cache misses for query, scope, and maxResults, but there are no corresponding tests for the two new key components. If the type/timeRange fields are later accidentally dropped from buildCacheKey, nothing will catch it.

// Suggested additions to swarm-coordinator.test.ts:

it('cache miss for different type', async () => {
  await coordinator.execute({query: 'auth', type: 'factual'})
  await coordinator.execute({query: 'auth', type: 'temporal'})
  expect((p1.query as sinon.SinonStub).callCount).to.equal(2)
})

it('cache miss for different timeRange', async () => {
  await coordinator.execute({query: 'auth', timeRange: {after: 1_000}})
  await coordinator.execute({query: 'auth', timeRange: {after: 2_000}})
  expect((p1.query as sinon.SinonStub).callCount).to.equal(2)
})

Fix this →


nitpick: meta.providers not cloned — consumer mutation reaches the cache

src/agent/infra/swarm/swarm-coordinator.ts, lines 201 and 279–281

The spread {...cached.result, results: ...} on both the store and hit paths clones the top-level SwarmQueryResult shape and deep-clones each QueryResult item (including metadata). However, meta.providers (Record<string, ProviderQueryMeta>) is still a shared reference — a consumer mutating result.meta.providers['byterover'].latencyMs = 0 would corrupt the cached entry. In practice this is unlikely since meta is read-only in the agent tool, but it's inconsistent with the mutation-safety intent applied to results[i].metadata.

// Option: clone meta.providers as well
return {
  ...cached.result,
  meta: {...cached.result.meta, providers: {...cached.result.meta.providers}},
  results: cached.result.results.map((r) => ({...r, metadata: {...r.metadata}})),
}

What's done well

  • All five correctness/quality issues from the previous review are addressed cleanly without scope creep.
  • buildCacheKey completeness: JSON.stringify([q, scope, max, request.type, request.timeRange]) now covers all fields that influence provider selection and result filtering.
  • resultCacheTtlMs > 0 guard: Zero-TTL no longer writes phantom entries that burn LRU slots.
  • readonly on both Maps: Documents and enforces the invariant that the map reference is never replaced, only mutated in-place.
  • Deep clone for results[i].metadata: Upgrades the previous shallow clone to protect nested objects within result items.
  • Test coverage: 12 cases covering hit, miss (query/scope/maxResults), case normalization, whitespace, TTL expiry, TTL=0, both invalidation paths, LRU eviction, and mutation safety — all correct and well-structured.
  • LRU eviction via delete+set on cache hit correctly promotes accessed entries to the "newest" end of Map insertion order.

Copy link
Copy Markdown
Collaborator

@danhdoan danhdoan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@danhdoan danhdoan merged commit 2ea6993 into proj/swarm-memory Apr 13, 2026
6 checks passed
@bao-byterover bao-byterover mentioned this pull request Apr 14, 2026
29 tasks
@cuongdo-byterover cuongdo-byterover deleted the feat/ENG-2073 branch April 14, 2026 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants