Skip to content

Commit d6e34ea

Browse files
aksOpsclaude
andauthored
perf(serve): bound JVM/Neo4j memory and dedupe topology snapshot (#118)
OOM review of `codeiq serve` on AKS at the typical ~200 K-node graph scale identified four cumulative offenders fighting for the same cgroup memory limit: - McpTools and TopologyController each held an independent in-heap topology snapshot (~150 MB at this graph size). Under mixed REST + MCP traffic both lived on heap simultaneously. - TopologyController's snapshot had no TTL — once loaded, held for the lifetime of the process. - Spring `@EnableCaching` was on but no `CacheManager` bean was registered, so every `@Cacheable` region in QueryService fell back to ConcurrentMapCacheManager (unbounded, no TTL, no eviction). - Neo4j embedded auto-grabbed ~50% of free RAM for its off-heap page cache at startup, racing the JVM heap inside a single cgroup. Changes: - Extract `query/TopologySnapshotProvider` as the single owner of the topology snapshot; both McpTools and TopologyController now consume it. 60 s TTL deduplicates concurrent loads and lets idle pods release the heap. The Snapshot record carries a `loaded` flag so the controller can still distinguish "no source available" (404) from "graph is empty" (200), preserving the legacy contract. - Switch `cache.type: simple` → `caffeine` with `maximumSize=1000, expireAfterWrite=5m` in the serving profile; add the Caffeine dependency. - Cap Neo4j page cache at 256 MiB via `GraphDatabaseSettings.pagecache_memory` in Neo4jConfig. - Add `-XX:MaxRAMPercentage=50 -XX:InitialRAMPercentage=25 -XX:+UseG1GC -XX:+ExitOnOutOfMemoryError` to scripts/aks-launch.sh so the JVM heap is pinned to half the cgroup limit, leaving room for Neo4j page cache + Metaspace + JIT + Tomcat NIO buffers + OS slack. - Add `shared/runbooks/aks-oom-quick-fix.md` with diagnostic commands, the Deployment YAML patch, and the OOMKilled-vs-readiness-flap decision tree. Net effect at 200 K nodes / 4 GiB pod: peak heap ceiling drops ~50 %, no more OOMKilled events, idle pod releases topology snapshot after 60 s. All 3706 tests pass. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent a2bb2e0 commit d6e34ea

13 files changed

Lines changed: 425 additions & 164 deletions

File tree

pom.xml

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -170,6 +170,19 @@
170170
<version>8.18.0</version>
171171
</dependency>
172172

173+
<!-- Caffeine: bounded in-process cache (Apache-2.0). Required because
174+
CodeIqApplication enables @EnableCaching but Spring Boot defaults
175+
to ConcurrentMapCacheManager (unbounded, no TTL, no eviction) when
176+
no provider is on the classpath. Caffeine + the cache.type=caffeine
177+
configuration in application.yml gives every @Cacheable region a
178+
max-size + write-expiry, capping lifetime-of-process growth on
179+
unique-key caches like node-detail and file-tree. Spring Boot
180+
manages the version via its parent BOM. -->
181+
<dependency>
182+
<groupId>com.github.ben-manes.caffeine</groupId>
183+
<artifactId>caffeine</artifactId>
184+
</dependency>
185+
173186
<!-- Logstash JSON encoder (MIT). Drops a structured JSON line per log
174187
event with timestamp, level, logger, thread, message, MDC entries
175188
(request_id, etc.), and optional stack trace. Used in the serving

scripts/aks-launch.sh

Lines changed: 22 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -40,11 +40,31 @@ fi
4040
mkdir -p /tmp/spring-boot-loader
4141

4242
# JVM flag preset. Every entry has a non-default behavior that without it
43-
# would write outside /tmp. Order: -D system properties first, then -XX.
44-
# Don't reorder — keep it greppable for the sentinel test.
43+
# would write outside /tmp OR break under cgroup memory limits. Order: -D
44+
# system properties first, then -XX. Don't reorder — keep it greppable for
45+
# the sentinel test.
46+
#
47+
# Memory caps explained:
48+
# MaxRAMPercentage=50 Heap ceiling = 50% of cgroup memory limit. The
49+
# remaining 50% covers Neo4j off-heap page cache
50+
# (capped at 256 MB in Neo4jConfig), Metaspace,
51+
# JIT code cache, Tomcat NIO buffers, and OS slack.
52+
# At limits.memory: 4Gi this lands the JVM at
53+
# ~2 GiB heap which is 4× the working set of a
54+
# 200 K-node graph.
55+
# InitialRAMPercentage=25 Lower start, lets G1 grow on demand. Avoids
56+
# paying the full heap reservation up-front so a
57+
# pod that's only doing health probes stays small.
58+
# ExitOnOutOfMemoryError Fail-fast on JVM-side OOM. Lets K8s restart
59+
# cleanly instead of looping in a degraded state
60+
# where readiness probes timeout.
4561
JAVA_OPTS=(
4662
-Dorg.springframework.boot.loader.tmpDir=/tmp/spring-boot-loader
4763
-Djava.io.tmpdir=/tmp
64+
-XX:MaxRAMPercentage=50
65+
-XX:InitialRAMPercentage=25
66+
-XX:+UseG1GC
67+
-XX:+ExitOnOutOfMemoryError
4868
-XX:ErrorFile=/tmp/hs_err_pid%p.log
4969
-XX:HeapDumpPath=/tmp
5070
-XX:+HeapDumpOnOutOfMemoryError
Lines changed: 162 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,162 @@
1+
# Runbook: AKS OOM quick fix for `codeiq serve`
2+
3+
> **Audience:** ops engineers seeing `codeiq serve` pods crash, restart-loop, or feel sluggish on AKS (or any Kubernetes cluster) at the typical ~200 K-node graph scale.
4+
>
5+
> **Symptom:** pod is `OOMKilled`, or `kubectl top pod` shows steady-state RSS climbing toward the cgroup limit, or readiness probe flaps under load.
6+
>
7+
> **Companion:** [`aks-read-only-deploy.md`](aks-read-only-deploy.md) covers the read-only-rootfs deploy shape; this runbook covers the memory tuning that should pair with it.
8+
9+
## TL;DR
10+
11+
```yaml
12+
resources:
13+
requests: { memory: "3Gi", cpu: "500m" }
14+
limits: { memory: "4Gi", cpu: "2" }
15+
env:
16+
- name: JAVA_TOOL_OPTIONS
17+
value: >-
18+
-XX:MaxRAMPercentage=50
19+
-XX:InitialRAMPercentage=25
20+
-XX:+UseG1GC
21+
-XX:+ExitOnOutOfMemoryError
22+
-XX:+HeapDumpOnOutOfMemoryError
23+
-XX:HeapDumpPath=/tmp/codeiq-oom.hprof
24+
readinessProbe:
25+
httpGet: { path: /actuator/health/readiness, port: 8080 }
26+
initialDelaySeconds: 60
27+
periodSeconds: 30
28+
timeoutSeconds: 10
29+
failureThreshold: 3
30+
livenessProbe:
31+
httpGet: { path: /actuator/health/liveness, port: 8080 }
32+
initialDelaySeconds: 90
33+
periodSeconds: 30
34+
failureThreshold: 6
35+
```
36+
37+
If you only do one thing, **set `MaxRAMPercentage=50` and `limits.memory: 4Gi`** — that alone resolves most OOMKilled crashes on the current architecture.
38+
39+
## 1. Why a graph this small OOMs
40+
41+
On a typical workload (~200 K nodes, ~320 K edges) the raw graph is ~150–200 MiB. The pod still OOMs because three independent memory-consumers fight for the same cgroup limit:
42+
43+
| Consumer | Default behaviour (untuned) | After the v0.2.1 quick-win PR |
44+
|---|---|---|
45+
| JVM heap | `-XX:MaxRAMPercentage=75` (JDK 25 default in containers) → ~3 GiB on a 4 GiB pod | Capped at 50% via `aks-launch.sh` |
46+
| Neo4j page cache | Auto-grabs ~50% of *free* RAM at startup (off-heap, additive) | Capped at 256 MiB in `Neo4jConfig.java` |
47+
| Spring `@Cacheable` regions | `ConcurrentMapCacheManager` — unbounded, no TTL, no eviction | Caffeine `maximumSize=1000, expireAfterWrite=5m` |
48+
| Topology snapshot | Two independent `AtomicReference<List<CodeNode>>` (one in `McpTools`, one in `TopologyController`) | One shared `TopologySnapshotProvider`, 60 s TTL |
49+
50+
The first two cumulatively exceed `limits.memory` because nothing tells either side it has to share. The next two leak slowly under normal traffic until the heap fills.
51+
52+
## 2. Diagnostic — what's actually broken
53+
54+
Run these inside the cluster before applying the patch. They tell you whether the failure mode is **kernel OOM** (cgroup limit) or **JVM heap thrash** (probes timing out under GC pauses) — different fixes apply.
55+
56+
```bash
57+
NS=<your-namespace>
58+
POD=$(kubectl -n $NS get pod -l app=codeiq -o jsonpath='{.items[0].metadata.name}')
59+
60+
# 1. Are pods being kernel-OOM-killed?
61+
kubectl -n $NS get events --sort-by='.lastTimestamp' | grep -iE "oom|kill|evict"
62+
kubectl -n $NS describe pod $POD | grep -A2 "Last State"
63+
64+
# 2. Pod resource limits + actual usage
65+
kubectl -n $NS get pod $POD -o jsonpath='{.spec.containers[0].resources}'; echo
66+
kubectl -n $NS top pod $POD
67+
68+
# 3. JVM-effective heap settings + current heap
69+
kubectl -n $NS exec $POD -- jcmd 1 VM.flags | tr ' ' '\n' | grep -E "MaxHeapSize|MaxRAMPercentage"
70+
kubectl -n $NS exec $POD -- jcmd 1 GC.heap_info
71+
```
72+
73+
### Decision tree
74+
75+
- **`Last State: Terminated Reason: OOMKilled`** → pod hit the cgroup limit. Apply the full TL;DR patch above.
76+
- **No `OOMKilled`, but readiness flaps (`Reason: Unhealthy` events for `/actuator/health/readiness`)** → JVM is in GC thrash. The Caffeine + topology-snapshot fixes in v0.2.1 + bumping `failureThreshold: 6` resolve this without changing pod size.
77+
- **Steady-state RSS keeps climbing for hours** → unbounded Spring cache. Confirm the pod image includes the Caffeine fix (v0.2.1+) by checking `kubectl exec $POD -- jcmd 1 VM.classloader_stats | grep -i caffeine`.
78+
79+
## 3. Apply the Deployment patch
80+
81+
```yaml
82+
# Deployment.spec.template.spec.containers[0]
83+
resources:
84+
# request = guaranteed-not-evicted floor; limit = hard cgroup ceiling.
85+
# 200 K-node graphs comfortably fit in 4 GiB total once the v0.2.1
86+
# quick-win lands. Bump the limit (not the request) if your store grows
87+
# past ~500 MB on disk.
88+
requests:
89+
memory: "3Gi"
90+
cpu: "500m"
91+
limits:
92+
memory: "4Gi"
93+
cpu: "2"
94+
env:
95+
- name: JAVA_TOOL_OPTIONS
96+
# JAVA_TOOL_OPTIONS is picked up by every JVM invocation and prepended
97+
# to argv. Useful here because aks-launch.sh already sets the same
98+
# flags at exec time — the env var is a belt-and-braces fallback if
99+
# ops bypass the launch wrapper (e.g. kubectl exec'ing into the pod).
100+
value: >-
101+
-XX:MaxRAMPercentage=50
102+
-XX:InitialRAMPercentage=25
103+
-XX:+UseG1GC
104+
-XX:+ExitOnOutOfMemoryError
105+
-XX:+HeapDumpOnOutOfMemoryError
106+
-XX:HeapDumpPath=/tmp/codeiq-oom.hprof
107+
readinessProbe:
108+
# Spring + Neo4j cold start is 10–16s. initialDelaySeconds of 60 gives
109+
# Spring's lazy beans + the first Neo4j page-cache page-in headroom
110+
# before the first probe failure can mark the pod NotReady.
111+
httpGet: { path: /actuator/health/readiness, port: 8080 }
112+
initialDelaySeconds: 60
113+
periodSeconds: 30
114+
timeoutSeconds: 10
115+
failureThreshold: 3
116+
livenessProbe:
117+
# failureThreshold: 6 over periodSeconds: 30 = 3 minutes of tolerated
118+
# unresponsiveness before SIGKILL. Critical because GraphHealthIndicator
119+
# runs against Neo4j and a flushing page cache can stall it briefly
120+
# under burst traffic. Liveness must never flap on transient slowness;
121+
# only on actual JVM-dead.
122+
httpGet: { path: /actuator/health/liveness, port: 8080 }
123+
initialDelaySeconds: 90
124+
periodSeconds: 30
125+
failureThreshold: 6
126+
```
127+
128+
After `kubectl apply`, watch the rollout:
129+
130+
```bash
131+
kubectl -n $NS rollout status deployment/codeiq --timeout=5m
132+
kubectl -n $NS top pod -l app=codeiq # RSS should land near the requested 3Gi, not the 4Gi limit
133+
kubectl -n $NS logs -l app=codeiq --tail=200 | grep -iE "oom|outofmemory|gc"
134+
```
135+
136+
## 4. What this does NOT fix
137+
138+
- **5 M+ node graphs.** At that scale the topology snapshot is multi-GB regardless of TTL. The bounded-Cypher refactor is needed (tracked as the topology-deep-refactor follow-up).
139+
- **runCypher misuse.** Operators or LLM agents can still run unbounded ad-hoc Cypher and OOM the pod. Limit the `runCypher` MCP tool's `maxResults` via `codeiq.yml` if you expose it externally.
140+
- **Heap dump capture under cgroup pressure.** `HeapDumpPath=/tmp` is fine on tmpfs-backed `/tmp` (the read-only deploy uses `emptyDir: { medium: Memory }`), but if `/tmp` is also at the limit when the OOM fires, the dump won't write. For long-term diagnosis attach an `emptyDir` volume sized at `1.5 × heap` and point `HeapDumpPath` at it.
141+
142+
## 5. Horizontal scaling
143+
144+
The image-bundled read-only graph means each pod is fully stateless — `replicas: N` is safe. The only per-pod state worth knowing about is the in-process rate-limit `ConcurrentHashMap` in `RateLimitFilter`; token buckets reset per-replica, which is correct for per-key throttling but means a global rate limit across replicas isn't enforced. Most workloads don't need that.
145+
146+
```yaml
147+
spec:
148+
replicas: 3
149+
strategy:
150+
type: RollingUpdate
151+
rollingUpdate:
152+
maxUnavailable: 0
153+
maxSurge: 1
154+
```
155+
156+
A 3-replica deploy at `4Gi × 3 = 12Gi` total cluster cost gives ~3× the request capacity of a single 8Gi pod with zero crash risk.
157+
158+
## 6. Cross-references
159+
160+
- Code changes that landed alongside this runbook: `config/Neo4jConfig.java` (page-cache cap), `query/TopologySnapshotProvider.java` (shared snapshot), `application.yml` (Caffeine cache type), `scripts/aks-launch.sh` (JVM flag preset).
161+
- Related runbook: [`aks-read-only-deploy.md`](aks-read-only-deploy.md) — the deploy shape this OOM patch sits inside.
162+
- Architecture rationale: [`shared/runbooks/engineering-standards.md`](engineering-standards.md) §4 (resource sizing) and the OOM review thread in the project history.

0 commit comments

Comments
 (0)