|
| 1 | +# Runbook: AKS OOM quick fix for `codeiq serve` |
| 2 | + |
| 3 | +> **Audience:** ops engineers seeing `codeiq serve` pods crash, restart-loop, or feel sluggish on AKS (or any Kubernetes cluster) at the typical ~200 K-node graph scale. |
| 4 | +> |
| 5 | +> **Symptom:** pod is `OOMKilled`, or `kubectl top pod` shows steady-state RSS climbing toward the cgroup limit, or readiness probe flaps under load. |
| 6 | +> |
| 7 | +> **Companion:** [`aks-read-only-deploy.md`](aks-read-only-deploy.md) covers the read-only-rootfs deploy shape; this runbook covers the memory tuning that should pair with it. |
| 8 | +
|
| 9 | +## TL;DR |
| 10 | + |
| 11 | +```yaml |
| 12 | +resources: |
| 13 | + requests: { memory: "3Gi", cpu: "500m" } |
| 14 | + limits: { memory: "4Gi", cpu: "2" } |
| 15 | +env: |
| 16 | + - name: JAVA_TOOL_OPTIONS |
| 17 | + value: >- |
| 18 | + -XX:MaxRAMPercentage=50 |
| 19 | + -XX:InitialRAMPercentage=25 |
| 20 | + -XX:+UseG1GC |
| 21 | + -XX:+ExitOnOutOfMemoryError |
| 22 | + -XX:+HeapDumpOnOutOfMemoryError |
| 23 | + -XX:HeapDumpPath=/tmp/codeiq-oom.hprof |
| 24 | +readinessProbe: |
| 25 | + httpGet: { path: /actuator/health/readiness, port: 8080 } |
| 26 | + initialDelaySeconds: 60 |
| 27 | + periodSeconds: 30 |
| 28 | + timeoutSeconds: 10 |
| 29 | + failureThreshold: 3 |
| 30 | +livenessProbe: |
| 31 | + httpGet: { path: /actuator/health/liveness, port: 8080 } |
| 32 | + initialDelaySeconds: 90 |
| 33 | + periodSeconds: 30 |
| 34 | + failureThreshold: 6 |
| 35 | +``` |
| 36 | +
|
| 37 | +If you only do one thing, **set `MaxRAMPercentage=50` and `limits.memory: 4Gi`** — that alone resolves most OOMKilled crashes on the current architecture. |
| 38 | + |
| 39 | +## 1. Why a graph this small OOMs |
| 40 | + |
| 41 | +On a typical workload (~200 K nodes, ~320 K edges) the raw graph is ~150–200 MiB. The pod still OOMs because three independent memory-consumers fight for the same cgroup limit: |
| 42 | + |
| 43 | +| Consumer | Default behaviour (untuned) | After the v0.2.1 quick-win PR | |
| 44 | +|---|---|---| |
| 45 | +| JVM heap | `-XX:MaxRAMPercentage=75` (JDK 25 default in containers) → ~3 GiB on a 4 GiB pod | Capped at 50% via `aks-launch.sh` | |
| 46 | +| Neo4j page cache | Auto-grabs ~50% of *free* RAM at startup (off-heap, additive) | Capped at 256 MiB in `Neo4jConfig.java` | |
| 47 | +| Spring `@Cacheable` regions | `ConcurrentMapCacheManager` — unbounded, no TTL, no eviction | Caffeine `maximumSize=1000, expireAfterWrite=5m` | |
| 48 | +| Topology snapshot | Two independent `AtomicReference<List<CodeNode>>` (one in `McpTools`, one in `TopologyController`) | One shared `TopologySnapshotProvider`, 60 s TTL | |
| 49 | + |
| 50 | +The first two cumulatively exceed `limits.memory` because nothing tells either side it has to share. The next two leak slowly under normal traffic until the heap fills. |
| 51 | + |
| 52 | +## 2. Diagnostic — what's actually broken |
| 53 | + |
| 54 | +Run these inside the cluster before applying the patch. They tell you whether the failure mode is **kernel OOM** (cgroup limit) or **JVM heap thrash** (probes timing out under GC pauses) — different fixes apply. |
| 55 | + |
| 56 | +```bash |
| 57 | +NS=<your-namespace> |
| 58 | +POD=$(kubectl -n $NS get pod -l app=codeiq -o jsonpath='{.items[0].metadata.name}') |
| 59 | +
|
| 60 | +# 1. Are pods being kernel-OOM-killed? |
| 61 | +kubectl -n $NS get events --sort-by='.lastTimestamp' | grep -iE "oom|kill|evict" |
| 62 | +kubectl -n $NS describe pod $POD | grep -A2 "Last State" |
| 63 | +
|
| 64 | +# 2. Pod resource limits + actual usage |
| 65 | +kubectl -n $NS get pod $POD -o jsonpath='{.spec.containers[0].resources}'; echo |
| 66 | +kubectl -n $NS top pod $POD |
| 67 | +
|
| 68 | +# 3. JVM-effective heap settings + current heap |
| 69 | +kubectl -n $NS exec $POD -- jcmd 1 VM.flags | tr ' ' '\n' | grep -E "MaxHeapSize|MaxRAMPercentage" |
| 70 | +kubectl -n $NS exec $POD -- jcmd 1 GC.heap_info |
| 71 | +``` |
| 72 | + |
| 73 | +### Decision tree |
| 74 | + |
| 75 | +- **`Last State: Terminated Reason: OOMKilled`** → pod hit the cgroup limit. Apply the full TL;DR patch above. |
| 76 | +- **No `OOMKilled`, but readiness flaps (`Reason: Unhealthy` events for `/actuator/health/readiness`)** → JVM is in GC thrash. The Caffeine + topology-snapshot fixes in v0.2.1 + bumping `failureThreshold: 6` resolve this without changing pod size. |
| 77 | +- **Steady-state RSS keeps climbing for hours** → unbounded Spring cache. Confirm the pod image includes the Caffeine fix (v0.2.1+) by checking `kubectl exec $POD -- jcmd 1 VM.classloader_stats | grep -i caffeine`. |
| 78 | + |
| 79 | +## 3. Apply the Deployment patch |
| 80 | + |
| 81 | +```yaml |
| 82 | +# Deployment.spec.template.spec.containers[0] |
| 83 | +resources: |
| 84 | + # request = guaranteed-not-evicted floor; limit = hard cgroup ceiling. |
| 85 | + # 200 K-node graphs comfortably fit in 4 GiB total once the v0.2.1 |
| 86 | + # quick-win lands. Bump the limit (not the request) if your store grows |
| 87 | + # past ~500 MB on disk. |
| 88 | + requests: |
| 89 | + memory: "3Gi" |
| 90 | + cpu: "500m" |
| 91 | + limits: |
| 92 | + memory: "4Gi" |
| 93 | + cpu: "2" |
| 94 | +env: |
| 95 | + - name: JAVA_TOOL_OPTIONS |
| 96 | + # JAVA_TOOL_OPTIONS is picked up by every JVM invocation and prepended |
| 97 | + # to argv. Useful here because aks-launch.sh already sets the same |
| 98 | + # flags at exec time — the env var is a belt-and-braces fallback if |
| 99 | + # ops bypass the launch wrapper (e.g. kubectl exec'ing into the pod). |
| 100 | + value: >- |
| 101 | + -XX:MaxRAMPercentage=50 |
| 102 | + -XX:InitialRAMPercentage=25 |
| 103 | + -XX:+UseG1GC |
| 104 | + -XX:+ExitOnOutOfMemoryError |
| 105 | + -XX:+HeapDumpOnOutOfMemoryError |
| 106 | + -XX:HeapDumpPath=/tmp/codeiq-oom.hprof |
| 107 | +readinessProbe: |
| 108 | + # Spring + Neo4j cold start is 10–16s. initialDelaySeconds of 60 gives |
| 109 | + # Spring's lazy beans + the first Neo4j page-cache page-in headroom |
| 110 | + # before the first probe failure can mark the pod NotReady. |
| 111 | + httpGet: { path: /actuator/health/readiness, port: 8080 } |
| 112 | + initialDelaySeconds: 60 |
| 113 | + periodSeconds: 30 |
| 114 | + timeoutSeconds: 10 |
| 115 | + failureThreshold: 3 |
| 116 | +livenessProbe: |
| 117 | + # failureThreshold: 6 over periodSeconds: 30 = 3 minutes of tolerated |
| 118 | + # unresponsiveness before SIGKILL. Critical because GraphHealthIndicator |
| 119 | + # runs against Neo4j and a flushing page cache can stall it briefly |
| 120 | + # under burst traffic. Liveness must never flap on transient slowness; |
| 121 | + # only on actual JVM-dead. |
| 122 | + httpGet: { path: /actuator/health/liveness, port: 8080 } |
| 123 | + initialDelaySeconds: 90 |
| 124 | + periodSeconds: 30 |
| 125 | + failureThreshold: 6 |
| 126 | +``` |
| 127 | + |
| 128 | +After `kubectl apply`, watch the rollout: |
| 129 | + |
| 130 | +```bash |
| 131 | +kubectl -n $NS rollout status deployment/codeiq --timeout=5m |
| 132 | +kubectl -n $NS top pod -l app=codeiq # RSS should land near the requested 3Gi, not the 4Gi limit |
| 133 | +kubectl -n $NS logs -l app=codeiq --tail=200 | grep -iE "oom|outofmemory|gc" |
| 134 | +``` |
| 135 | + |
| 136 | +## 4. What this does NOT fix |
| 137 | + |
| 138 | +- **5 M+ node graphs.** At that scale the topology snapshot is multi-GB regardless of TTL. The bounded-Cypher refactor is needed (tracked as the topology-deep-refactor follow-up). |
| 139 | +- **runCypher misuse.** Operators or LLM agents can still run unbounded ad-hoc Cypher and OOM the pod. Limit the `runCypher` MCP tool's `maxResults` via `codeiq.yml` if you expose it externally. |
| 140 | +- **Heap dump capture under cgroup pressure.** `HeapDumpPath=/tmp` is fine on tmpfs-backed `/tmp` (the read-only deploy uses `emptyDir: { medium: Memory }`), but if `/tmp` is also at the limit when the OOM fires, the dump won't write. For long-term diagnosis attach an `emptyDir` volume sized at `1.5 × heap` and point `HeapDumpPath` at it. |
| 141 | + |
| 142 | +## 5. Horizontal scaling |
| 143 | + |
| 144 | +The image-bundled read-only graph means each pod is fully stateless — `replicas: N` is safe. The only per-pod state worth knowing about is the in-process rate-limit `ConcurrentHashMap` in `RateLimitFilter`; token buckets reset per-replica, which is correct for per-key throttling but means a global rate limit across replicas isn't enforced. Most workloads don't need that. |
| 145 | + |
| 146 | +```yaml |
| 147 | +spec: |
| 148 | + replicas: 3 |
| 149 | + strategy: |
| 150 | + type: RollingUpdate |
| 151 | + rollingUpdate: |
| 152 | + maxUnavailable: 0 |
| 153 | + maxSurge: 1 |
| 154 | +``` |
| 155 | + |
| 156 | +A 3-replica deploy at `4Gi × 3 = 12Gi` total cluster cost gives ~3× the request capacity of a single 8Gi pod with zero crash risk. |
| 157 | + |
| 158 | +## 6. Cross-references |
| 159 | + |
| 160 | +- Code changes that landed alongside this runbook: `config/Neo4jConfig.java` (page-cache cap), `query/TopologySnapshotProvider.java` (shared snapshot), `application.yml` (Caffeine cache type), `scripts/aks-launch.sh` (JVM flag preset). |
| 161 | +- Related runbook: [`aks-read-only-deploy.md`](aks-read-only-deploy.md) — the deploy shape this OOM patch sits inside. |
| 162 | +- Architecture rationale: [`shared/runbooks/engineering-standards.md`](engineering-standards.md) §4 (resource sizing) and the OOM review thread in the project history. |
0 commit comments