Number531 · Number531 · May 16, 2026 · May 16, 2026 · May 16, 2026 · May 16, 2026
diff --git a/.claude/skills/deploy/SKILL.md b/.claude/skills/deploy/SKILL.md
@@ -77,7 +77,7 @@ If gcloud auth fails with `Reauthentication failed` or `cannot prompt during non
 | Docker push fails | Check Docker auth: `gcloud auth configure-docker us-east1-docker.pkg.dev`. If network error, retry. |
 | `ZONE_RESOURCE_POOL_EXHAUSTED` | GCE has no capacity in `us-east1-c`. Wait 10-30 min and retry — GCE capacity is transient. If persistent, try changing machine type in deploy.sh (n2-standard-2 uses a different capacity pool than e2-medium). |
 | No running instance after 160s | GCE capacity exhausted in all zones. Wait and retry, or manually create in a different zone. |
-| Static IP assignment fails | Script retries 3 times. If still fails, the old instance may still hold the IP — wait 30s for MIG replacement. Static IP only works in us-east1 zones. |
+| Static IP assignment fails | Script retries 8 times (~330s total budget including 180s pre-flight) and re-resolves the MIG instance name on every retry to defeat the mid-loop-roll variant. If still fails, follow the manual recovery commands printed at end-of-step 7. Static IP only works in us-east1 zones. |
 | Health check fails via SSH | Container may still be starting — wait 60s and check again. If DB shows ETIMEDOUT, the static IP `34.26.70.60` may not be in Cloud SQL whitelist |
 | Two instances running | Multiple MIGs may exist. The script finds the first RUNNING instance regardless of zone. If stale MIGs are found, delete them: `gcloud compute instance-groups managed delete super-legal-staging --zone=<stale-zone> --quiet` |
 | SSH `REMOTE HOST IDENTIFICATION HAS CHANGED` | Instance was replaced — new host key. Script clears these automatically, but if manual SSH is needed: `sed -i '' '/compute\./d' ~/.ssh/google_compute_known_hosts` |
@@ -124,54 +124,47 @@ gcloud compute ssh super-legal-staging-XXXX --zone=us-east1-c \
 ```
 Wait 60s, then verify via `curl http://34.26.70.60:3001/health | jq '.reconciliation'` — `error` field should disappear.
 
-### Static IP assignment race (recurring)
+### Static IP assignment race (recurring — two distinct failure modes)
 
-**Observed on**: 2026-04-27 v6.7.0 deploy AND 2026-04-28 v6.8.0 deploy.
+The script handles two related race conditions in Step 7. Both have hit production multiple times; both fixes are now in `deploy.sh`.
 
-**Symptom**: Step 7 reports `Failed to assign static IP after N attempts`. Container ends up on an ephemeral IP (e.g., `34.23.108.165`). Cloud SQL whitelist only includes `34.26.70.60`, so `ensureHookSchema()` fails.
+#### Pattern A — OLD instance still holds the IP
 
-**Root cause**: GCE Managed Instance Group terminates the OLD instance and creates the NEW one in parallel. The OLD instance still holds `34.26.70.60` for ~30-60s during graceful termination. The script's previous retry budget (3 attempts × 15s wait = 45s window) was too short.
+**Observed on**: 2026-04-27 v6.7.0, 2026-04-28 v6.8.0.
 
-**Fix in deploy.sh** (Step 7, applied 2026-04-28):
-1. Pre-flight wait: poll `gcloud compute addresses list --filter="address=$STATIC_IP"` until status is `RESERVED` (not `IN_USE`). Up to 90s window.
-2. Bumped retry attempts from 3 → 5 with 30s wait (total: 150s).
-3. Captured stderr to a tempfile so failures surface in the log instead of being silently swallowed by `2>/dev/null`.
-4. On final failure, print the manual recovery commands directly to stderr.
+**Symptom**: Step 7 retries fail because the OLD instance still has `34.26.70.60` bound during its graceful-termination window (~30-60s).
 
-**Manual recovery if step 7 still fails**:
-```bash
-INSTANCE=$(gcloud compute instances list --filter="name~super-legal-staging" --format="value(name)" | head -1)
-gcloud compute instances delete-access-config $INSTANCE --zone=us-east1-c --access-config-name=external-nat --quiet
-sleep 10
-gcloud compute instances add-access-config $INSTANCE --zone=us-east1-c --access-config-name=external-nat --address=34.26.70.60 --quiet
-gcloud compute ssh $INSTANCE --zone=us-east1-c --command='docker restart $(docker ps -q | head -1)'
-```
+**Fix in deploy.sh** (applied 2026-04-28, widened 2026-05-16):
+1. Pre-flight: poll `gcloud compute addresses list --filter="address=$STATIC_IP"` until status is `RESERVED` (not `IN_USE`). **12 × 15s = 180s window** (was 9 × 10s = 90s).
+2. Per-retry: 15s sleep between `delete-access-config` and `add-access-config` (was 5s — matches the duration that consistently works in manual recovery).
+3. Captured stderr to a tempfile so failures surface in the log.
 
-### Variant: MIG instance replacement mid-retries
+#### Pattern B — MIG rolls the instance mid-retry-loop
 
-**Observed on**: 2026-05-06 v7.0.1 deploy.
+**Observed on**: 2026-05-06 v7.0.1, then THREE consecutive deploys 2026-05-12 / 2026-05-15 / 2026-05-16.
 
-**Symptom**: Step 7 retries 5x with `Could not fetch resource: super-legal-staging-XXXX` even though the script logs `IP is RESERVED` and proceeds to `Attempt 1/5: Assigning ...`. The log line keeps showing the SAME instance name across all 5 attempts. Meanwhile, `gcloud compute instances list` reveals a DIFFERENT instance name is actually running.
+**Symptom**: Script logs `IP is RESERVED` then `Attempt 1/N: Assigning <name>` — but the log keeps targeting the SAME `<name>` across all retries while `gcloud compute instances list` shows a DIFFERENT running instance. Every `add-access-config` returns `Could not fetch resource: <stale-name>`.
 
-**Root cause**: The MIG terminated the instance the script was targeting (e.g., `super-legal-staging-0239`) and rolled forward to a new one (e.g., `super-legal-staging-bzx4`) DURING step 7's retry budget. The script captured the original instance name in step 6 and did not re-resolve it on each retry. Every `add-access-config` call hits a deleted resource.
+**Root cause**: MIG terminated the instance the script captured in Step 6 and rolled to a new one DURING Step 7's retry budget. The script's `$INSTANCE` variable was stale.
 
-**Detection between retries**:
-```bash
-gcloud compute instances list --filter='name~super-legal-staging AND status=RUNNING' --format='value(name)'
-```
-If this returns a different instance name than what the script's log shows, the variant has triggered.
+**Fix in deploy.sh** (applied 2026-05-16):
+1. **Re-resolve `$INSTANCE` on EVERY retry iteration** via a `resolve_current_instance()` helper that queries `gcloud compute instances list --filter='name~super-legal-staging AND status=RUNNING'`. The script now re-targets and logs `MIG rolled: <old> → <new> — re-targeting` whenever the name changes between iterations.
+2. **On "Could not fetch resource" errors specifically**, sleep 10s (not 30s) before re-resolving — fast recovery on the most common transient.
+3. **Bumped retries 5 → 8** with re-resolution on each, expanding the wall budget from 150s to ~330s (with 180s pre-flight). Total step 7 budget: ~510s (~8.5 min).
+4. **Updated manual recovery commands** in the failure message to re-resolve `INSTANCE` first (was using whatever the script had cached).
+
+#### Manual recovery if Step 7 still exhausts both budgets
 
-**Manual recovery on the new instance**:
 ```bash
-NEW_INSTANCE=$(gcloud compute instances list --filter='name~super-legal-staging AND status=RUNNING' --format='value(name)' | head -1)
-gcloud compute instances delete-access-config $NEW_INSTANCE --zone=us-east1-c --access-config-name=external-nat --quiet
-sleep 10
-gcloud compute instances add-access-config $NEW_INSTANCE --zone=us-east1-c --access-config-name=external-nat --address=34.26.70.60 --quiet
-sed -i '' '/compute\./d' ~/.ssh/google_compute_known_hosts
-gcloud compute ssh $NEW_INSTANCE --zone=us-east1-c --command='docker restart $(docker ps -q | head -1)'
+INSTANCE=$(gcloud compute instances list --filter='name~super-legal-staging AND status=RUNNING' --format='value(name)' | head -1)
+gcloud compute instances delete-access-config $INSTANCE --zone=us-east1-c --access-config-name=external-nat --quiet
+sleep 15
+gcloud compute instances add-access-config $INSTANCE --zone=us-east1-c --access-config-name=external-nat --address=34.26.70.60 --quiet
+sed -i '' '/compute\./d' ~/.ssh/google_compute_known_hosts  # if SSH balks at host-key change
+gcloud compute ssh $INSTANCE --zone=us-east1-c --command='docker restart $(docker ps -q | head -1)'
 ```
 
-Wait 60s, then verify via `curl http://34.26.70.60:3001/health`.
+Wait 60s, then verify via `curl http://34.26.70.60:3001/health | jq '.dependencies.database'`. `status: "ok"` confirms the static-IP + Cloud-SQL-whitelist path is working again.
 
 **Future deploy.sh hardening** (not yet implemented): Step 7's retry loop should re-resolve the instance name on each attempt:
 ```bash

diff --git a/.claude/skills/deploy/scripts/deploy.sh b/.claude/skills/deploy/scripts/deploy.sh
@@ -142,22 +142,31 @@ log "Ensuring firewall tags on $INSTANCE..."
 gcloud compute instances add-tags "$INSTANCE" --zone="$INSTANCE_ZONE" --tags=super-legal-mcp 2>/dev/null || true
 
 # ── Step 7: Assign static IP ────────────────────────────────────────────
-# The recurring failure mode (observed v6.7.0 + v6.8.0): MIG terminates the
-# OLD instance and creates the new one in parallel; the OLD instance still
-# holds 34.26.70.60 for ~30-60s during graceful termination. Trying to assign
-# the static IP to the NEW instance during that window fails with errors
-# silently swallowed by 2>/dev/null. Three fixes applied:
-#   1) Pre-flight: poll until the address resource shows status=RESERVED
-#      (or no users), proving the previous instance has fully released it.
-#   2) Bumped retries 3→5 with 30s waits (total window 150s vs old 45s).
-#   3) Captured stderr to a file so failure messages surface in the log.
+# Two recurring failure modes (Apr–May 2026):
+#   A) Race (v6.7.0, v6.8.0): MIG terminates OLD instance + creates NEW one
+#      in parallel; OLD still holds 34.26.70.60 for ~30-60s during graceful
+#      termination. add-access-config to NEW conflicts.
+#   B) Stale INSTANCE handle (v7.0.1, 2026-05-12, 2026-05-15, 2026-05-16):
+#      MIG rolls the instance AGAIN during retry budget. Script captured
+#      $INSTANCE in step 6; every retry hammers a dead resource and gets
+#      "Could not fetch resource: <stale-name>". Hit 3× in a row May 2026.
+#
+# Fixes applied (2026-05-16 — closes pattern B):
+#   1) Pre-flight: 12 × 15s = 180s waiting for status=RESERVED (was 90s).
+#   2) Re-resolve $INSTANCE from MIG on EVERY retry iteration (the manual
+#      recovery does this implicitly; the script did not).
+#   3) Bumped retries 5 → 8, sleep 5 → 15 between delete+add (matches the
+#      sleep duration that consistently works in manual recovery).
+#   4) On "Could not fetch resource" specifically, log it as a roll-event
+#      and re-resolve before next retry without waiting the full 30s.
+#   5) Captured stderr to a file so failure messages surface in the log.
 step "Step 7: Assign static IP ($STATIC_IP)"
 
 # Pre-flight: wait for the static address to be released by the old instance.
-# Up to 90s window — covers GCE graceful-termination latency.
+# Up to 180s window — covers GCE graceful-termination latency + 2nd MIG roll.
 log "Pre-flight: waiting for $STATIC_IP to be released by previous instance..."
 RELEASED=false
-for wait_attempt in 1 2 3 4 5 6 7 8 9; do
+for wait_attempt in 1 2 3 4 5 6 7 8 9 10 11 12; do
   ADDR_STATUS=$(gcloud compute addresses list \
     --filter="address=$STATIC_IP" \
     --format="value(status,users)" 2>/dev/null | head -1)
@@ -167,42 +176,72 @@ for wait_attempt in 1 2 3 4 5 6 7 8 9; do
     RELEASED=true
     break
   fi
-  log "  attempt $wait_attempt: $ADDR_STATUS"
-  sleep 10
+  log "  attempt $wait_attempt/12: $ADDR_STATUS"
+  sleep 15
 done
 if [ "$RELEASED" = false ]; then
-  warn "Static IP still bound after 90s — proceeding anyway"
+  warn "Static IP still bound after 180s — proceeding anyway"
 fi
 
+# Helper: re-resolve the current running MIG instance name. Returns empty if
+# nothing is RUNNING. Used at every retry to defeat pattern B.
+resolve_current_instance() {
+  gcloud compute instances list \
+    --filter="name~super-legal-staging AND status=RUNNING" \
+    --format="value(name)" 2>/dev/null | head -1
+}
+
 ASSIGNED=false
 ERR_FILE=$(mktemp)
 trap 'rm -f "$ERR_FILE"' EXIT
-for attempt in 1 2 3 4 5; do
-  log "Attempt $attempt/5: Assigning $STATIC_IP to $INSTANCE..."
+for attempt in 1 2 3 4 5 6 7 8; do
+  # Re-resolve INSTANCE on every iteration — protects against MIG rolling
+  # mid-loop (pattern B). $INSTANCE_ZONE assumed stable (single-zone MIG).
+  CURRENT_INSTANCE=$(resolve_current_instance)
+  if [ -z "$CURRENT_INSTANCE" ]; then
+    warn "  attempt $attempt/8: no RUNNING instance found — MIG may be rolling, waiting 20s"
+    sleep 20
+    continue
+  fi
+  if [ "$CURRENT_INSTANCE" != "$INSTANCE" ]; then
+    log "  MIG rolled: $INSTANCE → $CURRENT_INSTANCE — re-targeting"
+    INSTANCE="$CURRENT_INSTANCE"
+  fi
+  log "Attempt $attempt/8: Assigning $STATIC_IP to $INSTANCE..."
   # Remove existing access config if present (try lowercase first, fallback uppercase)
   gcloud compute instances delete-access-config "$INSTANCE" --zone="$INSTANCE_ZONE" \
     --access-config-name="external-nat" --quiet 2>/dev/null || \
   gcloud compute instances delete-access-config "$INSTANCE" --zone="$INSTANCE_ZONE" \
     --access-config-name="External NAT" --quiet 2>/dev/null || true
-  sleep 5  # let the delete propagate before re-adding
+  sleep 15  # let the delete propagate; 5s was too short, 10–15s consistently works
   if gcloud compute instances add-access-config "$INSTANCE" --zone="$INSTANCE_ZONE" \
     --access-config-name="external-nat" --address="$STATIC_IP" 2>"$ERR_FILE"; then
     log "Static IP assigned successfully"
     ASSIGNED=true
     break
   fi
-  warn "  failure: $(head -1 "$ERR_FILE")"
-  warn "  retrying in 30s..."
-  sleep 30
+  ERR_MSG=$(head -1 "$ERR_FILE")
+  warn "  failure: $ERR_MSG"
+  # Pattern B detection: "Could not fetch resource" usually means $INSTANCE
+  # was deleted by MIG mid-call. Re-resolve on the next iteration immediately
+  # rather than waiting the full 30s.
+  if echo "$ERR_MSG" | grep -q "Could not fetch resource"; then
+    warn "  → likely MIG rolled the instance; re-resolving in 10s"
+    sleep 10
+  else
+    warn "  retrying in 30s..."
+    sleep 30
+  fi
 done
 
 if [ "$ASSIGNED" = false ]; then
-  err "Failed to assign static IP after 5 attempts (total wait: 150s + 90s pre-flight)"
-  warn "Manual recovery — run these commands:"
-  warn "  gcloud compute instances delete-access-config $INSTANCE --zone=$INSTANCE_ZONE --access-config-name=external-nat --quiet"
-  warn "  sleep 10"
-  warn "  gcloud compute instances add-access-config $INSTANCE --zone=$INSTANCE_ZONE --access-config-name=external-nat --address=$STATIC_IP --quiet"
-  warn "  gcloud compute ssh $INSTANCE --zone=$INSTANCE_ZONE --command='docker restart \$(docker ps -q | head -1)'"
+  err "Failed to assign static IP after 8 attempts (~330s total: 180s pre-flight + 8 retries)"
+  warn "Manual recovery — run these commands (re-resolve INSTANCE first):"
+  warn "  INSTANCE=\$(gcloud compute instances list --filter='name~super-legal-staging AND status=RUNNING' --format='value(name)' | head -1)"
+  warn "  gcloud compute instances delete-access-config \$INSTANCE --zone=$INSTANCE_ZONE --access-config-name=external-nat --quiet"
+  warn "  sleep 15"
+  warn "  gcloud compute instances add-access-config \$INSTANCE --zone=$INSTANCE_ZONE --access-config-name=external-nat --address=$STATIC_IP --quiet"
+  warn "  gcloud compute ssh \$INSTANCE --zone=$INSTANCE_ZONE --command='docker restart \$(docker ps -q | head -1)'"
   FINAL_IP=$(gcloud compute instances describe "$INSTANCE" --zone="$INSTANCE_ZONE" \
     --format="value(networkInterfaces[0].accessConfigs[0].natIP)" 2>/dev/null)
   warn "Using ephemeral IP: $FINAL_IP"