Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 28 additions & 35 deletions .claude/skills/deploy/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ If gcloud auth fails with `Reauthentication failed` or `cannot prompt during non
| Docker push fails | Check Docker auth: `gcloud auth configure-docker us-east1-docker.pkg.dev`. If network error, retry. |
| `ZONE_RESOURCE_POOL_EXHAUSTED` | GCE has no capacity in `us-east1-c`. Wait 10-30 min and retry — GCE capacity is transient. If persistent, try changing machine type in deploy.sh (n2-standard-2 uses a different capacity pool than e2-medium). |
| No running instance after 160s | GCE capacity exhausted in all zones. Wait and retry, or manually create in a different zone. |
| Static IP assignment fails | Script retries 3 times. If still fails, the old instance may still hold the IP — wait 30s for MIG replacement. Static IP only works in us-east1 zones. |
| Static IP assignment fails | Script retries 8 times (~330s total budget including 180s pre-flight) and re-resolves the MIG instance name on every retry to defeat the mid-loop-roll variant. If still fails, follow the manual recovery commands printed at end-of-step 7. Static IP only works in us-east1 zones. |
| Health check fails via SSH | Container may still be starting — wait 60s and check again. If DB shows ETIMEDOUT, the static IP `34.26.70.60` may not be in Cloud SQL whitelist |
| Two instances running | Multiple MIGs may exist. The script finds the first RUNNING instance regardless of zone. If stale MIGs are found, delete them: `gcloud compute instance-groups managed delete super-legal-staging --zone=<stale-zone> --quiet` |
| SSH `REMOTE HOST IDENTIFICATION HAS CHANGED` | Instance was replaced — new host key. Script clears these automatically, but if manual SSH is needed: `sed -i '' '/compute\./d' ~/.ssh/google_compute_known_hosts` |
Expand Down Expand Up @@ -124,54 +124,47 @@ gcloud compute ssh super-legal-staging-XXXX --zone=us-east1-c \
```
Wait 60s, then verify via `curl http://34.26.70.60:3001/health | jq '.reconciliation'` — `error` field should disappear.

### Static IP assignment race (recurring)
### Static IP assignment race (recurring — two distinct failure modes)

**Observed on**: 2026-04-27 v6.7.0 deploy AND 2026-04-28 v6.8.0 deploy.
The script handles two related race conditions in Step 7. Both have hit production multiple times; both fixes are now in `deploy.sh`.

**Symptom**: Step 7 reports `Failed to assign static IP after N attempts`. Container ends up on an ephemeral IP (e.g., `34.23.108.165`). Cloud SQL whitelist only includes `34.26.70.60`, so `ensureHookSchema()` fails.
#### Pattern A — OLD instance still holds the IP

**Root cause**: GCE Managed Instance Group terminates the OLD instance and creates the NEW one in parallel. The OLD instance still holds `34.26.70.60` for ~30-60s during graceful termination. The script's previous retry budget (3 attempts × 15s wait = 45s window) was too short.
**Observed on**: 2026-04-27 v6.7.0, 2026-04-28 v6.8.0.

**Fix in deploy.sh** (Step 7, applied 2026-04-28):
1. Pre-flight wait: poll `gcloud compute addresses list --filter="address=$STATIC_IP"` until status is `RESERVED` (not `IN_USE`). Up to 90s window.
2. Bumped retry attempts from 3 → 5 with 30s wait (total: 150s).
3. Captured stderr to a tempfile so failures surface in the log instead of being silently swallowed by `2>/dev/null`.
4. On final failure, print the manual recovery commands directly to stderr.
**Symptom**: Step 7 retries fail because the OLD instance still has `34.26.70.60` bound during its graceful-termination window (~30-60s).

**Manual recovery if step 7 still fails**:
```bash
INSTANCE=$(gcloud compute instances list --filter="name~super-legal-staging" --format="value(name)" | head -1)
gcloud compute instances delete-access-config $INSTANCE --zone=us-east1-c --access-config-name=external-nat --quiet
sleep 10
gcloud compute instances add-access-config $INSTANCE --zone=us-east1-c --access-config-name=external-nat --address=34.26.70.60 --quiet
gcloud compute ssh $INSTANCE --zone=us-east1-c --command='docker restart $(docker ps -q | head -1)'
```
**Fix in deploy.sh** (applied 2026-04-28, widened 2026-05-16):
1. Pre-flight: poll `gcloud compute addresses list --filter="address=$STATIC_IP"` until status is `RESERVED` (not `IN_USE`). **12 × 15s = 180s window** (was 9 × 10s = 90s).
2. Per-retry: 15s sleep between `delete-access-config` and `add-access-config` (was 5s — matches the duration that consistently works in manual recovery).
3. Captured stderr to a tempfile so failures surface in the log.

### Variant: MIG instance replacement mid-retries
#### Pattern B — MIG rolls the instance mid-retry-loop

**Observed on**: 2026-05-06 v7.0.1 deploy.
**Observed on**: 2026-05-06 v7.0.1, then THREE consecutive deploys 2026-05-12 / 2026-05-15 / 2026-05-16.

**Symptom**: Step 7 retries 5x with `Could not fetch resource: super-legal-staging-XXXX` even though the script logs `IP is RESERVED` and proceeds to `Attempt 1/5: Assigning ...`. The log line keeps showing the SAME instance name across all 5 attempts. Meanwhile, `gcloud compute instances list` reveals a DIFFERENT instance name is actually running.
**Symptom**: Script logs `IP is RESERVED` then `Attempt 1/N: Assigning <name>` — but the log keeps targeting the SAME `<name>` across all retries while `gcloud compute instances list` shows a DIFFERENT running instance. Every `add-access-config` returns `Could not fetch resource: <stale-name>`.

**Root cause**: The MIG terminated the instance the script was targeting (e.g., `super-legal-staging-0239`) and rolled forward to a new one (e.g., `super-legal-staging-bzx4`) DURING step 7's retry budget. The script captured the original instance name in step 6 and did not re-resolve it on each retry. Every `add-access-config` call hits a deleted resource.
**Root cause**: MIG terminated the instance the script captured in Step 6 and rolled to a new one DURING Step 7's retry budget. The script's `$INSTANCE` variable was stale.

**Detection between retries**:
```bash
gcloud compute instances list --filter='name~super-legal-staging AND status=RUNNING' --format='value(name)'
```
If this returns a different instance name than what the script's log shows, the variant has triggered.
**Fix in deploy.sh** (applied 2026-05-16):
1. **Re-resolve `$INSTANCE` on EVERY retry iteration** via a `resolve_current_instance()` helper that queries `gcloud compute instances list --filter='name~super-legal-staging AND status=RUNNING'`. The script now re-targets and logs `MIG rolled: <old> → <new> — re-targeting` whenever the name changes between iterations.
2. **On "Could not fetch resource" errors specifically**, sleep 10s (not 30s) before re-resolving — fast recovery on the most common transient.
3. **Bumped retries 5 → 8** with re-resolution on each, expanding the wall budget from 150s to ~330s (with 180s pre-flight). Total step 7 budget: ~510s (~8.5 min).
4. **Updated manual recovery commands** in the failure message to re-resolve `INSTANCE` first (was using whatever the script had cached).

#### Manual recovery if Step 7 still exhausts both budgets

**Manual recovery on the new instance**:
```bash
NEW_INSTANCE=$(gcloud compute instances list --filter='name~super-legal-staging AND status=RUNNING' --format='value(name)' | head -1)
gcloud compute instances delete-access-config $NEW_INSTANCE --zone=us-east1-c --access-config-name=external-nat --quiet
sleep 10
gcloud compute instances add-access-config $NEW_INSTANCE --zone=us-east1-c --access-config-name=external-nat --address=34.26.70.60 --quiet
sed -i '' '/compute\./d' ~/.ssh/google_compute_known_hosts
gcloud compute ssh $NEW_INSTANCE --zone=us-east1-c --command='docker restart $(docker ps -q | head -1)'
INSTANCE=$(gcloud compute instances list --filter='name~super-legal-staging AND status=RUNNING' --format='value(name)' | head -1)
gcloud compute instances delete-access-config $INSTANCE --zone=us-east1-c --access-config-name=external-nat --quiet
sleep 15
gcloud compute instances add-access-config $INSTANCE --zone=us-east1-c --access-config-name=external-nat --address=34.26.70.60 --quiet
sed -i '' '/compute\./d' ~/.ssh/google_compute_known_hosts # if SSH balks at host-key change
gcloud compute ssh $INSTANCE --zone=us-east1-c --command='docker restart $(docker ps -q | head -1)'
```

Wait 60s, then verify via `curl http://34.26.70.60:3001/health`.
Wait 60s, then verify via `curl http://34.26.70.60:3001/health | jq '.dependencies.database'`. `status: "ok"` confirms the static-IP + Cloud-SQL-whitelist path is working again.

**Future deploy.sh hardening** (not yet implemented): Step 7's retry loop should re-resolve the instance name on each attempt:
```bash
Expand Down
91 changes: 65 additions & 26 deletions .claude/skills/deploy/scripts/deploy.sh
Original file line number Diff line number Diff line change
Expand Up @@ -142,22 +142,31 @@ log "Ensuring firewall tags on $INSTANCE..."
gcloud compute instances add-tags "$INSTANCE" --zone="$INSTANCE_ZONE" --tags=super-legal-mcp 2>/dev/null || true

# ── Step 7: Assign static IP ────────────────────────────────────────────
# The recurring failure mode (observed v6.7.0 + v6.8.0): MIG terminates the
# OLD instance and creates the new one in parallel; the OLD instance still
# holds 34.26.70.60 for ~30-60s during graceful termination. Trying to assign
# the static IP to the NEW instance during that window fails with errors
# silently swallowed by 2>/dev/null. Three fixes applied:
# 1) Pre-flight: poll until the address resource shows status=RESERVED
# (or no users), proving the previous instance has fully released it.
# 2) Bumped retries 3→5 with 30s waits (total window 150s vs old 45s).
# 3) Captured stderr to a file so failure messages surface in the log.
# Two recurring failure modes (Apr–May 2026):
# A) Race (v6.7.0, v6.8.0): MIG terminates OLD instance + creates NEW one
# in parallel; OLD still holds 34.26.70.60 for ~30-60s during graceful
# termination. add-access-config to NEW conflicts.
# B) Stale INSTANCE handle (v7.0.1, 2026-05-12, 2026-05-15, 2026-05-16):
# MIG rolls the instance AGAIN during retry budget. Script captured
# $INSTANCE in step 6; every retry hammers a dead resource and gets
# "Could not fetch resource: <stale-name>". Hit 3× in a row May 2026.
#
# Fixes applied (2026-05-16 — closes pattern B):
# 1) Pre-flight: 12 × 15s = 180s waiting for status=RESERVED (was 90s).
# 2) Re-resolve $INSTANCE from MIG on EVERY retry iteration (the manual
# recovery does this implicitly; the script did not).
# 3) Bumped retries 5 → 8, sleep 5 → 15 between delete+add (matches the
# sleep duration that consistently works in manual recovery).
# 4) On "Could not fetch resource" specifically, log it as a roll-event
# and re-resolve before next retry without waiting the full 30s.
# 5) Captured stderr to a file so failure messages surface in the log.
step "Step 7: Assign static IP ($STATIC_IP)"

# Pre-flight: wait for the static address to be released by the old instance.
# Up to 90s window — covers GCE graceful-termination latency.
# Up to 180s window — covers GCE graceful-termination latency + 2nd MIG roll.
log "Pre-flight: waiting for $STATIC_IP to be released by previous instance..."
RELEASED=false
for wait_attempt in 1 2 3 4 5 6 7 8 9; do
for wait_attempt in 1 2 3 4 5 6 7 8 9 10 11 12; do
ADDR_STATUS=$(gcloud compute addresses list \
--filter="address=$STATIC_IP" \
--format="value(status,users)" 2>/dev/null | head -1)
Expand All @@ -167,42 +176,72 @@ for wait_attempt in 1 2 3 4 5 6 7 8 9; do
RELEASED=true
break
fi
log " attempt $wait_attempt: $ADDR_STATUS"
sleep 10
log " attempt $wait_attempt/12: $ADDR_STATUS"
sleep 15
done
if [ "$RELEASED" = false ]; then
warn "Static IP still bound after 90s — proceeding anyway"
warn "Static IP still bound after 180s — proceeding anyway"
fi

# Helper: re-resolve the current running MIG instance name. Returns empty if
# nothing is RUNNING. Used at every retry to defeat pattern B.
resolve_current_instance() {
gcloud compute instances list \
--filter="name~super-legal-staging AND status=RUNNING" \
--format="value(name)" 2>/dev/null | head -1
}

ASSIGNED=false
ERR_FILE=$(mktemp)
trap 'rm -f "$ERR_FILE"' EXIT
for attempt in 1 2 3 4 5; do
log "Attempt $attempt/5: Assigning $STATIC_IP to $INSTANCE..."
for attempt in 1 2 3 4 5 6 7 8; do
# Re-resolve INSTANCE on every iteration — protects against MIG rolling
# mid-loop (pattern B). $INSTANCE_ZONE assumed stable (single-zone MIG).
CURRENT_INSTANCE=$(resolve_current_instance)
if [ -z "$CURRENT_INSTANCE" ]; then
warn " attempt $attempt/8: no RUNNING instance found — MIG may be rolling, waiting 20s"
sleep 20
continue
fi
if [ "$CURRENT_INSTANCE" != "$INSTANCE" ]; then
log " MIG rolled: $INSTANCE → $CURRENT_INSTANCE — re-targeting"
INSTANCE="$CURRENT_INSTANCE"
fi
log "Attempt $attempt/8: Assigning $STATIC_IP to $INSTANCE..."
# Remove existing access config if present (try lowercase first, fallback uppercase)
gcloud compute instances delete-access-config "$INSTANCE" --zone="$INSTANCE_ZONE" \
--access-config-name="external-nat" --quiet 2>/dev/null || \
gcloud compute instances delete-access-config "$INSTANCE" --zone="$INSTANCE_ZONE" \
--access-config-name="External NAT" --quiet 2>/dev/null || true
sleep 5 # let the delete propagate before re-adding
sleep 15 # let the delete propagate; 5s was too short, 10–15s consistently works
if gcloud compute instances add-access-config "$INSTANCE" --zone="$INSTANCE_ZONE" \
--access-config-name="external-nat" --address="$STATIC_IP" 2>"$ERR_FILE"; then
log "Static IP assigned successfully"
ASSIGNED=true
break
fi
warn " failure: $(head -1 "$ERR_FILE")"
warn " retrying in 30s..."
sleep 30
ERR_MSG=$(head -1 "$ERR_FILE")
warn " failure: $ERR_MSG"
# Pattern B detection: "Could not fetch resource" usually means $INSTANCE
# was deleted by MIG mid-call. Re-resolve on the next iteration immediately
# rather than waiting the full 30s.
if echo "$ERR_MSG" | grep -q "Could not fetch resource"; then
warn " → likely MIG rolled the instance; re-resolving in 10s"
sleep 10
else
warn " retrying in 30s..."
sleep 30
fi
done

if [ "$ASSIGNED" = false ]; then
err "Failed to assign static IP after 5 attempts (total wait: 150s + 90s pre-flight)"
warn "Manual recovery — run these commands:"
warn " gcloud compute instances delete-access-config $INSTANCE --zone=$INSTANCE_ZONE --access-config-name=external-nat --quiet"
warn " sleep 10"
warn " gcloud compute instances add-access-config $INSTANCE --zone=$INSTANCE_ZONE --access-config-name=external-nat --address=$STATIC_IP --quiet"
warn " gcloud compute ssh $INSTANCE --zone=$INSTANCE_ZONE --command='docker restart \$(docker ps -q | head -1)'"
err "Failed to assign static IP after 8 attempts (~330s total: 180s pre-flight + 8 retries)"
warn "Manual recovery — run these commands (re-resolve INSTANCE first):"
warn " INSTANCE=\$(gcloud compute instances list --filter='name~super-legal-staging AND status=RUNNING' --format='value(name)' | head -1)"
warn " gcloud compute instances delete-access-config \$INSTANCE --zone=$INSTANCE_ZONE --access-config-name=external-nat --quiet"
warn " sleep 15"
warn " gcloud compute instances add-access-config \$INSTANCE --zone=$INSTANCE_ZONE --access-config-name=external-nat --address=$STATIC_IP --quiet"
warn " gcloud compute ssh \$INSTANCE --zone=$INSTANCE_ZONE --command='docker restart \$(docker ps -q | head -1)'"
FINAL_IP=$(gcloud compute instances describe "$INSTANCE" --zone="$INSTANCE_ZONE" \
--format="value(networkInterfaces[0].accessConfigs[0].natIP)" 2>/dev/null)
warn "Using ephemeral IP: $FINAL_IP"
Expand Down
Loading