Skip to content

consult: 3-way CMAP silently degrades when gemini/codex fail #837

@swiftraccoon

Description

@swiftraccoon

Bug: 3-way CMAP infrastructure degrades silently when gemini/codex fail

Summary

The architect-role-doc-prescribed 3-way CMAP integration review (consult -m gemini/codex/claude --type integration --issue N in parallel) silently degrades to a 1-way review when gemini and/or codex consults fail. There is no operator-visible alert when a model's success rate is at 0%; the review pipeline continues as if all three succeeded, producing a single-claude review the architect must notice and synthesize.

Reproduction

consult stats --days 30 --json

In our project (apple-bluetooth, private), this returns:

{
  "byModel": [
    {"model": "claude", "count": 134, "successRate": 99.25, "successCount": 133},
    {"model": "gemini", "count": 84,  "successRate": 0,     "successCount": 0},
    {"model": "codex",  "count": 81,  "successRate": 0,     "successCount": 0}
  ]
}

165 invocations of gemini+codex over 30 days, all failed, with no operator awareness until the architect inspects consult stats directly.

Specific failure modes observed:

  • gemini: "You have exhausted your capacity on this model" — quota exhausted; consult retries indefinitely (loop until killed)
  • codex: Codex Exec exited with signal SIGKILL — process aborts; no retry; immediate failure

Impact

The architect-role doc prescribes:

consult -m gemini --type integration --issue N --output /tmp/cmap-gemini-N.md &
consult -m codex  --type integration --issue N --output /tmp/cmap-codex-N.md  &
consult -m claude --type integration --issue N --output /tmp/cmap-claude-N.md &
wait

If 2/3 silently fail and the architect doesn't notice, "3-way CMAP" becomes "1-way claude" — defeating the multi-model-review-redundancy design rationale.

Recommended fixes

  1. Pre-flight reachability probe: consult --probe -m <model> does a no-op invocation; cache result for 5 minutes. Fail fast on probe failure before launching the expensive review.

  2. Degraded-mode warning: when consult -m <model> fails or success-rate over the last N invocations is 0, log a prominent warning that the architect-direction document would treat as "this model is unavailable; proceed with caution OR pin to working models in .codev/config.json".

  3. Reduce gemini retry loop: 3 retries with exponential backoff cap at 30s, not 6+ retries with 30+s waits. Currently a failed gemini consult ties up a process for ~5 minutes producing nothing.

  4. Codex SIGKILL diagnostic: surface the actual underlying error rather than just "exited with signal SIGKILL". Hard to diagnose without strace.

  5. Architect-role doc note: update the architect-role doc to recommend running consult stats --days 30 periodically and pinning .codev/config.json porch.consultation.models to known-working models when degradation is observed.

Workaround we used

For our project, we pinned .codev/config.json:

{
  "porch": {
    "consultation": {
      "models": "claude"
    }
  }
}

This switches porch's per-phase consultation invocations to claude-only (which has 99.25% success). For the architect-side 3-way CMAP integration review, we attempted the full 3-way, noted the gemini/codex failures in the synthesized PR comment, and relied on architect-direct review for the missing 2/3 of reviewer redundancy.

Discovered

2026-05-24 during the integration review of PR #4 in our SPIR-protocol security-research project. We document the workaround in our project's CLAUDE.md.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions