Bug: 3-way CMAP infrastructure degrades silently when gemini/codex fail
Summary
The architect-role-doc-prescribed 3-way CMAP integration review (consult -m gemini/codex/claude --type integration --issue N in parallel) silently degrades to a 1-way review when gemini and/or codex consults fail. There is no operator-visible alert when a model's success rate is at 0%; the review pipeline continues as if all three succeeded, producing a single-claude review the architect must notice and synthesize.
Reproduction
consult stats --days 30 --json
In our project (apple-bluetooth, private), this returns:
{
"byModel": [
{"model": "claude", "count": 134, "successRate": 99.25, "successCount": 133},
{"model": "gemini", "count": 84, "successRate": 0, "successCount": 0},
{"model": "codex", "count": 81, "successRate": 0, "successCount": 0}
]
}
165 invocations of gemini+codex over 30 days, all failed, with no operator awareness until the architect inspects consult stats directly.
Specific failure modes observed:
- gemini:
"You have exhausted your capacity on this model" — quota exhausted; consult retries indefinitely (loop until killed)
- codex:
Codex Exec exited with signal SIGKILL — process aborts; no retry; immediate failure
Impact
The architect-role doc prescribes:
consult -m gemini --type integration --issue N --output /tmp/cmap-gemini-N.md &
consult -m codex --type integration --issue N --output /tmp/cmap-codex-N.md &
consult -m claude --type integration --issue N --output /tmp/cmap-claude-N.md &
wait
If 2/3 silently fail and the architect doesn't notice, "3-way CMAP" becomes "1-way claude" — defeating the multi-model-review-redundancy design rationale.
Recommended fixes
-
Pre-flight reachability probe: consult --probe -m <model> does a no-op invocation; cache result for 5 minutes. Fail fast on probe failure before launching the expensive review.
-
Degraded-mode warning: when consult -m <model> fails or success-rate over the last N invocations is 0, log a prominent warning that the architect-direction document would treat as "this model is unavailable; proceed with caution OR pin to working models in .codev/config.json".
-
Reduce gemini retry loop: 3 retries with exponential backoff cap at 30s, not 6+ retries with 30+s waits. Currently a failed gemini consult ties up a process for ~5 minutes producing nothing.
-
Codex SIGKILL diagnostic: surface the actual underlying error rather than just "exited with signal SIGKILL". Hard to diagnose without strace.
-
Architect-role doc note: update the architect-role doc to recommend running consult stats --days 30 periodically and pinning .codev/config.json porch.consultation.models to known-working models when degradation is observed.
Workaround we used
For our project, we pinned .codev/config.json:
{
"porch": {
"consultation": {
"models": "claude"
}
}
}
This switches porch's per-phase consultation invocations to claude-only (which has 99.25% success). For the architect-side 3-way CMAP integration review, we attempted the full 3-way, noted the gemini/codex failures in the synthesized PR comment, and relied on architect-direct review for the missing 2/3 of reviewer redundancy.
Discovered
2026-05-24 during the integration review of PR #4 in our SPIR-protocol security-research project. We document the workaround in our project's CLAUDE.md.
Bug: 3-way CMAP infrastructure degrades silently when gemini/codex fail
Summary
The architect-role-doc-prescribed 3-way CMAP integration review (
consult -m gemini/codex/claude --type integration --issue Nin parallel) silently degrades to a 1-way review when gemini and/or codex consults fail. There is no operator-visible alert when a model's success rate is at 0%; the review pipeline continues as if all three succeeded, producing a single-claude review the architect must notice and synthesize.Reproduction
In our project (
apple-bluetooth, private), this returns:{ "byModel": [ {"model": "claude", "count": 134, "successRate": 99.25, "successCount": 133}, {"model": "gemini", "count": 84, "successRate": 0, "successCount": 0}, {"model": "codex", "count": 81, "successRate": 0, "successCount": 0} ] }165 invocations of gemini+codex over 30 days, all failed, with no operator awareness until the architect inspects
consult statsdirectly.Specific failure modes observed:
"You have exhausted your capacity on this model"— quota exhausted; consult retries indefinitely (loop until killed)Codex Exec exited with signal SIGKILL— process aborts; no retry; immediate failureImpact
The architect-role doc prescribes:
If 2/3 silently fail and the architect doesn't notice, "3-way CMAP" becomes "1-way claude" — defeating the multi-model-review-redundancy design rationale.
Recommended fixes
Pre-flight reachability probe:
consult --probe -m <model>does a no-op invocation; cache result for 5 minutes. Fail fast on probe failure before launching the expensive review.Degraded-mode warning: when
consult -m <model>fails or success-rate over the last N invocations is 0, log a prominent warning that the architect-direction document would treat as "this model is unavailable; proceed with caution OR pin to working models in.codev/config.json".Reduce gemini retry loop: 3 retries with exponential backoff cap at 30s, not 6+ retries with 30+s waits. Currently a failed gemini consult ties up a process for ~5 minutes producing nothing.
Codex SIGKILL diagnostic: surface the actual underlying error rather than just "exited with signal SIGKILL". Hard to diagnose without strace.
Architect-role doc note: update the architect-role doc to recommend running
consult stats --days 30periodically and pinning.codev/config.jsonporch.consultation.modelsto known-working models when degradation is observed.Workaround we used
For our project, we pinned
.codev/config.json:{ "porch": { "consultation": { "models": "claude" } } }This switches porch's per-phase consultation invocations to claude-only (which has 99.25% success). For the architect-side 3-way CMAP integration review, we attempted the full 3-way, noted the gemini/codex failures in the synthesized PR comment, and relied on architect-direct review for the missing 2/3 of reviewer redundancy.
Discovered
2026-05-24 during the integration review of PR #4 in our SPIR-protocol security-research project. We document the workaround in our project's CLAUDE.md.