Skip to content

feat: /challenge skill — Polya 4-stage plan stress-test#1231

Draft
amargandhi wants to merge 1 commit intogarrytan:mainfrom
amargandhi:pr/challenge-skill
Draft

feat: /challenge skill — Polya 4-stage plan stress-test#1231
amargandhi wants to merge 1 commit intogarrytan:mainfrom
amargandhi:pr/challenge-skill

Conversation

@amargandhi
Copy link
Copy Markdown

Summary

Adds /challenge — a pure-analysis plan stress-test grounded in Polya 1945 How to Solve It. Walks a plan through 4 stages (Understand → Devise → Carry out → Look back) with 3-5 hard adversarial questions per stage. Each question includes the agent's recommended answer and a P1/P2/P3 priority. Verdict at top: READY / OPEN QUESTIONS / CRITICAL GAPS.

Why a new skill

Plans that land in /ship unchallenged tend to ship bugs that a 5-minute stress-test would have caught. The failure mode isn't stupidity — it's that the author is too close to the plan to see what's missing. /plan-eng-review covers architecture mechanics; /plan-ceo-review covers scope. Neither walks Polya's stages methodically — unstated assumptions, missing rollback, ambiguous acceptance criteria, "what if it takes 3x longer." /challenge fills that gap.

Complement to /codex challenge: /codex challenge gets a cross-model adversarial opinion. /challenge is methodological + deterministic — same skill, same model, same questions. Use both for high-stakes plans.

Hard gate

Pure analysis. No plan edits, no code, no implementation. Output is a report at ~/.gstack/challenges/<date>-<slug>.md. If the user wants to apply fixes after, that's a separate /plan-ceo-review, /plan-eng-review, or direct edit invocation. Documented prominently in the template.

What this PR adds

  1. challenge/SKILL.md.tmpl (354 lines) — defines the 4-stage walk with question banks per stage (Q1.1-Q4.5), HARD GATE prose, output format, and ranking heuristics.

  2. scripts/resolvers/preamble/generate-routing-injection.ts — adds one bullet so Claude/Codex auto-route phrases like "stress-test", "poke holes", "what could go wrong", "red-team" to /challenge.

  3. Generated SKILL.md across all 10 hosts (Claude, Codex, OpenCode, Cursor, Factory, Slate, Kiro, Hermes, GBrain, OpenClaw) — ~12K tokens per generated skill, well under the 40K ceiling.

Arguments

  • --scope <stage> — focus on one Polya stage instead of all 4
  • --dry-run — show questions without saving the report

Test plan

  • bun test test/skill-validation.test.ts test/gen-skill-docs.test.ts — 689/689 pass
  • bun run gen:skill-docs --host all — clean regen, no warnings, no token-ceiling alerts
  • Built against upstream/main at v1.15.0.0
  • Plan-mode E2E coverage via the new runPlanSkillObservation() harness — out of scope for this PR; happy to add as a follow-up if requested

Notes for review

  • This is one of seven canonical-literature-grounded skills on my fork. The others (/glossary — Evans DDD, /cso --stability — Nygard patterns, /investigate --file-issue, Fowler's 24 code smells in /review, etc.) are larger and I'm holding them for separate PRs based on your appetite.
  • Voice and style follow gstack conventions: short sentences, named sources, no AI vocabulary, no em dashes in the new prose. I've kept the template style consistent with /plan-eng-review and /plan-ceo-review.

Pure-analysis plan stress-test grounded in Polya 1945 *How to Solve It*.
Stress-tests a plan across 4 stages (Understand → Devise → Carry out →
Look back) with 3-5 hard adversarial questions per stage. Each question
includes the agent's recommended answer and a P1/P2/P3 priority. Verdict
at top: READY / OPEN QUESTIONS / CRITICAL GAPS.

**Why a new skill:** Plans that land in /ship unchallenged tend to ship
bugs that a five-minute stress-test would have caught. The author is too
close to the plan to see what's missing. /plan-eng-review covers
architecture mechanics; /plan-ceo-review covers scope. Neither walks
Polya's stages methodically — unstated assumptions, missing rollback,
ambiguous acceptance criteria, what-if-it-takes-3x-longer. /challenge
fills that gap.

**Hard gate:** Pure analysis. No plan edits, no code, no implementation.
Output is a report at ~/.gstack/challenges/<date>-<slug>.md. If the user
wants to apply fixes after, that's a separate /plan-ceo-review,
/plan-eng-review, or direct edit invocation.

**Arguments:**
- `--scope <stage>` — focus on one Polya stage instead of all 4
- `--dry-run` — show questions without saving the report

**Where it fits:** Run before /ship when reversibility matters
(production database changes, public API changes, anything you can't
easily undo). Complements /codex challenge (which gets a second-model
opinion) — /challenge is methodological + deterministic; /codex is
cross-model + non-deterministic.

What this PR adds:

1. `challenge/SKILL.md.tmpl` (354 lines) — the skill template. Defines
   the 4-stage walk with question banks per stage (Q1.1-Q4.5), HARD GATE
   prose, output format, and ranking heuristics.
2. `scripts/resolvers/preamble/generate-routing-injection.ts` — adds
   one bullet so Claude/Codex auto-route "stress-test", "poke holes",
   "what could go wrong", "red-team" prompts to /challenge.
3. Generated SKILL.md across all 10 hosts (Claude, Codex, OpenCode,
   Cursor, Factory, Slate, Kiro, Hermes, GBrain, OpenClaw) — ~12K tokens
   per skill, well under the 40K ceiling.

Tests: 689/689 pass on `bun test test/skill-validation.test.ts
test/gen-skill-docs.test.ts`.

Built and tested against upstream/main at v1.15.0.0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant