feat: /challenge skill — Polya 4-stage plan stress-test#1231
Draft
amargandhi wants to merge 1 commit intogarrytan:mainfrom
Draft
feat: /challenge skill — Polya 4-stage plan stress-test#1231amargandhi wants to merge 1 commit intogarrytan:mainfrom
amargandhi wants to merge 1 commit intogarrytan:mainfrom
Conversation
Pure-analysis plan stress-test grounded in Polya 1945 *How to Solve It*. Stress-tests a plan across 4 stages (Understand → Devise → Carry out → Look back) with 3-5 hard adversarial questions per stage. Each question includes the agent's recommended answer and a P1/P2/P3 priority. Verdict at top: READY / OPEN QUESTIONS / CRITICAL GAPS. **Why a new skill:** Plans that land in /ship unchallenged tend to ship bugs that a five-minute stress-test would have caught. The author is too close to the plan to see what's missing. /plan-eng-review covers architecture mechanics; /plan-ceo-review covers scope. Neither walks Polya's stages methodically — unstated assumptions, missing rollback, ambiguous acceptance criteria, what-if-it-takes-3x-longer. /challenge fills that gap. **Hard gate:** Pure analysis. No plan edits, no code, no implementation. Output is a report at ~/.gstack/challenges/<date>-<slug>.md. If the user wants to apply fixes after, that's a separate /plan-ceo-review, /plan-eng-review, or direct edit invocation. **Arguments:** - `--scope <stage>` — focus on one Polya stage instead of all 4 - `--dry-run` — show questions without saving the report **Where it fits:** Run before /ship when reversibility matters (production database changes, public API changes, anything you can't easily undo). Complements /codex challenge (which gets a second-model opinion) — /challenge is methodological + deterministic; /codex is cross-model + non-deterministic. What this PR adds: 1. `challenge/SKILL.md.tmpl` (354 lines) — the skill template. Defines the 4-stage walk with question banks per stage (Q1.1-Q4.5), HARD GATE prose, output format, and ranking heuristics. 2. `scripts/resolvers/preamble/generate-routing-injection.ts` — adds one bullet so Claude/Codex auto-route "stress-test", "poke holes", "what could go wrong", "red-team" prompts to /challenge. 3. Generated SKILL.md across all 10 hosts (Claude, Codex, OpenCode, Cursor, Factory, Slate, Kiro, Hermes, GBrain, OpenClaw) — ~12K tokens per skill, well under the 40K ceiling. Tests: 689/689 pass on `bun test test/skill-validation.test.ts test/gen-skill-docs.test.ts`. Built and tested against upstream/main at v1.15.0.0.
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds /challenge — a pure-analysis plan stress-test grounded in Polya 1945 How to Solve It. Walks a plan through 4 stages (Understand → Devise → Carry out → Look back) with 3-5 hard adversarial questions per stage. Each question includes the agent's recommended answer and a P1/P2/P3 priority. Verdict at top: READY / OPEN QUESTIONS / CRITICAL GAPS.
Why a new skill
Plans that land in
/shipunchallenged tend to ship bugs that a 5-minute stress-test would have caught. The failure mode isn't stupidity — it's that the author is too close to the plan to see what's missing./plan-eng-reviewcovers architecture mechanics;/plan-ceo-reviewcovers scope. Neither walks Polya's stages methodically — unstated assumptions, missing rollback, ambiguous acceptance criteria, "what if it takes 3x longer."/challengefills that gap.Complement to
/codex challenge:/codex challengegets a cross-model adversarial opinion./challengeis methodological + deterministic — same skill, same model, same questions. Use both for high-stakes plans.Hard gate
Pure analysis. No plan edits, no code, no implementation. Output is a report at
~/.gstack/challenges/<date>-<slug>.md. If the user wants to apply fixes after, that's a separate/plan-ceo-review,/plan-eng-review, or direct edit invocation. Documented prominently in the template.What this PR adds
challenge/SKILL.md.tmpl(354 lines) — defines the 4-stage walk with question banks per stage (Q1.1-Q4.5), HARD GATE prose, output format, and ranking heuristics.scripts/resolvers/preamble/generate-routing-injection.ts— adds one bullet so Claude/Codex auto-route phrases like "stress-test", "poke holes", "what could go wrong", "red-team" to/challenge.Generated SKILL.md across all 10 hosts (Claude, Codex, OpenCode, Cursor, Factory, Slate, Kiro, Hermes, GBrain, OpenClaw) — ~12K tokens per generated skill, well under the 40K ceiling.
Arguments
--scope <stage>— focus on one Polya stage instead of all 4--dry-run— show questions without saving the reportTest plan
bun test test/skill-validation.test.ts test/gen-skill-docs.test.ts— 689/689 passbun run gen:skill-docs --host all— clean regen, no warnings, no token-ceiling alertsrunPlanSkillObservation()harness — out of scope for this PR; happy to add as a follow-up if requestedNotes for review
/glossary— Evans DDD,/cso --stability— Nygard patterns,/investigate --file-issue, Fowler's 24 code smells in/review, etc.) are larger and I'm holding them for separate PRs based on your appetite./plan-eng-reviewand/plan-ceo-review.