-
Notifications
You must be signed in to change notification settings - Fork 432
fix: remove parallel_sub_agents experiment from smoke-pi workflow #37344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -17,24 +17,6 @@ permissions: | |
| issues: read | ||
| pull-requests: read | ||
| name: Smoke Pi | ||
| experiments: | ||
| sub_agent_decomposition: | ||
| variants: [single_agent, parallel_sub_agents] | ||
| description: "Test whether decomposing smoke tests into parallel sub-agents reduces token cost" | ||
| hypothesis: "H0: no change in effective token consumption. H1: parallel sub-agents reduce tokens by 15-25% by eliminating unnecessary context sharing" | ||
| metric: effective_token_count | ||
| secondary_metrics: [run_duration_seconds, test_pass_rate, false_failure_rate] | ||
| guardrail_metrics: | ||
| - name: test_completion_rate | ||
| threshold: ">=0.95" | ||
| - name: overall_pass_rate | ||
| threshold: ">=0.80" | ||
| min_samples: 20 | ||
| weight: [50, 50] | ||
| start_date: "2026-05-22" | ||
| analysis_type: mann_whitney | ||
| tags: [cost_optimization, smoke_tests, pi_engine] | ||
| # issue: PLACEHOLDER_ISSUE_NUMBER | ||
| engine: | ||
| id: pi | ||
| model: copilot/gpt-5.4 | ||
|
|
@@ -96,26 +78,13 @@ timeout-minutes: 10 | |
|
|
||
| ## Test Requirements | ||
|
|
||
| {{#if experiments.sub_agent_decomposition == 'parallel_sub_agents'}} | ||
| Launch five parallel `task` agents using mode: "background" to execute each smoke test independently. Use the `task` agent type with `description` field for each: | ||
|
|
||
| 1. **GitHub MCP Test Agent**: Fetch 2 merged PR titles from ${{ github.repository }} | ||
| 2. **Web Fetch Test Agent**: Fetch https://github.com and verify "GitHub" in response using web-fetch MCP | ||
| 3. **File I/O Test Agent**: Create `/tmp/gh-aw/agent/smoke-test-pi-${{ github.run_id }}.txt` with timestamp | ||
| 4. **Bash Test Agent**: Verify file creation with `cat` command | ||
| 5. **Build Test Agent**: Run `GOCACHE=/tmp/gh-aw/agent/go-cache GOMODCACHE=/tmp/gh-aw/agent/go-mod make build` | ||
|
|
||
| Wait for all five agents to complete (you'll receive notifications). Read each agent's result using `read_agent`. Aggregate the results into a unified report with ✅/❌ status for each test. | ||
|
|
||
| {{else}} | ||
| Execute the following tests sequentially in a single turn: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [/tdd] The bug was caught in production (~50% failure rate, ~17K tokens wasted per run). There is no regression test or compile-time check to detect if someone re-introduces an incompatible experiment variant for Pi. 💡 Possible guard options
Option 1 is most robust; Option 3 is the lowest-friction starting point. |
||
|
|
||
| 1. **GitHub MCP Testing**: Use GitHub MCP tools to fetch details of exactly 2 merged pull requests from ${{ github.repository }} (title and number only) | ||
| 2. **Web Fetch Testing**: Use the web-fetch MCP tool to fetch https://github.com and verify the response contains "GitHub" (do NOT use bash or playwright for this test - use the web-fetch MCP tool directly) | ||
| 3. **File Writing Testing**: Create a test file `/tmp/gh-aw/agent/smoke-test-pi-${{ github.run_id }}.txt` with content "Smoke test passed for Pi at $(date)" (create the directory if it doesn't exist) | ||
| 4. **Bash Tool Testing**: Execute bash commands to verify file creation was successful (use `cat` to read the file back) | ||
| 5. **Build gh-aw**: Run `GOCACHE=/tmp/gh-aw/agent/go-cache GOMODCACHE=/tmp/gh-aw/agent/go-mod make build` to verify the agent can successfully build the gh-aw project. If the command fails, mark this test as ❌ and report the failure. | ||
| {{/if}} | ||
|
|
||
| ## Output | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[/diagnose] The
id: piconfig gives no indication that Pi runs in single-pass mode (--print --mode json --no-session) and therefore cannot support experiments requiringtask/read_agenttools. Without this being documented here, the next engineer authoring an experiment for this workflow faces the same trap.💡 Suggested addition
Add a comment directly below
id: pi:This makes the constraint discoverable at the point where future experiments are authored rather than only surfacing in production failures.