Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
f5f7e32
feat: add shared call-time isConductor() helper
garrytan Jun 12, 2026
ba640b2
test: harden question-preference-hook harness against ambient Conduct…
garrytan Jun 12, 2026
d74a874
feat: PreToolUse hook denies AskUserQuestion in Conductor, redirects …
garrytan Jun 12, 2026
752865f
feat: Conductor renders AskUserQuestion decisions as prose by default
garrytan Jun 12, 2026
ec63a2d
feat: deploy the Conductor AskUserQuestion hook (setup + upgrade migr…
garrytan Jun 12, 2026
788f35f
test: E2E for Conductor prose + fix auto-decide-preserved GSTACK_HOME…
garrytan Jun 12, 2026
fcdcde9
v1.58.0.0 feat: Conductor renders AskUserQuestion decisions as prose
garrytan Jun 12, 2026
0f165cb
test: strip ambient Conductor env in memory-cache-injection hook harness
garrytan Jun 12, 2026
d1fc21c
feat: gstack-detach — run agent eval/bench jobs in their own session
garrytan Jun 12, 2026
f8a0dc0
feat: eval:bg* scripts — detached eval runs for agents
garrytan Jun 12, 2026
b608b06
docs: CLAUDE.md — agents must run long evals via gstack-detach
garrytan Jun 12, 2026
a4fb10e
feat: harden gstack-detach against all four eval-infra killers
garrytan Jun 12, 2026
6eea549
feat: eval:bg* use run-scoped logs + machine lock + watchdog
garrytan Jun 12, 2026
a38089a
docs: wire detached-eval guidance into /ship + correct CLAUDE.md flags
garrytan Jun 12, 2026
9e63c2d
refactor: extract pure promotedEnv() from conductor-env-shim
garrytan Jun 12, 2026
2d56961
feat: hermetic child-env builder for E2E runners
garrytan Jun 12, 2026
c3e65b1
feat: session-runner spawns hermetic children + isolation canaries
garrytan Jun 12, 2026
9799593
feat: PTY runner spawns hermetic claude sessions
garrytan Jun 12, 2026
b89ce26
feat: codex/gemini runners spawn hermetic children
garrytan Jun 12, 2026
7046198
feat: agent-sdk-runner spawns hermetic children via complete Options.env
garrytan Jun 12, 2026
65549a4
test: static tripwire pins hermetic wiring in all five runners
garrytan Jun 12, 2026
fce00b2
test: refresh codex/factory ship goldens with detached-eval block
garrytan Jun 12, 2026
778042b
docs: hermetic local E2E is the default; retire stale SDK env warning
garrytan Jun 12, 2026
6b715c7
fix: operational-learning fixture copies lib/jsonl-store.ts with the bin
garrytan Jun 12, 2026
6d2274e
fix: ios-qa daemon scenarios use unique pidfiles under --concurrent
garrytan Jun 12, 2026
c15f789
fix: workflow judge re-appends body-carved sections after the marker …
garrytan Jun 13, 2026
350cce0
Merge remote-tracking branch 'origin/main' into garrytan/conductor-sk…
garrytan Jun 14, 2026
a4197b9
harden: hermetic temp-dir GC grace window + half-seed cleanup
garrytan Jun 14, 2026
d5b869a
docs: document hermetic-by-default E2E + eval:bg detached runs in CON…
garrytan Jun 14, 2026
264da8a
chore: sync package.json to 1.58.1.0
garrytan Jun 14, 2026
80c02eb
docs: regenerate diagram SKILL.md with Conductor prose preamble
garrytan Jun 14, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
102 changes: 102 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,107 @@
# Changelog

## [1.58.1.0] - 2026-06-14

## **Local evals stop lying. Spawned `claude` test children run in a sealed clean room,**
## **and in Conductor every decision is a plain-text brief you answer with a letter.**

Two things shipped here. First, the local E2E harness is now hermetic by default:
every spawned agent (claude -p, the real-PTY plan-mode runner, the Agent SDK
runner, plus the codex and gemini runners) gets an allowlist-scrubbed environment,
a fresh seeded `CLAUDE_CONFIG_DIR`, a temp `GSTACK_HOME`, and `--strict-mcp-config`.
Before this, a dev machine leaked the operator's `~/.claude` config, MCP servers
(gbrain, Conductor), skills, `~/.gstack` decision logs, and `CONDUCTOR_*`/`CLAUDECODE`
env into every child, so local eval results disagreed with CI for reasons that had
nothing to do with the code under test. Now local signal matches CI. Set
`EVALS_HERMETIC=0` to debug against real operator state.

Second, in a Conductor session gstack no longer fights Conductor's flaky
AskUserQuestion tool. It detects the session and renders every decision as a prose
brief, a labeled question with a recommendation, per-option completeness scores, and
"reply with a letter," enforced by a PreToolUse hook that denies the tool and
redirects to prose. Destructive confirmations demand an explicit typed answer.

Agents that launch long eval runs get `gstack-detach`: a SIGTERM-proof, idle-sleep-proof
wrapper (fresh session + `caffeinate`) with a machine-wide lock so concurrent
worktrees serialize instead of saturating the model API, run-scoped logs, and a
guaranteed `EXIT=` sentinel so a poller never mistakes silence for success.

### The numbers that matter

Measured against the gate eval suite on a contaminated dev box (gbrain MCP up, live
Conductor session, sibling worktrees). Reproduce: `bun test` (free unit + wiring
tripwire) and `EVALS=1 EVALS_TIER=gate bun test test/skill-e2e-hermetic-canary.test.ts`.

| Metric | Before | After | Δ |
|--------|--------|-------|---|
| Spawned-child env | full operator `process.env` | allowlist-scrubbed | sealed |
| Runners hermeticized | 0 of 5 | 5 of 5 | +5 |
| Operator MCP servers visible to child | all (gbrain, Conductor) | 0 (`--strict-mcp-config`) | isolated |
| Config isolation proof | none | poisoned-operator sentinel canary | falsifiable |
| Long eval runs surviving a turn-boundary SIGTERM | no | yes (`gstack-detach`) | survives |

The clean room is falsifiable, not asserted: a `hermetic-sentinel` gate canary
plants a poisoned operator config (a user `CLAUDE.md` + an MCP server) and fails if
the child can see any of it, and a free static tripwire fails CI if any runner
reverts to a raw `process.env` spread.

### What this means for contributors

Run evals locally and trust the result. You no longer have to push to CI to find
out whether a failure was real or just your machine bleeding context into the agent.
Three latent bugs the old harness hid surfaced the moment the suite ran clean and
are fixed: a coverage-judge that scored carved skills against half a document, an
ios-qa daemon test that collided on a shared pidfile under concurrency, and an
operational-learning fixture missing a lib it imports. Start a run with
`bun run eval:bg:gate`; flip `EVALS_HERMETIC=0` only when you deliberately want your
real `~/.claude` in the loop.

### Itemized changes

#### Added
- **Hermetic E2E environment** (`test/helpers/hermetic-env.ts`): allowlist env
builder (process basics, network/proxy vars, named `ANTHROPIC_*` auth, per-runner
`extraAllow`), pure `promotedEnv()` shared with `lib/conductor-env-shim.ts`, a
sync-memoized singleton temp dir (`<runRoot>/.claude` keeps the plan-file path
contract), a seeded `.claude.json` for non-interactive first run, and pid-aware GC
of crashed runs. Default-on; `EVALS_HERMETIC=0` restores the legacy env AND drops
`--strict-mcp-config`.
- **Two gate-tier isolation canaries** (`test/skill-e2e-hermetic-canary.test.ts`):
`hermetic-canary` asserts env redirect + scrub + zero MCP servers + nonzero
API-key cost from the Bash tool_result (not model prose); `hermetic-sentinel`
proves the child cannot see a planted poisoned operator config.
- **Static wiring tripwire** (`test/hermetic-wiring.test.ts`): free-tier invariants
that fail CI if any of the five runners drops `hermeticChildEnv()`, the gated
`--strict-mcp-config`, or leaks `process.env` through a callsite override.
- **`gstack-detach`** + `eval:bg` / `eval:bg:all` / `eval:bg:gate` / `eval:bg:periodic`
scripts: detached, SIGTERM-proof, `caffeinate`-wrapped eval runs with a machine-wide
lock, per-run logs under `~/.gstack-dev/eval-runs/`, a watchdog, and an `EXIT=`
sentinel.
- **Conductor prose AskUserQuestion**: when a Conductor session is detected, every
decision renders as a prose brief (labeled question, recommendation, per-option
completeness, reply-with-a-letter), enforced by a PreToolUse hook that denies the
tool and redirects. Auto-decide preferences still apply first; destructive
confirmations require an explicit typed answer. Installed for Conductor even in
non-interactive setup, with an upgrade migration for existing installs.

#### Changed
- All five E2E runners (`session-runner`, `claude-pty-runner`, `agent-sdk-runner`,
`codex-session-runner`, `gemini-session-runner`) spawn children through
`hermeticChildEnv()`. The Agent SDK runner now receives a COMPLETE hermetic env
via `Options.env` (the old "never pass env: to the SDK" rule was partial-env
replacement; a complete env is safe).
- `hermetic-env.ts` is a global touchfile, so any change to it selects every E2E +
judge test.
- CLAUDE.md documents hermetic-by-default local evals and retires the stale SDK env
warning.

#### Fixed
- The workflow LLM-judge now re-appends body-carved `sections/*.md` after the marker
slice, so carved skills (document-release) are judged on the full workflow the
agent executes instead of a half-document.
- ios-qa daemon scenarios use unique pidfiles, fixing `already_running` collisions
under `bun test --concurrent`.

## [1.58.0.0] - 2026-06-12

## **Your documents grow diagrams. Mermaid and excalidraw fences render as real pictures,**
Expand Down
53 changes: 48 additions & 5 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,11 +31,26 @@ use Codex's own auth from `~/.codex/` config — no `OPENAI_API_KEY` env var nee
`lib/conductor-env-shim.ts`) promotes `GSTACK_ANTHROPIC_API_KEY` /
`GSTACK_OPENAI_API_KEY` to their canonical names inside gstack's TS binaries.
Tests run through gstack entrypoints inherit this promotion automatically.
Don't echo the key value to stdout, logs, or shell history. When passing to a
test's Agent SDK, do NOT pass `env: {...}` to `runAgentSdkTest` — the SDK's
auth pipeline doesn't pick up the key the same way when env is supplied as an
object (confirmed failure mode). Mutate `process.env.ANTHROPIC_API_KEY`
ambiently before the call and restore in `finally`.
Don't echo the key value to stdout, logs, or shell history. The historical
"never pass `env:` to `runAgentSdkTest`" rule is retired: the failure was
partial-env replacement (the SDK's `Options.env` REPLACES the child's entire
environment, so an object without the key broke auth). The runner now always
passes a COMPLETE hermetic env with per-test `env:` merged last, so per-test
overrides are safe; ambient `process.env.ANTHROPIC_API_KEY` mutation also
still works (the env builder reads process.env at call time).

**Hermetic local E2E (default).** Every E2E runner (claude -p, PTY, Agent
SDK, codex, gemini) spawns children through `test/helpers/hermetic-env.ts`:
allowlist-scrubbed env (operator `CONDUCTOR_*`, `CLAUDE_*`, `GSTACK_*`,
`MCP_*`, `GBRAIN_*`, and credentials like `GH_TOKEN` never reach children),
a fresh seeded `CLAUDE_CONFIG_DIR` (no operator `~/.claude` CLAUDE.md /
MCP servers / skills), a temp `GSTACK_HOME`, and `--strict-mcp-config`.
Local eval signal matches CI. Debug against real operator state with
`EVALS_HERMETIC=0` (restores the legacy env AND drops the strict-MCP flag).
Per-test `env:` overrides merge last, so deliberate contamination
(`CONDUCTOR_WORKSPACE_PATH`, per-test `GSTACK_HOME`) keeps working. Wiring
is pinned by `test/hermetic-wiring.test.ts` (static tripwire) and two
gate-tier canaries in `test/skill-e2e-hermetic-canary.test.ts`.

E2E tests stream progress in real-time (tool-by-tool via `--output-format stream-json
--verbose`). Results are persisted to `~/.gstack-dev/evals/` with auto-comparison
Expand Down Expand Up @@ -828,6 +843,34 @@ them. Report progress at each check (which tests passed, which are running, any
failures so far). The user wants to see the run complete, not a promise that
you'll check later.

## Running evals as an agent: always detach (SIGTERM-proof)

When **you (an agent/harness)** launch a long eval/benchmark run, run it through
`bin/gstack-detach` — NEVER as a plain backgrounded Bash task. A plain background
task lives in the harness's process group, so a SIGTERM ("polite quit") on a turn
boundary, a stopped Monitor, or an interruption kills the run mid-flight (observed:
`script "test:gate" was terminated by signal SIGTERM` ~40 min into a run). On macOS
the run can also die to idle-sleep. `gstack-detach` fixes both: a fresh session
(escapes the group SIGTERM) wrapped in `caffeinate -i` (blocks idle-sleep).

- Use the `eval:bg*` scripts (`eval:bg`, `eval:bg:all`, `eval:bg:gate`,
`eval:bg:periodic`) — they wrap the eval command in `gstack-detach` with the
machine-wide `gstack-evals` lock (concurrent worktrees serialize instead of
saturating the shared model API), a per-tier watchdog, and a **run-scoped** log
under `~/.gstack-dev/eval-runs/` (no shared-`/tmp` collision). Each prints its
log path. Or call `gstack-detach [--lock NAME] [--timeout SECS] [--label LBL] --
<cmd>` directly for any long agent job. Export `ANTHROPIC_API_KEY` first (never
pass keys in argv).
- Then **poll the printed logfile** with a death-aware watcher: break on the
guaranteed `### gstack-detach EXIT=<code> ###` sentinel (success AND failure are
both marked, so silence is never mistaken for success). The detached run survives
even if your watcher gets reaped, so re-checking the log always works.
- Why the lock: a shared dev box with several Conductor worktrees will rate-limit
the model API if two eval suites run at once (15-way concurrency each), which
mass-times-out E2E tests. The lock makes the second run WAIT, not collide.
- Humans running `bun run test:evals` foreground in their own terminal don't need
this — Ctrl-C is intended there. Detachment is for agent-launched runs only.

## E2E test fixtures: extract, don't copy

**NEVER copy a full SKILL.md file into an E2E test fixture.** SKILL.md files are
Expand Down
31 changes: 31 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -176,6 +176,18 @@ EVALS=1 bun test test/skill-e2e-*.test.ts
- Saves full NDJSON transcripts and failure JSON for debugging
- Tests live in `test/skill-e2e-*.test.ts` (split by category), runner logic in `test/helpers/session-runner.ts`

**Hermetic by default.** Every E2E runner (claude -p, the real-PTY plan-mode
runner, the Agent SDK runner, plus the codex and gemini runners) spawns its child
through `test/helpers/hermetic-env.ts`: an allowlist-scrubbed environment, a fresh
seeded `CLAUDE_CONFIG_DIR`, a temp `GSTACK_HOME`, and `--strict-mcp-config`. Your
operator `~/.claude` config, MCP servers (gbrain, Conductor), skills, `~/.gstack`
decision logs, and `CONDUCTOR_*` env never leak into the child, so local eval
signal matches CI instead of disagreeing for reasons unrelated to the code under
test. Set `EVALS_HERMETIC=0` to debug against your real operator state (this also
drops `--strict-mcp-config`). The wiring is pinned by `test/hermetic-wiring.test.ts`
(a free static tripwire) and two gate-tier isolation canaries in
`test/skill-e2e-hermetic-canary.test.ts`.

### E2E observability

When E2E tests run, they produce machine-readable artifacts in `~/.gstack-dev/`:
Expand All @@ -198,6 +210,25 @@ bun run eval:compare # compare two runs — shows per-test deltas + Take
bun run eval:summary # aggregate stats + per-test efficiency averages across runs
```

**Detached runs for agents and long suites.** When an agent (or you, for a run
you don't want to babysit) launches a long eval, use the `eval:bg*` scripts. They
wrap the eval command in `bin/gstack-detach`: a fresh session that escapes a
turn-boundary SIGTERM, a `caffeinate` wrapper that blocks idle-sleep, a machine-wide
`gstack-evals` lock so concurrent worktrees serialize instead of saturating the
model API, a run-scoped log under `~/.gstack-dev/eval-runs/`, a per-tier watchdog,
and a guaranteed `### gstack-detach EXIT=<code> ###` sentinel so a poller never
mistakes silence for success.

```bash
bun run eval:bg # detached test:evals (diff-based)
bun run eval:bg:all # detached test:evals:all
bun run eval:bg:gate # detached gate-tier suite
bun run eval:bg:periodic # detached periodic-tier suite
```

Each prints its log path. Humans running `bun run test:evals` foreground in their
own terminal don't need this — Ctrl-C is intended there.

**Eval comparison commentary:** `eval:compare` generates natural-language Takeaway sections interpreting what changed between runs — flagging regressions, noting improvements, calling out efficiency gains (fewer turns, faster, cheaper), and producing an overall summary. This is driven by `generateCommentary()` in `eval-store.ts`.

Artifacts are never cleaned up — they accumulate in `~/.gstack-dev/` for post-mortem debugging and trend analysis.
Expand Down
7 changes: 7 additions & 0 deletions SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,13 @@ echo "REPO_MODE: $REPO_MODE"
_SESSION_KIND=$(~/.claude/skills/gstack/bin/gstack-session-kind 2>/dev/null || echo "interactive")
case "$_SESSION_KIND" in spawned|headless|interactive) ;; *) _SESSION_KIND="interactive" ;; esac
echo "SESSION_KIND: $_SESSION_KIND"
# Conductor host: AskUserQuestion is unreliable here (native disabled, MCP
# variant flaky), so skills render decisions as prose instead of calling the
# tool. Gated on !headless so an eval/CI run INSIDE Conductor (GSTACK_HEADLESS)
# still BLOCKs rather than rendering prose to nobody.
if [ "$_SESSION_KIND" != "headless" ] && { [ -n "${CONDUCTOR_WORKSPACE_PATH:-}" ] || [ -n "${CONDUCTOR_PORT:-}" ]; }; then
echo "CONDUCTOR_SESSION: true"
fi
_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no")
echo "LAKE_INTRO: $_LAKE_SEEN"
_TEL=$(~/.claude/skills/gstack/bin/gstack-config get telemetry 2>/dev/null || true)
Expand Down
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
1.58.0.0
1.58.1.0
Loading
Loading