rocicorp · 0xcadams · Jun 30, 2026 · Jun 16, 2026 · Jun 25, 2026 · Jun 25, 2026
diff --git a/.agents/skills/benchmark-compare/SKILL.md b/.agents/skills/benchmark-compare/SKILL.md
@@ -0,0 +1,127 @@
+---
+name: benchmark-compare
+description: Compare benchmark results between two refs, branches, tags, or commits; produce aggregate before/after summaries and investigate major regressions.
+---
+
+# Benchmark Compare
+
+Use this skill when comparing benchmark performance across two code versions, especially for release notes or regression investigation.
+
+## Goal
+
+Produce a repeatable before/after benchmark comparison with raw evidence, clear aggregate results, and documented caveats.
+
+## Inputs
+
+Gather:
+
+- Repository path.
+- Baseline ref.
+- Target ref.
+- Benchmark command or commands.
+- Metric direction: higher-is-better or lower-is-better.
+- Desired output: release-note summary, investigation, or raw report.
+
+Ask one short question if any required input is unclear.
+
+## Workflow
+
+1. Resolve baseline and target refs to exact SHAs.
+2. Use separate temporary worktrees so active working trees are not touched.
+3. Set up each worktree using that ref's normal project tooling.
+4. Record environment details: machine, OS, runtime versions, package manager versions, and relevant env vars.
+5. Run equivalent benchmark commands on both refs.
+6. Save raw output and logs for both runs.
+7. Confirm the benchmark commands succeeded before comparing results.
+8. Normalize results into benchmark name, metric value, unit, and direction.
+9. Compare matching benchmark names only.
+10. Summarize aggregate results and investigate major outliers.
+
+## Comparison Rules
+
+Compare only benchmark cases that exist on both refs.
+
+Exclude changed benchmark definitions unless the same definition was intentionally run on both refs.
+
+If benchmark-only changes are backported for comparison, do that only in the temporary baseline worktree and document it.
+
+Do not backport production code when measuring a performance commit. Only backport benchmark files or benchmark harness changes needed to run the same benchmark definition on both refs.
+
+Before relying on a benchmark matrix for release notes, check whether important performance commits are actually covered:
+
+- Inspect the commit message, PR description, and changed files to identify the optimized path.
+- Look for durable benchmark coverage, usually `*.bench.ts` files run by the normal benchmark command.
+- Treat gated `PERF=1` tests, `*.perf.test.ts` files, one-off scripts, and custom console tables as evidence, not matrix coverage.
+- Check whether the benchmark exists on both refs with `git ls-tree <ref>:<path>` or an equivalent read-only check.
+- If a target-only benchmark covers the path, temp-backport it into the baseline worktree and run it on both refs.
+- If no durable benchmark exists, create the smallest benchmark that targets the optimized path using APIs available on both refs.
+- Validate the benchmark once on each ref before starting multi-run batches.
+- Record all benchmark-only temp backports: files changed, refs, SHAs, commands, and whether the benchmark definition was identical across refs.
+
+Examples:
+
+- A gated `zqlite/src/*.perf.test.ts` for flipped-join batching did not appear in the matrix because normal benchmark runs only picked up `*.bench.ts`. The fix was to add a standard `zql-benchmarks/src/*.bench.ts`, temp-backport it into both release worktrees, and report the result separately from the original matrix.
+- `array-view-transaction.bench.ts` covered transaction copy-on-write on the target ref but did not exist on the baseline ref. The fix was to temp-backport that benchmark file to the baseline worktree and run the same benchmark on both refs.
+
+Use ratios consistently:
+
+- Higher-is-better: `target / baseline`.
+- Lower-is-better: `baseline / target`.
+
+A ratio above `1.0` means target improved. A ratio below `1.0` means target regressed.
+
+## Summary
+
+Report:
+
+- Comparable benchmark count.
+- Median ratio.
+- Geometric mean ratio.
+- Improved count above a chosen threshold, usually 5%.
+- Regressed count above the same threshold.
+- Top improvements.
+- Top regressions.
+- Exclusions or caveats.
+
+If one outlier dominates the aggregate, report both the full result and the result excluding that named outlier. Do not hide the outlier.
+
+## Troubleshooting
+
+If output is empty or partial, inspect raw logs before comparing.
+
+If benchmark output is piped through a converter, ensure benchmark failures are not hidden.
+
+If runtime-specific build artifacts fail, rebuild or reinstall using the project's documented tooling.
+
+If benchmark arguments are not forwarded correctly, run from the package or directory where the benchmark command is defined, or invoke the underlying runner directly.
+
+If results look noisy, rerun the affected benchmark subset and report uncertainty.
+
+## Regression Investigation
+
+For severe regressions:
+
+1. Verify the result from raw output or a rerun.
+2. Confirm the benchmark definition did not change.
+3. Inspect relevant code diffs.
+4. Check tests for changed behavior or expectations.
+5. Use commit history or blame to identify likely causes.
+6. Link relevant commits or PRs when available.
+7. State whether the cause appears proven or uncertain.
+
+## Release Notes
+
+Prefer aggregate performance summaries over listing every performance PR.
+
+Include:
+
+- Baseline ref and SHA.
+- Target ref and SHA.
+- Benchmark suite or command.
+- Environment.
+- Comparable benchmark count.
+- Median and geometric mean results.
+- Important wins and regressions.
+- Caveats and exclusions.
+
+Keep claims scoped to the measured benchmark suite and environment.
diff --git a/.agents/skills/release-notes/SKILL.md b/.agents/skills/release-notes/SKILL.md
@@ -30,23 +30,67 @@ Produce a release note draft that is intentionally over-inclusive so a human can
 ## Workflow
 
 1. Determine commit range between release tags:
-   - `git log --oneline --no-merges <prevTag>..<targetTag>`
-2. Classify commits using conventional commit prefixes:
+   - First list every non-merge commit in the raw range:
+     - `git log --reverse --oneline --no-merges <prevTag>..<targetTag>`
+   - Keep this as the audit source until the human has reviewed categorization.
+2. Remove commits already included in the previous release through cherry-picks:
+   - Find the common ancestor between previous release and target:
+     - `git merge-base <prevTag> <targetTag>`
+   - Inspect commits on the previous-release side after that ancestor:
+     - `git log --reverse --format='%h%x09%s%n%b%n---END---' <mergeBase>..<prevTag>`
+   - Look for `-x` cherry-pick trailers such as `(cherry picked from commit <sha>)`.
+   - Also use patch-equivalence to catch cherry-picks whose lockfiles or package-manager files differ:
+     - `git log --right-only --no-merges --cherry-mark --format='%m%x09%h%x09%s' <prevTag>...<targetTag>`
+   - Treat `=` commits as already present in the previous release and categorize them as `skip`, unless the target commit contains materially different user-facing changes.
+   - When a previous-release maintenance commit names a target commit in its cherry-pick trailer, categorize that target commit as `skip`.
+3. Run the protocol compatibility check immediately:
+   - Find protocol constants in mono before drafting any notes.
+   - Check both previous release and target values, for example:
+     - `git show <prevTag>:packages/zero-protocol/src/protocol-version.ts`
+     - `git show <targetTag>:packages/zero-protocol/src/protocol-version.ts`
+   - Ensure the target version's minimum supported sync protocol is `<=` the previous release's `PROTOCOL_VERSION`.
+   - If compatibility fails, this is a release-blocking breaking change. Record it loudly in `.releases/<major>.<minor>/commits.md` with `BREAKING`, and mark the responsible commit `BREAKING` if it can be identified.
+   - If compatibility passes, still record the result in `.releases/<major>.<minor>/commits.md` so later release reviews do not need to rediscover it.
+4. Categorize every commit and flag potential breaking changes before drafting. Use only these categories:
+   - `perf`
+   - `feature`
+   - `fix`
+   - `skip`
+   - Also identify potential breaking changes in every commit, including skipped/internal-looking commits.
+   - Look early for API renames/removals, env var/config changes, default behavior flips, migration requirements, protocol changes, package export/import changes, dependency/peer dependency changes that can affect install/runtime behavior, and commit text containing "breaking".
+   - Record breaking-change status separately from category.
+   - Use `-` for commits that are not believed breaking.
+   - Use `MAYBE` for commits that could be breaking and need human review; explain why in the note.
+   - If a commit is believed to be breaking, make it extremely visible with `BREAKING`. This should be rare; Zero release planning aims to avoid breaking changes.
+   - Treat breaking-change detection as an early warning system for ongoing release review, not something to defer until the final draft.
+5. Present the categorization to the human for review before writing release notes:
+   - Use a Markdown table with `Commit`, `Category`, `Breaking?`, and `Note`.
+   - The note should be a few sentences when useful: summarize what changed, why the category was chosen, whether it was skipped due to cherry-pick, revert, internal-only scope, or lack of user-facing impact, and why it is or is not a potential breaking change.
+   - Prefer `skip` for CI, release tooling, benchmark-only, sample-only, dependency hygiene with no identified user-facing effect, reverted changes, and commits already in the previous release.
+   - Use `fix` for customer-observable behavior even when the commit is labeled `chore`, e.g. packaging changes that prevent duplicate runtime dependencies from breaking checks like `Pool instanceof`.
+   - Use `perf` for fixes whose primary user-facing value is measured speed/CPU/allocation improvement.
+   - Use `feature` for new user/operator/debugging capability.
+   - Reclassify suspicious commits while building the table:
+     - Include `chore` commits that look user-facing, behavior-changing, protocol-affecting, package/export-affecting, or crash/fix related.
+     - Check dependency update commits when they affect runtime, install, protocol, query correctness, or performance-critical packages. Look at upstream changelogs when needed.
+   - Treat the `Breaking?` column as the breaking-change pass:
+     - Look for API renames/removals, env var/config changes, behavior flips, migration requirements, protocol changes, package export/import changes, dependency/peer dependency changes, and semantically breaking behavior even if unlabeled.
+     - Also scan commit text for "breaking".
+   - Flag performance follow-ups early for commits that change query compilation, index use, dependency implementations, hot loops, or runtime semantics. Put the performance concern in the note or open questions even if the commit category is `fix`.
+6. Save the reviewed categorization in the docs worktree before drafting:
+   - Use a stable release working-state directory under `.releases/<major>.<minor>/`, for example `.releases/1.7/commits.md`.
+   - Include release version, previous tag, target, merge base, the exact commands used, the reviewed table, potential breaking changes, and any unresolved questions.
+   - On later sessions, read this file first and evolve it instead of redoing the whole commit audit.
+7. After human review of the categorization, classify the non-skipped commits using conventional commit prefixes as a starting point:
    - `feat` -> Features
    - `fix` -> Fixes
    - `perf` -> Performance (if meaningful)
    - `chore` -> ignored by default
-3. Reclassify suspicious commits:
-   - Include chore commits that look user-facing, behavior-changing, protocol-affecting, or crash/fix related.
-   - Check dependency update commits (especially performance-critical packages like `compare-utf8`, `litestream`, etc.) - look at the upstream changelogs to see if they contain notable perf or fix items.
-4. Breaking change pass (separate from type labels):
-   - Look for API renames/removals, env var/config changes, behavior flips, migration requirements, protocol changes.
-   - Also scan commit text for "breaking" or semantically breaking behavior even if unlabeled.
-5. Protocol compatibility check:
-   - Find protocol constants in mono.
-   - Ensure `MIN_PROTOCOL_VERSION` in the version being documented is `<=` `PROTOCOL_VERSION` in the previous release version.
-   - Flag any mismatch in the draft.
-6. Build draft release notes in the latest format used in this repo:
+8. Before drafting, revisit the reviewed table:
+   - Confirm all non-skipped commits are represented or intentionally omitted.
+   - Re-check any `MAYBE` or `BREAKING` rows and summarize the decision in the draft or in the saved release state.
+   - Re-check any performance follow-ups recorded in `.releases/<major>.<minor>/commits.md`.
+9. Build draft release notes in the latest format used in this repo:
    - Before drafting, read `contents/docs/release-notes/0.26.mdx` as the canonical long-form style reference to avoid format drift.
    - Frontmatter with `title` and `description`
    - `## Installation`

diff --git a/assets/search-index.json b/assets/search-index.json
@@ -8923,4 +8923,4 @@
     "content": "Scalar subqueries are not currently integrated with Zero's planner. You need to manually choose when to use them.",
     "kind": "section"
   }
-]
+]
-Original file line number
+Diff line change
@@ Expand Up / @@ -8923,4 +8923,4 @@ @@
         "content": "Scalar subqueries are not currently integrated with Zero's planner. You need to manually choose when to use them.",
         "kind": "section"
       }
-    ]
+    ]