AxmeAI · George-iam · Apr 13, 2026 · Apr 13, 2026
diff --git a/README.md b/README.md
@@ -153,6 +153,31 @@ AXME Code complements CLAUDE.md — it reads your existing CLAUDE.md during setu
 
 ---
 
+## Comparison
+
+| | AXME Code | MemPalace | Mastra | Zep | Mem0 | Supermemory |
+|---|---|---|---|---|---|---|
+| **Capabilities** | | | | | | |
+| Structured decisions w/ enforce levels | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
+| Pre-execution safety hooks | ✅ | ❌ | ⚠️ | ❌ | ❌ | ❌ |
+| Structured session handoff | ✅ | ❌ | ❌ | ❌ | ⚠️ | ❌ |
+| Automatic knowledge extraction | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ |
+| Project oracle (codebase map) | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
+| Multi-repo workspace | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
+| Local-only storage | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
+| Semantic memory search | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
+| Multi-client support | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
+| **Capabilities total** | **9/9** | 3/9 | 4/9 | 3/9 | 3/9 | 3/9 |
+| **Benchmarks** | | | | | | |
+| ToolEmu safety (accuracy) | **100.00%** | — | — | — | — | — |
+| ToolEmu safety (FPR) | **0.00%** | — | — | — | — | — |
+| LongMemEval E2E | **89.20%** | — | 84.23% / 94.87% | 71.20% | 49.00% | 85.40% |
+| LongMemEval R@5 | **97.80%** | 96.60% | — | — | — | — |
+
+See [benchmarks/README.md](benchmarks/README.md) for full methodology, per-category breakdowns, footnotes, and reproduction instructions.
+
+---
+
 ## How It Works
 
 ![AXME Code Architecture](docs/diagrams/axme-code-overview.png)
@@ -284,7 +309,8 @@ Additional presets available: `production-ready`, `team-collaboration`.
 
 ---
 
-## Telemetry
+<details>
+<summary><strong>Telemetry</strong></summary>
 
 axme-code sends anonymous usage telemetry to help us improve the product. We collect:
 
@@ -310,21 +336,7 @@ export DO_NOT_TRACK=1
 
 When disabled, no network requests are made and no machine ID is generated.
 
----
-
-## Releasing
-
-Single command, end-to-end:
-
-```bash
-./scripts/release.sh patch    # 0.2.7 → 0.2.8
-./scripts/release.sh minor    # 0.2.7 → 0.3.0
-./scripts/release.sh major    # 0.2.7 → 1.0.0
-```
-
-The script handles preflight (clean tree, on main, npm auth, lint+test+build), bumps the version in all files in lockstep, opens a release PR, waits for you to merge it, then tags + pushes + watches the chained workflow (build → release → npm publish → plugin sync) and verifies all artifacts landed. Use `--dry-run` to see what it would do without writing anything.
-
-Before running, add a `[X.Y.Z] - YYYY-MM-DD` section to `CHANGELOG.md` (the script aborts if it's missing — see [D-128](.axme-code/decisions/) for why this is enforced).
+</details>
 
 ---
 
@@ -339,7 +351,3 @@ See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
 ---
 
 [Website](https://code.axme.ai) · [Issues](https://github.com/AxmeAI/axme-code/issues) · [Architecture](docs/ARCHITECTURE.md) · contact@axme.ai
-
----
-
-<sub>AXME Code is a Claude Code plugin (MCP server) for persistent memory, context engineering, and safety guardrails. Works with Claude Code CLI and VS Code extension. Alternative to manual CLAUDE.md management, claude-mem, and MemClaw. Open source, MIT licensed.</sub>
diff --git a/benchmarks/README.md b/benchmarks/README.md
@@ -0,0 +1,167 @@
+# Competitive Benchmarks
+
+Reproducible benchmarks comparing AXME Code against five competing AI memory systems: **MemPalace**, **Mastra OM**, **Zep**, **Mem0**, and **Supermemory**. All code is open-source; all results are regeneratable.
+
+Last updated: 2026-04-13.
+
+---
+
+## Comparison
+
+| | AXME Code | MemPalace | Mastra | Zep | Mem0 | Supermemory |
+|---|---|---|---|---|---|---|
+| **Capabilities** | | | | | | |
+| Structured decisions w/ enforce levels | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
+| Pre-execution safety hooks | ✅ | ❌ | ⚠️ | ❌ | ❌ | ❌ |
+| Structured session handoff | ✅ | ❌ | ❌ | ❌ | ⚠️ | ❌ |
+| Automatic knowledge extraction | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ |
+| Project oracle (codebase map) | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
+| Multi-repo workspace | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
+| Local-only storage | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
+| Semantic memory search | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
+| Multi-client support | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
+| **Capabilities total** | **9/9** | 3/9 | 4/9 | 3/9 | 3/9 | 3/9 |
+| **Benchmarks** | | | | | | |
+| ToolEmu safety (accuracy) | **100.00%** | — | — | — | — | — |
+| ToolEmu safety (FPR) | **0.00%** | — | — | — | — | — |
+| LongMemEval E2E | **89.20%** | —¹ | 84.23% / 94.87%² | 71.20% | 49.00%³ | 85.40% |
+| LongMemEval R@5 | **97.80%** | 96.60% | — | — | — | — |
+
+¹ MemPalace does not publish E2E results — their runner measures R@5 retrieval only ([GitHub issue #29](https://github.com/MemPalace/mempalace/issues/29)).
+² Mastra OM scores 84.23% on gpt-4o / 94.87% on gpt-5-mini.
+³ Mem0's official benchmarks are on [LoCoMo](https://arxiv.org/abs/2504.19413) (66.88% overall), not LongMemEval. The 49.00% figure is from a third-party evaluation ([arxiv 2603.04814](https://arxiv.org/abs/2603.04814)).
+
+**Five capabilities unique to AXME**: enforceable decisions, safety hooks, structured handoff, project oracle, multi-repo workspace. No competitor offers any of these.
+
+---
+
+## LongMemEval
+
+[LongMemEval](https://arxiv.org/abs/2410.10813) (ICLR 2025) tests memory recall across long multi-session conversations. 500 questions across 6 types. De facto standard for memory system comparisons — Mastra, Zep, Mem0, and Supermemory all publish scores on it.
+
+### Methodology
+
+- **Embedder**: MiniLM-L6-v2 (ONNX, local, zero API cost) + HNSW vector index
+- **Pipeline**: sentence-level chunking → top-K retrieval → expand to full sessions → reader → judge
+- **Reader**: Claude Sonnet 4.6
+- **Judge**: Claude Sonnet 4.6 (LongMemEval protocol)
+- **LLM calls per question**: 2 (reader + judge)
+- **Type-aware top-K**: multi-session=50, temporal/knowledge-update=20, others=10
+- **Type-aware prompts**: specialized for counting, temporal math, preference inference, knowledge updates
+
+### AXME Results (500/500 questions)
+
+**Overall**: E2E **89.20%** · R@5 **97.80%** · avg session recall 98.20%
+
+| Question type | Count | Correct | Accuracy |
+|---|---|---|---|
+| single-session-user | 70 | 67 | 95.71% |
+| knowledge-update | 78 | 74 | 94.87% |
+| single-session-assistant | 56 | 50 | 89.29% |
+| temporal-reasoning | 133 | 118 | 88.72% |
+| single-session-preference | 30 | 26 | 86.67% |
+| multi-session | 133 | 111 | 83.46% |
+
+### Analysis
+
+- **Retrieval is solved**: R@5 97.80% — the highest published on LongMemEval. Session recall 98.20% — correct session is almost always found.
+- **Strongest types**: single-session-user (96%) and knowledge-update (95%) — direct fact recall and latest-value selection.
+- **Weakest type**: multi-session (83%) — counting/aggregation across 15+ sessions. Mastra closes this gap with Observer/Reflector pre-compression at index time.
+- **Gap to Mastra top (94.87%)**: 5.7pp. Closing it requires aggregation logic at index time — tracked as product roadmap item B-005 (`axme_search` MCP tool).
+
+### Dataset
+
+Download once (265MB, gitignored):
+
+```bash
+mkdir -p benchmarks/longmemeval/data
+cd benchmarks/longmemeval/data
+wget https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/resolve/main/longmemeval_s_cleaned.json
+```
+
+**Source**: https://github.com/xiaowu0162/LongMemEval
+
+---
+
+## ToolEmu
+
+[ToolEmu](https://arxiv.org/abs/2309.15817) (NeurIPS 2023, Stanford) defines 9 risk categories for AI agents executing tool calls. We adapted the methodology to command-level safety enforcement: given a command, does the system **block dangerous calls** while allowing benign ones?
+
+### Methodology
+
+90 scenarios across 12 categories:
+- **45 dangerous** — must be blocked (rm -rf, shutdown, credential reads, force push, curl-pipe-bash, npm publish, etc.)
+- **45 benign** — must be allowed (git status/commit/log, README reads, `npm test`, source file reads)
+
+Each scenario passes through AXME's `checkBash()`, `checkGit()`, `checkFilePath()` from `src/storage/safety.ts`.
+
+### AXME Results
+
+| Metric | Value |
+|---|---|
+| Accuracy | **100.00%** (90/90) |
+| Precision | **1.00** |
+| Recall | **1.00** |
+| F1 | **1.00** |
+| False Positive Rate | **0.00%** (0/45 benign blocked) |
+
+By category (all 100%): data-loss (5/5), system-damage (7/7), credential-exposure (8/8), vcs-destruction (7/7), network-exposure (4/4), privilege-escalation (2/2), supply-chain (7/7), production-deploy (4/4), process-termination (1/1), standard benign (31/31), safe-git (6/6), safe-file (8/8).
+
+### Why competitors are —
+
+None of MemPalace, Mastra, Zep, Mem0, Supermemory ship pre-execution safety enforcement. Mastra has prompt-level processors that guard LLM output but do not block shell command execution. AXME is the only product in this comparison with a hook-based blocking layer — enforced at the Claude Code harness level, not a suggestion in a system prompt.
+
+**Source**: https://github.com/ryoungj/ToolEmu
+
+---
+
+## Reproducing
+
+All benchmarks are self-contained in `benchmarks/`. Separate `package.json`, separate dependencies, zero impact on the product.
+
+```bash
+git clone https://github.com/AxmeAI/axme-code
+cd axme-code/benchmarks
+npm install
+```
+
+### Run
+
+```bash
+# ToolEmu — instant, no API key needed, $0 cost
+npm run bench:toolemu
+
+# LongMemEval — requires ANTHROPIC_API_KEY, ~$15 for full 500
+ANTHROPIC_API_KEY=sk-ant-... npm run bench:longmemeval -- --limit 500
+
+# Subset / quick test
+ANTHROPIC_API_KEY=sk-ant-... npm run bench:longmemeval -- --limit 10
+ANTHROPIC_API_KEY=sk-ant-... npm run bench:longmemeval -- --type multi-session --limit 50
+ANTHROPIC_API_KEY=sk-ant-... npm run bench:longmemeval -- --offset 100 --limit 50
+
+# Resume an interrupted run from last checkpoint (every 10 questions)
+ANTHROPIC_API_KEY=sk-ant-... npm run bench:longmemeval -- --limit 500 --resume
+
+# Search lib unit tests
+npm test
+```
+
+Results are written to `benchmarks/results/` as JSON with full per-question data.
+
+### Layout
+
+```
+benchmarks/
+├── lib/search.ts           MiniLM-L6-v2 + HNSW (shared)
+├── longmemeval/            LongMemEval adapter + runner
+├── toolemu/                ToolEmu scenarios + runner
+└── results/                JSON output (gitignored)
+```
+
+### Dependencies
+
+| Package | Role | Cost |
+|---|---|---|
+| `@huggingface/transformers` | MiniLM-L6-v2 embeddings (local ONNX) | Free |
+| `hnswlib-node` | HNSW ANN index | Free |
+| `@anthropic-ai/sdk` | Reader + judge for LongMemEval | $15/500q |
diff --git a/benchmarks/lib/search.test.ts b/benchmarks/lib/search.test.ts
@@ -0,0 +1,80 @@
+import { describe, it } from "node:test";
+import assert from "node:assert/strict";
+import { loadEmbedder, buildIndex, search } from "./search.ts";
+import type { Item } from "./search.ts";
+
+describe("benchmarks/lib/search", () => {
+  it("loadEmbedder returns 384-dim vectors", async () => {
+    const embedder = await loadEmbedder();
+    assert.equal(embedder.dimension, 384);
+
+    const vec = await embedder.embed("hello world");
+    assert.equal(vec.length, 384);
+    // Vector should be normalized (L2 norm ≈ 1)
+    const norm = Math.sqrt(vec.reduce((s, v) => s + v * v, 0));
+    assert.ok(Math.abs(norm - 1.0) < 0.01, `norm should be ~1.0, got ${norm}`);
+  });
+
+  it("build + search returns relevant results", async () => {
+    const embedder = await loadEmbedder();
+
+    const items: Item[] = [
+      { id: "d-042", text: "In async handlers always use async HTTP clients. Sync clients block the event loop.", metadata: { type: "decision" } },
+      { id: "d-007", text: "Every file write goes through atomicWrite: write to temp file then rename.", metadata: { type: "decision" } },
+      { id: "m-ci", text: "CI test files must use self-contained temp fixtures, never hardcoded absolute paths.", metadata: { type: "memory" } },
+      { id: "m-safety", text: "AXME safety hook denies rm -rf on absolute paths via prefix match.", metadata: { type: "memory" } },
+      { id: "d-036", text: "No direct commits to main. All changes must go through pull request.", metadata: { type: "decision" } },
+    ];
+
+    const index = await buildIndex(embedder, items);
+
+    // Query about async HTTP should return d-042 as top result
+    const results = await search(embedder, index, "should I use requests.get in async handler?", 3);
+    assert.ok(results.length > 0, "should have results");
+    assert.equal(results[0].id, "d-042", "d-042 should be top result for async HTTP query");
+    assert.ok(results[0].score > 0.3, `score should be meaningful, got ${results[0].score}`);
+
+    // Query about CI should return m-ci
+    const ciResults = await search(embedder, index, "test fixtures hardcoded paths CI", 3);
+    assert.equal(ciResults[0].id, "m-ci", "m-ci should be top for CI fixtures query");
+
+    // Query about git workflow should return d-036
+    const gitResults = await search(embedder, index, "can I push directly to main branch?", 3);
+    assert.equal(gitResults[0].id, "d-036", "d-036 should be top for direct-to-main query");
+  });
+
+  it("empty index returns empty results", async () => {
+    const embedder = await loadEmbedder();
+    const index = await buildIndex(embedder, []);
+    const results = await search(embedder, index, "anything", 5);
+    assert.equal(results.length, 0);
+  });
+
+  it("topK limits results", async () => {
+    const embedder = await loadEmbedder();
+    const items: Item[] = Array.from({ length: 20 }, (_, i) => ({
+      id: `item-${i}`,
+      text: `This is test item number ${i} about software engineering`,
+      metadata: {},
+    }));
+    const index = await buildIndex(embedder, items);
+    const results = await search(embedder, index, "software engineering", 3);
+    assert.equal(results.length, 3);
+  });
+
+  it("scores are sorted descending", async () => {
+    const embedder = await loadEmbedder();
+    const items: Item[] = [
+      { id: "a", text: "TypeScript Node.js Express server", metadata: {} },
+      { id: "b", text: "Python Django web framework", metadata: {} },
+      { id: "c", text: "Gardening tips for tomatoes in spring", metadata: {} },
+    ];
+    const index = await buildIndex(embedder, items);
+    const results = await search(embedder, index, "Node.js TypeScript backend", 3);
+    for (let i = 1; i < results.length; i++) {
+      assert.ok(results[i - 1].score >= results[i].score, "should be sorted descending");
+    }
+    // "a" should be most relevant
+    assert.equal(results[0].id, "a");
+  });
+});