Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 28 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,31 @@ AXME Code complements CLAUDE.md — it reads your existing CLAUDE.md during setu

---

## Comparison

| | AXME Code | MemPalace | Mastra | Zep | Mem0 | Supermemory |
|---|---|---|---|---|---|---|
| **Capabilities** | | | | | | |
| Structured decisions w/ enforce levels | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
| Pre-execution safety hooks | ✅ | ❌ | ⚠️ | ❌ | ❌ | ❌ |
| Structured session handoff | ✅ | ❌ | ❌ | ❌ | ⚠️ | ❌ |
| Automatic knowledge extraction | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ |
| Project oracle (codebase map) | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
| Multi-repo workspace | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
| Local-only storage | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
| Semantic memory search | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Multi-client support | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| **Capabilities total** | **9/9** | 3/9 | 4/9 | 3/9 | 3/9 | 3/9 |
| **Benchmarks** | | | | | | |
| ToolEmu safety (accuracy) | **100.00%** | — | — | — | — | — |
| ToolEmu safety (FPR) | **0.00%** | — | — | — | — | — |
| LongMemEval E2E | **89.20%** | — | 84.23% / 94.87% | 71.20% | 49.00% | 85.40% |
| LongMemEval R@5 | **97.80%** | 96.60% | — | — | — | — |

See [benchmarks/README.md](benchmarks/README.md) for full methodology, per-category breakdowns, footnotes, and reproduction instructions.

---

## How It Works

![AXME Code Architecture](docs/diagrams/axme-code-overview.png)
Expand Down Expand Up @@ -284,7 +309,8 @@ Additional presets available: `production-ready`, `team-collaboration`.

---

## Telemetry
<details>
<summary><strong>Telemetry</strong></summary>

axme-code sends anonymous usage telemetry to help us improve the product. We collect:

Expand All @@ -310,21 +336,7 @@ export DO_NOT_TRACK=1

When disabled, no network requests are made and no machine ID is generated.

---

## Releasing

Single command, end-to-end:

```bash
./scripts/release.sh patch # 0.2.7 → 0.2.8
./scripts/release.sh minor # 0.2.7 → 0.3.0
./scripts/release.sh major # 0.2.7 → 1.0.0
```

The script handles preflight (clean tree, on main, npm auth, lint+test+build), bumps the version in all files in lockstep, opens a release PR, waits for you to merge it, then tags + pushes + watches the chained workflow (build → release → npm publish → plugin sync) and verifies all artifacts landed. Use `--dry-run` to see what it would do without writing anything.

Before running, add a `[X.Y.Z] - YYYY-MM-DD` section to `CHANGELOG.md` (the script aborts if it's missing — see [D-128](.axme-code/decisions/) for why this is enforced).
</details>

---

Expand All @@ -339,7 +351,3 @@ See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
---

[Website](https://code.axme.ai) · [Issues](https://github.com/AxmeAI/axme-code/issues) · [Architecture](docs/ARCHITECTURE.md) · contact@axme.ai

---

<sub>AXME Code is a Claude Code plugin (MCP server) for persistent memory, context engineering, and safety guardrails. Works with Claude Code CLI and VS Code extension. Alternative to manual CLAUDE.md management, claude-mem, and MemClaw. Open source, MIT licensed.</sub>
167 changes: 167 additions & 0 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
# Competitive Benchmarks

Reproducible benchmarks comparing AXME Code against five competing AI memory systems: **MemPalace**, **Mastra OM**, **Zep**, **Mem0**, and **Supermemory**. All code is open-source; all results are regeneratable.

Last updated: 2026-04-13.

---

## Comparison

| | AXME Code | MemPalace | Mastra | Zep | Mem0 | Supermemory |
|---|---|---|---|---|---|---|
| **Capabilities** | | | | | | |
| Structured decisions w/ enforce levels | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
| Pre-execution safety hooks | ✅ | ❌ | ⚠️ | ❌ | ❌ | ❌ |
| Structured session handoff | ✅ | ❌ | ❌ | ❌ | ⚠️ | ❌ |
| Automatic knowledge extraction | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ |
| Project oracle (codebase map) | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
| Multi-repo workspace | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
| Local-only storage | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
| Semantic memory search | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Multi-client support | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| **Capabilities total** | **9/9** | 3/9 | 4/9 | 3/9 | 3/9 | 3/9 |
| **Benchmarks** | | | | | | |
| ToolEmu safety (accuracy) | **100.00%** | — | — | — | — | — |
| ToolEmu safety (FPR) | **0.00%** | — | — | — | — | — |
| LongMemEval E2E | **89.20%** | —¹ | 84.23% / 94.87%² | 71.20% | 49.00%³ | 85.40% |
| LongMemEval R@5 | **97.80%** | 96.60% | — | — | — | — |

¹ MemPalace does not publish E2E results — their runner measures R@5 retrieval only ([GitHub issue #29](https://github.com/MemPalace/mempalace/issues/29)).
² Mastra OM scores 84.23% on gpt-4o / 94.87% on gpt-5-mini.
³ Mem0's official benchmarks are on [LoCoMo](https://arxiv.org/abs/2504.19413) (66.88% overall), not LongMemEval. The 49.00% figure is from a third-party evaluation ([arxiv 2603.04814](https://arxiv.org/abs/2603.04814)).

**Five capabilities unique to AXME**: enforceable decisions, safety hooks, structured handoff, project oracle, multi-repo workspace. No competitor offers any of these.

---

## LongMemEval

[LongMemEval](https://arxiv.org/abs/2410.10813) (ICLR 2025) tests memory recall across long multi-session conversations. 500 questions across 6 types. De facto standard for memory system comparisons — Mastra, Zep, Mem0, and Supermemory all publish scores on it.

### Methodology

- **Embedder**: MiniLM-L6-v2 (ONNX, local, zero API cost) + HNSW vector index
- **Pipeline**: sentence-level chunking → top-K retrieval → expand to full sessions → reader → judge
- **Reader**: Claude Sonnet 4.6
- **Judge**: Claude Sonnet 4.6 (LongMemEval protocol)
- **LLM calls per question**: 2 (reader + judge)
- **Type-aware top-K**: multi-session=50, temporal/knowledge-update=20, others=10
- **Type-aware prompts**: specialized for counting, temporal math, preference inference, knowledge updates

### AXME Results (500/500 questions)

**Overall**: E2E **89.20%** · R@5 **97.80%** · avg session recall 98.20%

| Question type | Count | Correct | Accuracy |
|---|---|---|---|
| single-session-user | 70 | 67 | 95.71% |
| knowledge-update | 78 | 74 | 94.87% |
| single-session-assistant | 56 | 50 | 89.29% |
| temporal-reasoning | 133 | 118 | 88.72% |
| single-session-preference | 30 | 26 | 86.67% |
| multi-session | 133 | 111 | 83.46% |

### Analysis

- **Retrieval is solved**: R@5 97.80% — the highest published on LongMemEval. Session recall 98.20% — correct session is almost always found.
- **Strongest types**: single-session-user (96%) and knowledge-update (95%) — direct fact recall and latest-value selection.
- **Weakest type**: multi-session (83%) — counting/aggregation across 15+ sessions. Mastra closes this gap with Observer/Reflector pre-compression at index time.
- **Gap to Mastra top (94.87%)**: 5.7pp. Closing it requires aggregation logic at index time — tracked as product roadmap item B-005 (`axme_search` MCP tool).

### Dataset

Download once (265MB, gitignored):

```bash
mkdir -p benchmarks/longmemeval/data
cd benchmarks/longmemeval/data
wget https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/resolve/main/longmemeval_s_cleaned.json
```

**Source**: https://github.com/xiaowu0162/LongMemEval

---

## ToolEmu

[ToolEmu](https://arxiv.org/abs/2309.15817) (NeurIPS 2023, Stanford) defines 9 risk categories for AI agents executing tool calls. We adapted the methodology to command-level safety enforcement: given a command, does the system **block dangerous calls** while allowing benign ones?

### Methodology

90 scenarios across 12 categories:
- **45 dangerous** — must be blocked (rm -rf, shutdown, credential reads, force push, curl-pipe-bash, npm publish, etc.)
- **45 benign** — must be allowed (git status/commit/log, README reads, `npm test`, source file reads)

Each scenario passes through AXME's `checkBash()`, `checkGit()`, `checkFilePath()` from `src/storage/safety.ts`.

### AXME Results

| Metric | Value |
|---|---|
| Accuracy | **100.00%** (90/90) |
| Precision | **1.00** |
| Recall | **1.00** |
| F1 | **1.00** |
| False Positive Rate | **0.00%** (0/45 benign blocked) |

By category (all 100%): data-loss (5/5), system-damage (7/7), credential-exposure (8/8), vcs-destruction (7/7), network-exposure (4/4), privilege-escalation (2/2), supply-chain (7/7), production-deploy (4/4), process-termination (1/1), standard benign (31/31), safe-git (6/6), safe-file (8/8).

### Why competitors are —

None of MemPalace, Mastra, Zep, Mem0, Supermemory ship pre-execution safety enforcement. Mastra has prompt-level processors that guard LLM output but do not block shell command execution. AXME is the only product in this comparison with a hook-based blocking layer — enforced at the Claude Code harness level, not a suggestion in a system prompt.

**Source**: https://github.com/ryoungj/ToolEmu

---

## Reproducing

All benchmarks are self-contained in `benchmarks/`. Separate `package.json`, separate dependencies, zero impact on the product.

```bash
git clone https://github.com/AxmeAI/axme-code
cd axme-code/benchmarks
npm install
```

### Run

```bash
# ToolEmu — instant, no API key needed, $0 cost
npm run bench:toolemu

# LongMemEval — requires ANTHROPIC_API_KEY, ~$15 for full 500
ANTHROPIC_API_KEY=sk-ant-... npm run bench:longmemeval -- --limit 500

# Subset / quick test
ANTHROPIC_API_KEY=sk-ant-... npm run bench:longmemeval -- --limit 10
ANTHROPIC_API_KEY=sk-ant-... npm run bench:longmemeval -- --type multi-session --limit 50
ANTHROPIC_API_KEY=sk-ant-... npm run bench:longmemeval -- --offset 100 --limit 50

# Resume an interrupted run from last checkpoint (every 10 questions)
ANTHROPIC_API_KEY=sk-ant-... npm run bench:longmemeval -- --limit 500 --resume

# Search lib unit tests
npm test
```

Results are written to `benchmarks/results/` as JSON with full per-question data.

### Layout

```
benchmarks/
├── lib/search.ts MiniLM-L6-v2 + HNSW (shared)
├── longmemeval/ LongMemEval adapter + runner
├── toolemu/ ToolEmu scenarios + runner
└── results/ JSON output (gitignored)
```

### Dependencies

| Package | Role | Cost |
|---|---|---|
| `@huggingface/transformers` | MiniLM-L6-v2 embeddings (local ONNX) | Free |
| `hnswlib-node` | HNSW ANN index | Free |
| `@anthropic-ai/sdk` | Reader + judge for LongMemEval | $15/500q |
80 changes: 80 additions & 0 deletions benchmarks/lib/search.test.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
import { describe, it } from "node:test";
import assert from "node:assert/strict";
import { loadEmbedder, buildIndex, search } from "./search.ts";
import type { Item } from "./search.ts";

describe("benchmarks/lib/search", () => {
it("loadEmbedder returns 384-dim vectors", async () => {
const embedder = await loadEmbedder();
assert.equal(embedder.dimension, 384);

const vec = await embedder.embed("hello world");
assert.equal(vec.length, 384);
// Vector should be normalized (L2 norm ≈ 1)
const norm = Math.sqrt(vec.reduce((s, v) => s + v * v, 0));
assert.ok(Math.abs(norm - 1.0) < 0.01, `norm should be ~1.0, got ${norm}`);
});

it("build + search returns relevant results", async () => {
const embedder = await loadEmbedder();

const items: Item[] = [
{ id: "d-042", text: "In async handlers always use async HTTP clients. Sync clients block the event loop.", metadata: { type: "decision" } },
{ id: "d-007", text: "Every file write goes through atomicWrite: write to temp file then rename.", metadata: { type: "decision" } },
{ id: "m-ci", text: "CI test files must use self-contained temp fixtures, never hardcoded absolute paths.", metadata: { type: "memory" } },
{ id: "m-safety", text: "AXME safety hook denies rm -rf on absolute paths via prefix match.", metadata: { type: "memory" } },
{ id: "d-036", text: "No direct commits to main. All changes must go through pull request.", metadata: { type: "decision" } },
];

const index = await buildIndex(embedder, items);

// Query about async HTTP should return d-042 as top result
const results = await search(embedder, index, "should I use requests.get in async handler?", 3);
assert.ok(results.length > 0, "should have results");
assert.equal(results[0].id, "d-042", "d-042 should be top result for async HTTP query");
assert.ok(results[0].score > 0.3, `score should be meaningful, got ${results[0].score}`);

// Query about CI should return m-ci
const ciResults = await search(embedder, index, "test fixtures hardcoded paths CI", 3);
assert.equal(ciResults[0].id, "m-ci", "m-ci should be top for CI fixtures query");

// Query about git workflow should return d-036
const gitResults = await search(embedder, index, "can I push directly to main branch?", 3);
assert.equal(gitResults[0].id, "d-036", "d-036 should be top for direct-to-main query");
});

it("empty index returns empty results", async () => {
const embedder = await loadEmbedder();
const index = await buildIndex(embedder, []);
const results = await search(embedder, index, "anything", 5);
assert.equal(results.length, 0);
});

it("topK limits results", async () => {
const embedder = await loadEmbedder();
const items: Item[] = Array.from({ length: 20 }, (_, i) => ({
id: `item-${i}`,
text: `This is test item number ${i} about software engineering`,
metadata: {},
}));
const index = await buildIndex(embedder, items);
const results = await search(embedder, index, "software engineering", 3);
assert.equal(results.length, 3);
});

it("scores are sorted descending", async () => {
const embedder = await loadEmbedder();
const items: Item[] = [
{ id: "a", text: "TypeScript Node.js Express server", metadata: {} },
{ id: "b", text: "Python Django web framework", metadata: {} },
{ id: "c", text: "Gardening tips for tomatoes in spring", metadata: {} },
];
const index = await buildIndex(embedder, items);
const results = await search(embedder, index, "Node.js TypeScript backend", 3);
for (let i = 1; i < results.length; i++) {
assert.ok(results[i - 1].score >= results[i].score, "should be sorted descending");
}
// "a" should be most relevant
assert.equal(results[0].id, "a");
});
});
Loading