Reproducible evaluation suites and benchmark results for AFS (Agentic File System) — a virtual file system abstraction layer that gives AI agents a unified, path-based interface to heterogeneous storage backends.
This repository accompanies the paper "Everything is a Path: An Asset Substrate for Agent Runtimes" (working title) and the AFS-UI paper, providing the runnable code, fixtures, methodology notes, and result artefacts behind every numerical claim.
About AFS. AFS unifies access to filesystems, databases, key-value stores, APIs, and cloud services behind a path-based protocol that any AI agent can navigate. The core repo is at github.com/AIGNE-io/afs (Apache-2.0). This repo contains the evaluations only — the AFS implementation itself lives there.
| Suite | Question it answers | Headline result |
|---|---|---|
agent-substrate-bench/ |
How does AFS compare to MCP / LangChain / FS-CLI / Raw-SDK as an agent substrate? | AFS + FS-CLI deterministic 0% leak; others 15–85% on substrate-neutral ACL prompt across 1,200 trials |
afs-protocol-evaluations/ |
Protocol-level properties: provider conformance density, schema cost, scaling | E1–E11 experiments — protocol shape numbers for the paper's evaluation section |
afs-ui-evaluations/ |
Three UI generation paradigms compared (AUP vs HTML vs Markdown) | Performance, cost, interoperability, maintainability across 3 paradigms |
afs-runtime-gpt-5.5/ |
Live-runtime regression: does the codebase actually deliver the paper's runtime claims? | RQ1 20/20, RQ2 5/5, RQ3 5/5, RQ11 6/6, conformance 1669/0 |
locomo/ |
Long-term-conversational memory recall (Snap, ACL'24) | LOCOMO Recall@5 — 67.5% with embeddings, 57.8% FTS-only |
longmemeval/ |
Memory abilities + abstention (UEdinburgh+Tsinghua, ICLR'25) | Recall@5 99.1% with embeddings; QA accuracy 67.6% with claude-haiku-4-5 reader |
perltqa/ |
Personal long-term memory, multi-category (Du et al., NAACL'24) | Recall@5 86.6% English, 98.4% Chinese (FTS) |
dmr/ |
Deep Memory Retrieval — canonical multi-session test (MemGPT, 2023) | Recall@5 96.6% with embeddings |
MEMORY-BENCHMARKS-SUMMARY.md |
Cross-benchmark headline | One table comparing all four memory benchmarks |
The cumulative claim:
AFS is a substrate-uniform asset protocol — its access-control, provenance, and discoverability properties hold across heterogeneous backend shapes (FS, KV, SQLite, JSON, HTTP, vault, …) where POSIX
chmodcannot. The agent-substrate-bench data above is the empirical evidence; structural conformance tests in the AFS core repo (suitesvisibility-acl,canonical-paths,search-provenance) protect those properties from regression.
- Bun ≥ 1.3 (used as the test runner across all suites)
- Node.js ≥ 20 (for some scripts that shell out to npm packages)
- pnpm 10.x (workspace package manager)
- Anthropic / OpenAI API keys for suites that drive real LLMs
evaluation/
├── _shared/ shared LLM bridge + QA reader/judge utilities
├── MEMORY-BENCHMARKS-SUMMARY.md cross-benchmark headline table
├── agent-substrate-bench/ paper §IX — AFS vs MCP/LangChain/FS-CLI/Raw-SDK
│ ├── platforms/ one mock substrate adapter per platform
│ ├── tasks/ afs-intrinsic, hotpot-multi, swe-bench
│ ├── runners/ run-suite, plus result analyzers
│ ├── planning/ design.md, task-selection, conformance-promotions
│ └── results/ v1/v2/v3 result CSVs + reports
├── afs-protocol-evaluations/ paper §IX — protocol-level (E1–E11)
├── afs-ui-evaluations/ AUP paper — RQ1/RQ2/RQ3
├── afs-runtime-gpt-5.5/ live-runtime regression on real codebase
├── locomo/, longmemeval/, public memory benchmarks
├── perltqa/, dmr/ (data must be downloaded separately, see below)
Every suite has its own README with reproducible commands. The simplest entry points:
# Substrate comparison (agent-substrate-bench v3 — 1,200 trials)
cd agent-substrate-bench
bun runners/run-suite.ts --suite afs-intrinsic \
--platforms afs,mcp,langchain,fs-cli,raw-sdk \
--trials 10 \
--models claude-haiku-4-5,claude-sonnet-4-5 \
--out results/your-rerun
# Memory benchmark (e.g. LongMemEval)
cd longmemeval
bun scripts/run.ts --mode s --limit 100 # see longmemeval/README.mdMemory benchmarks (locomo, longmemeval, perltqa, dmr) require their
upstream datasets, which are NOT bundled here for license/size reasons:
| Suite | Dataset source | License |
|---|---|---|
| LOCOMO | snap-research/locomo | per upstream |
| LongMemEval | xiaowu0162/LongMemEval | per upstream |
| PerLTQA | Elvin-Yiming-Du/PerLTQA | per upstream |
| DMR | MemGPT paper appendix — MemGPT codebase | per upstream |
Each suite's README documents the exact path to drop the downloaded data into
(<suite>/data/...).
The agent-substrate-bench/results/ and afs-ui-evaluations/results/
directories contain full transcripts (the v1/v2/v3 paper-grade evidence —
small enough to bundle). The memory benchmarks' raw trials.jsonl files
are not bundled (each is hundreds of MB to several GB); their summary
reports (.md, .csv) are bundled. To regenerate transcripts, re-run
the suite locally.
- 5 substrates under test: AFS (production
@aigne/afs), MCP-fair (mock MCP server), LangChain-style (BaseRetriever + BaseStore mget), FS-CLI (POSIXchmod+ grep + cat), Raw-SDK (no ACL primitive) - 4 tasks probing distinct AFS-intrinsic properties: conformance discovery, access-control, canonical-path provenance, cross-aggregate
- Substrate-neutral prompts — entries named by bare key, "use whatever fetch primitive your substrate provides" — to remove AFS-pathy phrasing bias
- Strict + lenient verifiers — strict checks paper-grade canonical form; lenient checks whether agent got the right idea
- Intrinsic probes — substrate-property metrics (canonical-path-rate, namespace-acl, failure-envelope-rate, cross-provider-density) measured alongside extrinsic success
See agent-substrate-bench/planning/design.md and
agent-substrate-bench/results/v3-PAPER_REPORT.md
for the full methodology + collaborator-reviewed honest data.
- Same hit rule everyone publishes: a retrieved chunk is a hit if (a) gold answer appears as substring after normalisation, OR (b) ≥ 50% of meaningful tokens (length > 3, stopwords removed) appear in the chunk
- Mirrors YourMemory's methodology documentation
- Scoreable subset = excludes adversarial-refusal categories (LOCOMO 5,
LongMemEval
_abs); inclusion would conflate retrieval with refusal logic - Top-line headline: Recall@5 (paper convention)
See MEMORY-BENCHMARKS-SUMMARY.md for cross-suite comparison.
A central charter of this evaluation effort is that paper findings → real
codebase changes. The agent-substrate-bench v1 → v2 → v3 progression closed
that loop end-to-end:
- v1 (700 trials) surfaced 6 substrate-adapter improvements
- v2 shipped 5/6 to the substrate adapter; 700-trial confirmation
- v3 collaborator review removed 2 methodology biases; 5/5 paper-relevant
changes promoted to
@aigne/afscore; 1,200 trials confirmed AFS + FS-CLI as the only deterministic 0% leak substrates
Then the structural backing was filled in:
| Layer | Artifact in AFS core repo |
|---|---|
| Protocol primitive | packages/core/src/afs.ts — visibility:meta enforcement; MountOptions.visibility universal hook |
| L5 conformance | packages/testing/src/suites/visibility-acl.ts (Proxy invariant: provider.search never invoked) |
| L1 conformance | packages/testing/src/suites/canonical-paths.ts, search-provenance.ts |
| Production substrates | providers/core/vault/test/visibility.test.ts, providers/core/kv/test/visibility-mount.test.ts |
| Sweep | 30+ providers verified clean (core / platform / cost / messaging / iot / runtime) |
This is an evaluation repo — issues, methodology critiques, and reproductions are welcome. PRs that add a benchmark / substrate / metric in the existing methodology style are happily reviewed.
For changes to the AFS core itself (new providers, protocol features,
conformance suites), please open issues / PRs in
AIGNE-io/afs instead. Keep
discussion of what AFS does in the core repo and how AFS measures up
in this one.
@misc{afs-evaluation,
title = {AFS Evaluation: Benchmark Suites for the Agentic File System},
author = {ArcBlock},
year = {2026},
howpublished = {\url{https://github.com/ArcBlock/afs-evaluation}},
note = {Reproducible evaluation harnesses and result artefacts for AFS
(Agentic File System) — accompanies the "Everything is a Path"
paper.}
}MIT — see LICENSE.
The benchmark code, methodology, and result analyses in this repository are ours. The third-party dataset references (LOCOMO, LongMemEval, PerLTQA, DMR) remain under their original licenses; we link to them rather than redistribute.