AFS Evaluation

Reproducible evaluation suites and benchmark results for AFS (Agentic File System) — a virtual file system abstraction layer that gives AI agents a unified, path-based interface to heterogeneous storage backends.

This repository accompanies the paper "Everything is a Path: An Asset Substrate for Agent Runtimes" (working title) and the AFS-UI paper, providing the runnable code, fixtures, methodology notes, and result artefacts behind every numerical claim.

About AFS. AFS unifies access to filesystems, databases, key-value stores, APIs, and cloud services behind a path-based protocol that any AI agent can navigate. The core repo is at github.com/AIGNE-io/afs (Apache-2.0). This repo contains the evaluations only — the AFS implementation itself lives there.

At-a-glance — what's in here

Suite	Question it answers	Headline result
`agent-substrate-bench/`	How does AFS compare to MCP / LangChain / FS-CLI / Raw-SDK as an agent substrate?	AFS + FS-CLI deterministic 0% leak; others 15–85% on substrate-neutral ACL prompt across 1,200 trials
`afs-protocol-evaluations/`	Protocol-level properties: provider conformance density, schema cost, scaling	E1–E11 experiments — protocol shape numbers for the paper's evaluation section
`afs-ui-evaluations/`	Three UI generation paradigms compared (AUP vs HTML vs Markdown)	Performance, cost, interoperability, maintainability across 3 paradigms
`afs-runtime-gpt-5.5/`	Live-runtime regression: does the codebase actually deliver the paper's runtime claims?	RQ1 20/20, RQ2 5/5, RQ3 5/5, RQ11 6/6, conformance 1669/0
`locomo/`	Long-term-conversational memory recall (Snap, ACL'24)	LOCOMO Recall@5 — 67.5% with embeddings, 57.8% FTS-only
`longmemeval/`	Memory abilities + abstention (UEdinburgh+Tsinghua, ICLR'25)	Recall@5 99.1% with embeddings; QA accuracy 67.6% with claude-haiku-4-5 reader
`perltqa/`	Personal long-term memory, multi-category (Du et al., NAACL'24)	Recall@5 86.6% English, 98.4% Chinese (FTS)
`dmr/`	Deep Memory Retrieval — canonical multi-session test (MemGPT, 2023)	Recall@5 96.6% with embeddings
`MEMORY-BENCHMARKS-SUMMARY.md`	Cross-benchmark headline	One table comparing all four memory benchmarks

The cumulative claim:

AFS is a substrate-uniform asset protocol — its access-control, provenance, and discoverability properties hold across heterogeneous backend shapes (FS, KV, SQLite, JSON, HTTP, vault, …) where POSIX chmod cannot. The agent-substrate-bench data above is the empirical evidence; structural conformance tests in the AFS core repo (suites visibility-acl, canonical-paths, search-provenance) protect those properties from regression.

Getting started

Prerequisites

Bun ≥ 1.3 (used as the test runner across all suites)
Node.js ≥ 20 (for some scripts that shell out to npm packages)
pnpm 10.x (workspace package manager)
Anthropic / OpenAI API keys for suites that drive real LLMs

Layout

evaluation/
├── _shared/                          shared LLM bridge + QA reader/judge utilities
├── MEMORY-BENCHMARKS-SUMMARY.md      cross-benchmark headline table
├── agent-substrate-bench/            paper §IX — AFS vs MCP/LangChain/FS-CLI/Raw-SDK
│   ├── platforms/                      one mock substrate adapter per platform
│   ├── tasks/                          afs-intrinsic, hotpot-multi, swe-bench
│   ├── runners/                        run-suite, plus result analyzers
│   ├── planning/                       design.md, task-selection, conformance-promotions
│   └── results/                        v1/v2/v3 result CSVs + reports
├── afs-protocol-evaluations/         paper §IX — protocol-level (E1–E11)
├── afs-ui-evaluations/               AUP paper — RQ1/RQ2/RQ3
├── afs-runtime-gpt-5.5/              live-runtime regression on real codebase
├── locomo/, longmemeval/,            public memory benchmarks
├── perltqa/, dmr/                       (data must be downloaded separately, see below)

Running a suite

Every suite has its own README with reproducible commands. The simplest entry points:

# Substrate comparison (agent-substrate-bench v3 — 1,200 trials)
cd agent-substrate-bench
bun runners/run-suite.ts --suite afs-intrinsic \
  --platforms afs,mcp,langchain,fs-cli,raw-sdk \
  --trials 10 \
  --models claude-haiku-4-5,claude-sonnet-4-5 \
  --out results/your-rerun

# Memory benchmark (e.g. LongMemEval)
cd longmemeval
bun scripts/run.ts --mode s --limit 100   # see longmemeval/README.md

Third-party datasets — download yourself

Memory benchmarks (locomo, longmemeval, perltqa, dmr) require their upstream datasets, which are NOT bundled here for license/size reasons:

Suite	Dataset source	License
LOCOMO	snap-research/locomo	per upstream
LongMemEval	xiaowu0162/LongMemEval	per upstream
PerLTQA	Elvin-Yiming-Du/PerLTQA	per upstream
DMR	MemGPT paper appendix — MemGPT codebase	per upstream

Each suite's README documents the exact path to drop the downloaded data into (<suite>/data/...).

Trial transcripts — partially bundled

The agent-substrate-bench/results/ and afs-ui-evaluations/results/ directories contain full transcripts (the v1/v2/v3 paper-grade evidence — small enough to bundle). The memory benchmarks' raw trials.jsonl files are not bundled (each is hundreds of MB to several GB); their summary reports (.md, .csv) are bundled. To regenerate transcripts, re-run the suite locally.

Methodology highlights

Substrate-bench `afs-intrinsic`

5 substrates under test: AFS (production @aigne/afs), MCP-fair (mock MCP server), LangChain-style (BaseRetriever + BaseStore mget), FS-CLI (POSIX chmod + grep + cat), Raw-SDK (no ACL primitive)
4 tasks probing distinct AFS-intrinsic properties: conformance discovery, access-control, canonical-path provenance, cross-aggregate
Substrate-neutral prompts — entries named by bare key, "use whatever fetch primitive your substrate provides" — to remove AFS-pathy phrasing bias
Strict + lenient verifiers — strict checks paper-grade canonical form; lenient checks whether agent got the right idea
Intrinsic probes — substrate-property metrics (canonical-path-rate, namespace-acl, failure-envelope-rate, cross-provider-density) measured alongside extrinsic success

See agent-substrate-bench/planning/design.md and agent-substrate-bench/results/v3-PAPER_REPORT.md for the full methodology + collaborator-reviewed honest data.

Memory benchmarks — apples-to-apples convention

Same hit rule everyone publishes: a retrieved chunk is a hit if (a) gold answer appears as substring after normalisation, OR (b) ≥ 50% of meaningful tokens (length > 3, stopwords removed) appear in the chunk
Mirrors YourMemory's methodology documentation
Scoreable subset = excludes adversarial-refusal categories (LOCOMO 5, LongMemEval _abs); inclusion would conflate retrieval with refusal logic
Top-line headline: Recall@5 (paper convention)

See MEMORY-BENCHMARKS-SUMMARY.md for cross-suite comparison.

Evaluation-as-self-improvement loop

A central charter of this evaluation effort is that paper findings → real codebase changes. The agent-substrate-bench v1 → v2 → v3 progression closed that loop end-to-end:

v1 (700 trials) surfaced 6 substrate-adapter improvements
v2 shipped 5/6 to the substrate adapter; 700-trial confirmation
v3 collaborator review removed 2 methodology biases; 5/5 paper-relevant changes promoted to @aigne/afs core; 1,200 trials confirmed AFS + FS-CLI as the only deterministic 0% leak substrates

Then the structural backing was filled in:

Layer	Artifact in AFS core repo
Protocol primitive	`packages/core/src/afs.ts` — `visibility:meta` enforcement; `MountOptions.visibility` universal hook
L5 conformance	`packages/testing/src/suites/visibility-acl.ts` (Proxy invariant: provider.search never invoked)
L1 conformance	`packages/testing/src/suites/canonical-paths.ts`, `search-provenance.ts`
Production substrates	`providers/core/vault/test/visibility.test.ts`, `providers/core/kv/test/visibility-mount.test.ts`
Sweep	30+ providers verified clean (core / platform / cost / messaging / iot / runtime)

Contributing

This is an evaluation repo — issues, methodology critiques, and reproductions are welcome. PRs that add a benchmark / substrate / metric in the existing methodology style are happily reviewed.

For changes to the AFS core itself (new providers, protocol features, conformance suites), please open issues / PRs in AIGNE-io/afs instead. Keep discussion of what AFS does in the core repo and how AFS measures up in this one.

Citation

@misc{afs-evaluation,
  title  = {AFS Evaluation: Benchmark Suites for the Agentic File System},
  author = {ArcBlock},
  year   = {2026},
  howpublished = {\url{https://github.com/ArcBlock/afs-evaluation}},
  note   = {Reproducible evaluation harnesses and result artefacts for AFS
            (Agentic File System) — accompanies the "Everything is a Path"
            paper.}
}

License

MIT — see LICENSE.

The benchmark code, methodology, and result analyses in this repository are ours. The third-party dataset references (LOCOMO, LongMemEval, PerLTQA, DMR) remain under their original licenses; we link to them rather than redistribute.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AFS Evaluation

At-a-glance — what's in here

Getting started

Prerequisites

Layout

Running a suite

Third-party datasets — download yourself

Trial transcripts — partially bundled

Methodology highlights

Substrate-bench `afs-intrinsic`

Memory benchmarks — apples-to-apples convention

Evaluation-as-self-improvement loop

Contributing

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
_shared		_shared
afs-protocol-evaluations		afs-protocol-evaluations
afs-runtime-gpt-5.5		afs-runtime-gpt-5.5
afs-ui-evaluations		afs-ui-evaluations
agent-substrate-bench		agent-substrate-bench
dmr		dmr
locomo		locomo
longmemeval		longmemeval
perltqa		perltqa
.gitignore		.gitignore
LICENSE		LICENSE
MEMORY-BENCHMARKS-SUMMARY.md		MEMORY-BENCHMARKS-SUMMARY.md
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

AFS Evaluation

At-a-glance — what's in here

Getting started

Prerequisites

Layout

Running a suite

Third-party datasets — download yourself

Trial transcripts — partially bundled

Methodology highlights

Substrate-bench afs-intrinsic

Memory benchmarks — apples-to-apples convention

Evaluation-as-self-improvement loop

Contributing

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Substrate-bench `afs-intrinsic`

Packages