Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
16 commits
Select commit Hold shift + click to select a range
c17b8a0
分頁化 admin reindex 掃描流程
MakiDevelop Apr 27, 2026
31ddf83
回收 SQLite 暫時性故障連線並重試
MakiDevelop Apr 27, 2026
3e7d2ce
強制 runtime sqlite3 系統連結指向 3.53.0
MakiDevelop Apr 27, 2026
a19886b
新增背景 WAL checkpoint 與雙資料庫截斷
MakiDevelop Apr 27, 2026
c8efe4f
docs(council): Phase A.5 PR 1 summary + Phase B Codex Dissent
MakiDevelop Apr 27, 2026
9761bee
fix(deploy): patch C ln 順序修正 — 必須在 apt-get install 之後
MakiDevelop Apr 27, 2026
6ee69d5
feat(search+health): Phase A.5 PR 2 — RRF weighted + liveness/readine…
MakiDevelop Apr 27, 2026
36b4bc5
docs(adr): 0008 — memhall 是 personal PKI,輕量 > 完整
MakiDevelop Apr 28, 2026
146e71d
feat(bench): add scripts/bench_hybrid.py — RRF vs weighted_linear 評估
MakiDevelop Apr 28, 2026
9f25f1a
fix(search): hybrid 預設改回 rrf — bench 結果不支持 weighted_linear
MakiDevelop Apr 28, 2026
dbfe247
feat(auth): admin gate (two-tier bearer) — MH_ADMIN_TOKEN
MakiDevelop Apr 28, 2026
ce6ac3b
fix(auth): config-load fail-fast invariant — Codex round 1 修補
MakiDevelop Apr 28, 2026
8b4347b
test(auth): autouse fixture also clear MH_ADMIN_TOKEN — Codex round 2…
MakiDevelop Apr 28, 2026
ea00666
revert(health): 撤掉 Phase A.5 PR2 Patch F — 回到單一 /v1/health
MakiDevelop Apr 28, 2026
62ddd82
feat(cli): mh CLI 自動注入 MH_API_TOKEN Bearer header
MakiDevelop Apr 28, 2026
80a9e7a
docs(agents): 新增 agent integration 指南 + AGENTS.md 入口
MakiDevelop Apr 28, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,12 @@ MH_DEFAULT_TENANT_ID=default
# Generate with: openssl rand -hex 32
# MH_API_TOKEN=

# Admin gate (optional, two-tier bearer). See ADR 0009.
# When set, /v1/admin/* requires this token; the regular MH_API_TOKEN is
# rejected on admin paths. When unset, /v1/admin/* falls back to MH_API_TOKEN
# (backward compat). Use a different value from MH_API_TOKEN.
# MH_ADMIN_TOKEN=

# Request behavior
MH_REQUEST_TIMEOUT_S=5.0
MH_LIST_DEFAULT_LIMIT=50
Expand Down
35 changes: 35 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# AGENTS.md

If you are an AI agent that just cloned this repo, read this first.

This file is **informational**, not a directive. It tells you where the agent-facing docs are. It does not tell you what to build.

---

## You are probably here to do one of these

1. **Write to / read from a running memhall instance** (most common).
→ Read [`docs/agent-integration.md`](docs/agent-integration.md). It has a decision tree that picks the right surface (embedded Python / HTTP+Bearer / `mh` CLI) based on whether your sandbox can open TCP sockets.

2. **Modify this codebase** (add a feature, fix a bug, write a test).
→ Read [`README.md`](README.md) (architecture and three entry points), [`docs/design.md`](docs/design.md) (internals), and [`docs/adr/`](docs/adr/) (why things are the way they are).

3. **Deploy memhall somewhere new.**
→ Read [`docs/deploy.md`](docs/deploy.md).

---

## Three things that bite agents on first contact

These are real failure modes seen in real agent sessions. Not hypothetical.

1. **`{"detail":"missing bearer token"}`** — the server has `MH_API_TOKEN` set. Every `/v1/memory/*` request needs `Authorization: Bearer ${MH_API_TOKEN}`. `/v1/health` is the only public endpoint.
2. **`command not found: mh`** — the CLI is a console script in `pyproject.toml`, it is not globally installed. Run `uv sync` inside the repo, then `uv run mh …`. The CLI reads `MH_API_TOKEN` from the environment automatically (no flag needed); export it before calling against an auth-enabled server.
3. **Sandboxed agents (Codex CLI, restricted containers) cannot open localhost TCP** — skip HTTP and use the embedded Python path (`from memory_hall import build_runtime`). Same storage, no auth, no network. See [`docs/agent-integration.md`](docs/agent-integration.md) Path A.

---

## What this file does NOT tell you

- It does not tell you to "rebuild the skeleton" or "follow this as the only source of truth". Treat the per-task instructions you were given as authoritative; this file just points at reference docs.
- It does not pin you to a phase or a workflow. The repo evolves. If anything in `docs/agent-integration.md` looks wrong against the running code, fix the doc — do not work around it.
9 changes: 8 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,6 @@ ENV PYTHONDONTWRITEBYTECODE=1 \

# Inject upgraded SQLite to runtime stage too
COPY --from=sqlite-builder /opt/sqlite /opt/sqlite
RUN echo "/opt/sqlite/lib" > /etc/ld.so.conf.d/sqlite-upgrade.conf && ldconfig

RUN apt-get update && apt-get install -y --no-install-recommends \
curl \
Expand All @@ -91,6 +90,14 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
&& mkdir -p /data \
&& chown memhall:memhall /data

# Force system libsqlite3.so.0 to our upgraded build so subprocesses that do not
# inherit LD_LIBRARY_PATH still resolve SQLite 3.53.0.
# IMPORTANT: must run AFTER apt-get install (dpkg post-install can reset symlinks).
RUN echo "/opt/sqlite/lib" > /etc/ld.so.conf.d/sqlite-upgrade.conf \
&& ldconfig \
&& ln -sf /opt/sqlite/lib/libsqlite3.so.3.53.0 /lib/aarch64-linux-gnu/libsqlite3.so.0 \
&& { ln -sf /opt/sqlite/lib/libsqlite3.so.3.53.0 /usr/lib/aarch64-linux-gnu/libsqlite3.so.0 2>/dev/null || true; }

WORKDIR /app

COPY --from=builder --chown=memhall:memhall /app/.venv /app/.venv
Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,8 @@ See [`docs/adr/0003-engine-library-vs-deployment-platform.md`](docs/adr/0003-eng

No entry is privileged — they all hit the same backend, so no single-point-of-failure path.

> **Agents reading this**: see [`docs/agent-integration.md`](docs/agent-integration.md) for a decision tree that picks the right surface based on your sandbox, plus the auth + install gotchas that have bitten real Codex / Gemini sessions.

### Embedded (in-process) use

Some agents run in sandboxes that block localhost sockets (Codex CLI, some Gemini setups, restricted containers). For those, skip HTTP entirely:
Expand Down
97 changes: 97 additions & 0 deletions docs/adr/0008-personal-pki-lightweight-stance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# ADR 0008 — memhall 是 personal PKI,輕量 > 完整

- **Status**: Accepted
- **Date**: 2026-04-28
- **Related**: ADR 0003(engine library vs deployment platform)、ADR 0005(v0.2 minimum viable contract)、ADR 0007(minimal token auth)、`rules/four-layer-north-star.md` L4

## Context

2026-04-28 對 Phase A / A.5 / B 體檢時發現一個漂移傾向:每個 reliability incident 後,patch 容易順手帶入「業界最佳實踐」(k8s liveness/readiness 拆分、weighted linear hybrid 加 tuning knob、HMAC + principal registry + key rotation),把 memhall 的複雜度往 production-grade memory platform 推。

但 memhall 的實際定位是:

- **單一使用者**(Maki)
- **單一部署**(Mac mini Tailscale tailnet `:9100`,mini2 冷備)
- **規模 ~10² 量級** entries
- **caller < 10**(ops-hub / repo CLI / `.claude/skills/*` / mk-brain),全部在 Maki 自己的 tailnet 內
- **目的**:七位一體共用記憶大廳 + Maki 個人 PKI 的聯想入口

ADR 0003 已經把「engine library vs deployment platform」分開——這份 ADR 把它再往前推一步,明確 memhall 的設計目標**不是** production-grade memory platform,是 **personal PKI 的記憶引擎**。

## Decision

memhall 接受以下四個北極星,依優先序:

1. **聯想品質**(retrieval recall / ranking 正確)
2. **穩定**(不會壞、不會吞錯、不會 silent degrade)
3. **快速**(search p50 < 200ms,write < 50ms)
4. **輕量**(schema、config knob、auth 機制、ops surface 都要可以一個人理解)

**任何 patch 在 land 前必須通過「personal PKI 體檢」**:

- 這個改動修的是真 bug 還是引入「業界慣例」?
- 加了幾個 config knob?每個 knob 的 default 你能解釋嗎?
- schema 多了幾個欄位?對 ~10² 規模值得嗎?
- 對單一 caller 場景,是否引入跨組織 / 多 tenant / 多 operator 才需要的機制?
- 如果回答「以後可能用得到」——拒絕,等真的用到再做。

明確**不做**的清單(除非觸發 sunset criteria):

- ❌ k8s 風格的 liveness/readiness/startup probe 三件套(單一 launchd container 不需要)
- **2026-04-28 補執行**:Phase A.5 PR2 Patch F 引入的 `/v1/healthz` + `/v1/ready` 拆分已 revert,回到單一 `/v1/health`(200/503,body 含完整 status)。理由:mini 用 `restart: unless-stopped`,health unhealthy 不會自動 restart,flapping 風險為零;單一 endpoint 對個人 PKI 維運心智成本更低
- ❌ Hybrid search 的可調 α / mode switch(除非有 retrieval benchmark 證明非 RRF 更好)
- ❌ HMAC + nonce + per-key rotation(ADR 0007 minimal token + Tailscale ACL 已足夠)
- ❌ Principal registry / role mapping / `key_id → role/ns/agent` 表
- ❌ Per-row 失敗計數 / retry budget machinery(log + 下次 reindex 重試就夠)
- ❌ Dashboard / metrics aggregation / 需要打開看的觀測介面(違反 L4)

## Consequences

### Gains

- **複雜度預算用在聯想品質上**(embedding 模型、ranking、CJK tokenization),不是 ops surface
- **單人可維護**:schema、auth、health 邏輯都能一個下午讀完
- **可逆**:每個 ADR 都有 sunset criteria,跨過門檻就升級,不跨就保持輕量
- **OSS friendly**:`git clone && docker compose up` 立刻能跑,不需要設 ACL / 簽 cert / 發 key

### Costs

- **不適合多 operator 共用**:第二個 operator 出現時,這份 ADR 的多數決策需要重新評估
- **Audit trail 較弱**:自我宣告的 `agent_id` 是唯一的 attribution,不是密碼學保證
- **某些「正確」的工程實踐被刻意延後**:HMAC、principal registry、retry budget——不是因為它們錯,是因為**現在做的 ROI 不夠**

### Non-goals

- 不取代 ADR 0003 的 engine vs platform 分工:production-grade ACL / multi-tenant ACL / 跨組織 audit 仍由未來的 `memory-gateway` 承擔
- 不否定 `rules/agent-security-hygiene.md` S2.1 的 HMAC 規格——那是 destination,這份 ADR 是「現在不要走」的理由
- 不放棄 reliability:Phase A SQLite chain / silent except / WAL 修復都是必須做的,這份 ADR 不是「拒絕修 bug」

## Sunset criteria

任一條件成立就重新審視這份 ADR:

1. 第二個 operator(不是 Maki)開始寫入同一個 memhall 部署
2. caller 數量 > 20,或出現 Maki 不認識的 caller
3. entries 規模超過 10⁵(schema / index 策略可能需要重新設計)
4. 出現需要密碼學 attribution 的 incident(token 洩漏 + 不知道誰寫的 entry)
5. memhall 變成 OSS 多人協作專案,外部 contributor 開始要求「production-grade」feature

## Alternatives considered

### A. 不寫這份 ADR,用 PR review 把關

拒絕。沒有明文化的設計哲學,每個 PR 都要重新辯論「這個是不是 over-design」。這份 ADR 把判準寫下來,未來的 PR / Codex 提案 / Claude 設計都先過這份體檢,不通過就直接砍。

### B. 寫成 rule(`rules/memhall-lightweight.md`)而非 ADR

拒絕。ADR 是 repo 內 immutable 決策記錄,scope 限定 memhall。Rules 是跨專案行為規範。這份內容的 scope 是 memhall 設計哲學,屬於 ADR。

### C. 列「禁止做什麼」清單但不寫優先序

拒絕。沒有優先序時,遇到取捨會憑感覺。明確「聯想品質 > 穩定 > 快速 > 輕量」讓未來的取捨有依據——例如 BM25 normalize bug 雖然動了 ranking 邏輯,但是修聯想品質,最高優先;hybrid α 參數化是輕量倒退,最低優先,需要 benchmark 才能 land。

## Implementation summary

- 新增本 ADR
- 更新 `docs/adr/README.md` 索引
- 後續 PR description 在引入新 config knob / schema 欄位 / auth 機制時,必須引用本 ADR 並回答「personal PKI 體檢」五題
106 changes: 106 additions & 0 deletions docs/adr/0009-admin-gate.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# ADR 0009 — Admin gate(two-tier bearer,不做 HMAC)

- **Status**: Accepted
- **Date**: 2026-04-28
- **Related**: ADR 0007(minimal token auth,這份是它的最小延伸)、ADR 0008(personal PKI 輕量立場,是這份的判準依據)、Codex Phase B Dissent 2026-04-27(D2 Option E 的最小實作)

## Context

現況下 `/v1/admin/reindex` 與 `/v1/admin/audit` 兩個 admin endpoint 由 `MH_API_TOKEN` 統一保護——任何持有 api_token 的 caller 都能呼叫 admin 操作。風險:

- api_token 被多個 caller 共用(ops-hub / repo CLI / 4 個 Claude skills / mk-brain),任一 caller 機器被攻破或 log 不慎洩漏 token,都直接拿到 admin 權限
- reindex 是危險動作(會掃描全表、可能踩到 embedder 連環失敗),不該與一般 read/write 共用權限

七位一體 Phase B 一開始的提案是「HMAC + nonce + replay window + principal registry + 14 天並存期 + 7 連日零 bearer write 退場」一整套 production-grade machinery(rules/agent-security-hygiene.md S2.1 的方向)。Codex Phase B Dissent 2026-04-27 D2 Option E 把它縮成「先封 admin,再做 attribution」。SuperGrok 2026-04-28 sanity check:2025-2026 全球範圍沒有命中本情境(Tailscale tailnet + single-tenant + two-tier static bearer)的近期 incident,社群也沒把這個簡化設計列為已知 anti-pattern;獨立 admin bearer 反而是 community 推薦的 least-privilege 做法。

ADR 0008 已 ratify「memhall 是 personal PKI,輕量 > 完整」,明確排除 HMAC / principal registry / per-key rotation。本 ADR 把 Phase B 縮到 ADR 0008 立場下還能做的最小步驟。

## Decision

**新增 `MH_ADMIN_TOKEN`(optional,獨立於 `MH_API_TOKEN`)。設定後 `/v1/admin/*` 要求 admin token,一般 api_token 在 admin path 上被拒絕。**

- 新 config field:`Settings.admin_token: str | None = None`(`MH_ADMIN_TOKEN` env)
- Middleware 行為(`src/memory_hall/server/app.py` 的 `require_api_token`):
- `/v1/health*` → 永遠 public(沿用 ADR 0007)
- `/v1/admin/*` 且 `admin_token` 已設 → 要求 `Authorization: Bearer <admin_token>`,傳 `api_token` 也回 `401`
- `/v1/admin/*` 但 `admin_token` 未設 → fallback 到 `api_token` 邏輯(ADR 0007 backward compat)
- 其他 path → 既有 `api_token` 邏輯
- `admin_token` 不能反過來用在非 admin path(least privilege 雙向)
- 比較全程用 `hmac.compare_digest`(constant-time,沿用 ADR 0007)
- 錯誤訊息分開(`invalid token` vs `invalid admin token`),但**不**用 `403` 區分「你的 token 是 valid api_token 但不是 admin」——避免 token validity oracle

非程式碼層面的搭配(docs only,不寫進 repo code):

- 在 mini Tailscale ACL 鎖 `/v1/admin/*` path 到 Maki 自己的 device(defense-in-depth 第二層)
- Token 用 `openssl rand -hex 32` 生成,與 `MH_API_TOKEN` 不同值
- 不要 log `Authorization` header(已 grep 過 src/memory_hall/,目前無此類 log;本 PR 不引入)

## Consequences

### Gains

- **Admin 操作從共享 token 隔離出來**:一般 caller token 洩漏不再等於 admin 失守
- **Backward compatible**:`MH_ADMIN_TOKEN` 未設時行為與 ADR 0007 完全相同,現有 deployment 不需改
- **實作 ~30 行**(config 1 行 + middleware 改 ~20 行 + tests 6 個 case),1.5 小時內完成
- **Personal PKI 體檢通過**:1 個新 config knob、0 個新 schema 欄位、0 個跨組織機制

### Costs

- **仍是 possession-based**:admin_token 洩漏 = admin 失守,沒有 cryptographic attribution
- **沒有 rotation infra**:rotate admin_token = 改 env + restart container + 通知少數 caller,與 api_token 同等
- **Config-load 時 fail-fast 兩個 invariant**(Codex review 2026-04-28 PR1 round 1 補強,5 行 pydantic validator):
- `admin_token` 設了但 `api_token` 沒設 → 拒絕啟動(否則非 admin path 會 fail-open)
- `admin_token == api_token` → 拒絕啟動(否則 two-tier 被靜默抵消)
- 這兩條不算違反 ADR 0008 輕量原則:屬於「防止操作者誤配置造成 silent security regression」,5 行 code 防一個 high-severity 漏洞,ROI 明確

### Non-goals

- 不取代 HMAC(rules/agent-security-hygiene.md S2.1 仍是 destination,但 sunset criteria 未觸發)
- 不引入 principal registry / role mapping
- 不做 14 天 sunset window(沒有要 retire 的舊機制)
- 不在 code 層強制 Tailscale ACL(infra config 該由 ops 維護)

## Alternatives considered

### A. Codex 完整版 Phase B Option E(registry + HMAC + 14 天並存期 + 7 連日零 bearer write 退場)

拒絕:sunset criteria 未觸發(單一 operator / caller < 10 / 全部在 Maki tailnet 內)。引入 HMAC 等 ADR 0008 sunset criteria 1 (第二個 operator) 或 5 (token 洩漏 incident) 之一發生才做。

### B. 用 `403 Forbidden` 區分「valid api_token 用在 admin path」

拒絕:會形成 token validity oracle(攻擊者送 garbage 拿 401,送 valid api_token 拿 403,能反推 token 是否合法)。統一回 401 較安全。內部 caller 的 debug 體驗用「invalid admin token」訊息字串足以區分。

### C. 不做 admin gate,靠 Tailscale ACL 鎖 path

拒絕:ACL 是 device 層級,無法區分「同 device 上 ops-hub 的 read-only flow」和「同 device 上不該呼叫 reindex 的 LINE bot」。code 層 self-defense + ACL defense-in-depth 比單靠 ACL 強。

### D. 把 admin_token 設成 default required(不向後相容)

拒絕:會影響現有 deployment(mini production),需要 migration window。本 ADR 走可逆路徑:opt-in 起手,未來如果要強制可再 supersede。

## Sunset criteria

任一條件成立就重新審視:

1. ADR 0008 任一 sunset criteria 觸發(自動帶動本 ADR)
2. admin_token 洩漏 incident(這份 ADR 為什麼沒有 rotation infra 就是答案——出事的話 rotation 是第一個要建的東西)
3. caller 數量需要 per-caller admin attribution(例如知道是 ops-hub 還是 mk-brain 觸發的 reindex)
4. 出現第三層權限需求(read-only / write / admin → read-only / write / reindex / audit / superuser)

## Implementation summary

- `src/memory_hall/config.py`:加 `admin_token` 欄位
- `src/memory_hall/server/app.py`:擴充 `require_api_token` middleware,加 admin path 分支
- `tests/test_auth.py`:8 個新 case(6 個 middleware 行為 + 2 個 config invariant fail-fast)
- `.env.example`:加 `MH_ADMIN_TOKEN=` 範例段落
- `docs/api.md`:加「Admin gate (two-tier bearer)」段落

Total: ~140 行 across 6 files。`pytest`:16 passed (auth),full suite 59 passed 1 skipped。

## Round 1 review history

- 2026-04-28 Codex review REJECT,2 finding:
1. [HIGH] `admin_token` 設 + `api_token` 沒設 → 非 admin path fail-open(實測 POST /v1/memory/write 回 201)
2. [MEDIUM] `admin_token == api_token` → 靜默抵消 two-tier
- 修法:在 `Settings` 加 `_validate_auth_tokens` model_validator,config load 時 fail-fast
- 補 2 個 unit test 鎖 invariant
2 changes: 2 additions & 0 deletions docs/adr/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ Numbered, immutable records of significant design choices. Append new entries; n
| [0005](0005-v0.2-minimum-viable-contract.md) | v0.2 Minimum Viable Contract (production-facing freeze) | Accepted (2026-04-19) |
| [0006](0006-http-embedder-embed-queue-isolation.md) | HttpEmbedder: embed path isolation from LLM queue | Accepted (2026-04-20) |
| [0007](0007-minimal-token-auth.md) | Minimal Token auth (single-tenant deployment shim) | Accepted (2026-04-23) |
| [0008](0008-personal-pki-lightweight-stance.md) | memhall 是 personal PKI,輕量 > 完整 | Accepted (2026-04-28) |
| [0009](0009-admin-gate.md) | Admin gate(two-tier bearer,不做 HMAC) | Accepted (2026-04-28) |

## Format

Expand Down
Loading
Loading