feat(memory): MD-on-disk content + participant bucketing + async pipeline foundation by sanil-23 · Pull Request #1008 · tinyhumansai/openhuman

sanil-23 · 2026-04-28T20:00:19Z

Summary

Move chunk + summary content from inline SQLite TEXT columns to .md files on disk; SQLite holds pointer + sha256 + 500-char preview + state. Body bytes are immutable; sha is over body only; front-matter is mutable for tag rewrites.
Bucket Gmail by (sender, to-list) sorted/deduped (CC ignored, threadId ignored) so all correspondence between the same set of participants lands in one source tree. Path layout: email/<participants_slug>/<chunk_id>.md.
Async memory-tree pipeline foundation (inherited from feat(memory): async pipeline + per-tree label fixes + summariser split + memory tree enabled at app startup #997 — closed): SQLite-backed job queue with dedupe + retry, per-tree LabelStrategy (ExtractFromContent / UnionFromChildren / Empty), summariser/extractor split, manual memory_tree.trigger_digest RPC.
Message-aware chunker for chat (^## boundaries) and email (^---\nFrom: boundaries); document fallback unchanged. Pack-to-budget greedy emit so a thread's emails stay together when they fit.
Per-tree-kind summary path layout: summaries/{source|global|topic}/<scope>/L<n>/<id>.md. After-extract tag rewrite merges entities + topics into the summary's front-matter tags: (atomic, body-untouched).
New end-to-end validation bin gmail-backfill-3d: paginates GMAIL_FETCH_EMAILS via Composio, ingests last N days, drains the queue, re-hashes every chunk + summary file against content_sha256. Validated against real Gmail (40 chunks, 2 summaries, 100% sha-verified).

Problem

Today every chunk and summary body lives inline in mem_tree_chunks.text / mem_tree_summaries.summary. That couples three things that should be decoupled: (1) human/portable artifact, (2) deterministic identity, (3) state machine. Specific consequences:

No browse-able artifact. The user's emails are searchable only via SQL; opening Obsidian on a vault folder shows nothing.
DB grows unboundedly with body bytes; backups must include large TEXT columns alongside the metadata that's actually load-bearing.
No content-addressed integrity check. A corrupted body row is undetectable until the LLM produces visibly wrong output downstream.
Sealing was synchronous-on-ingest, which made every ingest_chat/ingest_email call wait for the LLM scoring + summariser. The async pipeline (feat(memory): async pipeline + per-tree label fixes + summariser split + memory tree enabled at app startup #997) decoupled this; this PR builds on that foundation.

Solution

Storage layer. New content_store/ module owns paths, compose, atomic write (tempfile + fsync + rename), read-with-sha-verify, and tag rewrites. Body bytes hashed; YAML front-matter is mutable metadata that doesn't enter the hash. Atomicity contract: write file first, then SQLite tx commits the pointer — file always exists before any row claims to point at it; transient orphan MDs cleanable by janitor.

Bucketing. bucket_by_participants(messages) produces source trees keyed by the sorted set of distinct participants (from ∪ to_list). All correspondence between Alice and Bob lands in one tree regardless of subject thread, regardless of direction.

Chunker. Message-aware splitter dispatches on SourceKind. Chat splits at H2 headings, email at ---\nFrom: blocks, document falls through to paragraph splitting. Greedy-pack messages up to DEFAULT_CHUNK_MAX_TOKENS = 3000 (sized below TOKEN_BUDGET = 4500 so each L0 buffer accumulates ~1-3 chunks before sealing — natural pacing for the local 1B summariser, which produces noticeably better summaries with ≤4-5k input than at the previous 10k cap).

Email body cleanup. canonicalize/email.rs now calls email_clean::clean_body on each message body — strips quoted reply chains (On <date>, <name> wrote:), forwarded-message separators, and footer noise (Unsubscribe, copyright lines, address blocks).

Summary MD path layout.

summaries/source/<participants_slug>/L<n>/<id>.md — per source tree
summaries/global/<yyyy-mm-dd>/L<n>/<id>.md — per daily digest
summaries/topic/<entity_slug>/L<n>/<id>.md — per topic tree (lazy-materialised)

Tag rewrites. Post-extract handler walks mem_tree_entity_index rows for the just-sealed node, formats each as <kind>/<surface_slug> (Obsidian hierarchical tag format), atomically rewrites the YAML tags: block in the on-disk file. Body bytes pass through byte-for-byte unchanged. Summary extraction emits topics (emit_topics=true) so summary tags include both entities and topics; chunk extraction emits entities only.

Bounded retry on transport failures. LlmEntityExtractor::extract now retries up to 3 times with exponential backoff (250ms / 500ms / 1000ms) on transport-level failures before falling back to empty extraction. Caught a single Ollama hiccup during validation that left one summary's tags empty — would now self-heal.

Front-matter tags use Obsidian's hierarchical format (person/Alex-Johnson, org/Tinyhumans-AI, topic/incident-2604, gmail, ingested) so the content directory is directly browsable as an Obsidian vault — tags: field renders as clickable nested tags in the tag pane.

Submission Checklist

Unit tests — 460 passing in memory::tree, 60 in composio::providers::gmail. Coverage includes participants bucketing, message-aware chunker boundaries, content_store atomic write, sha verification, summary compose with all three tree kinds, tag rewrite (entities + topics merge, body bytes unchanged sanity check), retry semantics on transport failure.
Integration / E2E — cargo run --bin gmail-backfill-3d -- --days 3 against real Gmail via Composio. Validated 40 chunks + 2 source-tree L1 summaries; every file's body sha re-verified against the stored content_sha256 (100% match). Manual digest trigger via existing memory_tree.trigger_digest RPC.
Doc comments — module-level rustdoc on every new file in content_store/. Inline comments on the path-collision invariant, the atomic-write ordering, the body-immutable / front-matter-mutable split, and the _solo_ bucketing fallback for messages without participants.

Impact

Runtime. Desktop only (Tauri shell unaffected; this is core-side). The async-pipeline foundation enables non-blocking ingest. Per-chunk + per-seal MD writes are local fs operations (tens of microseconds); negligible compared to the LLM call cost they accompany.

Performance. Chunker max 3k + seal budget 4.5k produces ~1.5× more chunks per ingest than the previous 10k cap (acceptable for fine-grained retrieval), and ~3× more L1 seals (good for summary quality on the local 1B model — it produces noticeably better summaries on smaller inputs).

Migration. Schema migrations are additive + nullable (content_path TEXT, content_sha256 TEXT on both mem_tree_chunks and mem_tree_summaries). Legacy rows read back with content_path = NULL and full text in summary — read paths fall back gracefully. New writes always go through file-first / pointer-in-row.

Compatibility. mem_tree_summaries.summary column shrinks to a 500-char preview at write time (mirrors mem_tree_chunks.text). Existing rows untouched. Re-ingest after this PR rotates chunk_ids (chunker output changes) — bin's --wipe flag clears state for clean re-ingest.

Security. No new external surfaces. Filesystem writes scoped to content_root (env > config > workspace default). Atomic write contract closes the partial-write window that would have shown half-baked summaries to consumers.

Closes feat(memory): async pipeline + per-tree label fixes + summariser split + memory tree enabled at app startup #997 (async memory-tree pipeline — superseded by this PR; commits inherited verbatim)
Follow-up PR(s)/TODOs:
- Async ExtractSummary job kind to remove the seal-time inline extraction call entirely (today: 3-attempt retry; ideal: queue-backed retry with cascade-aware ordering)
- self_email config knob to drop the user's address from participant bucket keys (cleaner folder names)
- Anthropic-style per-message random no-reply addresses currently produce one tree per email (no grouping); add sender canonicalisation for these patterns
- GitHub mailing-list pattern: user is on Cc not To, so user's address isn't in the bucket — consider including Cc when one of its addresses is a "self" identifier
- Janitor: orphan MD detection, sha-mismatch healing, periodic full rebuild of mem_tree_entity_index from MD front-matter
- Make chunker_max_tokens and seal_token_budget config-driven so operators can tune without recompiling

Summary by CodeRabbit

New Features
- Gmail backfill CLI; Slack ratelimit probe mode.
- Async memory-tree job pipeline: worker pool, scheduler, manual digest trigger RPC, and test drain utility.
- Disk-backed content store with atomic writes, staged summaries, and tag-rewrite tooling.
- Config option to override memory-tree content directory.
Improvements
- Deterministic email canonicalisation and robust email-cleaning utilities.
- Smaller chunking and summariser budgets for tighter segmentation.
- Ingest fast-scores and enqueues async extraction; retrieval hydrates full content from disk.
- File integrity verification and explicit chunk lifecycle handling.

…ct, quantity The previous LLM extractor schema only emitted person / organization / location / event / product. That covers classic NER but leaves real gaps for chat content: * `datetime` — "Friday", "Q2 2026", "EOD tomorrow" (deadlines mentioned inside chunks) * `technology` — "Rust", "OAuth", "Slack API", "nomic-embed" (tools / frameworks / languages / services) * `artifact` — "PR tinyhumansai#934", "src/openhuman/...", "OH-42" (code / ticket / doc references) * `quantity` — "$5K", "20/min", "10k tokens" (amounts / metrics / money) Without these kinds, the affected mentions either get dropped or fall into Misc — both flatten retrieval. This adds them to the EntityKind enum, the LLM extractor's allowed_kinds + parse_kind synonyms, and the SYSTEM_PROMPT (with a Kinds guide and an explicit "skip URLs/emails — those are extracted by regex" instruction so the model doesn't double up with the mechanical extractors). The probe-ratelimit / probe-pacing-ms flags from the prior commit auto-formatted slightly; included here. Tests: extract suite 36/36, full memory_tree 337/337. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…gest trigger Move memory-tree work onto a SQLite-backed job queue so summariser / extractor / cascade work runs out of band of ingest. Pipeline shape after this change: ingest_chat → write chunk (lifecycle=pending_extraction) → enqueue extract_chunk worker (×3, LLM semaphore=3): extract_chunk → LLM extract → admit/drop → enqueue append_buffer + topic_route append_buffer → push to L0 → enqueue seal if gate met seal → seal one level → enqueue parent seal if cascading → for source trees: enqueue topic_route(Summary) topic_route → maybe_spawn_topic_tree (counters + backfill if hot) → enqueue per-topic append_buffer digest_daily → end_of_day_digest(date) flush_stale → enqueue seal jobs for stale buffers scheduler (1 task, daily UTC 00:05): enqueue digest_daily(yesterday) + flush_stale (dedupe-keyed by date) Key additions: * mem_tree_jobs table with dedupe keys and per-status retry/backoff * lifecycle_status column on mem_tree_chunks * Per-level seal jobs (each level its own crash-recovery checkpoint) * LabelStrategy enum driving summary entity/topic population: - source-tree seal: ExtractFromContent(regex+LLM emit_topics=true) - global L0 + L1+: UnionFromChildren (no extra LLM call) - topic-tree seal: Empty (scope already pins the topic) * NodeRef::{Leaf, Summary} payload generalisation so summary entities flow back into topic-tree spawn / route pipeline * 30-day backfill window on topic-tree spawn aligned with hotness recency cliff * memory_tree.trigger_digest RPC + jobs::backfill_missing_digests catch-up helper * extract::build_summary_extractor honest builder (no ScoringConfig misuse for seal-time extraction) * Lenient day-of-week parser in email_clean for real-world MTA quirks * Slack/Gmail canonicalisers, composio providers, slack_ingestion module Tests: 522 passing across memory_tree, slack_ingestion, composio_providers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ration assertion The memory_tree_trigger_digest controller was added in the async digest work but tests/json_rpc_e2e.rs still expected only 3 registered methods, causing json_rpc_memory_tree_end_to_end to fail with 4 != 3.

Major - jobs/worker.rs: log warn on `recover_stale_locks` startup failure instead of swallowing with `let _ = ...`. Stale `running` rows are now operator-visible at startup. - jobs/worker.rs: persist full anyhow chain in `last_error` via `format!("{err:#}")` so readers of mem_tree_jobs see the root cause, matching the `{:#}` log format used three lines above. - source_tree/bucket_seal.rs: stop hardcoding `TreeKind::Source` in `seal_one_level`'s `SummaryContext` and `SummaryNode`. The function also runs for topic trees (handle_seal, cascade_all_from, flush) — topic-tree summaries were being persisted with `tree_kind='source'`, breaking any query filtering on tree_kind. Use `tree.kind`. Minor - jobs/types.rs: drop `#TBD` placeholder from module doc. - jobs/handlers/mod.rs: add `debug!` log on the no-op `entity_ids` empty path in `handle_topic_route` for diagnosis. - jobs/scheduler.rs: comment why `.single()` cannot return `None` (UTC has no DST gaps). Tests - Add `topic_tree_seal_persists_topic_kind_not_source` regression test guarding the bucket_seal `tree.kind` fix. - 371/371 memory::tree tests pass · 23/23 memory::tree::jobs tests pass · cargo fmt clean. Cargo.lock: bump openhuman version 0.53.3 → 0.53.4 (matches Cargo.toml). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…g (Phase A) - Add `content_dir: Option<PathBuf>` to `MemoryTreeConfig` with serde default None. - Add `OPENHUMAN_MEMORY_TREE_CONTENT_DIR` env-var override in `apply_env_overrides`; empty string falls back to None (consistent with other memory_tree env vars). - Add `Config::memory_tree_content_root()` helper that resolves the effective content root: explicit config path > `<workspace>/memory_tree/content/`. - Add additive, idempotent `ALTER TABLE` migrations in `with_connection` for `mem_tree_chunks.content_path TEXT` and `mem_tree_chunks.content_sha256 TEXT`. - Tests: default None, env-override logic, schema columns present after migration. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add `partial_message: bool` to `Chunk` (transient; always false in DB) - Rewrite chunker to dispatch on SourceKind: Chat splits at `## ` boundaries, Email at `---\nFrom:` boundaries, Document uses paragraph budget packing; oversize units sub-split with partial_message=true - Strip leading `# ...` headers from chat, email, and document canonicalisers (front-matter will carry that info in Phase C) - Update all 22 Chunk struct literals across test modules to include partial_message field Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…atter (Phase C) - New src/openhuman/memory/tree/content_store/ module with 5 sub-files: - paths.rs: chunk_rel_path / chunk_abs_path / slugify_source_id - compose.rs: YAML front-matter composition, split_front_matter, rewrite_tags - atomic.rs: write_if_new (tempfile+fsync+rename), sha256_hex (body-only) - read.rs: read_chunk_file + verify_chunk_file - tags.rs: update_chunk_tags, slugify_tag_kind/value, entity_tag - StagedChunk struct + stage_chunks() orchestrator in mod.rs - 27 unit tests passing across all sub-modules Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…rite (Phase D) - ingest::persist() calls content_store::stage_chunks() before the SQLite tx, so chunk .md files land on disk before DB pointers are written - store: add upsert_staged_chunks_tx() writing content_path + content_sha256; add get_chunk_content_path() for post-extraction tag lookup - handle_extract: after tx.commit(), convert entity IDs to Obsidian tags (kind/Value) and rewrite the chunk .md file's tags: block atomically - Update ingest test: Phase B chunker produces one chunk per message (2 messages → 2 chunks); adjust assertions accordingly Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…line (Phase E) - New [[bin]] gmail-backfill-3d: authenticates via Composio, paginates GMAIL_FETCH_EMAILS, groups by thread_id, drives ingest_email per thread (content_store writes .md files + SQLite), drains the worker pool, then runs a SHA-256 integrity check against all content_path rows - store: add get_chunk_content_pointers() returning (content_path, sha256) as a public helper for the binary's verify loop - Cargo.toml: register gmail-backfill-3d [[bin]] entry (autobins=false) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Drop three unused imports left from earlier phases (tokio::fs::File, super::atomic::write_if_new, source_tree::append_leaf_deferred). - Add --wipe to gmail-backfill-3d so the bin can re-run cleanly after the message-aware chunker change invalidates existing chunk IDs: removes chunks.db (+ wal/shm) and <content_root>/chunks/ before fetch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…n cleanup ## What changed (6 changes) 1. **bucket_by_sender_and_thread solo fallback** (`gmail/ingest.rs`): Messages without `threadId` now share a single `_solo_` bucket per sender instead of getting a unique `solo:{message_id}` bucket. Fixes the root-cause where each threadless message produced a separate `source_id` and therefore a separate path subtree. 2. **Per-thread source_id in ingest_page_into_memory_tree** (`gmail/ingest.rs`): Dropped the `source_id: &str` parameter. The function now derives `source_id = \"gmail:{sender}:{thread_id}\"` per (sender, thread) pair so every thread lands in its own path subtree. Made `ingest` a `pub mod` so the bin can import it directly. 3. **Path scheme — sender/thread/chunk_id** (`content_store/paths.rs`): `chunk_rel_path` is now source-kind-aware. Email: parses `gmail:{sender}:{thread}` into 4-level path `email/<sender_slug>/<thread_slug>/<chunk_id>.md`. Chat/Document: 3-level `<kind>/<source_slug>/<chunk_id>.md` (chunk_id replaces seq as filename). Malformed email source_id falls back to 2-level layout without panicking. Added `email_clean` to canonicalize module exports so the type is reachable. 4. **Chunker greedy packing** (`chunker.rs`): Chat and Email branches now pack consecutive units into a single chunk until adding the next would exceed `max_tokens` (default 10k). Oversize single units still fall back to `split_by_token_budget` with `partial_message = true`. Document branch unchanged. 5. **Bin cleanup** (`bin/gmail_backfill_3d.rs`): Deleted `ingest_page`, `bucket_by_thread`, `raw_to_email_message`, `parse_date`, `parse_addr_list`, `strip_re_fwd` and their tests (~185 lines of duplicated logic). Main loop now calls `ingest_page_into_memory_tree` directly. Bin LOC: 576 → 391. 6. **Front-matter email fields** (`content_store/compose.rs`): Email chunks now emit `aliases`, `sender`, and `thread_id` YAML fields parsed from the `gmail:{sender}:{thread_id}` source_id. SHA-256 is still over body bytes only (front-matter changes don't invalidate hash). TODO: thread_subject requires carrying it through Metadata. ## Why Path collisions caused 9 surviving files out of 44 expected chunks on the first backfill run. All three root causes fixed: (a) solo bucket per message → unified _solo_ bucket, (b) single source_id for all threads → per-thread source_id, (c) seq-as-filename → chunk_id-as-filename. ## Bin LOC Before: 576 lines. After: 391 lines. ## cargo check Passes clean. All pre-existing clippy warnings are in unrelated files (threads/ops.rs, voice/server.rs, webview_accounts/ops.rs, etc.) and are intentionally not fixed in this commit. ## Tests - memory::tree: 431 passed, 0 failed - composio::providers::gmail: 57 passed, 0 failed - Updated ingest_chat_writes_and_queue_drains_to_admitted_chunk to expect 1 packed chunk (was 2, before greedy packing). ## New path formats - Email: email/<sender_slug>/<thread_slug>/<chunk_id>.md - Chat: chat/<source_slug>/<chunk_id>.md - Document: document/<source_slug>/<chunk_id>.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…gration - Bucket by participants (sender ∪ to-list, sorted, deduped, CC ignored). Drops threadId from grouping entirely. - source_id = gmail:{participants} (e.g. gmail:alice@x.com|bob@y.com) - Path: email/<participants_slug>/<chunk_id>.md (single layer, no thread folder) - canonicalize/email.rs now calls clean_body on each message body to strip reply chains, marketing footers, legal disclaimers, and template noise - Front-matter: replace sender/thread_id with participants: YAML list + human-readable aliases (first <-> second, or first <-> N others for groups) - chunk_ids rotate (content + source_id both change); --wipe required for re-ingest - All new tests pass (435 memory::tree + 60 composio::gmail); cargo check clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirrors the chunk pattern: bodies on disk under `summaries/<source|global|topic>/...`, SQLite holds pointer + sha + 500-char preview. All three seal paths (source/global/topic) call stage_summary before upsert_summary_tx. handle_seal calls update_summary_tags after committing entity_index. gmail-backfill-3d bin's verify step extended to summaries. Schema migration is additive + nullable; legacy rows still readable. Summary filenames sanitize ':' → '-' for cross-platform NTFS compatibility. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Sized for the local 1B summariser (gemma3:1b-it-qat), which produces noticeably better summaries with ≤4-5k input than at the previous 10k cap. Chunker max sits below the seal budget so each L0 buffer accumulates roughly 1-3 chunks before sealing — natural pacing. - DEFAULT_CHUNK_MAX_TOKENS: 10_000 → 3_000 - TOKEN_BUDGET (L0 seal): 10_000 → 4_500 - Test fixtures rewritten to use TOKEN_BUDGET-relative sizes (chunk_count = TOKEN_BUDGET * 6 / 10) so they auto-track the constants if we tune again. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously a single transport failure (Ollama briefly down, transient connection refused, DNS hiccup) would silently return empty extraction with no retry. For chunks this is OK because the whole job retries via the async pipeline, but for summary extraction (which runs INLINE in seal_one_level) a one-shot transport failure leaves the summary with empty entities + topics permanently — no async retry path. Now: try up to 3 attempts with exponential backoff (250ms / 500ms / 1000ms ≈ 1.75s total) before falling back to empty. Non-transport failures (HTTP non-success, malformed JSON) still fall back immediately because retrying the same input yields the same bad response. Both summary-time and chunk-time extraction benefit since they share the same LlmEntityExtractor. Cascade ordering is preserved: parent seals always see populated entity_index from children. Caught by the v8 backfill run: Otter summary's seal hit a transport hiccup at exactly the wrong millisecond, leaving its tags empty while the GitHub summary (sealed seconds later) extracted cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-04-28T20:00:36Z

📝 Walkthrough

Walkthrough

Adds a disk-backed MD content store, an SQLite job queue + worker pool for async memory-tree processing, Gmail backfill CLI and ingest logic, email/chat/document canonicalisation and chunking changes, label/summary staging, scheduling (daily digest/flush), and many schema, retrieval, and test updates.

Changes

Cohort / File(s)	Summary
Build & Config `\.gitignore`, `Cargo.toml`, `src/openhuman/config/schema/*`	Ignore `.target-codex/`; add `gmail-backfill-3d` binary; new `MemoryTreeConfig.content_dir`, env override `OPENHUMAN_MEMORY_TREE_CONTENT_DIR`, and `Config::memory_tree_content_root()` resolver.
Backfill CLIs & Providers `src/bin/gmail_backfill_3d.rs`, `src/bin/slack_backfill.rs`, `src/openhuman/composio/providers/gmail/...`	New Gmail backfill CLI and Gmail ingest module (participant bucketing, subject/source_id normalization); Slack backfill adds ratelimit-probe mode.
Canonicalisation & Email Cleaning `src/openhuman/memory/tree/canonicalize/...`	Remove leading top-level `#` headers across chat/doc/email; add `email_clean` utilities (reply/footer stripping, md-escape, email extraction, robust date parsing).
Chunking & Types `src/openhuman/memory/tree/chunker.rs`, `src/openhuman/memory/tree/types.rs`	Chunker now splits by source-kind units (chat/email/document), greedy-packs units, marks oversize splits with `partial_message`; default chunk token cap 10k→3k; `Chunk.partial_message` added.
Disk Content Store `src/openhuman/memory/tree/content_store/...`	New content-store: atomic write helpers, compose/read/verify chunk & summary MD, deterministic paths, staging (`stage_summary`/`stage_chunks`), body-only SHA256, and tag-rewrite utilities.
Async Job Pipeline `src/openhuman/memory/tree/jobs/...`, `src/openhuman/memory/tree/ingest.rs`	Adds SQLite `mem_tree_jobs` queue, job kinds (Extract/AppendBuffer/Seal/TopicRoute/Digest/Flush), enqueue/claim/retry/recovery, scheduler, worker pool (LLM semaphore), handlers for job kinds, and ingest now fast-scores, stages content, persists staged rows, and enqueues extract jobs.
Sealing, LabelStrategy & Summaries `src/openhuman/memory/tree/source_tree/...`, `src/openhuman/memory/tree/global_tree/...`	Introduce `LabelStrategy` (ExtractFromContent/UnionFromChildren/Empty), thread through seals, stage summary MD before DB insert, union labels from hydrated inputs for global digests, and reduce TOKEN_BUDGET.
Scoring & Extractors `src/openhuman/memory/tree/score/...`	Add `score_chunks_fast` (no LLM) for ingest hot path; LLM extractor gains `emit_topics`, expanded entity kinds, and retry-on-transport behavior.
Topic Tree & Backfill `src/openhuman/memory/tree/topic_tree/...`	Introduce 30-day backfill window and deterministic `backfill_topic_tree_at`; use `LabelStrategy::Empty` for internal topic inserts.
DB Schema & Store Changes `src/openhuman/memory/tree/store.rs`, `.../store_tests.rs`	Add `lifecycle_status` for chunks, staged-chunk upsert (`content_path`/`content_sha256` + ≤500-char preview), job-table schema, content-pointer queries, and related tests.
Retrieval & Hydration `src/openhuman/memory/tree/retrieval/...`	Retrieval/drill-down/fetch now hydrate full bodies from disk (`read_chunk_body`/`read_summary_body`) with preview fallback; tests updated to stage content.
RPC, Schemas & Startup `src/openhuman/memory/tree/rpc.rs`, `schemas.rs`, `src/core/jsonrpc.rs`, `tests/json_rpc_e2e.rs`	Add `openhuman.memory_tree.trigger_digest` RPC (optional date_iso); register/start memory-tree jobs on domain-subscriber bootstrap (Config passed into subscriber registration).
Utilities & Redaction `src/openhuman/memory/tree/util/...`	Add `util::redact` module (deterministic PII-safe redaction and endpoint simplification) used in logs.
Tests & Fixtures `src/openhuman/memory/tree/...` (many test files)	Widespread test updates: set `partial_message: false`, stage content fixtures, pass explicit `LabelStrategy` where required, adjust token-budget expectations, add `drain_until_idle`, and add/adjust job/label/backfill/digest tests.
Docs `docs/memory-tree-async-pipeline.excalidraw`	Add Excalidraw diagram specifying async memory-tree pipeline, job flows, worker behavior, and DB tables.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Ingest
    participant ContentStore
    participant DB
    participant Jobs as MemJobs
    participant Scheduler
    participant WorkerPool
    participant Handler

    Client->>Ingest: submit(source)
    Ingest->>Ingest: score_chunks_fast()
    Ingest->>ContentStore: stage_chunks() (atomic write)
    Ingest->>DB: upsert_staged_chunks_tx()
    Ingest->>MemJobs: enqueue(ExtractChunk)
    Ingest-->>Client: return (queued)

    Scheduler->>MemJobs: enqueue(digest_daily / flush_stale)
    Scheduler->>WorkerPool: wake_workers()

    WorkerPool->>MemJobs: claim_next()
    MemJobs-->>WorkerPool: job
    WorkerPool->>Handler: handle_job(job)

    alt ExtractChunk
        Handler->>ContentStore: read_chunk_body()
        Handler->>DB: persist scores, set lifecycle
        Handler->>MemJobs: enqueue(AppendBuffer, TopicRoute)
    else AppendBuffer
        Handler->>DB: insert leaf (L0 buffer)
        Handler->>MemJobs: enqueue(Seal) if gate met
    else Seal
        Handler->>ContentStore: stage_summary()
        Handler->>DB: insert_summary_tx(staged)
        Handler->>MemJobs: enqueue(parent_seal / topic_route)
    else TopicRoute
        Handler->>DB: resolve entities, ensure topic trees
        Handler->>MemJobs: enqueue(AppendBuffer per entity)
    else Digest/Flush
        Handler->>DB: union inputs, insert daily summary
        Handler->>ContentStore: stage_summary()
    end

    WorkerPool->>MemJobs: mark_done()

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

feat(memory): topic trees (lazy hotness-driven materialisation) (#709) #799: closely related to topic-tree routing/backfill and job routing features.
feat(memory): Phase 1 memory tree - multi-source ingestion & canonical chunks (#707) #732: overlaps foundational memory-tree ingestion, canonicalisation, chunker, and store changes.
feat(memory): global activity digest tree (#709) #798: related to global-tree digest behavior and scheduling changes.

Suggested reviewers

senamakel

Poem

🐇 I hop through staged files and queued delights,
I tidy subjects and split long nights.
Workers hum, digests bloom at dawn,
Threads find topics, tags are drawn.
A little rabbit cheers: the tree marches on!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the three main contributions: MD-on-disk content storage, participant-based email bucketing, and async pipeline foundation.
Linked Issues check	✅ Passed	All major requirements from `#997` are implemented: async job queue, lifecycle tracking, summariser/extractor split, LabelStrategy variants, 30-day hotness window, topic routing, trigger_digest RPC, scheduler, and comprehensive tests.
Out of Scope Changes check	✅ Passed	All changes directly support the linked issues: content store infrastructure, job queue, handlers, email canonicalization/bucketing, config extensions, and related test infrastructure.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…n line Pure formatting; no semantic changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A previous cleanup commit dropped File from the tokio::fs use list believing it was unused — but it's referenced by the directory-fsync helper at load.rs:471. Caught by upstream CI clippy (E0433: undeclared type `File`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 20

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

src/openhuman/memory/tree/retrieval/drill_down.rs (1)
259-272: ⚠️ Potential issue | 🟡 Minor

Formatting gate is failing CI.

cargo fmt --check is failing around Line 259 and Line 271. Please run cargo fmt --all so the PR passes the type-check workflow.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/retrieval/drill_down.rs` around lines 259 - 272,
Formatting in the block that constructs a chunk and calls upsert_chunks and
append_leaf (see functions upsert_chunks and append_leaf and the LeafRef
construction with fields token_count, seq_in_source, created_at/ timestamp,
partial_message) is not rustfmt-compliant; run rustfmt by executing cargo fmt
--all and commit the resulting changes so the crate passes cargo fmt --check
(ensure the struct literal formatting and trailing commas match rustfmt output
around the chunk creation and LeafRef append).
src/openhuman/memory/tree/source_tree/store.rs (1)
202-244: ⚠️ Potential issue | 🔴 Critical

This truncates every staged summary for all current readers.

Once staged is Some, mem_tree_summaries.content only keeps the first 500 chars, but the existing read path in this module still returns content directly and never hydrates content_path. That means staged summaries are now recursively summarised and retrieved from previews instead of full bodies.

Please either hydrate the markdown file on read before building SummaryNode, or keep writing full node.content here until all readers switch to the on-disk path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/source_tree/store.rs` around lines 202 - 244, The
current insert into mem_tree_summaries truncates staged content via
content_preview when staged is Some, causing readers to get only the 500-char
preview; instead preserve the full node.content when writing the DB until
readers are updated. Change the staged match (variables staged, content_preview,
content_path, content_sha256) so that the value inserted for the content column
is node.content (not the 500-char preview) while still setting content_path and
content_sha256 from the staged struct; keep the preview logic only for any
separate preview column or consumers that explicitly need it, or alternatively
implement hydration in the SummaryNode read path later (references: staged,
content_preview, node.content, content_path, content_sha256, the tx.execute
INSERT into mem_tree_summaries, and the SummaryNode read/hydrate code).
src/openhuman/memory/tree/topic_tree/backfill.rs (1)
162-173: ⚠️ Potential issue | 🟡 Minor

Don't count no-op append_leaf calls as appended leaves.

append_leaf is documented here as idempotent, but appended += 1 runs unconditionally. A repeat backfill will still report those hits as newly appended, so the return value no longer matches the function contract.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/topic_tree/backfill.rs` around lines 162 - 173, The
code unconditionally increments appended after calling append_leaf (in the
backfill flow using LabelStrategy::Empty), but append_leaf is idempotent and may
be a no-op; update the call to capture append_leaf's result and only increment
appended when the result indicates a real append occurred (e.g., a
boolean/Option/enum returned by append_leaf signaling change). Keep the existing
.with_context(...) error handling around the await, then branch on the
successful result from append_leaf (using the return value from append_leaf in
the backfill function) to decide whether to do appended += 1.
src/openhuman/memory/tree/ingest.rs (1)
155-203: ⚠️ Potential issue | 🟡 Minor

chunks_written counts upserts, not queued work.

This closure returns n from upsert_staged_chunks_tx, which is always staged.len(). On a duplicate re-ingest, to_schedule can stay empty, but the API still reports those chunks as "written and queued". Return to_schedule.len() here, or expose separate stored_chunks and queued_chunks counts.

Also applies to: 210-214
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/ingest.rs` around lines 155 - 203, The returned
value `n` is from store::upsert_staged_chunks_tx and thus equals staged.len(),
which misreports queued work when duplicates are skipped via the to_schedule
HashSet; change the closure to return the number of actually scheduled chunks
(to_schedule.len()) or return a tuple/struct with both stored_chunks (result of
upsert_staged_chunks_tx) and queued_chunks (to_schedule.len()), and update
callers accordingly; locate logic around upsert_staged_chunks_tx, the
to_schedule set, score::persist_score_tx and jobs::enqueue_tx where tx.commit()
is called and replace the single `n` return with the chosen corrected count(s).

🟡 Minor comments (6)

docs/memory-tree-async-pipeline.excalidraw-44-44 (1)

44-44: ⚠️ Potential issue | 🟡 Minor

Use “job queue” wording for clarity.

“jobs queue” reads awkwardly; consider “job queue” for cleaner phrasing in the subtitle.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/memory-tree-async-pipeline.excalidraw` at line 44, Update the subtitle
string "Leaf ingestion -> jobs queue -> workers -> source/topic/global tree
building" to use the singular phrasing "job queue" instead of "jobs queue" so it
reads "Leaf ingestion -> job queue -> workers -> source/topic/global tree
building".

src/bin/slack_backfill.rs-261-263 (1)

261-263: ⚠️ Potential issue | 🟡 Minor

Validate --probe-ratelimit input to reject zero calls.

Line 261 accepts 0, which produces a probe run with no calls and misleading output.

Proposed input validation

 if let Some(n) = cli.probe_ratelimit {
+    if n == 0 {
+        bail!("--probe-ratelimit must be >= 1");
+    }
+
     // Pure quota probe: fire N back-to-back

Also applies to: 303-340

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/bin/slack_backfill.rs` around lines 261 - 263, The probe-ratelimit branch
accepts 0 and produces a no-op run; add input validation where you destructure
cli.probe_ratelimit (the if let Some(n) = cli.probe_ratelimit { ... } blocks) to
reject zero: if n == 0 print a clear error and exit (or return Err) instead of
proceeding. Apply the same check in both probe blocks (the first if let Some(n)
= cli.probe_ratelimit { ... } and the later analogous block around lines
~303-340) so any --probe-ratelimit 0 is handled early and does not run a
misleading probe. Ensure the error message references the flag name
(--probe-ratelimit) and required positive integer semantics.

src/openhuman/memory/tree/canonicalize/document.rs-42-44 (1)

42-44: ⚠️ Potential issue | 🟡 Minor

Update module docs to match the new no-header behavior.

The implementation/tests now remove the leading title header, but the file-level docs still say a title header is added. Please align docs to avoid misleading future changes.

📝 Suggested doc fix

-//! Document sources are single-record (no grouping): one Notion page, one
-//! Drive doc, one meeting-note file. The canonicaliser adds a small title
-//! header and passes through the body; if the body is already markdown it
-//! is kept verbatim.
+//! Document sources are single-record (no grouping): one Notion page, one
+//! Drive doc, one meeting-note file. Canonical output is the trimmed body
+//! only; provider/title metadata is carried outside the body (front-matter
+//! in the MD-on-disk pipeline).

Also applies to: 89-102

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/canonicalize/document.rs` around lines 42 - 44,
Update the file-level/module documentation to reflect the new behavior that the
canonicalizer no longer adds a leading `# provider — title` header; change any
comments that state a title header is added to say the provider/title belong in
front-matter and that the code (e.g., the canonicalization that calls
md.push_str(doc.body.trim())) strips/removes such headers; apply the same
wording change to the other comment block referenced around the region
corresponding to lines 89–102 so all docs consistently state "no leading header;
provider/title belong in MD front-matter."

src/openhuman/memory/tree/retrieval/source.rs-333-345 (1)

333-345: ⚠️ Potential issue | 🟡 Minor

Run rustfmt on this file to unblock CI.

Line 333 and Line 344 are part of the formatting diff currently failing cargo fmt --check. Please run cargo fmt --all and commit the formatted output.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/retrieval/source.rs` around lines 333 - 345, The
file fails rustfmt; run the formatter and commit the changes: run `cargo fmt
--all` (or your IDE’s Rustfmt) and reformat this file so the blocks around
upsert_chunks, append_leaf, the LeafRef struct literal, and TOKEN_BUDGET
arithmetic conform to rustfmt style; then stage and commit the updated file so
CI's `cargo fmt --check` passes.

src/bin/gmail_backfill_3d.rs-449-455 (1)

449-455: ⚠️ Potential issue | 🟡 Minor

The summary no_pointer metric is currently wrong.

Returning a hard-coded 0 makes the integrity summary lie on mixed legacy/new databases, which this PR explicitly supports. Please count pointerless summary rows or drop this field from the report.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/bin/gmail_backfill_3d.rs` around lines 449 - 455, The summary field
no_pointer is hard-coded to 0 which will under-report pointerless rows; replace
it with an actual count by querying the DB for rows where content_path IS NULL
(e.g. run SELECT COUNT(*) FROM ... WHERE content_path IS NULL using the same
client used to compute rows_with_pointer) and assign that result to no_pointer
(handle errors/unwrap as done for other counts), or if you prefer to remove the
misleading metric simply drop no_pointer from the summary output and any
references to it; locate the variable no_pointer and the code that computes
rows_with_pointer to add the new COUNT query or remove the field consistently.

src/openhuman/memory/tree/score/extract/llm.rs-204-206 (1)

204-206: ⚠️ Potential issue | 🟡 Minor

Run cargo fmt --all on this hunk.

CI is already failing cargo fmt --check on the log::warn! formatting here.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/score/extract/llm.rs` around lines 204 - 206, The
log::warn! invocation in memory_tree::extract::llm is incorrectly formatted; run
rustfmt (cargo fmt --all) or manually reformat the macro invocation so it
follows rustfmt style (e.g., break arguments/format string onto separate lines
as rustfmt expects) to satisfy cargo fmt --check and fix the CI failure.

🧹 Nitpick comments (6)

docs/memory-tree-async-pipeline.excalidraw (1)

557-557: Avoid hard-coding worker/semaphore counts in architecture docs.

spawn 3 worker tasks and Semaphore(3) can drift from runtime config. Prefer wording like “N workers” / “shared semaphore limit (configurable)” to keep the diagram accurate as settings evolve.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/memory-tree-async-pipeline.excalidraw` at line 557, Replace hard-coded
numeric worker/semaphore text in the diagram labels: change "spawn 3 worker
tasks" and "shared Semaphore(3) for LLM-bound work" to use configurable wording
such as "spawn N workers (configurable)" and "shared semaphore limit
(configurable)" so the diagram matches runtime settings; keep surrounding labels
("jobs::start(workspace_dir)", "recover_stale_locks()", "Notify wakeup + 5s
polling fallback") unchanged.

.gitignore (1)

69-69: Consider using /.target-codex/ for consistency with other target directory patterns.

The .target-codex/ pattern will match at any directory level, while other target directories use either a leading slash (e.g., /target/ to match only root) or specific paths. If this directory is only expected at the root level, using /.target-codex/ would align with the existing conventions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.gitignore at line 69, Update the .gitignore entry ".target-codex/" to use a
rooted pattern "/.target-codex/" so it only matches the repository root like
other target dirs; replace the existing ".target-codex/" pattern with
"/.target-codex/" (keep the trailing slash) to maintain consistency with the
project's ignore conventions.

tests/json_rpc_e2e.rs (1)

824-831: Prefer subset assertion over exact controller-count equality.

assert_eq!(controllers.len(), expected_methods.len()) makes this test brittle when new memory-tree RPCs are added.

🧪 Proposed test hardening

-    assert_eq!(controllers.len(), expected_methods.len());
     for method in &expected_methods {
         assert!(
             controllers
                 .iter()
                 .any(|controller| controller.rpc_method_name() == *method),
             "expected memory_tree controller registration for {method}"
         );
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/json_rpc_e2e.rs` around lines 824 - 831, The test currently enforces
exact controller count with assert_eq!(controllers.len(),
expected_methods.len()), which is brittle; change the assertion to verify that
each entry in expected_methods is present in controllers (i.e., a subset check)
instead of requiring lengths to match. Locate the variables expected_methods and
controllers in the test (around the vector initialization and loop) and replace
the length equality check with logic that iterates over expected_methods and
asserts controllers.contains(method) (or uses a set/subset comparison) so new
RPCs can be added without breaking the test.

src/openhuman/memory/tree/source_tree/summariser/llm.rs (1)

318-327: Remove the now-unused prompt budget parameter.

system_prompt no longer templates length, so carrying _budget through build_request is stale API surface.

♻️ Proposed cleanup

-    fn build_request(&self, prompt_body: &str, budget: u32) -> OllamaChatRequest {
+    fn build_request(&self, prompt_body: &str) -> OllamaChatRequest {
         OllamaChatRequest {
             model: self.cfg.model.clone(),
             messages: vec![
                 OllamaMessage {
                     role: "system".to_string(),
-                    content: system_prompt(budget),
+                    content: system_prompt(),
                 },
                 OllamaMessage {
                     role: "user".to_string(),
                     content: prompt_body.to_string(),
@@
-fn system_prompt(_budget: u32) -> String {
+fn system_prompt() -> String {
@@
-        let req = s.build_request("body", 2048);
+        let req = s.build_request("body");
@@
-        let p = system_prompt(4096);
+        let p = system_prompt();

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/source_tree/summariser/llm.rs` around lines 318 -
327, The system_prompt function still accepts an unused _budget parameter;
change its signature to fn system_prompt() -> String (remove the parameter) and
update all call sites that pass a budget (notably the build_request calls
referencing system_prompt) to call system_prompt() with no arguments; ensure any
intermediate helpers (e.g., build_request or other functions that forwarded
budget solely to system_prompt) drop the budget parameter or stop forwarding it
so the stale API surface is removed while preserving behavior.

src/openhuman/memory/tree/score/extract/mod.rs (1)

21-89: Keep mod.rs export-focused.

build_summary_extractor is operational builder logic, so it should live in a sibling module (ops.rs/builder.rs) and be re-exported here instead of expanding mod.rs.

As per coding guidelines, "src/openhuman/**/*.rs: Use camelCase for variable names and keep domain mod.rs export-focused, with operational code in ops.rs, store.rs, types.rs, etc."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/score/extract/mod.rs` around lines 21 - 89, The
build_summary_extractor function (and its helpers/references like
LlmExtractorConfig, LlmEntityExtractor, RegexEntityExtractor,
CompositeExtractor) is operational logic and should be moved out of the domain
module file into a sibling implementation module (e.g., ops.rs or builder.rs);
create that new module, move the entire build_summary_extractor implementation
there, update imports/visibility so it compiles, and then re-export it from
mod.rs with a pub use so mod.rs remains export-focused and contains no
operational builder code.

src/openhuman/memory/tree/content_store/mod.rs (1)

29-105: Keep content_store/mod.rs export-focused.

This entrypoint now owns operational logic plus local tests. Please move StagedChunk, stage_chunks, and the wrapper API into sibling files such as types.rs / ops.rs, then re-export them from mod.rs.

As per coding guidelines, "keep domain mod.rs export-focused, with operational code in ops.rs, store.rs, types.rs, etc."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/content_store/mod.rs` around lines 29 - 105,
Summary: mod.rs contains operational code and tests; split types and ops and
keep mod.rs export-focused. Move the StagedChunk struct into a new sibling
types.rs, move stage_chunks and the wrapper update_summary_tags into a new
ops.rs (or ops.rs for functions and types.rs for the struct), migrate any local
tests into ops.rs/tests, and re-export them from mod.rs via pub use
crate::openhuman::memory::tree::content_store::types::StagedChunk and pub use
crate::openhuman::memory::tree::content_store::ops::{stage_chunks,
update_summary_tags}; ensure compose::compose_chunk_file, atomic::write_if_new,
atomic::sha256_hex and paths::chunk_* references remain imported in ops.rs and
update module imports accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/bin/gmail_backfill_3d.rs`:
- Around line 353-359: The code builds abs_path by blindly pushing rel_path
components into content_root (see abs_path, content_root, rel_path), which
allows path traversal; before constructing abs_path validate rel_path by
splitting into components and rejecting any component that is empty, "." or
"..", any component containing path separators or leading '/' (absolute
segments), or any non-normal component; if validation fails, skip/log the row
and do not open the file. Apply the same validation logic for the other
identical block (the one around the 409-415 area) to prevent escaping
content_root.

In `@src/bin/slack_backfill.rs`:
- Around line 319-323: The current ratelimit detection uses case-sensitive
substring checks on the local variable err and misses variants like "Too Many
Requests" or different capitalizations; update the check in the slack backfill
logic that inspects err (the r.error.as_deref().unwrap_or(...) result) to
perform a case-insensitive match or use a regex that looks for common rate-limit
tokens (e.g., "rate", "limit", "too many requests", "429", "rate_limited",
"ratelimited") and treat any match as a rate limit error; ensure you normalize
err (to_lowercase or equivalent) or compile a case-insensitive pattern and
replace the three contains(...) branches with this single robust check so other
flow logic (the code that currently handles this branch) continues to trigger
correctly.
- Around line 261-293: The probe mode block triggered by cli.probe_ratelimit
runs before the normal connection selection and can target the wrong Slack
connection; modify the logic so probe mode respects the same connection
filtering as the main path (i.e., use the existing cli.connection_id selection
behavior rather than unconditionally calling SLACK_LIST_CONVERSATIONS), either
by moving the probe_ratelimit handling after the connection selection code or by
adding a guard that validates the discovered channel belongs to the selected
connection (reusing the same filtering logic used later for cli.connection_id)
before firing client.execute_tool("SLACK_LIST_CONVERSATIONS", ...) and running
the N calls.

In `@src/openhuman/composio/providers/gmail/ingest.rs`:
- Around line 205-243: The logs currently emit raw PII via participants,
source_id, and owner; instead compute and use redacted values (e.g.,
participant_count and a stable hash/slug of participants and owner like
sha256_hex(participants) or sha256_hex(owner)) and replace the log calls that
reference participants, source_id, and owner (the log::debug!, log::info!, and
log::warn! lines around the EmailThread creation and ingest_email call) to print
only the count and hashed/slugged identifiers; keep source_id as-is for internal
use if needed but never emit its raw value in logs—emit the hashed_source_id and
hashed_owner instead. Ensure the same hashed format is used consistently across
all three logging sites so you can correlate events without exposing PII.
- Around line 63-87: The participants_bucket_key currently returns the same
literal "unknown" when from and to fail to parse, which collapses unrelated
messages; change the fallback to use a message-specific identifier: after
computing all and finding it empty, check raw.get("threadId").and_then(|v|
v.as_str()) and use that (lowercased) as the bucket key if present, else check
raw.get("id").and_then(|v| v.as_str()) and use that, and only if neither exists
fall back to the literal "unknown"; keep using the existing helpers
(extract_email, parse_address_list_for_bucket) and preserve sorting/dedup
behavior, but replace the final if-branch to prefer "threadId" then "id" as the
unique fallback instead of always returning "unknown".

In `@src/openhuman/config/schema/load.rs`:
- Line 16: The unix-only use of File in the sync_directory function is missing
because File was removed from the tokio::fs import; restore it by importing File
(e.g., include File in the existing import tokio::fs::{self, OpenOptions, File})
or add a cfg(unix) conditional import for std::fs::File so that sync_directory
can call File::open(path) on unix targets without compilation errors.

In `@src/openhuman/memory/tree/chunker.rs`:
- Around line 82-86: The debug log in memory_tree::chunker currently prints raw
input.source_id (alongside input.markdown.len()), which can leak PII; before
logging compute and use a redacted identifier (e.g., a short stable hash or slug
derived from input.source_id such as SHA256 or blake3 truncated to 8–12 hex
chars) and log that instead; update both occurrences that call log::debug! (the
lines referencing input.source_id and input.markdown) to replace input.source_id
with the redacted value so no full email addresses or participant data are
emitted.

In `@src/openhuman/memory/tree/content_store/atomic.rs`:
- Around line 42-76: The rename currently returns Ok(true) immediately in the
Ok(()) arm but doesn't fsync the parent directory, so a crash can lose the
directory entry; after a successful std::fs::rename(&tmp_path, abs_path) inside
the Ok(()) branch (e.g., in write_if_new/atomic rename handling in atomic.rs),
open the parent directory (abs_path.parent() or "." if none) as a File (with
read permissions), call sync_all() on that dir file and propagate/log any error
(i.e., only return Ok(true) after the directory sync succeeds), and keep the
existing logging; on failure of the dir fsync, return an error similar to the
other map_err/anyhow style used in this file.
- Around line 125-153: The early-return when abs_path.exists() must verify the
on-disk contents match the composed body before returning the staged SHA: inside
the existing if abs_path.exists() branch (around the check that currently logs
and returns a StagedSummary), open/read the file bytes at abs_path, compute the
SHA-256 of those bytes and compare to the computed sha256 for composed.full; if
they match, return the StagedSummary as before (summary_id, rel_path,
content_sha256), otherwise return a failure (propagate an Err) so callers know
the on-disk content disagrees (do not silently trust the new sha); reference the
variables/functions composed.full, sha256, abs_path, StagedSummary, and
write_if_new to locate and implement the change.

In `@src/openhuman/memory/tree/content_store/paths.rs`:
- Around line 147-152: The debug log currently prints the raw email source_id
(which may contain full participant emails); change the log to avoid PII by
logging the redacted slug or a stable hash instead of source_id. Locate the code
around slugify_source_id(source_id) and chunk_id in the fallback branch (the
block that formats "email/{}/{}.md") and replace the log argument to reference
the computed slug (or a short hash/hex of source_id) rather than source_id; keep
the rest of the message context the same so the layout and chunk_id remain
identifiable without exposing raw emails.

In `@src/openhuman/memory/tree/content_store/read.rs`:
- Around line 45-50: The warning currently logs the full filesystem path via
abs_path.display(), which may expose PII; replace that with a non-sensitive
identifier (e.g., a stable content id or a redacted version of content_path)
when calling log::warn! in this file. Locate the log::warn! calls that reference
abs_path (and the variables content_path / abs_path and contents) and change the
message to include a redacted identifier (for example a contents.id, a
SHA256-of-path, or a redaction helper like redact_path(content_path)) instead of
the full path; apply the same change to the similar warn block around the other
occurrence mentioned (lines ~93-98) so neither log prints full content-store
paths. Ensure the rest of the logged fields (expected_sha256, contents.sha256)
remain unchanged.

In `@src/openhuman/memory/tree/content_store/tags.rs`:
- Around line 111-132: The current flow builds tags from entity_ids (via
list_entity_ids_for_node and entity_tag) and then calls
compose_rewrite_summary_tags with only those entity-derived tags, dropping any
existing front-matter topic labels; change the logic to first parse the existing
summary front-matter from old_bytes (use the same parser that
compose_rewrite_summary_tags expects or add a helper), merge existing topic tags
with the entity-derived tags (preserving uniqueness and sort/dedup semantics),
and then pass the merged tag list to compose_rewrite_summary_tags(&old_bytes,
&merged_tags) so existing topics are preserved when rewriting.

In `@src/openhuman/memory/tree/jobs/handlers/mod.rs`:
- Around line 60-92: The transaction commits score/lifecycle changes in
chunk_store::with_connection (around score::persist_score_tx and the UPDATE of
mem_tree_chunks) before enqueueing downstream work, which can leave an admitted
chunk without queued jobs on crash; to fix, construct the NewJob instances for
append_buffer and topic_route (as used by handle_extract) before committing and
call store::enqueue_tx within the same tx (using the tx from
chunk_store::with_connection) so the enqueue operations are part of the same
SQLite transaction, then tx.commit() only after enqueue_tx succeeds; apply the
same change to the second occurrence noted (lines ~140-159).

In `@src/openhuman/memory/tree/jobs/store.rs`:
- Around line 141-210: mark_done and mark_failed currently update rows by id
only and can overwrite another worker's claim; add a lease/claim token to the
Job (e.g., new column lease_token or computed from attempts+started_at_ms) that
is set when the worker acquires the lock and include that token in the WHERE
clause of both mark_done and mark_failed (modify functions mark_done and
mark_failed to accept the claim token parameter and update SQL to WHERE id = ?
AND lease_token = ?); ensure the code that acquires locks writes the token, and
if the UPDATE affects 0 rows log a warning and return Ok() so stale/incorrect
claims don't corrupt retry accounting.

In `@src/openhuman/memory/tree/score/extract/llm.rs`:
- Around line 470-489: into_extracted_entities is unconditionally forwarding
self.topics into ExtractedEntities so callers with emit_topics = false still
receive topics; update the mapping to check the emit_topics flag (e.g.,
self.emit_topics or the method parameter controlling emission) and only map
self.topics into ExtractedTopic entries when that flag is true, otherwise return
an empty topics Vec; ensure you still trim/filter empty labels the same way and
construct ExtractedTopic with score as before, and return the resulting topics
into the ExtractedEntities struct.

In `@src/openhuman/memory/tree/score/extract/mod.rs`:
- Around line 67-75: The current log in the LlmEntityExtractor::new success
branch prints the user-supplied endpoint at info level (via log::info) which may
leak credentials; change this to log::debug and stop logging the full
endpoint—either omit endpoint altogether or log a redacted host only (e.g.,
strip path/credentials and show host or "[redacted]"). Update the call that
currently includes endpoint, model, timeout_ms to use debug level and remove or
replace endpoint with the sanitized value, keeping model and timeout_ms as
needed for diagnostics.

In `@src/openhuman/memory/tree/source_tree/bucket_seal.rs`:
- Around line 470-485: The current scope_slug logic in the bucket sealing code
(scope_slug, tree.scope, TreeKind::Topic, slugify_source_id) strips everything
before the first ':' for all non-topic trees and causes collisions across
different source types; change the non-Topic branch to only special-case
"gmail:" by detecting if tree.scope starts_with "gmail:" and then slugifying
just the participants portion (after the colon), otherwise call
slugify_source_id on the entire tree.scope; keep the existing Topic branch
behavior unchanged.
- Around line 501-529: The code currently creates an empty StagedSummary on
stage_summary(...) failure and continues, which results in committing a DB
summary row without an on-disk .md file; instead abort the seal so the buffer
remains for retry. Replace the fallback branch in the stage_summary match (the
Err(e) arm around stage_summary(&content_root, &compose_input, &scope_slug,
None)) to propagate/return an error from the enclosing bucket_seal function (or
otherwise short-circuit the transaction) rather than constructing StagedSummary;
reference stage_summary, node.id, compose_input, content_root, scope_slug,
staged_opt and StagedSummary so you locate the code to change. Ensure the
transaction is not committed when stage_summary fails so
content_path/content_sha256 are not persisted.

In `@src/openhuman/memory/tree/topic_tree/backfill.rs`:
- Around line 80-86: The logs in the backfill routine are printing raw entity_id
(a PII email for some topic trees); update each log::info call that includes
entity_id (the logger lines shown and the other occurrences around the backfill
flow) to redact or hash entity_id before logging—e.g., compute a short stable
hash or mask using a helper (e.g., redact_entity_id or hash_entity_id) and log
that value instead of the raw entity_id while leaving other fields such as
tree.id, BACKFILL_WINDOW_DAYS, and cutoff_ms unchanged; apply this change to all
occurrences noted (the log lines around the backfill function where entity_id is
interpolated).

---

Outside diff comments:
In `@src/openhuman/memory/tree/ingest.rs`:
- Around line 155-203: The returned value `n` is from
store::upsert_staged_chunks_tx and thus equals staged.len(), which misreports
queued work when duplicates are skipped via the to_schedule HashSet; change the
closure to return the number of actually scheduled chunks (to_schedule.len()) or
return a tuple/struct with both stored_chunks (result of
upsert_staged_chunks_tx) and queued_chunks (to_schedule.len()), and update
callers accordingly; locate logic around upsert_staged_chunks_tx, the
to_schedule set, score::persist_score_tx and jobs::enqueue_tx where tx.commit()
is called and replace the single `n` return with the chosen corrected count(s).

In `@src/openhuman/memory/tree/retrieval/drill_down.rs`:
- Around line 259-272: Formatting in the block that constructs a chunk and calls
upsert_chunks and append_leaf (see functions upsert_chunks and append_leaf and
the LeafRef construction with fields token_count, seq_in_source, created_at/
timestamp, partial_message) is not rustfmt-compliant; run rustfmt by executing
cargo fmt --all and commit the resulting changes so the crate passes cargo fmt
--check (ensure the struct literal formatting and trailing commas match rustfmt
output around the chunk creation and LeafRef append).

In `@src/openhuman/memory/tree/source_tree/store.rs`:
- Around line 202-244: The current insert into mem_tree_summaries truncates
staged content via content_preview when staged is Some, causing readers to get
only the 500-char preview; instead preserve the full node.content when writing
the DB until readers are updated. Change the staged match (variables staged,
content_preview, content_path, content_sha256) so that the value inserted for
the content column is node.content (not the 500-char preview) while still
setting content_path and content_sha256 from the staged struct; keep the preview
logic only for any separate preview column or consumers that explicitly need it,
or alternatively implement hydration in the SummaryNode read path later
(references: staged, content_preview, node.content, content_path,
content_sha256, the tx.execute INSERT into mem_tree_summaries, and the
SummaryNode read/hydrate code).

In `@src/openhuman/memory/tree/topic_tree/backfill.rs`:
- Around line 162-173: The code unconditionally increments appended after
calling append_leaf (in the backfill flow using LabelStrategy::Empty), but
append_leaf is idempotent and may be a no-op; update the call to capture
append_leaf's result and only increment appended when the result indicates a
real append occurred (e.g., a boolean/Option/enum returned by append_leaf
signaling change). Keep the existing .with_context(...) error handling around
the await, then branch on the successful result from append_leaf (using the
return value from append_leaf in the backfill function) to decide whether to do
appended += 1.

---

Minor comments:
In `@docs/memory-tree-async-pipeline.excalidraw`:
- Line 44: Update the subtitle string "Leaf ingestion -> jobs queue -> workers
-> source/topic/global tree building" to use the singular phrasing "job queue"
instead of "jobs queue" so it reads "Leaf ingestion -> job queue -> workers ->
source/topic/global tree building".

In `@src/bin/gmail_backfill_3d.rs`:
- Around line 449-455: The summary field no_pointer is hard-coded to 0 which
will under-report pointerless rows; replace it with an actual count by querying
the DB for rows where content_path IS NULL (e.g. run SELECT COUNT(*) FROM ...
WHERE content_path IS NULL using the same client used to compute
rows_with_pointer) and assign that result to no_pointer (handle errors/unwrap as
done for other counts), or if you prefer to remove the misleading metric simply
drop no_pointer from the summary output and any references to it; locate the
variable no_pointer and the code that computes rows_with_pointer to add the new
COUNT query or remove the field consistently.

In `@src/bin/slack_backfill.rs`:
- Around line 261-263: The probe-ratelimit branch accepts 0 and produces a no-op
run; add input validation where you destructure cli.probe_ratelimit (the if let
Some(n) = cli.probe_ratelimit { ... } blocks) to reject zero: if n == 0 print a
clear error and exit (or return Err) instead of proceeding. Apply the same check
in both probe blocks (the first if let Some(n) = cli.probe_ratelimit { ... } and
the later analogous block around lines ~303-340) so any --probe-ratelimit 0 is
handled early and does not run a misleading probe. Ensure the error message
references the flag name (--probe-ratelimit) and required positive integer
semantics.

In `@src/openhuman/memory/tree/canonicalize/document.rs`:
- Around line 42-44: Update the file-level/module documentation to reflect the
new behavior that the canonicalizer no longer adds a leading `# provider —
title` header; change any comments that state a title header is added to say the
provider/title belong in front-matter and that the code (e.g., the
canonicalization that calls md.push_str(doc.body.trim())) strips/removes such
headers; apply the same wording change to the other comment block referenced
around the region corresponding to lines 89–102 so all docs consistently state
"no leading header; provider/title belong in MD front-matter."

In `@src/openhuman/memory/tree/retrieval/source.rs`:
- Around line 333-345: The file fails rustfmt; run the formatter and commit the
changes: run `cargo fmt --all` (or your IDE’s Rustfmt) and reformat this file so
the blocks around upsert_chunks, append_leaf, the LeafRef struct literal, and
TOKEN_BUDGET arithmetic conform to rustfmt style; then stage and commit the
updated file so CI's `cargo fmt --check` passes.

In `@src/openhuman/memory/tree/score/extract/llm.rs`:
- Around line 204-206: The log::warn! invocation in memory_tree::extract::llm is
incorrectly formatted; run rustfmt (cargo fmt --all) or manually reformat the
macro invocation so it follows rustfmt style (e.g., break arguments/format
string onto separate lines as rustfmt expects) to satisfy cargo fmt --check and
fix the CI failure.

---

Nitpick comments:
In @.gitignore:
- Line 69: Update the .gitignore entry ".target-codex/" to use a rooted pattern
"/.target-codex/" so it only matches the repository root like other target dirs;
replace the existing ".target-codex/" pattern with "/.target-codex/" (keep the
trailing slash) to maintain consistency with the project's ignore conventions.

In `@docs/memory-tree-async-pipeline.excalidraw`:
- Line 557: Replace hard-coded numeric worker/semaphore text in the diagram
labels: change "spawn 3 worker tasks" and "shared Semaphore(3) for LLM-bound
work" to use configurable wording such as "spawn N workers (configurable)" and
"shared semaphore limit (configurable)" so the diagram matches runtime settings;
keep surrounding labels ("jobs::start(workspace_dir)", "recover_stale_locks()",
"Notify wakeup + 5s polling fallback") unchanged.

In `@src/openhuman/memory/tree/content_store/mod.rs`:
- Around line 29-105: Summary: mod.rs contains operational code and tests; split
types and ops and keep mod.rs export-focused. Move the StagedChunk struct into a
new sibling types.rs, move stage_chunks and the wrapper update_summary_tags into
a new ops.rs (or ops.rs for functions and types.rs for the struct), migrate any
local tests into ops.rs/tests, and re-export them from mod.rs via pub use
crate::openhuman::memory::tree::content_store::types::StagedChunk and pub use
crate::openhuman::memory::tree::content_store::ops::{stage_chunks,
update_summary_tags}; ensure compose::compose_chunk_file, atomic::write_if_new,
atomic::sha256_hex and paths::chunk_* references remain imported in ops.rs and
update module imports accordingly.

In `@src/openhuman/memory/tree/score/extract/mod.rs`:
- Around line 21-89: The build_summary_extractor function (and its
helpers/references like LlmExtractorConfig, LlmEntityExtractor,
RegexEntityExtractor, CompositeExtractor) is operational logic and should be
moved out of the domain module file into a sibling implementation module (e.g.,
ops.rs or builder.rs); create that new module, move the entire
build_summary_extractor implementation there, update imports/visibility so it
compiles, and then re-export it from mod.rs with a pub use so mod.rs remains
export-focused and contains no operational builder code.

In `@src/openhuman/memory/tree/source_tree/summariser/llm.rs`:
- Around line 318-327: The system_prompt function still accepts an unused
_budget parameter; change its signature to fn system_prompt() -> String (remove
the parameter) and update all call sites that pass a budget (notably the
build_request calls referencing system_prompt) to call system_prompt() with no
arguments; ensure any intermediate helpers (e.g., build_request or other
functions that forwarded budget solely to system_prompt) drop the budget
parameter or stop forwarding it so the stale API surface is removed while
preserving behavior.

In `@tests/json_rpc_e2e.rs`:
- Around line 824-831: The test currently enforces exact controller count with
assert_eq!(controllers.len(), expected_methods.len()), which is brittle; change
the assertion to verify that each entry in expected_methods is present in
controllers (i.e., a subset check) instead of requiring lengths to match. Locate
the variables expected_methods and controllers in the test (around the vector
initialization and loop) and replace the length equality check with logic that
iterates over expected_methods and asserts controllers.contains(method) (or uses
a set/subset comparison) so new RPCs can be added without breaking the test.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 48d26b85-545f-4947-a96f-f7a831671a96

📥 Commits

Reviewing files that changed from the base of the PR and between 1a25f5b and dd0ea3d.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (67)

.gitignore
Cargo.toml
docs/memory-tree-async-pipeline.excalidraw
src/bin/gmail_backfill_3d.rs
src/bin/slack_backfill.rs
src/core/jsonrpc.rs
src/openhuman/composio/providers/gmail/ingest.rs
src/openhuman/composio/providers/gmail/mod.rs
src/openhuman/config/schema/load.rs
src/openhuman/config/schema/storage_memory.rs
src/openhuman/config/schema/types.rs
src/openhuman/memory/tree/canonicalize/chat.rs
src/openhuman/memory/tree/canonicalize/document.rs
src/openhuman/memory/tree/canonicalize/email.rs
src/openhuman/memory/tree/canonicalize/email_clean.rs
src/openhuman/memory/tree/canonicalize/mod.rs
src/openhuman/memory/tree/chunker.rs
src/openhuman/memory/tree/content_store/atomic.rs
src/openhuman/memory/tree/content_store/compose.rs
src/openhuman/memory/tree/content_store/mod.rs
src/openhuman/memory/tree/content_store/paths.rs
src/openhuman/memory/tree/content_store/read.rs
src/openhuman/memory/tree/content_store/tags.rs
src/openhuman/memory/tree/global_tree/digest.rs
src/openhuman/memory/tree/global_tree/digest_tests.rs
src/openhuman/memory/tree/global_tree/recap.rs
src/openhuman/memory/tree/global_tree/seal.rs
src/openhuman/memory/tree/ingest.rs
src/openhuman/memory/tree/jobs/handlers/mod.rs
src/openhuman/memory/tree/jobs/mod.rs
src/openhuman/memory/tree/jobs/scheduler.rs
src/openhuman/memory/tree/jobs/store.rs
src/openhuman/memory/tree/jobs/testing.rs
src/openhuman/memory/tree/jobs/types.rs
src/openhuman/memory/tree/jobs/worker.rs
src/openhuman/memory/tree/mod.rs
src/openhuman/memory/tree/retrieval/drill_down.rs
src/openhuman/memory/tree/retrieval/fetch.rs
src/openhuman/memory/tree/retrieval/global.rs
src/openhuman/memory/tree/retrieval/integration_test.rs
src/openhuman/memory/tree/retrieval/rpc.rs
src/openhuman/memory/tree/retrieval/source.rs
src/openhuman/memory/tree/retrieval/topic.rs
src/openhuman/memory/tree/rpc.rs
src/openhuman/memory/tree/schemas.rs
src/openhuman/memory/tree/score/extract/llm.rs
src/openhuman/memory/tree/score/extract/llm_tests.rs
src/openhuman/memory/tree/score/extract/mod.rs
src/openhuman/memory/tree/score/extract/types.rs
src/openhuman/memory/tree/score/mod.rs
src/openhuman/memory/tree/score/mod_tests.rs
src/openhuman/memory/tree/score/store.rs
src/openhuman/memory/tree/source_tree/bucket_seal.rs
src/openhuman/memory/tree/source_tree/bucket_seal_tests.rs
src/openhuman/memory/tree/source_tree/flush.rs
src/openhuman/memory/tree/source_tree/mod.rs
src/openhuman/memory/tree/source_tree/store.rs
src/openhuman/memory/tree/source_tree/store_tests.rs
src/openhuman/memory/tree/source_tree/summariser/llm.rs
src/openhuman/memory/tree/source_tree/types.rs
src/openhuman/memory/tree/store.rs
src/openhuman/memory/tree/store_tests.rs
src/openhuman/memory/tree/topic_tree/backfill.rs
src/openhuman/memory/tree/topic_tree/curator.rs
src/openhuman/memory/tree/topic_tree/routing.rs
src/openhuman/memory/tree/types.rs
tests/json_rpc_e2e.rs

coderabbitai

🧹 Nitpick comments (1)

src/openhuman/memory/tree/retrieval/source.rs (1)

333-334: Extract the repeated token-count expression into one local constant.

The same TOKEN_BUDGET * 6 / 10 expression is duplicated for both Chunk.token_count and LeafRef.token_count; a local constant avoids drift in future edits.

♻️ Suggested cleanup

 async fn seed_source(cfg: &Config, scope: &str, ts: DateTime<Utc>) {
     let tree = get_or_create_source_tree(cfg, scope).unwrap();
     let summariser = InertSummariser::new();
+    const TEST_LEAF_TOKEN_COUNT: u32 =
+        crate::openhuman::memory::tree::source_tree::types::TOKEN_BUDGET * 6 / 10;
     for seq in 0..2u32 {
         let c = Chunk {
@@
-                token_count: crate::openhuman::memory::tree::source_tree::types::TOKEN_BUDGET * 6
-                    / 10,
+                token_count: TEST_LEAF_TOKEN_COUNT,
@@
-                    token_count: crate::openhuman::memory::tree::source_tree::types::TOKEN_BUDGET
-                        * 6
-                        / 10,
+                    token_count: TEST_LEAF_TOKEN_COUNT,

Also applies to: 345-347

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/retrieval/source.rs` around lines 333 - 334,
Extract the duplicated TOKEN_BUDGET * 6 / 10 expression into a single local
constant (e.g., token_quota) near where the Chunk and LeafRef are created, then
replace both Chunk.token_count and LeafRef.token_count uses with that constant
to prevent drift; reference the TOKEN_BUDGET symbol and the Chunk.token_count
and LeafRef.token_count fields when making the change so both occurrences
(including the one around lines 345-347) are updated.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/openhuman/memory/tree/retrieval/source.rs`:
- Around line 333-334: Extract the duplicated TOKEN_BUDGET * 6 / 10 expression
into a single local constant (e.g., token_quota) near where the Chunk and
LeafRef are created, then replace both Chunk.token_count and LeafRef.token_count
uses with that constant to prevent drift; reference the TOKEN_BUDGET symbol and
the Chunk.token_count and LeafRef.token_count fields when making the change so
both occurrences (including the one around lines 345-347) are updated.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 710a73ea-1a66-457f-8bc0-de763ec349d4

📥 Commits

Reviewing files that changed from the base of the PR and between dd0ea3d and 38add9f.

📒 Files selected for processing (4)

src/openhuman/config/schema/load.rs
src/openhuman/memory/tree/retrieval/drill_down.rs
src/openhuman/memory/tree/retrieval/source.rs
src/openhuman/memory/tree/score/extract/llm.rs

✅ Files skipped from review due to trivial changes (1)

src/openhuman/memory/tree/score/extract/llm.rs

🚧 Files skipped from review as they are similar to previous changes (2)

src/openhuman/config/schema/load.rs
src/openhuman/memory/tree/retrieval/drill_down.rs

…ems) tinyhumansai#4 gmail unknown bucket — orphan:{id} fallback instead of collapsing all no-address messages into one "unknown" source tree; messages with neither address nor id are dropped with a warn (skip sentinel). tinyhumansai#5 PII in gmail/ingest.rs — participants, source_id, and owner wrapped with redact() before logging; no email addresses in log output. tinyhumansai#7 PII in chunker.rs — source_id in debug logs replaced with redact(source_id) hash. tinyhumansai#8 Parent dir fsync after rename in atomic.rs — on Unix, open the parent directory and call sync_all() after a successful rename so the dirent is durable across a crash; best-effort warn on failure, non-fatal. tinyhumansai#9 Body verify on re-stage in atomic.rs — when the path already exists, read and hash the on-disk body; if sha differs from the new body, remove the stale file and re-write atomically so SQLite content_sha256 always matches the on-disk bytes. New unit test covers the re-write path. tinyhumansai#10 PII in paths.rs — malformed source_id fallback log now emits redact(source_id) instead of the raw string. tinyhumansai#11 PII in read.rs — sha256 mismatch warn logs now emit redact(abs_path) instead of abs_path.display(). tinyhumansai#13 Same-tx enqueue in handle_extract — append_buffer and topic_route jobs are built before the transaction opens and enqueued via enqueue_tx() inside the same tx that commits the lifecycle update, eliminating the crash window where lifecycle commits but jobs are lost. tinyhumansai#16 PII in score/extract/mod.rs — LLM extractor info! log demoted to debug! and endpoint wrapped with redact_endpoint() to strip path/query/credentials from the URL. tinyhumansai#17 Non-email scope slug in bucket_seal.rs — only the gmail: prefix is stripped before slugify; all other source kinds (slack:, discord:, document:, …) slugify the full scope string so slack:#eng and discord:#eng no longer produce the same "eng" slug and collide in summaries/source/. tinyhumansai#18 Don't commit summary on stage_summary failure — bucket_seal.rs, global_tree/digest.rs, and global_tree/seal.rs now propagate stage_summary errors via .context()? instead of falling back to a NULL content_path row; the seal is retried via the normal job-retry path. insert_summary_tx callers in those three paths now always pass Some(&staged) instead of Option. tinyhumansai#20 PII in topic_tree/backfill.rs — all entity_id occurrences in log lines (info/debug/warn) replaced with redact(entity_id). New shared helper: src/openhuman/memory/tree/util/redact.rs - redact(s) → stable 8-hex-char SHA-256 prefix for PII fields - redact_endpoint(u) → host:port only (strips scheme, creds, path, query) - Full unit-test coverage for both helpers Items tinyhumansai#14 (lease tokens) and tinyhumansai#19 (preview-vs-disk reader audit) intentionally deferred to follow-up commits — both have larger blast radius and need separate review. Items tinyhumansai#1, tinyhumansai#12, tinyhumansai#15 pushed back on PR thread (verified false positive or covered by upstream contract). Items #2, tinyhumansai#3 deferred (pre-existing in slack_backfill.rs, not introduced by this PR). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…mansai#14) Job settlement (mark_done / mark_failed) now gates on (attempts, started_at_ms) so a stale worker whose lease expired and got the row re-claimed by another worker can't clobber the new lessee's state. The race we close: t=0 Worker A claims job_X, locked_until_ms = t + lock_duration t=lock Lease expires while A's handler is still running (Ollama hung) t=lock+ recover_stale_locks resets job_X to ready t=lock+ Worker B claims job_X (attempts++, new started_at_ms) t=... B's handler completes successfully, marks done t=... A's handler finally returns Ok, calls mark_done ─────── before this fix: A's UPDATE clobbers B's state ─────── after this fix: A's UPDATE matches 0 rows, logs warn, no-op Implementation: - mark_done / mark_failed now take &Job (carrying attempts + started_at_ms snapshots from the worker's claim_next call) - SQL WHERE adds: id = ? AND attempts = ? AND started_at_ms IS ? (IS handles the NULL-vs-NULL equality semantics SQLite needs) - rows_affected == 0 → warn-log "stale lease, settlement no-op", return Ok (this is a known race outcome, not a failure) - All 5 existing tests updated to pass &job snapshots - 3 new tests: · mark_done_succeeds_for_current_lessee (happy-path guard) · stale_worker_settlement_is_noop (the race itself) · stale_worker_mark_failed_is_noop (same contract for failure path) Worker call sites updated to thread &job through. scheduler.rs test fixture updated to claim before settling (needed for &Job snapshot). Addresses CodeRabbit review item tinyhumansai#14 (Critical) on PR tinyhumansai#1008. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…nyhumansai#1008 CR-19) After the MD-on-disk migration `mem_tree_chunks.content` and `mem_tree_summaries.content` store only a ≤500-char preview. All readers that feed LLMs or the retrieval API were still consuming the preview as if it were the full body. Fix: add `content_store::read::read_chunk_body` / `read_summary_body` (disk read + SHA-256 verify) and thread them through every full-body call site: • `bucket_seal::hydrate_leaf_inputs` / `hydrate_summary_inputs` (LLM summariser input — fatal on missing content_path) • `handlers::handle_extract` — scorer + embedder (LLM extractor + embedding — fatal) • `handlers::handle_append_buffer` Leaf + Summary branches (topic-tree ingest — fatal) • `retrieval::fetch::hit_from_chunk` — non-fatal warn+fallback • `retrieval::drill_down` chunk + summary branches — non-fatal • `retrieval::source::collect_hits_and_nodes` — non-fatal • `retrieval::topic::fetch_topic_tree_root_summary` + `entity_hit_to_retrieval_hit` — non-fatal • `global_tree::digest::pick_source_contribution` — non-fatal Also fix `content_store::paths::chunk_rel_path` to sanitize chunk IDs into cross-platform filenames (colons are illegal on Windows NTFS; summary_rel_path already did this). Test fixtures: every test that calls `upsert_chunks` and then triggers a seal or a retrieval hit now calls `stage_test_chunks` (a local helper that writes .md files + stores the content_path pointer in SQLite) so `read_chunk_body` can resolve the on-disk file. Affected: bucket_seal_tests, drill_down, fetch, global, integration_test, rpc, source (retrieval), flush, recap, digest_tests. All 477 memory::tree lib tests pass; 62 gmail lib tests pass.

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/openhuman/memory/tree/source_tree/bucket_seal.rs (1)
560-567: ⚠️ Potential issue | 🟠 Major

Persist node.topics into the entity index too.

After resolve_labels, the summary row carries both entities and topics, but this transaction only indexes node.entities. That leaves topic summaries discoverable via summary.topics but invisible to the paths that read mem_tree_entity_index (query_topic, search_entities(kind=topic), and the tag rewrite).

Based on learnings, summary topics are stored in mem_tree_entity_index with entity_kind = "topic", and update_summary_tags reads that index as the source of truth.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/source_tree/bucket_seal.rs` around lines 560 - 567,
The transaction only indexes node.entities but must also persist node.topics
into mem_tree_entity_index so topic summaries become discoverable; call the same
index routine
(crate::openhuman::memory::tree::score::store::index_summary_entity_ids_tx) for
node.topics as well (or extend that function to accept both entities and topics)
ensuring the entries are written with entity_kind = "topic" (so
update_summary_tags, query_topic, and search_entities(kind=topic) will see
them); ensure you pass the same tx, node.id, node.score, timestamp, and tree_id
context when indexing topics.

♻️ Duplicate comments (1)

src/openhuman/memory/tree/content_store/read.rs (1)
24-29: ⚠️ Potential issue | 🟠 Major

Redact the path in these error strings too.

read_chunk_file() still bakes abs_path into its read/UTF-8/front-matter errors. Those errors bubble up through handler contexts and are logged with full chains, so participant-bucket paths can still leak even though the explicit warn! calls are redacted. read_summary_file() inherits the same behavior through this helper.

As per coding guidelines, "Never log secrets or full PII."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/content_store/read.rs` around lines 24 - 29, The
error messages in read_chunk_file()/read_summary_file() (in read.rs) include the
full abs_path in the read/UTF-8/front-matter errors; replace those messages to
avoid leaking PII by redacting the path before interpolating into error strings
(e.g., use a helper or only include non-sensitive parts such as file stem or a
fixed "[redacted]" token) for the std::fs::read map_err, the from_utf8 map_err,
and the split_front_matter no_front_matter error that currently reference
abs_path; ensure you update references around abs_path and split_front_matter so
all three error branches report a redacted path or placeholder instead of the
full abs_path.

🧹 Nitpick comments (4)

src/openhuman/composio/providers/gmail/ingest.rs (1)
228-258: Align new locals with the repo's camelCase rule.

Names like total_chunks, raw_msgs, source_id, and thread_subject don't match the src/openhuman/**/*.rs naming convention. I'd normalize them here while the module is still new so it stays consistent with the rest of openhuman. As per coding guidelines, src/openhuman/**/*.rs: "Use camelCase for variable names".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/composio/providers/gmail/ingest.rs` around lines 228 - 258,
Rename snake_case local variables to camelCase: change total_chunks ->
totalChunks, total_buckets -> totalBuckets, raw_msgs -> rawMsgs, source_id ->
sourceId, and thread_subject -> threadSubject; update their declarations and
every use in this function (the for loop binding (participants, rawMsgs), the
call to raw_to_email_message, creation of sourceId = format!("gmail:{}",
participants), and EmailThread { thread_subject -> threadSubject }) so all
references match the new names, keeping other symbols like
bucket_by_participants, raw_to_email_message, EmailThread, pick_thread_subject,
and DEFAULT_TAGS unchanged.
src/openhuman/memory/tree/content_store/paths.rs (1)
198-202: Variable naming in this file diverges from the repo convention.

At Line 198 onward, local variables use snake_case (e.g., pending_underscore, last_dash). If you want strict consistency with the project rule, these should be camelCase.

As per coding guidelines, "Use camelCase for variable names..."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/content_store/paths.rs` around lines 198 - 202,
Local variables in src/openhuman/memory/tree/content_store/paths.rs use
snake_case (e.g., pending_underscore, last_dash) but the repo requires
camelCase; rename these locals (for example pending_underscore ->
pendingUnderscore, last_dash -> lastDash, and any other snake_case locals in
this function such as lower if needed) and update every use site within the same
function (and any closures) to the new names, keeping the same types/semantics
and preserving the existing initialization (e.g., let mut pendingUnderscore =
false;). Run cargo build/tests to ensure no unresolved references remain.
src/openhuman/memory/tree/chunker.rs (1)
73-198: Rename the new locals to camelCase to match the repo convention.

This hunk introduces several new snake_case locals (max_tokens, max_chars, unit_separator, acc_chars, etc.), so the file now mixes conventions inside the same module.

As per coding guidelines, "Use camelCase for variable names."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/chunker.rs` around lines 73 - 198, The new locals
use snake_case but the repo prefers camelCase; rename variables and closure
parameters consistently (e.g., max_tokens -> maxTokens, max_chars -> maxChars,
unit_separator -> unitSeparator, acc_chars -> accChars, sub_pieces -> subPieces)
and update the flush closure signature and its uses (flush(&mut acc, &mut
accChars, &mut out)) as well as every reference to approx_token_count,
split_by_token_budget, units, and Chunk construction sites to use the new names
so the logic is unchanged but naming matches camelCase conventions.
src/openhuman/memory/tree/jobs/scheduler.rs (1)
28-127: Use camelCase for the new scheduler params and locals.

The added helpers introduce repo-inconsistent snake_case names such as days_back, date_iso, today_iso, and next_sleep_duration.

As per coding guidelines, "Use camelCase for variable names."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/jobs/scheduler.rs` around lines 28 - 127, Rename
snake_case scheduler parameters, locals, and the helper function to camelCase:
change the backfill_missing_digests parameter days_back to daysBack and update
all uses inside backfill_missing_digests (including the log call and loop
bounds); rename local variables date_iso and today_iso in enqueue_daily_jobs
(and trigger_digest's payload field references) to dateIso and todayIso; rename
the function next_sleep_duration to nextSleepDuration and update its internal
locals and any call sites; make sure to update NewJob::digest_daily/flush_stale
payload field names references (date_iso -> dateIso) and all log messages to use
the new identifiers so the code compiles and follows the camelCase variable
naming guideline.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/openhuman/composio/providers/gmail/ingest.rs`:
- Around line 126-140: The current splitting on commas in
parse_address_list_for_bucket (and the other helper used at 189-203) breaks
RFC-822 quoted display names like "Doe, John" and must be replaced with an
RFC-aware parser; create a single shared parser function that uses an RFC
mailbox parser (for example mailparse::addrparse_header or another RFC-822/2822
mailbox parser) to parse the header/value into MailAddr entries and then collect
the mailbox addresses (or full mailbox strings) into Vec<String>, and update
parse_address_list_for_bucket and the other helper to call that parser instead
of using split(',') so quoted names and escaped commas are handled correctly.

In `@src/openhuman/memory/tree/content_store/paths.rs`:
- Around line 69-71: The code currently panics in summary_rel_path by calling
expect on date_for_global for SummaryTreeKind::Global; change summary_rel_path
to return a Result (e.g., Result<String, E> or anyhow::Result<String>) instead
of unwrapping, detect when kind == SummaryTreeKind::Global and
date_for_global.is_none() and return a descriptive error (include context about
missing date_for_global), replace the expect(...) call with
date_for_global.ok_or_else(|| /* error with context */)?.format(...) or map_err
to convert to the chosen error type, and propagate this Result up from callers
so the async pipeline can handle retry/failure instead of panicking.

In `@src/openhuman/memory/tree/content_store/read.rs`:
- Around line 154-179: The code ignores the stored SHA from pointers and returns
the file body without verification; change the tuple binding to capture the
stored SHA (rename _sha256 -> stored_sha), compute the SHA256 (hex) of the read
bytes returned by read_chunk_file, compare it to stored_sha, and return an error
with context (same style as with_context) if they differ; reference the
variables chunk_id, rel_path, stored_sha, the call to
read_chunk_file(&abs_path), and use the existing logging/context helpers
(redact, with_context) to produce the failure message so corrupted or stale
files are rejected before returning result.body.
- Around line 148-153: The code currently treats a NULL return from
get_chunk_content_pointers(...) as an error and hard-fails; instead preserve the
legacy-row fallback by treating None as a valid case and proceeding to read the
inline content column. Modify the call around get_chunk_content_pointers in
read.rs (the variable pointers from get_chunk_content_pointers) so it does not
err_on_none but branches: if Some(pointers) use the external
content_path/content_sha256 logic, otherwise continue with the pre-MD migration
path that reads the inline content field; apply the same change to the other
occurrence referenced (the block around lines 204-209) so old rows are not
turned into permanent failures.

In `@src/openhuman/memory/tree/source_tree/bucket_seal.rs`:
- Around line 164-170: The debug log currently prints the raw chunk id
(leaf.chunk_id) which can leak PII; update the log::debug call in this
bucket_seal/append_leaf area to avoid printing leaf.chunk_id directly—replace it
with a redacted or deterministic hashed representation (e.g., hash or fixed
structural prefix like "chunk-<redacted>" or first N chars of a secure hash)
while leaving tree.id, leaf.token_count, and strategy unchanged; ensure the new
value is computed safely before the log call and used in the log::debug format
string instead of leaf.chunk_id.
- Around line 726-743: The current code silently swallows index read errors and
unconditionally calls content_read::read_chunk_body/read_summary_body which
fails for pre-migration inline rows; update hydrate_leaf_inputs (and the similar
summary path) to: 1) stop using
list_entity_ids_for_node(...).unwrap_or_default() — match on the Result and only
default to empty when the error is a benign "index missing/empty" sentinel (log
it) but propagate real DB/file/SHA errors upward; 2) before calling
content_read::read_chunk_body or content_read::read_summary_body, check the
chunk's content_path presence and, if absent, use the inline content field from
the row as the body fallback; otherwise call the read_* function and propagate
its errors. Reference list_entity_ids_for_node, hydrate_leaf_inputs,
content_read::read_chunk_body and content_read::read_summary_body when making
these changes.

In `@src/openhuman/memory/tree/util/redact.rs`:
- Around line 33-46: The redact_endpoint function currently uses split_once('@')
which removes everything before the first '@' and can leak credential fragments
(e.g., "user:p@ss@example.com"); update redact_endpoint to first isolate the
authority (take after_scheme up to the first '/', '?', or '#'), then remove
userinfo by using rsplit_once('@') on that authority so the last '@' is honored,
and finally return only the host:port portion; reference the symbols
redact_endpoint, after_scheme, after_creds/authority and host_port when making
this change.

---

Outside diff comments:
In `@src/openhuman/memory/tree/source_tree/bucket_seal.rs`:
- Around line 560-567: The transaction only indexes node.entities but must also
persist node.topics into mem_tree_entity_index so topic summaries become
discoverable; call the same index routine
(crate::openhuman::memory::tree::score::store::index_summary_entity_ids_tx) for
node.topics as well (or extend that function to accept both entities and topics)
ensuring the entries are written with entity_kind = "topic" (so
update_summary_tags, query_topic, and search_entities(kind=topic) will see
them); ensure you pass the same tx, node.id, node.score, timestamp, and tree_id
context when indexing topics.

---

Duplicate comments:
In `@src/openhuman/memory/tree/content_store/read.rs`:
- Around line 24-29: The error messages in read_chunk_file()/read_summary_file()
(in read.rs) include the full abs_path in the read/UTF-8/front-matter errors;
replace those messages to avoid leaking PII by redacting the path before
interpolating into error strings (e.g., use a helper or only include
non-sensitive parts such as file stem or a fixed "[redacted]" token) for the
std::fs::read map_err, the from_utf8 map_err, and the split_front_matter
no_front_matter error that currently reference abs_path; ensure you update
references around abs_path and split_front_matter so all three error branches
report a redacted path or placeholder instead of the full abs_path.

---

Nitpick comments:
In `@src/openhuman/composio/providers/gmail/ingest.rs`:
- Around line 228-258: Rename snake_case local variables to camelCase: change
total_chunks -> totalChunks, total_buckets -> totalBuckets, raw_msgs -> rawMsgs,
source_id -> sourceId, and thread_subject -> threadSubject; update their
declarations and every use in this function (the for loop binding (participants,
rawMsgs), the call to raw_to_email_message, creation of sourceId =
format!("gmail:{}", participants), and EmailThread { thread_subject ->
threadSubject }) so all references match the new names, keeping other symbols
like bucket_by_participants, raw_to_email_message, EmailThread,
pick_thread_subject, and DEFAULT_TAGS unchanged.

In `@src/openhuman/memory/tree/chunker.rs`:
- Around line 73-198: The new locals use snake_case but the repo prefers
camelCase; rename variables and closure parameters consistently (e.g.,
max_tokens -> maxTokens, max_chars -> maxChars, unit_separator -> unitSeparator,
acc_chars -> accChars, sub_pieces -> subPieces) and update the flush closure
signature and its uses (flush(&mut acc, &mut accChars, &mut out)) as well as
every reference to approx_token_count, split_by_token_budget, units, and Chunk
construction sites to use the new names so the logic is unchanged but naming
matches camelCase conventions.

In `@src/openhuman/memory/tree/content_store/paths.rs`:
- Around line 198-202: Local variables in
src/openhuman/memory/tree/content_store/paths.rs use snake_case (e.g.,
pending_underscore, last_dash) but the repo requires camelCase; rename these
locals (for example pending_underscore -> pendingUnderscore, last_dash ->
lastDash, and any other snake_case locals in this function such as lower if
needed) and update every use site within the same function (and any closures) to
the new names, keeping the same types/semantics and preserving the existing
initialization (e.g., let mut pendingUnderscore = false;). Run cargo build/tests
to ensure no unresolved references remain.

In `@src/openhuman/memory/tree/jobs/scheduler.rs`:
- Around line 28-127: Rename snake_case scheduler parameters, locals, and the
helper function to camelCase: change the backfill_missing_digests parameter
days_back to daysBack and update all uses inside backfill_missing_digests
(including the log call and loop bounds); rename local variables date_iso and
today_iso in enqueue_daily_jobs (and trigger_digest's payload field references)
to dateIso and todayIso; rename the function next_sleep_duration to
nextSleepDuration and update its internal locals and any call sites; make sure
to update NewJob::digest_daily/flush_stale payload field names references
(date_iso -> dateIso) and all log messages to use the new identifiers so the
code compiles and follows the camelCase variable naming guideline.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a0a9ccb3-ddc1-49f9-8fd6-598dc2007c83

📥 Commits

Reviewing files that changed from the base of the PR and between 38add9f and 937e6a8.

📒 Files selected for processing (28)

src/openhuman/composio/providers/gmail/ingest.rs
src/openhuman/memory/tree/chunker.rs
src/openhuman/memory/tree/content_store/atomic.rs
src/openhuman/memory/tree/content_store/paths.rs
src/openhuman/memory/tree/content_store/read.rs
src/openhuman/memory/tree/global_tree/digest.rs
src/openhuman/memory/tree/global_tree/digest_tests.rs
src/openhuman/memory/tree/global_tree/recap.rs
src/openhuman/memory/tree/global_tree/seal.rs
src/openhuman/memory/tree/jobs/handlers/mod.rs
src/openhuman/memory/tree/jobs/scheduler.rs
src/openhuman/memory/tree/jobs/store.rs
src/openhuman/memory/tree/jobs/worker.rs
src/openhuman/memory/tree/mod.rs
src/openhuman/memory/tree/retrieval/drill_down.rs
src/openhuman/memory/tree/retrieval/fetch.rs
src/openhuman/memory/tree/retrieval/global.rs
src/openhuman/memory/tree/retrieval/integration_test.rs
src/openhuman/memory/tree/retrieval/rpc.rs
src/openhuman/memory/tree/retrieval/source.rs
src/openhuman/memory/tree/retrieval/topic.rs
src/openhuman/memory/tree/score/extract/mod.rs
src/openhuman/memory/tree/source_tree/bucket_seal.rs
src/openhuman/memory/tree/source_tree/bucket_seal_tests.rs
src/openhuman/memory/tree/source_tree/flush.rs
src/openhuman/memory/tree/topic_tree/backfill.rs
src/openhuman/memory/tree/util/mod.rs
src/openhuman/memory/tree/util/redact.rs

✅ Files skipped from review due to trivial changes (2)

src/openhuman/memory/tree/jobs/worker.rs
src/openhuman/memory/tree/topic_tree/backfill.rs

🚧 Files skipped from review as they are similar to previous changes (12)

src/openhuman/memory/tree/mod.rs
src/openhuman/memory/tree/global_tree/recap.rs
src/openhuman/memory/tree/retrieval/fetch.rs
src/openhuman/memory/tree/retrieval/topic.rs
src/openhuman/memory/tree/retrieval/integration_test.rs
src/openhuman/memory/tree/global_tree/seal.rs
src/openhuman/memory/tree/retrieval/source.rs
src/openhuman/memory/tree/score/extract/mod.rs
src/openhuman/memory/tree/global_tree/digest.rs
src/openhuman/memory/tree/source_tree/flush.rs
src/openhuman/memory/tree/jobs/store.rs
src/openhuman/memory/tree/content_store/atomic.rs

… in path util, last-@ for endpoint redact Three small correctness fixes from the CodeRabbit re-review of the preview-vs-disk reader audit (tinyhumansai#19) and surrounding work: - read_chunk_body / read_summary_body now compare the on-disk body's sha256 to the value stored in SQLite at write time. Mismatch returns Err rather than silently serving corrupt bytes into the LLM pipeline. The previous code computed the sha but discarded it. - summary_rel_path no longer panics when date_for_global is None for the Global tree kind. A path utility shouldn't take the process down; falls back to an `unknown-date` sentinel segment with a warn log surfacing the caller bug. Test flipped from #[should_panic] to asserting the sentinel. - redact_endpoint splits the authority on the LAST `@` (rsplit_once) instead of the first, so URLs with `@` in user-info segments (e.g. `https://user:p@ass@host.com`) extract the host correctly. Also tightened the order-of-operations: take the authority first (cut at /?#), then split userinfo within it, so a `@` in path or query string can't be misread as a userinfo separator. Addresses CodeRabbit comments 3157361914, 3157361930, 3157361952. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

src/openhuman/memory/tree/content_store/read.rs (1)

120-124: Reword NULL-pointer failures as migration regressions, not legacy rows.

The hard-fail itself is right, but the comments and error text still frame None as a plausible “pre-MD-migration row” case. In this branch that state is supposed to be eliminated by wipe + re-ingest, so this wording will send triage toward a non-existent compatibility path instead of the real causes: a staging/write-path regression or a bad ingest state.

📝 Suggested wording

-// - If `content_path` / `content_sha256` are NULL (legacy rows ingested before
-//   the MD-on-disk migration), return `Err` — callers must handle the
-//   "pre-migration chunk" case explicitly. The job pipeline propagates the
-//   error and retries; retrieval falls back gracefully.
+// - If `content_path` / `content_sha256` are NULL, return `Err` — in this
+//   migration that indicates an unexpected staging/schema regression or an
+//   incomplete wipe + re-ingest transition.

-            "[content_store::read] no content_path for chunk_id={} (pre-MD-migration row?)",
+            "[content_store::read] no content_path for chunk_id={} (expected disk-backed pointer missing)",

-            "[content_store::read] no content_path for summary_id={} (pre-MD-migration row?)",
+            "[content_store::read] no content_path for summary_id={} (expected disk-backed pointer missing)",

Based on learnings, this PR’s deployment model is a --wipe + re-ingest migration, so NULL content_path is intended to signal a write-path/schema regression rather than a supported legacy-row case.

Also applies to: 148-153, 207-209, 221-226

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/content_store/read.rs` around lines 120 - 124,
Update the wording in the NULL-pointer failure comments and any error messages
to treat missing content_path/content_sha256 as a migration/regression
(write-path or ingest state bug) rather than a supported "pre-migration legacy
row" case; locate references to content_path and content_sha256 in read.rs
(including the comment block above the "Error policy" and any error strings
returned where code checks for None) and replace language that suggests a
graceful legacy compatibility path with wording that clearly signals a
migration/regression and instructs operators to wipe+re-ingest or investigate
write/ingest pipeline failures.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/openhuman/memory/tree/content_store/read.rs`:
- Around line 23-29: read_chunk_file currently embeds the full abs_path into its
anyhow! error messages (in the std::fs::read map_err, the UTF-8 map_err, and the
split_front_matter ok_or_else), which can leak PII; change those error messages
to avoid printing the full path by using a sanitized/redacted representation
(e.g., a small helper like sanitize_path(abs_path) or only
abs_path.file_name().and_then(|s| s.to_str()).unwrap_or("<redacted>")) and use
that redacted value in the anyhow! calls in read_chunk_file so
split_front_matter, std::fs::read and UTF-8 errors no longer expose the full
path.

---

Nitpick comments:
In `@src/openhuman/memory/tree/content_store/read.rs`:
- Around line 120-124: Update the wording in the NULL-pointer failure comments
and any error messages to treat missing content_path/content_sha256 as a
migration/regression (write-path or ingest state bug) rather than a supported
"pre-migration legacy row" case; locate references to content_path and
content_sha256 in read.rs (including the comment block above the "Error policy"
and any error strings returned where code checks for None) and replace language
that suggests a graceful legacy compatibility path with wording that clearly
signals a migration/regression and instructs operators to wipe+re-ingest or
investigate write/ingest pipeline failures.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 88842fcc-4ebd-446c-b6bf-d2761152116d

📥 Commits

Reviewing files that changed from the base of the PR and between 937e6a8 and 3ed3393.

📒 Files selected for processing (3)

src/openhuman/memory/tree/content_store/paths.rs
src/openhuman/memory/tree/content_store/read.rs
src/openhuman/memory/tree/util/redact.rs

coderabbitai · 2026-04-28T22:02:11Z

+pub fn read_chunk_file(abs_path: &Path) -> anyhow::Result<ChunkFileContents> {
+    let raw = std::fs::read(abs_path).map_err(|e| anyhow::anyhow!("read {:?}: {e}", abs_path))?;
+    let content = std::str::from_utf8(&raw)
+        .map_err(|e| anyhow::anyhow!("invalid UTF-8 in {:?}: {e}", abs_path))?;
+
+    let (_fm, body) = split_front_matter(content)
+        .ok_or_else(|| anyhow::anyhow!("no front-matter in {:?}", abs_path))?;


⚠️ Potential issue | 🟠 Major

Redact the path in low-level reader errors too.

read_chunk_file() still embeds abs_path verbatim in its anyhow! messages. Those errors are later bubbled up and logged with {e:#} in retrieval/job paths, so a failed read can still expose participant-bucket paths even though the explicit warn logs were already sanitized.

🔐 Suggested change

pub fn read_chunk_file(abs_path: &Path) -> anyhow::Result<ChunkFileContents> { - let raw = std::fs::read(abs_path).map_err(|e| anyhow::anyhow!("read {:?}: {e}", abs_path))?; + let path_hash = redact(&abs_path.to_string_lossy()); + let raw = std::fs::read(abs_path).map_err(|e| { + anyhow::anyhow!("[content_store::read] read failed path_hash={}: {e}", path_hash) + })?; let content = std::str::from_utf8(&raw) - .map_err(|e| anyhow::anyhow!("invalid UTF-8 in {:?}: {e}", abs_path))?; + .map_err(|e| { + anyhow::anyhow!("[content_store::read] invalid UTF-8 path_hash={}: {e}", path_hash) + })?; let (_fm, body) = split_front_matter(content) - .ok_or_else(|| anyhow::anyhow!("no front-matter in {:?}", abs_path))?; + .ok_or_else(|| { + anyhow::anyhow!("[content_store::read] no front-matter path_hash={}", path_hash) + })?;

As per coding guidelines, "Never log secrets or full PII."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/openhuman/memory/tree/content_store/read.rs` around lines 23 - 29, read_chunk_file currently embeds the full abs_path into its anyhow! error messages (in the std::fs::read map_err, the UTF-8 map_err, and the split_front_matter ok_or_else), which can leak PII; change those error messages to avoid printing the full path by using a sanitized/redacted representation (e.g., a small helper like sanitize_path(abs_path) or only abs_path.file_name().and_then(|s| s.to_str()).unwrap_or("<redacted>")) and use that redacted value in the anyhow! calls in read_chunk_file so split_front_matter, std::fs::read and UTF-8 errors no longer expose the full path.

sanil-23 and others added 16 commits April 28, 2026 02:52

Merge branch 'main' into feat/memory-tree-async-pipeline

b4c01f6

sanil-23 requested a review from a team April 28, 2026 20:00

sanil-23 and others added 2 commits April 29, 2026 01:33

chore(memory): cargo fmt — wrap long TOKEN_BUDGET expressions and war…

7511275

…n line Pure formatting; no semantic changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai Bot reviewed Apr 28, 2026

View reviewed changes

sanil-23 and others added 3 commits April 29, 2026 02:24

coderabbitai Bot reviewed Apr 28, 2026

View reviewed changes

senamakel approved these changes Apr 29, 2026

View reviewed changes

senamakel merged commit fd0398d into tinyhumansai:main Apr 29, 2026
9 checks passed

Conversation

sanil-23 commented Apr 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Submission Checklist

Impact

Related

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sanil-23 commented Apr 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 28, 2026 •

edited

Loading