feat(memory): MD-on-disk content + participant bucketing + async pipeline foundation#1008
Conversation
…ct, quantity
The previous LLM extractor schema only emitted person / organization /
location / event / product. That covers classic NER but leaves real
gaps for chat content:
* `datetime` — "Friday", "Q2 2026", "EOD tomorrow"
(deadlines mentioned inside chunks)
* `technology` — "Rust", "OAuth", "Slack API", "nomic-embed"
(tools / frameworks / languages / services)
* `artifact` — "PR tinyhumansai#934", "src/openhuman/...", "OH-42"
(code / ticket / doc references)
* `quantity` — "$5K", "20/min", "10k tokens"
(amounts / metrics / money)
Without these kinds, the affected mentions either get dropped or fall
into Misc — both flatten retrieval. This adds them to the EntityKind
enum, the LLM extractor's allowed_kinds + parse_kind synonyms, and the
SYSTEM_PROMPT (with a Kinds guide and an explicit "skip URLs/emails —
those are extracted by regex" instruction so the model doesn't double
up with the mechanical extractors).
The probe-ratelimit / probe-pacing-ms flags from the prior commit
auto-formatted slightly; included here.
Tests: extract suite 36/36, full memory_tree 337/337.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gest trigger
Move memory-tree work onto a SQLite-backed job queue so summariser /
extractor / cascade work runs out of band of ingest.
Pipeline shape after this change:
ingest_chat → write chunk (lifecycle=pending_extraction)
→ enqueue extract_chunk
worker (×3, LLM semaphore=3):
extract_chunk → LLM extract → admit/drop → enqueue append_buffer + topic_route
append_buffer → push to L0 → enqueue seal if gate met
seal → seal one level → enqueue parent seal if cascading
→ for source trees: enqueue topic_route(Summary)
topic_route → maybe_spawn_topic_tree (counters + backfill if hot)
→ enqueue per-topic append_buffer
digest_daily → end_of_day_digest(date)
flush_stale → enqueue seal jobs for stale buffers
scheduler (1 task, daily UTC 00:05):
enqueue digest_daily(yesterday) + flush_stale (dedupe-keyed by date)
Key additions:
* mem_tree_jobs table with dedupe keys and per-status retry/backoff
* lifecycle_status column on mem_tree_chunks
* Per-level seal jobs (each level its own crash-recovery checkpoint)
* LabelStrategy enum driving summary entity/topic population:
- source-tree seal: ExtractFromContent(regex+LLM emit_topics=true)
- global L0 + L1+: UnionFromChildren (no extra LLM call)
- topic-tree seal: Empty (scope already pins the topic)
* NodeRef::{Leaf, Summary} payload generalisation so summary entities
flow back into topic-tree spawn / route pipeline
* 30-day backfill window on topic-tree spawn aligned with hotness
recency cliff
* memory_tree.trigger_digest RPC + jobs::backfill_missing_digests
catch-up helper
* extract::build_summary_extractor honest builder (no ScoringConfig
misuse for seal-time extraction)
* Lenient day-of-week parser in email_clean for real-world MTA quirks
* Slack/Gmail canonicalisers, composio providers, slack_ingestion module
Tests: 522 passing across memory_tree, slack_ingestion, composio_providers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ration assertion The memory_tree_trigger_digest controller was added in the async digest work but tests/json_rpc_e2e.rs still expected only 3 registered methods, causing json_rpc_memory_tree_end_to_end to fail with 4 != 3.
Major
- jobs/worker.rs: log warn on `recover_stale_locks` startup failure
instead of swallowing with `let _ = ...`. Stale `running` rows are
now operator-visible at startup.
- jobs/worker.rs: persist full anyhow chain in `last_error` via
`format!("{err:#}")` so readers of mem_tree_jobs see the root cause,
matching the `{:#}` log format used three lines above.
- source_tree/bucket_seal.rs: stop hardcoding `TreeKind::Source` in
`seal_one_level`'s `SummaryContext` and `SummaryNode`. The function
also runs for topic trees (handle_seal, cascade_all_from, flush) —
topic-tree summaries were being persisted with `tree_kind='source'`,
breaking any query filtering on tree_kind. Use `tree.kind`.
Minor
- jobs/types.rs: drop `#TBD` placeholder from module doc.
- jobs/handlers/mod.rs: add `debug!` log on the no-op `entity_ids`
empty path in `handle_topic_route` for diagnosis.
- jobs/scheduler.rs: comment why `.single()` cannot return `None`
(UTC has no DST gaps).
Tests
- Add `topic_tree_seal_persists_topic_kind_not_source` regression
test guarding the bucket_seal `tree.kind` fix.
- 371/371 memory::tree tests pass · 23/23 memory::tree::jobs tests
pass · cargo fmt clean.
Cargo.lock: bump openhuman version 0.53.3 → 0.53.4 (matches Cargo.toml).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…g (Phase A) - Add `content_dir: Option<PathBuf>` to `MemoryTreeConfig` with serde default None. - Add `OPENHUMAN_MEMORY_TREE_CONTENT_DIR` env-var override in `apply_env_overrides`; empty string falls back to None (consistent with other memory_tree env vars). - Add `Config::memory_tree_content_root()` helper that resolves the effective content root: explicit config path > `<workspace>/memory_tree/content/`. - Add additive, idempotent `ALTER TABLE` migrations in `with_connection` for `mem_tree_chunks.content_path TEXT` and `mem_tree_chunks.content_sha256 TEXT`. - Tests: default None, env-override logic, schema columns present after migration. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add `partial_message: bool` to `Chunk` (transient; always false in DB) - Rewrite chunker to dispatch on SourceKind: Chat splits at `## ` boundaries, Email at `---\nFrom:` boundaries, Document uses paragraph budget packing; oversize units sub-split with partial_message=true - Strip leading `# ...` headers from chat, email, and document canonicalisers (front-matter will carry that info in Phase C) - Update all 22 Chunk struct literals across test modules to include partial_message field Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…atter (Phase C) - New src/openhuman/memory/tree/content_store/ module with 5 sub-files: - paths.rs: chunk_rel_path / chunk_abs_path / slugify_source_id - compose.rs: YAML front-matter composition, split_front_matter, rewrite_tags - atomic.rs: write_if_new (tempfile+fsync+rename), sha256_hex (body-only) - read.rs: read_chunk_file + verify_chunk_file - tags.rs: update_chunk_tags, slugify_tag_kind/value, entity_tag - StagedChunk struct + stage_chunks() orchestrator in mod.rs - 27 unit tests passing across all sub-modules Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rite (Phase D) - ingest::persist() calls content_store::stage_chunks() before the SQLite tx, so chunk .md files land on disk before DB pointers are written - store: add upsert_staged_chunks_tx() writing content_path + content_sha256; add get_chunk_content_path() for post-extraction tag lookup - handle_extract: after tx.commit(), convert entity IDs to Obsidian tags (kind/Value) and rewrite the chunk .md file's tags: block atomically - Update ingest test: Phase B chunker produces one chunk per message (2 messages → 2 chunks); adjust assertions accordingly Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…line (Phase E) - New [[bin]] gmail-backfill-3d: authenticates via Composio, paginates GMAIL_FETCH_EMAILS, groups by thread_id, drives ingest_email per thread (content_store writes .md files + SQLite), drains the worker pool, then runs a SHA-256 integrity check against all content_path rows - store: add get_chunk_content_pointers() returning (content_path, sha256) as a public helper for the binary's verify loop - Cargo.toml: register gmail-backfill-3d [[bin]] entry (autobins=false) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Drop three unused imports left from earlier phases (tokio::fs::File, super::atomic::write_if_new, source_tree::append_leaf_deferred). - Add --wipe to gmail-backfill-3d so the bin can re-run cleanly after the message-aware chunker change invalidates existing chunk IDs: removes chunks.db (+ wal/shm) and <content_root>/chunks/ before fetch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n cleanup
## What changed (6 changes)
1. **bucket_by_sender_and_thread solo fallback** (`gmail/ingest.rs`):
Messages without `threadId` now share a single `_solo_` bucket per
sender instead of getting a unique `solo:{message_id}` bucket. Fixes
the root-cause where each threadless message produced a separate
`source_id` and therefore a separate path subtree.
2. **Per-thread source_id in ingest_page_into_memory_tree** (`gmail/ingest.rs`):
Dropped the `source_id: &str` parameter. The function now derives
`source_id = \"gmail:{sender}:{thread_id}\"` per (sender, thread) pair
so every thread lands in its own path subtree. Made `ingest` a `pub mod`
so the bin can import it directly.
3. **Path scheme — sender/thread/chunk_id** (`content_store/paths.rs`):
`chunk_rel_path` is now source-kind-aware. Email: parses
`gmail:{sender}:{thread}` into 4-level path
`email/<sender_slug>/<thread_slug>/<chunk_id>.md`. Chat/Document: 3-level
`<kind>/<source_slug>/<chunk_id>.md` (chunk_id replaces seq as filename).
Malformed email source_id falls back to 2-level layout without panicking.
Added `email_clean` to canonicalize module exports so the type is reachable.
4. **Chunker greedy packing** (`chunker.rs`):
Chat and Email branches now pack consecutive units into a single chunk
until adding the next would exceed `max_tokens` (default 10k). Oversize
single units still fall back to `split_by_token_budget` with
`partial_message = true`. Document branch unchanged.
5. **Bin cleanup** (`bin/gmail_backfill_3d.rs`):
Deleted `ingest_page`, `bucket_by_thread`, `raw_to_email_message`,
`parse_date`, `parse_addr_list`, `strip_re_fwd` and their tests (~185
lines of duplicated logic). Main loop now calls
`ingest_page_into_memory_tree` directly. Bin LOC: 576 → 391.
6. **Front-matter email fields** (`content_store/compose.rs`):
Email chunks now emit `aliases`, `sender`, and `thread_id` YAML fields
parsed from the `gmail:{sender}:{thread_id}` source_id. SHA-256 is
still over body bytes only (front-matter changes don't invalidate hash).
TODO: thread_subject requires carrying it through Metadata.
## Why
Path collisions caused 9 surviving files out of 44 expected chunks on the
first backfill run. All three root causes fixed: (a) solo bucket per message
→ unified _solo_ bucket, (b) single source_id for all threads → per-thread
source_id, (c) seq-as-filename → chunk_id-as-filename.
## Bin LOC
Before: 576 lines. After: 391 lines.
## cargo check
Passes clean. All pre-existing clippy warnings are in unrelated files
(threads/ops.rs, voice/server.rs, webview_accounts/ops.rs, etc.) and
are intentionally not fixed in this commit.
## Tests
- memory::tree: 431 passed, 0 failed
- composio::providers::gmail: 57 passed, 0 failed
- Updated ingest_chat_writes_and_queue_drains_to_admitted_chunk to expect
1 packed chunk (was 2, before greedy packing).
## New path formats
- Email: email/<sender_slug>/<thread_slug>/<chunk_id>.md
- Chat: chat/<source_slug>/<chunk_id>.md
- Document: document/<source_slug>/<chunk_id>.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gration
- Bucket by participants (sender ∪ to-list, sorted, deduped, CC ignored).
Drops threadId from grouping entirely.
- source_id = gmail:{participants} (e.g. gmail:alice@x.com|bob@y.com)
- Path: email/<participants_slug>/<chunk_id>.md (single layer, no thread folder)
- canonicalize/email.rs now calls clean_body on each message body to strip
reply chains, marketing footers, legal disclaimers, and template noise
- Front-matter: replace sender/thread_id with participants: YAML list +
human-readable aliases (first <-> second, or first <-> N others for groups)
- chunk_ids rotate (content + source_id both change); --wipe required for re-ingest
- All new tests pass (435 memory::tree + 60 composio::gmail); cargo check clean
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the chunk pattern: bodies on disk under `summaries/<source|global|topic>/...`, SQLite holds pointer + sha + 500-char preview. All three seal paths (source/global/topic) call stage_summary before upsert_summary_tx. handle_seal calls update_summary_tags after committing entity_index. gmail-backfill-3d bin's verify step extended to summaries. Schema migration is additive + nullable; legacy rows still readable. Summary filenames sanitize ':' → '-' for cross-platform NTFS compatibility. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sized for the local 1B summariser (gemma3:1b-it-qat), which produces noticeably better summaries with ≤4-5k input than at the previous 10k cap. Chunker max sits below the seal budget so each L0 buffer accumulates roughly 1-3 chunks before sealing — natural pacing. - DEFAULT_CHUNK_MAX_TOKENS: 10_000 → 3_000 - TOKEN_BUDGET (L0 seal): 10_000 → 4_500 - Test fixtures rewritten to use TOKEN_BUDGET-relative sizes (chunk_count = TOKEN_BUDGET * 6 / 10) so they auto-track the constants if we tune again. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously a single transport failure (Ollama briefly down, transient connection refused, DNS hiccup) would silently return empty extraction with no retry. For chunks this is OK because the whole job retries via the async pipeline, but for summary extraction (which runs INLINE in seal_one_level) a one-shot transport failure leaves the summary with empty entities + topics permanently — no async retry path. Now: try up to 3 attempts with exponential backoff (250ms / 500ms / 1000ms ≈ 1.75s total) before falling back to empty. Non-transport failures (HTTP non-success, malformed JSON) still fall back immediately because retrying the same input yields the same bad response. Both summary-time and chunk-time extraction benefit since they share the same LlmEntityExtractor. Cascade ordering is preserved: parent seals always see populated entity_index from children. Caught by the v8 backfill run: Otter summary's seal hit a transport hiccup at exactly the wrong millisecond, leaving its tags empty while the GitHub summary (sealed seconds later) extracted cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughAdds a disk-backed MD content store, an SQLite job queue + worker pool for async memory-tree processing, Gmail backfill CLI and ingest logic, email/chat/document canonicalisation and chunking changes, label/summary staging, scheduling (daily digest/flush), and many schema, retrieval, and test updates. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Ingest
participant ContentStore
participant DB
participant Jobs as MemJobs
participant Scheduler
participant WorkerPool
participant Handler
Client->>Ingest: submit(source)
Ingest->>Ingest: score_chunks_fast()
Ingest->>ContentStore: stage_chunks() (atomic write)
Ingest->>DB: upsert_staged_chunks_tx()
Ingest->>MemJobs: enqueue(ExtractChunk)
Ingest-->>Client: return (queued)
Scheduler->>MemJobs: enqueue(digest_daily / flush_stale)
Scheduler->>WorkerPool: wake_workers()
WorkerPool->>MemJobs: claim_next()
MemJobs-->>WorkerPool: job
WorkerPool->>Handler: handle_job(job)
alt ExtractChunk
Handler->>ContentStore: read_chunk_body()
Handler->>DB: persist scores, set lifecycle
Handler->>MemJobs: enqueue(AppendBuffer, TopicRoute)
else AppendBuffer
Handler->>DB: insert leaf (L0 buffer)
Handler->>MemJobs: enqueue(Seal) if gate met
else Seal
Handler->>ContentStore: stage_summary()
Handler->>DB: insert_summary_tx(staged)
Handler->>MemJobs: enqueue(parent_seal / topic_route)
else TopicRoute
Handler->>DB: resolve entities, ensure topic trees
Handler->>MemJobs: enqueue(AppendBuffer per entity)
else Digest/Flush
Handler->>DB: union inputs, insert daily summary
Handler->>ContentStore: stage_summary()
end
WorkerPool->>MemJobs: mark_done()
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
…n line Pure formatting; no semantic changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A previous cleanup commit dropped File from the tokio::fs use list believing it was unused — but it's referenced by the directory-fsync helper at load.rs:471. Caught by upstream CI clippy (E0433: undeclared type `File`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 20
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
src/openhuman/memory/tree/retrieval/drill_down.rs (1)
259-272:⚠️ Potential issue | 🟡 MinorFormatting gate is failing CI.
cargo fmt --checkis failing around Line 259 and Line 271. Please runcargo fmt --allso the PR passes the type-check workflow.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/memory/tree/retrieval/drill_down.rs` around lines 259 - 272, Formatting in the block that constructs a chunk and calls upsert_chunks and append_leaf (see functions upsert_chunks and append_leaf and the LeafRef construction with fields token_count, seq_in_source, created_at/ timestamp, partial_message) is not rustfmt-compliant; run rustfmt by executing cargo fmt --all and commit the resulting changes so the crate passes cargo fmt --check (ensure the struct literal formatting and trailing commas match rustfmt output around the chunk creation and LeafRef append).src/openhuman/memory/tree/source_tree/store.rs (1)
202-244:⚠️ Potential issue | 🔴 CriticalThis truncates every staged summary for all current readers.
Once
stagedisSome,mem_tree_summaries.contentonly keeps the first 500 chars, but the existing read path in this module still returnscontentdirectly and never hydratescontent_path. That means staged summaries are now recursively summarised and retrieved from previews instead of full bodies.Please either hydrate the markdown file on read before building
SummaryNode, or keep writing fullnode.contenthere until all readers switch to the on-disk path.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/memory/tree/source_tree/store.rs` around lines 202 - 244, The current insert into mem_tree_summaries truncates staged content via content_preview when staged is Some, causing readers to get only the 500-char preview; instead preserve the full node.content when writing the DB until readers are updated. Change the staged match (variables staged, content_preview, content_path, content_sha256) so that the value inserted for the content column is node.content (not the 500-char preview) while still setting content_path and content_sha256 from the staged struct; keep the preview logic only for any separate preview column or consumers that explicitly need it, or alternatively implement hydration in the SummaryNode read path later (references: staged, content_preview, node.content, content_path, content_sha256, the tx.execute INSERT into mem_tree_summaries, and the SummaryNode read/hydrate code).src/openhuman/memory/tree/topic_tree/backfill.rs (1)
162-173:⚠️ Potential issue | 🟡 MinorDon't count no-op
append_leafcalls as appended leaves.
append_leafis documented here as idempotent, butappended += 1runs unconditionally. A repeat backfill will still report those hits as newly appended, so the return value no longer matches the function contract.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/memory/tree/topic_tree/backfill.rs` around lines 162 - 173, The code unconditionally increments appended after calling append_leaf (in the backfill flow using LabelStrategy::Empty), but append_leaf is idempotent and may be a no-op; update the call to capture append_leaf's result and only increment appended when the result indicates a real append occurred (e.g., a boolean/Option/enum returned by append_leaf signaling change). Keep the existing .with_context(...) error handling around the await, then branch on the successful result from append_leaf (using the return value from append_leaf in the backfill function) to decide whether to do appended += 1.src/openhuman/memory/tree/ingest.rs (1)
155-203:⚠️ Potential issue | 🟡 Minor
chunks_writtencounts upserts, not queued work.This closure returns
nfromupsert_staged_chunks_tx, which is alwaysstaged.len(). On a duplicate re-ingest,to_schedulecan stay empty, but the API still reports those chunks as "written and queued". Returnto_schedule.len()here, or expose separatestored_chunksandqueued_chunkscounts.Also applies to: 210-214
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/memory/tree/ingest.rs` around lines 155 - 203, The returned value `n` is from store::upsert_staged_chunks_tx and thus equals staged.len(), which misreports queued work when duplicates are skipped via the to_schedule HashSet; change the closure to return the number of actually scheduled chunks (to_schedule.len()) or return a tuple/struct with both stored_chunks (result of upsert_staged_chunks_tx) and queued_chunks (to_schedule.len()), and update callers accordingly; locate logic around upsert_staged_chunks_tx, the to_schedule set, score::persist_score_tx and jobs::enqueue_tx where tx.commit() is called and replace the single `n` return with the chosen corrected count(s).
🟡 Minor comments (6)
docs/memory-tree-async-pipeline.excalidraw-44-44 (1)
44-44:⚠️ Potential issue | 🟡 MinorUse “job queue” wording for clarity.
“jobs queue” reads awkwardly; consider “job queue” for cleaner phrasing in the subtitle.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/memory-tree-async-pipeline.excalidraw` at line 44, Update the subtitle string "Leaf ingestion -> jobs queue -> workers -> source/topic/global tree building" to use the singular phrasing "job queue" instead of "jobs queue" so it reads "Leaf ingestion -> job queue -> workers -> source/topic/global tree building".src/bin/slack_backfill.rs-261-263 (1)
261-263:⚠️ Potential issue | 🟡 MinorValidate
--probe-ratelimitinput to reject zero calls.Line 261 accepts
0, which produces a probe run with no calls and misleading output.Proposed input validation
if let Some(n) = cli.probe_ratelimit { + if n == 0 { + bail!("--probe-ratelimit must be >= 1"); + } + // Pure quota probe: fire N back-to-backAlso applies to: 303-340
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/bin/slack_backfill.rs` around lines 261 - 263, The probe-ratelimit branch accepts 0 and produces a no-op run; add input validation where you destructure cli.probe_ratelimit (the if let Some(n) = cli.probe_ratelimit { ... } blocks) to reject zero: if n == 0 print a clear error and exit (or return Err) instead of proceeding. Apply the same check in both probe blocks (the first if let Some(n) = cli.probe_ratelimit { ... } and the later analogous block around lines ~303-340) so any --probe-ratelimit 0 is handled early and does not run a misleading probe. Ensure the error message references the flag name (--probe-ratelimit) and required positive integer semantics.src/openhuman/memory/tree/canonicalize/document.rs-42-44 (1)
42-44:⚠️ Potential issue | 🟡 MinorUpdate module docs to match the new no-header behavior.
The implementation/tests now remove the leading title header, but the file-level docs still say a title header is added. Please align docs to avoid misleading future changes.
📝 Suggested doc fix
-//! Document sources are single-record (no grouping): one Notion page, one -//! Drive doc, one meeting-note file. The canonicaliser adds a small title -//! header and passes through the body; if the body is already markdown it -//! is kept verbatim. +//! Document sources are single-record (no grouping): one Notion page, one +//! Drive doc, one meeting-note file. Canonical output is the trimmed body +//! only; provider/title metadata is carried outside the body (front-matter +//! in the MD-on-disk pipeline).Also applies to: 89-102
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/memory/tree/canonicalize/document.rs` around lines 42 - 44, Update the file-level/module documentation to reflect the new behavior that the canonicalizer no longer adds a leading `# provider — title` header; change any comments that state a title header is added to say the provider/title belong in front-matter and that the code (e.g., the canonicalization that calls md.push_str(doc.body.trim())) strips/removes such headers; apply the same wording change to the other comment block referenced around the region corresponding to lines 89–102 so all docs consistently state "no leading header; provider/title belong in MD front-matter."src/openhuman/memory/tree/retrieval/source.rs-333-345 (1)
333-345:⚠️ Potential issue | 🟡 MinorRun rustfmt on this file to unblock CI.
Line 333 and Line 344 are part of the formatting diff currently failing
cargo fmt --check. Please runcargo fmt --alland commit the formatted output.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/memory/tree/retrieval/source.rs` around lines 333 - 345, The file fails rustfmt; run the formatter and commit the changes: run `cargo fmt --all` (or your IDE’s Rustfmt) and reformat this file so the blocks around upsert_chunks, append_leaf, the LeafRef struct literal, and TOKEN_BUDGET arithmetic conform to rustfmt style; then stage and commit the updated file so CI's `cargo fmt --check` passes.src/bin/gmail_backfill_3d.rs-449-455 (1)
449-455:⚠️ Potential issue | 🟡 MinorThe summary
no_pointermetric is currently wrong.Returning a hard-coded
0makes the integrity summary lie on mixed legacy/new databases, which this PR explicitly supports. Please count pointerless summary rows or drop this field from the report.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/bin/gmail_backfill_3d.rs` around lines 449 - 455, The summary field no_pointer is hard-coded to 0 which will under-report pointerless rows; replace it with an actual count by querying the DB for rows where content_path IS NULL (e.g. run SELECT COUNT(*) FROM ... WHERE content_path IS NULL using the same client used to compute rows_with_pointer) and assign that result to no_pointer (handle errors/unwrap as done for other counts), or if you prefer to remove the misleading metric simply drop no_pointer from the summary output and any references to it; locate the variable no_pointer and the code that computes rows_with_pointer to add the new COUNT query or remove the field consistently.src/openhuman/memory/tree/score/extract/llm.rs-204-206 (1)
204-206:⚠️ Potential issue | 🟡 MinorRun
cargo fmt --allon this hunk.CI is already failing
cargo fmt --checkon thelog::warn!formatting here.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/memory/tree/score/extract/llm.rs` around lines 204 - 206, The log::warn! invocation in memory_tree::extract::llm is incorrectly formatted; run rustfmt (cargo fmt --all) or manually reformat the macro invocation so it follows rustfmt style (e.g., break arguments/format string onto separate lines as rustfmt expects) to satisfy cargo fmt --check and fix the CI failure.
🧹 Nitpick comments (6)
docs/memory-tree-async-pipeline.excalidraw (1)
557-557: Avoid hard-coding worker/semaphore counts in architecture docs.
spawn 3 worker tasksandSemaphore(3)can drift from runtime config. Prefer wording like “N workers” / “shared semaphore limit (configurable)” to keep the diagram accurate as settings evolve.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/memory-tree-async-pipeline.excalidraw` at line 557, Replace hard-coded numeric worker/semaphore text in the diagram labels: change "spawn 3 worker tasks" and "shared Semaphore(3) for LLM-bound work" to use configurable wording such as "spawn N workers (configurable)" and "shared semaphore limit (configurable)" so the diagram matches runtime settings; keep surrounding labels ("jobs::start(workspace_dir)", "recover_stale_locks()", "Notify wakeup + 5s polling fallback") unchanged..gitignore (1)
69-69: Consider using/.target-codex/for consistency with other target directory patterns.The
.target-codex/pattern will match at any directory level, while other target directories use either a leading slash (e.g.,/target/to match only root) or specific paths. If this directory is only expected at the root level, using/.target-codex/would align with the existing conventions.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.gitignore at line 69, Update the .gitignore entry ".target-codex/" to use a rooted pattern "/.target-codex/" so it only matches the repository root like other target dirs; replace the existing ".target-codex/" pattern with "/.target-codex/" (keep the trailing slash) to maintain consistency with the project's ignore conventions.tests/json_rpc_e2e.rs (1)
824-831: Prefer subset assertion over exact controller-count equality.
assert_eq!(controllers.len(), expected_methods.len())makes this test brittle when new memory-tree RPCs are added.🧪 Proposed test hardening
- assert_eq!(controllers.len(), expected_methods.len()); for method in &expected_methods { assert!( controllers .iter() .any(|controller| controller.rpc_method_name() == *method), "expected memory_tree controller registration for {method}" ); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/json_rpc_e2e.rs` around lines 824 - 831, The test currently enforces exact controller count with assert_eq!(controllers.len(), expected_methods.len()), which is brittle; change the assertion to verify that each entry in expected_methods is present in controllers (i.e., a subset check) instead of requiring lengths to match. Locate the variables expected_methods and controllers in the test (around the vector initialization and loop) and replace the length equality check with logic that iterates over expected_methods and asserts controllers.contains(method) (or uses a set/subset comparison) so new RPCs can be added without breaking the test.src/openhuman/memory/tree/source_tree/summariser/llm.rs (1)
318-327: Remove the now-unused promptbudgetparameter.
system_promptno longer templates length, so carrying_budgetthroughbuild_requestis stale API surface.♻️ Proposed cleanup
- fn build_request(&self, prompt_body: &str, budget: u32) -> OllamaChatRequest { + fn build_request(&self, prompt_body: &str) -> OllamaChatRequest { OllamaChatRequest { model: self.cfg.model.clone(), messages: vec![ OllamaMessage { role: "system".to_string(), - content: system_prompt(budget), + content: system_prompt(), }, OllamaMessage { role: "user".to_string(), content: prompt_body.to_string(), @@ -fn system_prompt(_budget: u32) -> String { +fn system_prompt() -> String { @@ - let req = s.build_request("body", 2048); + let req = s.build_request("body"); @@ - let p = system_prompt(4096); + let p = system_prompt();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/memory/tree/source_tree/summariser/llm.rs` around lines 318 - 327, The system_prompt function still accepts an unused _budget parameter; change its signature to fn system_prompt() -> String (remove the parameter) and update all call sites that pass a budget (notably the build_request calls referencing system_prompt) to call system_prompt() with no arguments; ensure any intermediate helpers (e.g., build_request or other functions that forwarded budget solely to system_prompt) drop the budget parameter or stop forwarding it so the stale API surface is removed while preserving behavior.src/openhuman/memory/tree/score/extract/mod.rs (1)
21-89: Keepmod.rsexport-focused.
build_summary_extractoris operational builder logic, so it should live in a sibling module (ops.rs/builder.rs) and be re-exported here instead of expandingmod.rs.As per coding guidelines, "
src/openhuman/**/*.rs: Use camelCase for variable names and keep domainmod.rsexport-focused, with operational code inops.rs,store.rs,types.rs, etc."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/memory/tree/score/extract/mod.rs` around lines 21 - 89, The build_summary_extractor function (and its helpers/references like LlmExtractorConfig, LlmEntityExtractor, RegexEntityExtractor, CompositeExtractor) is operational logic and should be moved out of the domain module file into a sibling implementation module (e.g., ops.rs or builder.rs); create that new module, move the entire build_summary_extractor implementation there, update imports/visibility so it compiles, and then re-export it from mod.rs with a pub use so mod.rs remains export-focused and contains no operational builder code.src/openhuman/memory/tree/content_store/mod.rs (1)
29-105: Keepcontent_store/mod.rsexport-focused.This entrypoint now owns operational logic plus local tests. Please move
StagedChunk,stage_chunks, and the wrapper API into sibling files such astypes.rs/ops.rs, then re-export them frommod.rs.As per coding guidelines, "keep domain
mod.rsexport-focused, with operational code inops.rs,store.rs,types.rs, etc."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/memory/tree/content_store/mod.rs` around lines 29 - 105, Summary: mod.rs contains operational code and tests; split types and ops and keep mod.rs export-focused. Move the StagedChunk struct into a new sibling types.rs, move stage_chunks and the wrapper update_summary_tags into a new ops.rs (or ops.rs for functions and types.rs for the struct), migrate any local tests into ops.rs/tests, and re-export them from mod.rs via pub use crate::openhuman::memory::tree::content_store::types::StagedChunk and pub use crate::openhuman::memory::tree::content_store::ops::{stage_chunks, update_summary_tags}; ensure compose::compose_chunk_file, atomic::write_if_new, atomic::sha256_hex and paths::chunk_* references remain imported in ops.rs and update module imports accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/bin/gmail_backfill_3d.rs`:
- Around line 353-359: The code builds abs_path by blindly pushing rel_path
components into content_root (see abs_path, content_root, rel_path), which
allows path traversal; before constructing abs_path validate rel_path by
splitting into components and rejecting any component that is empty, "." or
"..", any component containing path separators or leading '/' (absolute
segments), or any non-normal component; if validation fails, skip/log the row
and do not open the file. Apply the same validation logic for the other
identical block (the one around the 409-415 area) to prevent escaping
content_root.
In `@src/bin/slack_backfill.rs`:
- Around line 319-323: The current ratelimit detection uses case-sensitive
substring checks on the local variable err and misses variants like "Too Many
Requests" or different capitalizations; update the check in the slack backfill
logic that inspects err (the r.error.as_deref().unwrap_or(...) result) to
perform a case-insensitive match or use a regex that looks for common rate-limit
tokens (e.g., "rate", "limit", "too many requests", "429", "rate_limited",
"ratelimited") and treat any match as a rate limit error; ensure you normalize
err (to_lowercase or equivalent) or compile a case-insensitive pattern and
replace the three contains(...) branches with this single robust check so other
flow logic (the code that currently handles this branch) continues to trigger
correctly.
- Around line 261-293: The probe mode block triggered by cli.probe_ratelimit
runs before the normal connection selection and can target the wrong Slack
connection; modify the logic so probe mode respects the same connection
filtering as the main path (i.e., use the existing cli.connection_id selection
behavior rather than unconditionally calling SLACK_LIST_CONVERSATIONS), either
by moving the probe_ratelimit handling after the connection selection code or by
adding a guard that validates the discovered channel belongs to the selected
connection (reusing the same filtering logic used later for cli.connection_id)
before firing client.execute_tool("SLACK_LIST_CONVERSATIONS", ...) and running
the N calls.
In `@src/openhuman/composio/providers/gmail/ingest.rs`:
- Around line 205-243: The logs currently emit raw PII via participants,
source_id, and owner; instead compute and use redacted values (e.g.,
participant_count and a stable hash/slug of participants and owner like
sha256_hex(participants) or sha256_hex(owner)) and replace the log calls that
reference participants, source_id, and owner (the log::debug!, log::info!, and
log::warn! lines around the EmailThread creation and ingest_email call) to print
only the count and hashed/slugged identifiers; keep source_id as-is for internal
use if needed but never emit its raw value in logs—emit the hashed_source_id and
hashed_owner instead. Ensure the same hashed format is used consistently across
all three logging sites so you can correlate events without exposing PII.
- Around line 63-87: The participants_bucket_key currently returns the same
literal "unknown" when from and to fail to parse, which collapses unrelated
messages; change the fallback to use a message-specific identifier: after
computing all and finding it empty, check raw.get("threadId").and_then(|v|
v.as_str()) and use that (lowercased) as the bucket key if present, else check
raw.get("id").and_then(|v| v.as_str()) and use that, and only if neither exists
fall back to the literal "unknown"; keep using the existing helpers
(extract_email, parse_address_list_for_bucket) and preserve sorting/dedup
behavior, but replace the final if-branch to prefer "threadId" then "id" as the
unique fallback instead of always returning "unknown".
In `@src/openhuman/config/schema/load.rs`:
- Line 16: The unix-only use of File in the sync_directory function is missing
because File was removed from the tokio::fs import; restore it by importing File
(e.g., include File in the existing import tokio::fs::{self, OpenOptions, File})
or add a cfg(unix) conditional import for std::fs::File so that sync_directory
can call File::open(path) on unix targets without compilation errors.
In `@src/openhuman/memory/tree/chunker.rs`:
- Around line 82-86: The debug log in memory_tree::chunker currently prints raw
input.source_id (alongside input.markdown.len()), which can leak PII; before
logging compute and use a redacted identifier (e.g., a short stable hash or slug
derived from input.source_id such as SHA256 or blake3 truncated to 8–12 hex
chars) and log that instead; update both occurrences that call log::debug! (the
lines referencing input.source_id and input.markdown) to replace input.source_id
with the redacted value so no full email addresses or participant data are
emitted.
In `@src/openhuman/memory/tree/content_store/atomic.rs`:
- Around line 42-76: The rename currently returns Ok(true) immediately in the
Ok(()) arm but doesn't fsync the parent directory, so a crash can lose the
directory entry; after a successful std::fs::rename(&tmp_path, abs_path) inside
the Ok(()) branch (e.g., in write_if_new/atomic rename handling in atomic.rs),
open the parent directory (abs_path.parent() or "." if none) as a File (with
read permissions), call sync_all() on that dir file and propagate/log any error
(i.e., only return Ok(true) after the directory sync succeeds), and keep the
existing logging; on failure of the dir fsync, return an error similar to the
other map_err/anyhow style used in this file.
- Around line 125-153: The early-return when abs_path.exists() must verify the
on-disk contents match the composed body before returning the staged SHA: inside
the existing if abs_path.exists() branch (around the check that currently logs
and returns a StagedSummary), open/read the file bytes at abs_path, compute the
SHA-256 of those bytes and compare to the computed sha256 for composed.full; if
they match, return the StagedSummary as before (summary_id, rel_path,
content_sha256), otherwise return a failure (propagate an Err) so callers know
the on-disk content disagrees (do not silently trust the new sha); reference the
variables/functions composed.full, sha256, abs_path, StagedSummary, and
write_if_new to locate and implement the change.
In `@src/openhuman/memory/tree/content_store/paths.rs`:
- Around line 147-152: The debug log currently prints the raw email source_id
(which may contain full participant emails); change the log to avoid PII by
logging the redacted slug or a stable hash instead of source_id. Locate the code
around slugify_source_id(source_id) and chunk_id in the fallback branch (the
block that formats "email/{}/{}.md") and replace the log argument to reference
the computed slug (or a short hash/hex of source_id) rather than source_id; keep
the rest of the message context the same so the layout and chunk_id remain
identifiable without exposing raw emails.
In `@src/openhuman/memory/tree/content_store/read.rs`:
- Around line 45-50: The warning currently logs the full filesystem path via
abs_path.display(), which may expose PII; replace that with a non-sensitive
identifier (e.g., a stable content id or a redacted version of content_path)
when calling log::warn! in this file. Locate the log::warn! calls that reference
abs_path (and the variables content_path / abs_path and contents) and change the
message to include a redacted identifier (for example a contents.id, a
SHA256-of-path, or a redaction helper like redact_path(content_path)) instead of
the full path; apply the same change to the similar warn block around the other
occurrence mentioned (lines ~93-98) so neither log prints full content-store
paths. Ensure the rest of the logged fields (expected_sha256, contents.sha256)
remain unchanged.
In `@src/openhuman/memory/tree/content_store/tags.rs`:
- Around line 111-132: The current flow builds tags from entity_ids (via
list_entity_ids_for_node and entity_tag) and then calls
compose_rewrite_summary_tags with only those entity-derived tags, dropping any
existing front-matter topic labels; change the logic to first parse the existing
summary front-matter from old_bytes (use the same parser that
compose_rewrite_summary_tags expects or add a helper), merge existing topic tags
with the entity-derived tags (preserving uniqueness and sort/dedup semantics),
and then pass the merged tag list to compose_rewrite_summary_tags(&old_bytes,
&merged_tags) so existing topics are preserved when rewriting.
In `@src/openhuman/memory/tree/jobs/handlers/mod.rs`:
- Around line 60-92: The transaction commits score/lifecycle changes in
chunk_store::with_connection (around score::persist_score_tx and the UPDATE of
mem_tree_chunks) before enqueueing downstream work, which can leave an admitted
chunk without queued jobs on crash; to fix, construct the NewJob instances for
append_buffer and topic_route (as used by handle_extract) before committing and
call store::enqueue_tx within the same tx (using the tx from
chunk_store::with_connection) so the enqueue operations are part of the same
SQLite transaction, then tx.commit() only after enqueue_tx succeeds; apply the
same change to the second occurrence noted (lines ~140-159).
In `@src/openhuman/memory/tree/jobs/store.rs`:
- Around line 141-210: mark_done and mark_failed currently update rows by id
only and can overwrite another worker's claim; add a lease/claim token to the
Job (e.g., new column lease_token or computed from attempts+started_at_ms) that
is set when the worker acquires the lock and include that token in the WHERE
clause of both mark_done and mark_failed (modify functions mark_done and
mark_failed to accept the claim token parameter and update SQL to WHERE id = ?
AND lease_token = ?); ensure the code that acquires locks writes the token, and
if the UPDATE affects 0 rows log a warning and return Ok() so stale/incorrect
claims don't corrupt retry accounting.
In `@src/openhuman/memory/tree/score/extract/llm.rs`:
- Around line 470-489: into_extracted_entities is unconditionally forwarding
self.topics into ExtractedEntities so callers with emit_topics = false still
receive topics; update the mapping to check the emit_topics flag (e.g.,
self.emit_topics or the method parameter controlling emission) and only map
self.topics into ExtractedTopic entries when that flag is true, otherwise return
an empty topics Vec; ensure you still trim/filter empty labels the same way and
construct ExtractedTopic with score as before, and return the resulting topics
into the ExtractedEntities struct.
In `@src/openhuman/memory/tree/score/extract/mod.rs`:
- Around line 67-75: The current log in the LlmEntityExtractor::new success
branch prints the user-supplied endpoint at info level (via log::info) which may
leak credentials; change this to log::debug and stop logging the full
endpoint—either omit endpoint altogether or log a redacted host only (e.g.,
strip path/credentials and show host or "[redacted]"). Update the call that
currently includes endpoint, model, timeout_ms to use debug level and remove or
replace endpoint with the sanitized value, keeping model and timeout_ms as
needed for diagnostics.
In `@src/openhuman/memory/tree/source_tree/bucket_seal.rs`:
- Around line 470-485: The current scope_slug logic in the bucket sealing code
(scope_slug, tree.scope, TreeKind::Topic, slugify_source_id) strips everything
before the first ':' for all non-topic trees and causes collisions across
different source types; change the non-Topic branch to only special-case
"gmail:" by detecting if tree.scope starts_with "gmail:" and then slugifying
just the participants portion (after the colon), otherwise call
slugify_source_id on the entire tree.scope; keep the existing Topic branch
behavior unchanged.
- Around line 501-529: The code currently creates an empty StagedSummary on
stage_summary(...) failure and continues, which results in committing a DB
summary row without an on-disk .md file; instead abort the seal so the buffer
remains for retry. Replace the fallback branch in the stage_summary match (the
Err(e) arm around stage_summary(&content_root, &compose_input, &scope_slug,
None)) to propagate/return an error from the enclosing bucket_seal function (or
otherwise short-circuit the transaction) rather than constructing StagedSummary;
reference stage_summary, node.id, compose_input, content_root, scope_slug,
staged_opt and StagedSummary so you locate the code to change. Ensure the
transaction is not committed when stage_summary fails so
content_path/content_sha256 are not persisted.
In `@src/openhuman/memory/tree/topic_tree/backfill.rs`:
- Around line 80-86: The logs in the backfill routine are printing raw entity_id
(a PII email for some topic trees); update each log::info call that includes
entity_id (the logger lines shown and the other occurrences around the backfill
flow) to redact or hash entity_id before logging—e.g., compute a short stable
hash or mask using a helper (e.g., redact_entity_id or hash_entity_id) and log
that value instead of the raw entity_id while leaving other fields such as
tree.id, BACKFILL_WINDOW_DAYS, and cutoff_ms unchanged; apply this change to all
occurrences noted (the log lines around the backfill function where entity_id is
interpolated).
---
Outside diff comments:
In `@src/openhuman/memory/tree/ingest.rs`:
- Around line 155-203: The returned value `n` is from
store::upsert_staged_chunks_tx and thus equals staged.len(), which misreports
queued work when duplicates are skipped via the to_schedule HashSet; change the
closure to return the number of actually scheduled chunks (to_schedule.len()) or
return a tuple/struct with both stored_chunks (result of
upsert_staged_chunks_tx) and queued_chunks (to_schedule.len()), and update
callers accordingly; locate logic around upsert_staged_chunks_tx, the
to_schedule set, score::persist_score_tx and jobs::enqueue_tx where tx.commit()
is called and replace the single `n` return with the chosen corrected count(s).
In `@src/openhuman/memory/tree/retrieval/drill_down.rs`:
- Around line 259-272: Formatting in the block that constructs a chunk and calls
upsert_chunks and append_leaf (see functions upsert_chunks and append_leaf and
the LeafRef construction with fields token_count, seq_in_source, created_at/
timestamp, partial_message) is not rustfmt-compliant; run rustfmt by executing
cargo fmt --all and commit the resulting changes so the crate passes cargo fmt
--check (ensure the struct literal formatting and trailing commas match rustfmt
output around the chunk creation and LeafRef append).
In `@src/openhuman/memory/tree/source_tree/store.rs`:
- Around line 202-244: The current insert into mem_tree_summaries truncates
staged content via content_preview when staged is Some, causing readers to get
only the 500-char preview; instead preserve the full node.content when writing
the DB until readers are updated. Change the staged match (variables staged,
content_preview, content_path, content_sha256) so that the value inserted for
the content column is node.content (not the 500-char preview) while still
setting content_path and content_sha256 from the staged struct; keep the preview
logic only for any separate preview column or consumers that explicitly need it,
or alternatively implement hydration in the SummaryNode read path later
(references: staged, content_preview, node.content, content_path,
content_sha256, the tx.execute INSERT into mem_tree_summaries, and the
SummaryNode read/hydrate code).
In `@src/openhuman/memory/tree/topic_tree/backfill.rs`:
- Around line 162-173: The code unconditionally increments appended after
calling append_leaf (in the backfill flow using LabelStrategy::Empty), but
append_leaf is idempotent and may be a no-op; update the call to capture
append_leaf's result and only increment appended when the result indicates a
real append occurred (e.g., a boolean/Option/enum returned by append_leaf
signaling change). Keep the existing .with_context(...) error handling around
the await, then branch on the successful result from append_leaf (using the
return value from append_leaf in the backfill function) to decide whether to do
appended += 1.
---
Minor comments:
In `@docs/memory-tree-async-pipeline.excalidraw`:
- Line 44: Update the subtitle string "Leaf ingestion -> jobs queue -> workers
-> source/topic/global tree building" to use the singular phrasing "job queue"
instead of "jobs queue" so it reads "Leaf ingestion -> job queue -> workers ->
source/topic/global tree building".
In `@src/bin/gmail_backfill_3d.rs`:
- Around line 449-455: The summary field no_pointer is hard-coded to 0 which
will under-report pointerless rows; replace it with an actual count by querying
the DB for rows where content_path IS NULL (e.g. run SELECT COUNT(*) FROM ...
WHERE content_path IS NULL using the same client used to compute
rows_with_pointer) and assign that result to no_pointer (handle errors/unwrap as
done for other counts), or if you prefer to remove the misleading metric simply
drop no_pointer from the summary output and any references to it; locate the
variable no_pointer and the code that computes rows_with_pointer to add the new
COUNT query or remove the field consistently.
In `@src/bin/slack_backfill.rs`:
- Around line 261-263: The probe-ratelimit branch accepts 0 and produces a no-op
run; add input validation where you destructure cli.probe_ratelimit (the if let
Some(n) = cli.probe_ratelimit { ... } blocks) to reject zero: if n == 0 print a
clear error and exit (or return Err) instead of proceeding. Apply the same check
in both probe blocks (the first if let Some(n) = cli.probe_ratelimit { ... } and
the later analogous block around lines ~303-340) so any --probe-ratelimit 0 is
handled early and does not run a misleading probe. Ensure the error message
references the flag name (--probe-ratelimit) and required positive integer
semantics.
In `@src/openhuman/memory/tree/canonicalize/document.rs`:
- Around line 42-44: Update the file-level/module documentation to reflect the
new behavior that the canonicalizer no longer adds a leading `# provider —
title` header; change any comments that state a title header is added to say the
provider/title belong in front-matter and that the code (e.g., the
canonicalization that calls md.push_str(doc.body.trim())) strips/removes such
headers; apply the same wording change to the other comment block referenced
around the region corresponding to lines 89–102 so all docs consistently state
"no leading header; provider/title belong in MD front-matter."
In `@src/openhuman/memory/tree/retrieval/source.rs`:
- Around line 333-345: The file fails rustfmt; run the formatter and commit the
changes: run `cargo fmt --all` (or your IDE’s Rustfmt) and reformat this file so
the blocks around upsert_chunks, append_leaf, the LeafRef struct literal, and
TOKEN_BUDGET arithmetic conform to rustfmt style; then stage and commit the
updated file so CI's `cargo fmt --check` passes.
In `@src/openhuman/memory/tree/score/extract/llm.rs`:
- Around line 204-206: The log::warn! invocation in memory_tree::extract::llm is
incorrectly formatted; run rustfmt (cargo fmt --all) or manually reformat the
macro invocation so it follows rustfmt style (e.g., break arguments/format
string onto separate lines as rustfmt expects) to satisfy cargo fmt --check and
fix the CI failure.
---
Nitpick comments:
In @.gitignore:
- Line 69: Update the .gitignore entry ".target-codex/" to use a rooted pattern
"/.target-codex/" so it only matches the repository root like other target dirs;
replace the existing ".target-codex/" pattern with "/.target-codex/" (keep the
trailing slash) to maintain consistency with the project's ignore conventions.
In `@docs/memory-tree-async-pipeline.excalidraw`:
- Line 557: Replace hard-coded numeric worker/semaphore text in the diagram
labels: change "spawn 3 worker tasks" and "shared Semaphore(3) for LLM-bound
work" to use configurable wording such as "spawn N workers (configurable)" and
"shared semaphore limit (configurable)" so the diagram matches runtime settings;
keep surrounding labels ("jobs::start(workspace_dir)", "recover_stale_locks()",
"Notify wakeup + 5s polling fallback") unchanged.
In `@src/openhuman/memory/tree/content_store/mod.rs`:
- Around line 29-105: Summary: mod.rs contains operational code and tests; split
types and ops and keep mod.rs export-focused. Move the StagedChunk struct into a
new sibling types.rs, move stage_chunks and the wrapper update_summary_tags into
a new ops.rs (or ops.rs for functions and types.rs for the struct), migrate any
local tests into ops.rs/tests, and re-export them from mod.rs via pub use
crate::openhuman::memory::tree::content_store::types::StagedChunk and pub use
crate::openhuman::memory::tree::content_store::ops::{stage_chunks,
update_summary_tags}; ensure compose::compose_chunk_file, atomic::write_if_new,
atomic::sha256_hex and paths::chunk_* references remain imported in ops.rs and
update module imports accordingly.
In `@src/openhuman/memory/tree/score/extract/mod.rs`:
- Around line 21-89: The build_summary_extractor function (and its
helpers/references like LlmExtractorConfig, LlmEntityExtractor,
RegexEntityExtractor, CompositeExtractor) is operational logic and should be
moved out of the domain module file into a sibling implementation module (e.g.,
ops.rs or builder.rs); create that new module, move the entire
build_summary_extractor implementation there, update imports/visibility so it
compiles, and then re-export it from mod.rs with a pub use so mod.rs remains
export-focused and contains no operational builder code.
In `@src/openhuman/memory/tree/source_tree/summariser/llm.rs`:
- Around line 318-327: The system_prompt function still accepts an unused
_budget parameter; change its signature to fn system_prompt() -> String (remove
the parameter) and update all call sites that pass a budget (notably the
build_request calls referencing system_prompt) to call system_prompt() with no
arguments; ensure any intermediate helpers (e.g., build_request or other
functions that forwarded budget solely to system_prompt) drop the budget
parameter or stop forwarding it so the stale API surface is removed while
preserving behavior.
In `@tests/json_rpc_e2e.rs`:
- Around line 824-831: The test currently enforces exact controller count with
assert_eq!(controllers.len(), expected_methods.len()), which is brittle; change
the assertion to verify that each entry in expected_methods is present in
controllers (i.e., a subset check) instead of requiring lengths to match. Locate
the variables expected_methods and controllers in the test (around the vector
initialization and loop) and replace the length equality check with logic that
iterates over expected_methods and asserts controllers.contains(method) (or uses
a set/subset comparison) so new RPCs can be added without breaking the test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 48d26b85-545f-4947-a96f-f7a831671a96
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (67)
.gitignoreCargo.tomldocs/memory-tree-async-pipeline.excalidrawsrc/bin/gmail_backfill_3d.rssrc/bin/slack_backfill.rssrc/core/jsonrpc.rssrc/openhuman/composio/providers/gmail/ingest.rssrc/openhuman/composio/providers/gmail/mod.rssrc/openhuman/config/schema/load.rssrc/openhuman/config/schema/storage_memory.rssrc/openhuman/config/schema/types.rssrc/openhuman/memory/tree/canonicalize/chat.rssrc/openhuman/memory/tree/canonicalize/document.rssrc/openhuman/memory/tree/canonicalize/email.rssrc/openhuman/memory/tree/canonicalize/email_clean.rssrc/openhuman/memory/tree/canonicalize/mod.rssrc/openhuman/memory/tree/chunker.rssrc/openhuman/memory/tree/content_store/atomic.rssrc/openhuman/memory/tree/content_store/compose.rssrc/openhuman/memory/tree/content_store/mod.rssrc/openhuman/memory/tree/content_store/paths.rssrc/openhuman/memory/tree/content_store/read.rssrc/openhuman/memory/tree/content_store/tags.rssrc/openhuman/memory/tree/global_tree/digest.rssrc/openhuman/memory/tree/global_tree/digest_tests.rssrc/openhuman/memory/tree/global_tree/recap.rssrc/openhuman/memory/tree/global_tree/seal.rssrc/openhuman/memory/tree/ingest.rssrc/openhuman/memory/tree/jobs/handlers/mod.rssrc/openhuman/memory/tree/jobs/mod.rssrc/openhuman/memory/tree/jobs/scheduler.rssrc/openhuman/memory/tree/jobs/store.rssrc/openhuman/memory/tree/jobs/testing.rssrc/openhuman/memory/tree/jobs/types.rssrc/openhuman/memory/tree/jobs/worker.rssrc/openhuman/memory/tree/mod.rssrc/openhuman/memory/tree/retrieval/drill_down.rssrc/openhuman/memory/tree/retrieval/fetch.rssrc/openhuman/memory/tree/retrieval/global.rssrc/openhuman/memory/tree/retrieval/integration_test.rssrc/openhuman/memory/tree/retrieval/rpc.rssrc/openhuman/memory/tree/retrieval/source.rssrc/openhuman/memory/tree/retrieval/topic.rssrc/openhuman/memory/tree/rpc.rssrc/openhuman/memory/tree/schemas.rssrc/openhuman/memory/tree/score/extract/llm.rssrc/openhuman/memory/tree/score/extract/llm_tests.rssrc/openhuman/memory/tree/score/extract/mod.rssrc/openhuman/memory/tree/score/extract/types.rssrc/openhuman/memory/tree/score/mod.rssrc/openhuman/memory/tree/score/mod_tests.rssrc/openhuman/memory/tree/score/store.rssrc/openhuman/memory/tree/source_tree/bucket_seal.rssrc/openhuman/memory/tree/source_tree/bucket_seal_tests.rssrc/openhuman/memory/tree/source_tree/flush.rssrc/openhuman/memory/tree/source_tree/mod.rssrc/openhuman/memory/tree/source_tree/store.rssrc/openhuman/memory/tree/source_tree/store_tests.rssrc/openhuman/memory/tree/source_tree/summariser/llm.rssrc/openhuman/memory/tree/source_tree/types.rssrc/openhuman/memory/tree/store.rssrc/openhuman/memory/tree/store_tests.rssrc/openhuman/memory/tree/topic_tree/backfill.rssrc/openhuman/memory/tree/topic_tree/curator.rssrc/openhuman/memory/tree/topic_tree/routing.rssrc/openhuman/memory/tree/types.rstests/json_rpc_e2e.rs
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/openhuman/memory/tree/retrieval/source.rs (1)
333-334: Extract the repeated token-count expression into one local constant.The same
TOKEN_BUDGET * 6 / 10expression is duplicated for bothChunk.token_countandLeafRef.token_count; a local constant avoids drift in future edits.♻️ Suggested cleanup
async fn seed_source(cfg: &Config, scope: &str, ts: DateTime<Utc>) { let tree = get_or_create_source_tree(cfg, scope).unwrap(); let summariser = InertSummariser::new(); + const TEST_LEAF_TOKEN_COUNT: u32 = + crate::openhuman::memory::tree::source_tree::types::TOKEN_BUDGET * 6 / 10; for seq in 0..2u32 { let c = Chunk { @@ - token_count: crate::openhuman::memory::tree::source_tree::types::TOKEN_BUDGET * 6 - / 10, + token_count: TEST_LEAF_TOKEN_COUNT, @@ - token_count: crate::openhuman::memory::tree::source_tree::types::TOKEN_BUDGET - * 6 - / 10, + token_count: TEST_LEAF_TOKEN_COUNT,Also applies to: 345-347
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/memory/tree/retrieval/source.rs` around lines 333 - 334, Extract the duplicated TOKEN_BUDGET * 6 / 10 expression into a single local constant (e.g., token_quota) near where the Chunk and LeafRef are created, then replace both Chunk.token_count and LeafRef.token_count uses with that constant to prevent drift; reference the TOKEN_BUDGET symbol and the Chunk.token_count and LeafRef.token_count fields when making the change so both occurrences (including the one around lines 345-347) are updated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/openhuman/memory/tree/retrieval/source.rs`:
- Around line 333-334: Extract the duplicated TOKEN_BUDGET * 6 / 10 expression
into a single local constant (e.g., token_quota) near where the Chunk and
LeafRef are created, then replace both Chunk.token_count and LeafRef.token_count
uses with that constant to prevent drift; reference the TOKEN_BUDGET symbol and
the Chunk.token_count and LeafRef.token_count fields when making the change so
both occurrences (including the one around lines 345-347) are updated.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 710a73ea-1a66-457f-8bc0-de763ec349d4
📒 Files selected for processing (4)
src/openhuman/config/schema/load.rssrc/openhuman/memory/tree/retrieval/drill_down.rssrc/openhuman/memory/tree/retrieval/source.rssrc/openhuman/memory/tree/score/extract/llm.rs
✅ Files skipped from review due to trivial changes (1)
- src/openhuman/memory/tree/score/extract/llm.rs
🚧 Files skipped from review as they are similar to previous changes (2)
- src/openhuman/config/schema/load.rs
- src/openhuman/memory/tree/retrieval/drill_down.rs
…ems) tinyhumansai#4 gmail unknown bucket — orphan:{id} fallback instead of collapsing all no-address messages into one "unknown" source tree; messages with neither address nor id are dropped with a warn (skip sentinel). tinyhumansai#5 PII in gmail/ingest.rs — participants, source_id, and owner wrapped with redact() before logging; no email addresses in log output. tinyhumansai#7 PII in chunker.rs — source_id in debug logs replaced with redact(source_id) hash. tinyhumansai#8 Parent dir fsync after rename in atomic.rs — on Unix, open the parent directory and call sync_all() after a successful rename so the dirent is durable across a crash; best-effort warn on failure, non-fatal. tinyhumansai#9 Body verify on re-stage in atomic.rs — when the path already exists, read and hash the on-disk body; if sha differs from the new body, remove the stale file and re-write atomically so SQLite content_sha256 always matches the on-disk bytes. New unit test covers the re-write path. tinyhumansai#10 PII in paths.rs — malformed source_id fallback log now emits redact(source_id) instead of the raw string. tinyhumansai#11 PII in read.rs — sha256 mismatch warn logs now emit redact(abs_path) instead of abs_path.display(). tinyhumansai#13 Same-tx enqueue in handle_extract — append_buffer and topic_route jobs are built before the transaction opens and enqueued via enqueue_tx() inside the same tx that commits the lifecycle update, eliminating the crash window where lifecycle commits but jobs are lost. tinyhumansai#16 PII in score/extract/mod.rs — LLM extractor info! log demoted to debug! and endpoint wrapped with redact_endpoint() to strip path/query/credentials from the URL. tinyhumansai#17 Non-email scope slug in bucket_seal.rs — only the gmail: prefix is stripped before slugify; all other source kinds (slack:, discord:, document:, …) slugify the full scope string so slack:#eng and discord:#eng no longer produce the same "eng" slug and collide in summaries/source/. tinyhumansai#18 Don't commit summary on stage_summary failure — bucket_seal.rs, global_tree/digest.rs, and global_tree/seal.rs now propagate stage_summary errors via .context()? instead of falling back to a NULL content_path row; the seal is retried via the normal job-retry path. insert_summary_tx callers in those three paths now always pass Some(&staged) instead of Option. tinyhumansai#20 PII in topic_tree/backfill.rs — all entity_id occurrences in log lines (info/debug/warn) replaced with redact(entity_id). New shared helper: src/openhuman/memory/tree/util/redact.rs - redact(s) → stable 8-hex-char SHA-256 prefix for PII fields - redact_endpoint(u) → host:port only (strips scheme, creds, path, query) - Full unit-test coverage for both helpers Items tinyhumansai#14 (lease tokens) and tinyhumansai#19 (preview-vs-disk reader audit) intentionally deferred to follow-up commits — both have larger blast radius and need separate review. Items tinyhumansai#1, tinyhumansai#12, tinyhumansai#15 pushed back on PR thread (verified false positive or covered by upstream contract). Items #2, tinyhumansai#3 deferred (pre-existing in slack_backfill.rs, not introduced by this PR). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mansai#14) Job settlement (mark_done / mark_failed) now gates on (attempts, started_at_ms) so a stale worker whose lease expired and got the row re-claimed by another worker can't clobber the new lessee's state. The race we close: t=0 Worker A claims job_X, locked_until_ms = t + lock_duration t=lock Lease expires while A's handler is still running (Ollama hung) t=lock+ recover_stale_locks resets job_X to ready t=lock+ Worker B claims job_X (attempts++, new started_at_ms) t=... B's handler completes successfully, marks done t=... A's handler finally returns Ok, calls mark_done ─────── before this fix: A's UPDATE clobbers B's state ─────── after this fix: A's UPDATE matches 0 rows, logs warn, no-op Implementation: - mark_done / mark_failed now take &Job (carrying attempts + started_at_ms snapshots from the worker's claim_next call) - SQL WHERE adds: id = ? AND attempts = ? AND started_at_ms IS ? (IS handles the NULL-vs-NULL equality semantics SQLite needs) - rows_affected == 0 → warn-log "stale lease, settlement no-op", return Ok (this is a known race outcome, not a failure) - All 5 existing tests updated to pass &job snapshots - 3 new tests: · mark_done_succeeds_for_current_lessee (happy-path guard) · stale_worker_settlement_is_noop (the race itself) · stale_worker_mark_failed_is_noop (same contract for failure path) Worker call sites updated to thread &job through. scheduler.rs test fixture updated to claim before settling (needed for &Job snapshot). Addresses CodeRabbit review item tinyhumansai#14 (Critical) on PR tinyhumansai#1008. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nyhumansai#1008 CR-19) After the MD-on-disk migration `mem_tree_chunks.content` and `mem_tree_summaries.content` store only a ≤500-char preview. All readers that feed LLMs or the retrieval API were still consuming the preview as if it were the full body. Fix: add `content_store::read::read_chunk_body` / `read_summary_body` (disk read + SHA-256 verify) and thread them through every full-body call site: • `bucket_seal::hydrate_leaf_inputs` / `hydrate_summary_inputs` (LLM summariser input — fatal on missing content_path) • `handlers::handle_extract` — scorer + embedder (LLM extractor + embedding — fatal) • `handlers::handle_append_buffer` Leaf + Summary branches (topic-tree ingest — fatal) • `retrieval::fetch::hit_from_chunk` — non-fatal warn+fallback • `retrieval::drill_down` chunk + summary branches — non-fatal • `retrieval::source::collect_hits_and_nodes` — non-fatal • `retrieval::topic::fetch_topic_tree_root_summary` + `entity_hit_to_retrieval_hit` — non-fatal • `global_tree::digest::pick_source_contribution` — non-fatal Also fix `content_store::paths::chunk_rel_path` to sanitize chunk IDs into cross-platform filenames (colons are illegal on Windows NTFS; summary_rel_path already did this). Test fixtures: every test that calls `upsert_chunks` and then triggers a seal or a retrieval hit now calls `stage_test_chunks` (a local helper that writes .md files + stores the content_path pointer in SQLite) so `read_chunk_body` can resolve the on-disk file. Affected: bucket_seal_tests, drill_down, fetch, global, integration_test, rpc, source (retrieval), flush, recap, digest_tests. All 477 memory::tree lib tests pass; 62 gmail lib tests pass.
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/openhuman/memory/tree/source_tree/bucket_seal.rs (1)
560-567:⚠️ Potential issue | 🟠 MajorPersist
node.topicsinto the entity index too.After
resolve_labels, the summary row carries bothentitiesandtopics, but this transaction only indexesnode.entities. That leaves topic summaries discoverable viasummary.topicsbut invisible to the paths that readmem_tree_entity_index(query_topic,search_entities(kind=topic), and the tag rewrite).Based on learnings, summary topics are stored in
mem_tree_entity_indexwithentity_kind = "topic", andupdate_summary_tagsreads that index as the source of truth.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/memory/tree/source_tree/bucket_seal.rs` around lines 560 - 567, The transaction only indexes node.entities but must also persist node.topics into mem_tree_entity_index so topic summaries become discoverable; call the same index routine (crate::openhuman::memory::tree::score::store::index_summary_entity_ids_tx) for node.topics as well (or extend that function to accept both entities and topics) ensuring the entries are written with entity_kind = "topic" (so update_summary_tags, query_topic, and search_entities(kind=topic) will see them); ensure you pass the same tx, node.id, node.score, timestamp, and tree_id context when indexing topics.
♻️ Duplicate comments (1)
src/openhuman/memory/tree/content_store/read.rs (1)
24-29:⚠️ Potential issue | 🟠 MajorRedact the path in these error strings too.
read_chunk_file()still bakesabs_pathinto its read/UTF-8/front-matter errors. Those errors bubble up through handler contexts and are logged with full chains, so participant-bucket paths can still leak even though the explicitwarn!calls are redacted.read_summary_file()inherits the same behavior through this helper.As per coding guidelines, "Never log secrets or full PII."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/memory/tree/content_store/read.rs` around lines 24 - 29, The error messages in read_chunk_file()/read_summary_file() (in read.rs) include the full abs_path in the read/UTF-8/front-matter errors; replace those messages to avoid leaking PII by redacting the path before interpolating into error strings (e.g., use a helper or only include non-sensitive parts such as file stem or a fixed "[redacted]" token) for the std::fs::read map_err, the from_utf8 map_err, and the split_front_matter no_front_matter error that currently reference abs_path; ensure you update references around abs_path and split_front_matter so all three error branches report a redacted path or placeholder instead of the full abs_path.
🧹 Nitpick comments (4)
src/openhuman/composio/providers/gmail/ingest.rs (1)
228-258: Align new locals with the repo's camelCase rule.Names like
total_chunks,raw_msgs,source_id, andthread_subjectdon't match thesrc/openhuman/**/*.rsnaming convention. I'd normalize them here while the module is still new so it stays consistent with the rest ofopenhuman. As per coding guidelines,src/openhuman/**/*.rs: "Use camelCase for variable names".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/composio/providers/gmail/ingest.rs` around lines 228 - 258, Rename snake_case local variables to camelCase: change total_chunks -> totalChunks, total_buckets -> totalBuckets, raw_msgs -> rawMsgs, source_id -> sourceId, and thread_subject -> threadSubject; update their declarations and every use in this function (the for loop binding (participants, rawMsgs), the call to raw_to_email_message, creation of sourceId = format!("gmail:{}", participants), and EmailThread { thread_subject -> threadSubject }) so all references match the new names, keeping other symbols like bucket_by_participants, raw_to_email_message, EmailThread, pick_thread_subject, and DEFAULT_TAGS unchanged.src/openhuman/memory/tree/content_store/paths.rs (1)
198-202: Variable naming in this file diverges from the repo convention.At Line 198 onward, local variables use snake_case (e.g.,
pending_underscore,last_dash). If you want strict consistency with the project rule, these should be camelCase.As per coding guidelines, "Use camelCase for variable names..."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/memory/tree/content_store/paths.rs` around lines 198 - 202, Local variables in src/openhuman/memory/tree/content_store/paths.rs use snake_case (e.g., pending_underscore, last_dash) but the repo requires camelCase; rename these locals (for example pending_underscore -> pendingUnderscore, last_dash -> lastDash, and any other snake_case locals in this function such as lower if needed) and update every use site within the same function (and any closures) to the new names, keeping the same types/semantics and preserving the existing initialization (e.g., let mut pendingUnderscore = false;). Run cargo build/tests to ensure no unresolved references remain.src/openhuman/memory/tree/chunker.rs (1)
73-198: Rename the new locals to camelCase to match the repo convention.This hunk introduces several new snake_case locals (
max_tokens,max_chars,unit_separator,acc_chars, etc.), so the file now mixes conventions inside the same module.As per coding guidelines, "Use camelCase for variable names."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/memory/tree/chunker.rs` around lines 73 - 198, The new locals use snake_case but the repo prefers camelCase; rename variables and closure parameters consistently (e.g., max_tokens -> maxTokens, max_chars -> maxChars, unit_separator -> unitSeparator, acc_chars -> accChars, sub_pieces -> subPieces) and update the flush closure signature and its uses (flush(&mut acc, &mut accChars, &mut out)) as well as every reference to approx_token_count, split_by_token_budget, units, and Chunk construction sites to use the new names so the logic is unchanged but naming matches camelCase conventions.src/openhuman/memory/tree/jobs/scheduler.rs (1)
28-127: Use camelCase for the new scheduler params and locals.The added helpers introduce repo-inconsistent snake_case names such as
days_back,date_iso,today_iso, andnext_sleep_duration.As per coding guidelines, "Use camelCase for variable names."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/memory/tree/jobs/scheduler.rs` around lines 28 - 127, Rename snake_case scheduler parameters, locals, and the helper function to camelCase: change the backfill_missing_digests parameter days_back to daysBack and update all uses inside backfill_missing_digests (including the log call and loop bounds); rename local variables date_iso and today_iso in enqueue_daily_jobs (and trigger_digest's payload field references) to dateIso and todayIso; rename the function next_sleep_duration to nextSleepDuration and update its internal locals and any call sites; make sure to update NewJob::digest_daily/flush_stale payload field names references (date_iso -> dateIso) and all log messages to use the new identifiers so the code compiles and follows the camelCase variable naming guideline.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/openhuman/composio/providers/gmail/ingest.rs`:
- Around line 126-140: The current splitting on commas in
parse_address_list_for_bucket (and the other helper used at 189-203) breaks
RFC-822 quoted display names like "Doe, John" and must be replaced with an
RFC-aware parser; create a single shared parser function that uses an RFC
mailbox parser (for example mailparse::addrparse_header or another RFC-822/2822
mailbox parser) to parse the header/value into MailAddr entries and then collect
the mailbox addresses (or full mailbox strings) into Vec<String>, and update
parse_address_list_for_bucket and the other helper to call that parser instead
of using split(',') so quoted names and escaped commas are handled correctly.
In `@src/openhuman/memory/tree/content_store/paths.rs`:
- Around line 69-71: The code currently panics in summary_rel_path by calling
expect on date_for_global for SummaryTreeKind::Global; change summary_rel_path
to return a Result (e.g., Result<String, E> or anyhow::Result<String>) instead
of unwrapping, detect when kind == SummaryTreeKind::Global and
date_for_global.is_none() and return a descriptive error (include context about
missing date_for_global), replace the expect(...) call with
date_for_global.ok_or_else(|| /* error with context */)?.format(...) or map_err
to convert to the chosen error type, and propagate this Result up from callers
so the async pipeline can handle retry/failure instead of panicking.
In `@src/openhuman/memory/tree/content_store/read.rs`:
- Around line 154-179: The code ignores the stored SHA from pointers and returns
the file body without verification; change the tuple binding to capture the
stored SHA (rename _sha256 -> stored_sha), compute the SHA256 (hex) of the read
bytes returned by read_chunk_file, compare it to stored_sha, and return an error
with context (same style as with_context) if they differ; reference the
variables chunk_id, rel_path, stored_sha, the call to
read_chunk_file(&abs_path), and use the existing logging/context helpers
(redact, with_context) to produce the failure message so corrupted or stale
files are rejected before returning result.body.
- Around line 148-153: The code currently treats a NULL return from
get_chunk_content_pointers(...) as an error and hard-fails; instead preserve the
legacy-row fallback by treating None as a valid case and proceeding to read the
inline content column. Modify the call around get_chunk_content_pointers in
read.rs (the variable pointers from get_chunk_content_pointers) so it does not
err_on_none but branches: if Some(pointers) use the external
content_path/content_sha256 logic, otherwise continue with the pre-MD migration
path that reads the inline content field; apply the same change to the other
occurrence referenced (the block around lines 204-209) so old rows are not
turned into permanent failures.
In `@src/openhuman/memory/tree/source_tree/bucket_seal.rs`:
- Around line 164-170: The debug log currently prints the raw chunk id
(leaf.chunk_id) which can leak PII; update the log::debug call in this
bucket_seal/append_leaf area to avoid printing leaf.chunk_id directly—replace it
with a redacted or deterministic hashed representation (e.g., hash or fixed
structural prefix like "chunk-<redacted>" or first N chars of a secure hash)
while leaving tree.id, leaf.token_count, and strategy unchanged; ensure the new
value is computed safely before the log call and used in the log::debug format
string instead of leaf.chunk_id.
- Around line 726-743: The current code silently swallows index read errors and
unconditionally calls content_read::read_chunk_body/read_summary_body which
fails for pre-migration inline rows; update hydrate_leaf_inputs (and the similar
summary path) to: 1) stop using
list_entity_ids_for_node(...).unwrap_or_default() — match on the Result and only
default to empty when the error is a benign "index missing/empty" sentinel (log
it) but propagate real DB/file/SHA errors upward; 2) before calling
content_read::read_chunk_body or content_read::read_summary_body, check the
chunk's content_path presence and, if absent, use the inline content field from
the row as the body fallback; otherwise call the read_* function and propagate
its errors. Reference list_entity_ids_for_node, hydrate_leaf_inputs,
content_read::read_chunk_body and content_read::read_summary_body when making
these changes.
In `@src/openhuman/memory/tree/util/redact.rs`:
- Around line 33-46: The redact_endpoint function currently uses split_once('@')
which removes everything before the first '@' and can leak credential fragments
(e.g., "user:p@ss@example.com"); update redact_endpoint to first isolate the
authority (take after_scheme up to the first '/', '?', or '#'), then remove
userinfo by using rsplit_once('@') on that authority so the last '@' is honored,
and finally return only the host:port portion; reference the symbols
redact_endpoint, after_scheme, after_creds/authority and host_port when making
this change.
---
Outside diff comments:
In `@src/openhuman/memory/tree/source_tree/bucket_seal.rs`:
- Around line 560-567: The transaction only indexes node.entities but must also
persist node.topics into mem_tree_entity_index so topic summaries become
discoverable; call the same index routine
(crate::openhuman::memory::tree::score::store::index_summary_entity_ids_tx) for
node.topics as well (or extend that function to accept both entities and topics)
ensuring the entries are written with entity_kind = "topic" (so
update_summary_tags, query_topic, and search_entities(kind=topic) will see
them); ensure you pass the same tx, node.id, node.score, timestamp, and tree_id
context when indexing topics.
---
Duplicate comments:
In `@src/openhuman/memory/tree/content_store/read.rs`:
- Around line 24-29: The error messages in read_chunk_file()/read_summary_file()
(in read.rs) include the full abs_path in the read/UTF-8/front-matter errors;
replace those messages to avoid leaking PII by redacting the path before
interpolating into error strings (e.g., use a helper or only include
non-sensitive parts such as file stem or a fixed "[redacted]" token) for the
std::fs::read map_err, the from_utf8 map_err, and the split_front_matter
no_front_matter error that currently reference abs_path; ensure you update
references around abs_path and split_front_matter so all three error branches
report a redacted path or placeholder instead of the full abs_path.
---
Nitpick comments:
In `@src/openhuman/composio/providers/gmail/ingest.rs`:
- Around line 228-258: Rename snake_case local variables to camelCase: change
total_chunks -> totalChunks, total_buckets -> totalBuckets, raw_msgs -> rawMsgs,
source_id -> sourceId, and thread_subject -> threadSubject; update their
declarations and every use in this function (the for loop binding (participants,
rawMsgs), the call to raw_to_email_message, creation of sourceId =
format!("gmail:{}", participants), and EmailThread { thread_subject ->
threadSubject }) so all references match the new names, keeping other symbols
like bucket_by_participants, raw_to_email_message, EmailThread,
pick_thread_subject, and DEFAULT_TAGS unchanged.
In `@src/openhuman/memory/tree/chunker.rs`:
- Around line 73-198: The new locals use snake_case but the repo prefers
camelCase; rename variables and closure parameters consistently (e.g.,
max_tokens -> maxTokens, max_chars -> maxChars, unit_separator -> unitSeparator,
acc_chars -> accChars, sub_pieces -> subPieces) and update the flush closure
signature and its uses (flush(&mut acc, &mut accChars, &mut out)) as well as
every reference to approx_token_count, split_by_token_budget, units, and Chunk
construction sites to use the new names so the logic is unchanged but naming
matches camelCase conventions.
In `@src/openhuman/memory/tree/content_store/paths.rs`:
- Around line 198-202: Local variables in
src/openhuman/memory/tree/content_store/paths.rs use snake_case (e.g.,
pending_underscore, last_dash) but the repo requires camelCase; rename these
locals (for example pending_underscore -> pendingUnderscore, last_dash ->
lastDash, and any other snake_case locals in this function such as lower if
needed) and update every use site within the same function (and any closures) to
the new names, keeping the same types/semantics and preserving the existing
initialization (e.g., let mut pendingUnderscore = false;). Run cargo build/tests
to ensure no unresolved references remain.
In `@src/openhuman/memory/tree/jobs/scheduler.rs`:
- Around line 28-127: Rename snake_case scheduler parameters, locals, and the
helper function to camelCase: change the backfill_missing_digests parameter
days_back to daysBack and update all uses inside backfill_missing_digests
(including the log call and loop bounds); rename local variables date_iso and
today_iso in enqueue_daily_jobs (and trigger_digest's payload field references)
to dateIso and todayIso; rename the function next_sleep_duration to
nextSleepDuration and update its internal locals and any call sites; make sure
to update NewJob::digest_daily/flush_stale payload field names references
(date_iso -> dateIso) and all log messages to use the new identifiers so the
code compiles and follows the camelCase variable naming guideline.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a0a9ccb3-ddc1-49f9-8fd6-598dc2007c83
📒 Files selected for processing (28)
src/openhuman/composio/providers/gmail/ingest.rssrc/openhuman/memory/tree/chunker.rssrc/openhuman/memory/tree/content_store/atomic.rssrc/openhuman/memory/tree/content_store/paths.rssrc/openhuman/memory/tree/content_store/read.rssrc/openhuman/memory/tree/global_tree/digest.rssrc/openhuman/memory/tree/global_tree/digest_tests.rssrc/openhuman/memory/tree/global_tree/recap.rssrc/openhuman/memory/tree/global_tree/seal.rssrc/openhuman/memory/tree/jobs/handlers/mod.rssrc/openhuman/memory/tree/jobs/scheduler.rssrc/openhuman/memory/tree/jobs/store.rssrc/openhuman/memory/tree/jobs/worker.rssrc/openhuman/memory/tree/mod.rssrc/openhuman/memory/tree/retrieval/drill_down.rssrc/openhuman/memory/tree/retrieval/fetch.rssrc/openhuman/memory/tree/retrieval/global.rssrc/openhuman/memory/tree/retrieval/integration_test.rssrc/openhuman/memory/tree/retrieval/rpc.rssrc/openhuman/memory/tree/retrieval/source.rssrc/openhuman/memory/tree/retrieval/topic.rssrc/openhuman/memory/tree/score/extract/mod.rssrc/openhuman/memory/tree/source_tree/bucket_seal.rssrc/openhuman/memory/tree/source_tree/bucket_seal_tests.rssrc/openhuman/memory/tree/source_tree/flush.rssrc/openhuman/memory/tree/topic_tree/backfill.rssrc/openhuman/memory/tree/util/mod.rssrc/openhuman/memory/tree/util/redact.rs
✅ Files skipped from review due to trivial changes (2)
- src/openhuman/memory/tree/jobs/worker.rs
- src/openhuman/memory/tree/topic_tree/backfill.rs
🚧 Files skipped from review as they are similar to previous changes (12)
- src/openhuman/memory/tree/mod.rs
- src/openhuman/memory/tree/global_tree/recap.rs
- src/openhuman/memory/tree/retrieval/fetch.rs
- src/openhuman/memory/tree/retrieval/topic.rs
- src/openhuman/memory/tree/retrieval/integration_test.rs
- src/openhuman/memory/tree/global_tree/seal.rs
- src/openhuman/memory/tree/retrieval/source.rs
- src/openhuman/memory/tree/score/extract/mod.rs
- src/openhuman/memory/tree/global_tree/digest.rs
- src/openhuman/memory/tree/source_tree/flush.rs
- src/openhuman/memory/tree/jobs/store.rs
- src/openhuman/memory/tree/content_store/atomic.rs
… in path util, last-@ for endpoint redact Three small correctness fixes from the CodeRabbit re-review of the preview-vs-disk reader audit (tinyhumansai#19) and surrounding work: - read_chunk_body / read_summary_body now compare the on-disk body's sha256 to the value stored in SQLite at write time. Mismatch returns Err rather than silently serving corrupt bytes into the LLM pipeline. The previous code computed the sha but discarded it. - summary_rel_path no longer panics when date_for_global is None for the Global tree kind. A path utility shouldn't take the process down; falls back to an `unknown-date` sentinel segment with a warn log surfacing the caller bug. Test flipped from #[should_panic] to asserting the sentinel. - redact_endpoint splits the authority on the LAST `@` (rsplit_once) instead of the first, so URLs with `@` in user-info segments (e.g. `https://user:p@ass@host.com`) extract the host correctly. Also tightened the order-of-operations: take the authority first (cut at /?#), then split userinfo within it, so a `@` in path or query string can't be misread as a userinfo separator. Addresses CodeRabbit comments 3157361914, 3157361930, 3157361952. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/openhuman/memory/tree/content_store/read.rs (1)
120-124: Reword NULL-pointer failures as migration regressions, not legacy rows.The hard-fail itself is right, but the comments and error text still frame
Noneas a plausible “pre-MD-migration row” case. In this branch that state is supposed to be eliminated by wipe + re-ingest, so this wording will send triage toward a non-existent compatibility path instead of the real causes: a staging/write-path regression or a bad ingest state.📝 Suggested wording
-// - If `content_path` / `content_sha256` are NULL (legacy rows ingested before -// the MD-on-disk migration), return `Err` — callers must handle the -// "pre-migration chunk" case explicitly. The job pipeline propagates the -// error and retries; retrieval falls back gracefully. +// - If `content_path` / `content_sha256` are NULL, return `Err` — in this +// migration that indicates an unexpected staging/schema regression or an +// incomplete wipe + re-ingest transition. - "[content_store::read] no content_path for chunk_id={} (pre-MD-migration row?)", + "[content_store::read] no content_path for chunk_id={} (expected disk-backed pointer missing)", - "[content_store::read] no content_path for summary_id={} (pre-MD-migration row?)", + "[content_store::read] no content_path for summary_id={} (expected disk-backed pointer missing)",Based on learnings, this PR’s deployment model is a
--wipe+ re-ingest migration, so NULLcontent_pathis intended to signal a write-path/schema regression rather than a supported legacy-row case.Also applies to: 148-153, 207-209, 221-226
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/memory/tree/content_store/read.rs` around lines 120 - 124, Update the wording in the NULL-pointer failure comments and any error messages to treat missing content_path/content_sha256 as a migration/regression (write-path or ingest state bug) rather than a supported "pre-migration legacy row" case; locate references to content_path and content_sha256 in read.rs (including the comment block above the "Error policy" and any error strings returned where code checks for None) and replace language that suggests a graceful legacy compatibility path with wording that clearly signals a migration/regression and instructs operators to wipe+re-ingest or investigate write/ingest pipeline failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/openhuman/memory/tree/content_store/read.rs`:
- Around line 23-29: read_chunk_file currently embeds the full abs_path into its
anyhow! error messages (in the std::fs::read map_err, the UTF-8 map_err, and the
split_front_matter ok_or_else), which can leak PII; change those error messages
to avoid printing the full path by using a sanitized/redacted representation
(e.g., a small helper like sanitize_path(abs_path) or only
abs_path.file_name().and_then(|s| s.to_str()).unwrap_or("<redacted>")) and use
that redacted value in the anyhow! calls in read_chunk_file so
split_front_matter, std::fs::read and UTF-8 errors no longer expose the full
path.
---
Nitpick comments:
In `@src/openhuman/memory/tree/content_store/read.rs`:
- Around line 120-124: Update the wording in the NULL-pointer failure comments
and any error messages to treat missing content_path/content_sha256 as a
migration/regression (write-path or ingest state bug) rather than a supported
"pre-migration legacy row" case; locate references to content_path and
content_sha256 in read.rs (including the comment block above the "Error policy"
and any error strings returned where code checks for None) and replace language
that suggests a graceful legacy compatibility path with wording that clearly
signals a migration/regression and instructs operators to wipe+re-ingest or
investigate write/ingest pipeline failures.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 88842fcc-4ebd-446c-b6bf-d2761152116d
📒 Files selected for processing (3)
src/openhuman/memory/tree/content_store/paths.rssrc/openhuman/memory/tree/content_store/read.rssrc/openhuman/memory/tree/util/redact.rs
| pub fn read_chunk_file(abs_path: &Path) -> anyhow::Result<ChunkFileContents> { | ||
| let raw = std::fs::read(abs_path).map_err(|e| anyhow::anyhow!("read {:?}: {e}", abs_path))?; | ||
| let content = std::str::from_utf8(&raw) | ||
| .map_err(|e| anyhow::anyhow!("invalid UTF-8 in {:?}: {e}", abs_path))?; | ||
|
|
||
| let (_fm, body) = split_front_matter(content) | ||
| .ok_or_else(|| anyhow::anyhow!("no front-matter in {:?}", abs_path))?; |
There was a problem hiding this comment.
Redact the path in low-level reader errors too.
read_chunk_file() still embeds abs_path verbatim in its anyhow! messages. Those errors are later bubbled up and logged with {e:#} in retrieval/job paths, so a failed read can still expose participant-bucket paths even though the explicit warn logs were already sanitized.
🔐 Suggested change
pub fn read_chunk_file(abs_path: &Path) -> anyhow::Result<ChunkFileContents> {
- let raw = std::fs::read(abs_path).map_err(|e| anyhow::anyhow!("read {:?}: {e}", abs_path))?;
+ let path_hash = redact(&abs_path.to_string_lossy());
+ let raw = std::fs::read(abs_path).map_err(|e| {
+ anyhow::anyhow!("[content_store::read] read failed path_hash={}: {e}", path_hash)
+ })?;
let content = std::str::from_utf8(&raw)
- .map_err(|e| anyhow::anyhow!("invalid UTF-8 in {:?}: {e}", abs_path))?;
+ .map_err(|e| {
+ anyhow::anyhow!("[content_store::read] invalid UTF-8 path_hash={}: {e}", path_hash)
+ })?;
let (_fm, body) = split_front_matter(content)
- .ok_or_else(|| anyhow::anyhow!("no front-matter in {:?}", abs_path))?;
+ .ok_or_else(|| {
+ anyhow::anyhow!("[content_store::read] no front-matter path_hash={}", path_hash)
+ })?;As per coding guidelines, "Never log secrets or full PII."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/openhuman/memory/tree/content_store/read.rs` around lines 23 - 29,
read_chunk_file currently embeds the full abs_path into its anyhow! error
messages (in the std::fs::read map_err, the UTF-8 map_err, and the
split_front_matter ok_or_else), which can leak PII; change those error messages
to avoid printing the full path by using a sanitized/redacted representation
(e.g., a small helper like sanitize_path(abs_path) or only
abs_path.file_name().and_then(|s| s.to_str()).unwrap_or("<redacted>")) and use
that redacted value in the anyhow! calls in read_chunk_file so
split_front_matter, std::fs::read and UTF-8 errors no longer expose the full
path.
Summary
.mdfiles on disk; SQLite holds pointer + sha256 + 500-char preview + state. Body bytes are immutable; sha is over body only; front-matter is mutable for tag rewrites.email/<participants_slug>/<chunk_id>.md.ExtractFromContent/UnionFromChildren/Empty), summariser/extractor split, manualmemory_tree.trigger_digestRPC.^##boundaries) and email (^---\nFrom:boundaries); document fallback unchanged. Pack-to-budget greedy emit so a thread's emails stay together when they fit.summaries/{source|global|topic}/<scope>/L<n>/<id>.md. After-extract tag rewrite merges entities + topics into the summary's front-mattertags:(atomic, body-untouched).gmail-backfill-3d: paginatesGMAIL_FETCH_EMAILSvia Composio, ingests last N days, drains the queue, re-hashes every chunk + summary file againstcontent_sha256. Validated against real Gmail (40 chunks, 2 summaries, 100% sha-verified).Problem
Today every chunk and summary body lives inline in
mem_tree_chunks.text/mem_tree_summaries.summary. That couples three things that should be decoupled: (1) human/portable artifact, (2) deterministic identity, (3) state machine. Specific consequences:ingest_chat/ingest_emailcall wait for the LLM scoring + summariser. The async pipeline (feat(memory): async pipeline + per-tree label fixes + summariser split + memory tree enabled at app startup #997) decoupled this; this PR builds on that foundation.Solution
Storage layer. New
content_store/module owns paths, compose, atomic write (tempfile + fsync + rename), read-with-sha-verify, and tag rewrites. Body bytes hashed; YAML front-matter is mutable metadata that doesn't enter the hash. Atomicity contract: write file first, then SQLite tx commits the pointer — file always exists before any row claims to point at it; transient orphan MDs cleanable by janitor.Bucketing.
bucket_by_participants(messages)produces source trees keyed by the sorted set of distinct participants (from∪to_list). All correspondence between Alice and Bob lands in one tree regardless of subject thread, regardless of direction.Chunker. Message-aware splitter dispatches on
SourceKind. Chat splits at H2 headings, email at---\nFrom:blocks, document falls through to paragraph splitting. Greedy-pack messages up toDEFAULT_CHUNK_MAX_TOKENS = 3000(sized belowTOKEN_BUDGET = 4500so each L0 buffer accumulates ~1-3 chunks before sealing — natural pacing for the local 1B summariser, which produces noticeably better summaries with ≤4-5k input than at the previous 10k cap).Email body cleanup.
canonicalize/email.rsnow callsemail_clean::clean_bodyon each message body — strips quoted reply chains (On <date>, <name> wrote:), forwarded-message separators, and footer noise (Unsubscribe, copyright lines, address blocks).Summary MD path layout.
summaries/source/<participants_slug>/L<n>/<id>.md— per source treesummaries/global/<yyyy-mm-dd>/L<n>/<id>.md— per daily digestsummaries/topic/<entity_slug>/L<n>/<id>.md— per topic tree (lazy-materialised)Tag rewrites. Post-extract handler walks
mem_tree_entity_indexrows for the just-sealed node, formats each as<kind>/<surface_slug>(Obsidian hierarchical tag format), atomically rewrites the YAMLtags:block in the on-disk file. Body bytes pass through byte-for-byte unchanged. Summary extraction emits topics (emit_topics=true) so summary tags include both entities and topics; chunk extraction emits entities only.Bounded retry on transport failures.
LlmEntityExtractor::extractnow retries up to 3 times with exponential backoff (250ms / 500ms / 1000ms) on transport-level failures before falling back to empty extraction. Caught a single Ollama hiccup during validation that left one summary's tags empty — would now self-heal.Front-matter tags use Obsidian's hierarchical format (
person/Alex-Johnson,org/Tinyhumans-AI,topic/incident-2604,gmail,ingested) so the content directory is directly browsable as an Obsidian vault —tags:field renders as clickable nested tags in the tag pane.Submission Checklist
memory::tree, 60 incomposio::providers::gmail. Coverage includes participants bucketing, message-aware chunker boundaries, content_store atomic write, sha verification, summary compose with all three tree kinds, tag rewrite (entities + topics merge, body bytes unchanged sanity check), retry semantics on transport failure.cargo run --bin gmail-backfill-3d -- --days 3against real Gmail via Composio. Validated 40 chunks + 2 source-tree L1 summaries; every file's body sha re-verified against the storedcontent_sha256(100% match). Manual digest trigger via existingmemory_tree.trigger_digestRPC.content_store/. Inline comments on the path-collision invariant, the atomic-write ordering, the body-immutable / front-matter-mutable split, and the_solo_bucketing fallback for messages without participants.Impact
Runtime. Desktop only (Tauri shell unaffected; this is core-side). The async-pipeline foundation enables non-blocking ingest. Per-chunk + per-seal MD writes are local fs operations (tens of microseconds); negligible compared to the LLM call cost they accompany.
Performance. Chunker max 3k + seal budget 4.5k produces ~1.5× more chunks per ingest than the previous 10k cap (acceptable for fine-grained retrieval), and ~3× more L1 seals (good for summary quality on the local 1B model — it produces noticeably better summaries on smaller inputs).
Migration. Schema migrations are additive + nullable (
content_path TEXT,content_sha256 TEXTon bothmem_tree_chunksandmem_tree_summaries). Legacy rows read back withcontent_path = NULLand full text insummary— read paths fall back gracefully. New writes always go through file-first / pointer-in-row.Compatibility.
mem_tree_summaries.summarycolumn shrinks to a 500-char preview at write time (mirrorsmem_tree_chunks.text). Existing rows untouched. Re-ingest after this PR rotates chunk_ids (chunker output changes) — bin's--wipeflag clears state for clean re-ingest.Security. No new external surfaces. Filesystem writes scoped to
content_root(env > config > workspace default). Atomic write contract closes the partial-write window that would have shown half-baked summaries to consumers.Related
ExtractSummaryjob kind to remove the seal-time inline extraction call entirely (today: 3-attempt retry; ideal: queue-backed retry with cascade-aware ordering)self_emailconfig knob to drop the user's address from participant bucket keys (cleaner folder names)mem_tree_entity_indexfrom MD front-matterchunker_max_tokensandseal_token_budgetconfig-driven so operators can tune without recompilingSummary by CodeRabbit
New Features
Improvements