feat(memory): Phase 1 memory tree - multi-source ingestion & canonical chunks (#707)#732
Conversation
…l chunks (tinyhumansai#707) Adds an isolated memory tree layer under src/openhuman/memory/tree/ implementing Phase 1 of the new memory architecture (umbrella tinyhumansai#711). Zero edits to existing memory/*.rs files - the new layer coexists with the legacy TinyHumans-backed client. - Source adapters: chat / email / document -> canonical Markdown - Token-bounded chunker with deterministic SHA-256 chunk IDs - SQLite persistence at <workspace>/memory_tree/chunks.db with full provenance metadata (source_kind, source_id, owner, timestamps, tags, time_range) and back-pointer to raw source - Unified JSON-RPC ingest (dispatches on source_kind + JSON payload): openhuman.memory_tree_ingest, _list_chunks, _get_chunk - DataSource enum covering the 8 providers from m.excalidraw step 1 (Discord/Telegram/Whatsapp/Gmail/OtherEmail/Notion/MeetingNotes/DriveDocs) - ~40 unit tests (chunk ID stability, UTF-8-safe splitting, canonicalisation idempotence, store round-trip, filter behavior) Additive only: new tables in a new DB file, new JSON-RPC namespace, no existing behavior changes. Feeds tinyhumansai#708 (scoring), tinyhumansai#709 (summary trees), tinyhumansai#710 (query tools). Closes tinyhumansai#707. Parent: tinyhumansai#711. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughAdds a new "memory_tree" ingestion subsystem: canonicalizers for chat/email/doc, markdown chunker, deterministic chunk IDs, SQLite persistence, JSON-RPC handlers and controller/schema registrations, plus end-to-end tests and registry integration. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant Controller as memory_tree Controller
participant Canonicalizer as Canonicalizer (chat/email/doc)
participant Chunker as Markdown Chunker
participant Store as SQLite Store
Client->>Controller: JSON-RPC ingest request
Controller->>Canonicalizer: canonicalise(source_id, owner, tags, payload)
alt empty source
Canonicalizer-->>Controller: Ok(None)
Controller-->>Client: IngestResult { chunks_written: 0 }
else non-empty
Canonicalizer-->>Controller: Ok(CanonicalisedSource)
Controller->>Chunker: chunk_markdown(markdown, metadata, opts)
Chunker-->>Controller: Vec<Chunk>
Controller->>Store: upsert_chunks(chunks)
Store-->>Controller: rows_written
Controller-->>Client: IngestResult { chunks_written, chunk_ids }
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (1)
src/openhuman/memory/tree/schemas.rs (1)
21-44: Please add the JSON-RPC E2E for this namespace before merge.This file wires a public transport surface, so the riskiest failures are end-to-end: registry aggregation, param deserialization, and response shaping. The current unit coverage in the leaf modules will not exercise that path.
Based on learnings: "Follow the feature design workflow in order: (1) specify against existing codebase, (2) implement in Rust with unit tests, (3) add JSON-RPC E2E tests, (4) build React UI with
core_rpc_relay/coreRpcClient, (5) add app unit tests, (6) add desktop E2E specs"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/memory/tree/schemas.rs` around lines 21 - 44, Add a JSON-RPC end-to-end test exercising the public namespace wired up by all_controller_schemas() and all_registered_controllers(): write a test that starts the JSON-RPC transport/server with the same RegisteredController entries (schemas("ingest"), schemas("list_chunks"), schemas("get_chunk") wired to handle_ingest, handle_list_chunks, handle_get_chunk), sends real JSON-RPC requests for each method, verifies param deserialization and the full response shaping and error paths, and asserts registry aggregation works (all three methods are reachable); place the test alongside existing e2e tests and reuse the same schemas() helper and RegisteredController registration to ensure parity with production wiring.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/core/all.rs`:
- Around line 119-120: The namespace "memory_tree" has been registered via
controllers.extend(crate::openhuman::memory::all_memory_tree_registered_controllers())
but namespace_description() in src/core/all.rs lacks a matching arm, so CLI/help
shows no summary; update namespace_description() to add a "memory_tree" match
arm (similar style to other arms) that returns a short descriptive string for
the memory_tree namespace, ensuring the function covers "memory_tree" and any
related display helpers that consume namespace_description().
In `@src/openhuman/memory/tree/canonicalize/mod.rs`:
- Around line 16-18: Add a shared normalizer in the canonicalization layer that
trims and converts blank/whitespace-only source refs to None so adapters can't
persist Some("") pointers; implement a helper (e.g. canonicalize_source_ref or
normalize_source_ref) near the canonicalize module and call it wherever
Metadata.source_ref is set or transformed (reference types::Metadata and the
places noted around lines 16 and 32-44) so that any Option<String> source_ref is
trimmed and turned to None if empty.
In `@src/openhuman/memory/tree/chunker.rs`:
- Around line 89-118: split_by_token_budget currently allows max_tokens == 0
which leads to hard_split_by_chars(..., 0) emitting a leading empty chunk and
incorrect behavior; clamp or reject zero up-front. In split_by_token_budget,
normalize the incoming max_tokens to at least 1 (e.g., let max_tokens =
max_tokens.max(1)) before calling approx_token_count, computing max_chars, or
invoking pack_segments/hard_split_by_chars so subsequent logic
(approx_token_count, pack_segments, hard_split_by_chars) never receives 0 and no
empty leading chunk is produced.
In `@src/openhuman/memory/tree/ingest.rs`:
- Around line 99-103: The debug log is printing raw user content (doc.title);
update the log in the ingest path (the log::debug call referencing source_id and
doc.title) to avoid logging the raw title — log a redacted indicator such as
whether a title exists and its length (e.g., title.is_some() and title.map(|t|
t.len())) or a boolean like has_title instead of the full doc.title; keep
source_id as-is and ensure you do not include any raw PII in the log message.
In `@src/openhuman/memory/tree/schemas.rs`:
- Around line 91-117: The outputs schema for ingest_rpc is incorrectly nested
under a single "result" object; change the outputs Vec in the schema (where
FieldSchema currently has name "result" and TypeSchema::Object) to instead
declare three top-level FieldSchema entries named "source_id"
(TypeSchema::String), "chunks_written" (TypeSchema::U64) and "chunk_ids"
(TypeSchema::Array(Box::new(TypeSchema::String))) so the schema matches
RpcOutcome<IngestResult>'s flattened shape; keep the existing comments and
required flags from the nested fields when you move them out.
In `@src/openhuman/memory/tree/store.rs`:
- Around line 128-160: In list_chunks (function list_chunks, variable sql)
change the ORDER BY clause to add seq_in_source as a secondary sort key so equal
timestamp_ms rows are deterministic; replace "ORDER BY timestamp_ms DESC LIMIT
?" with "ORDER BY timestamp_ms DESC, seq_in_source ASC LIMIT ?" to preserve
source order for multi-piece chunks.
- Around line 124-125: The RPC-provided struct field `limit: Option<usize>` must
be validated and safely converted before being used as an SQLite LIMIT; clamp
the requested value to a sane maximum (e.g. define a MAX_RESULTS constant such
as 10_000), default to 100 when None, and then convert using i64::try_from to
detect overflow. Locate the code that reads `limit` (the `limit` field of the
store/tree request and the place where it is cast to i64 for the SQL query) and
replace the direct cast with logic that: 1) maps None -> 100, 2) clamps Some(n)
to the range [1, MAX_RESULTS], 3) uses i64::try_from(clamped) and handles Err by
falling back to MAX_RESULTS (or returning an error), and 4) uses that safe i64
value for the SQLite LIMIT. Ensure no negative or overflowing values can reach
the SQL layer.
- Around line 237-246: The with_connection function currently opens a rusqlite
Connection but doesn't configure concurrency settings; update with_connection to
set a busy timeout and enable WAL mode after opening the DB but before
initializing the schema: call
conn.set_busy_timeout(Some(Duration::from_secs(5))) (or an appropriate duration)
and execute the PRAGMA "PRAGMA journal_mode=WAL;" on the Connection (e.g., via
conn.execute_batch or conn.execute) before running conn.execute_batch(SCHEMA);
keep the same error contexts on failures so any errors from set_busy_timeout or
enabling WAL surface in logs.
In `@src/openhuman/memory/tree/types.rs`:
- Around line 8-10: The module docs incorrectly state chunk IDs use blake3 but
the implementation in chunk_id() uses SHA-256; update the top-of-module doc
comment to state the correct algorithm and truncation (SHA-256(source_kind |
source_id | seq) truncated to 32 hex chars) and/or add a short Rustdoc note
above the chunk_id() function that documents the exact input concatenation and
truncation behavior so docs and code match (or, if you prefer blake3, change the
chunk_id() implementation to use blake3 consistently—pick one and make both doc
and function agree).
---
Nitpick comments:
In `@src/openhuman/memory/tree/schemas.rs`:
- Around line 21-44: Add a JSON-RPC end-to-end test exercising the public
namespace wired up by all_controller_schemas() and all_registered_controllers():
write a test that starts the JSON-RPC transport/server with the same
RegisteredController entries (schemas("ingest"), schemas("list_chunks"),
schemas("get_chunk") wired to handle_ingest, handle_list_chunks,
handle_get_chunk), sends real JSON-RPC requests for each method, verifies param
deserialization and the full response shaping and error paths, and asserts
registry aggregation works (all three methods are reachable); place the test
alongside existing e2e tests and reuse the same schemas() helper and
RegisteredController registration to ensure parity with production wiring.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7925714b-906f-4b33-bdea-9e2232dd9e18
📒 Files selected for processing (13)
src/core/all.rssrc/openhuman/memory/mod.rssrc/openhuman/memory/tree/canonicalize/chat.rssrc/openhuman/memory/tree/canonicalize/document.rssrc/openhuman/memory/tree/canonicalize/email.rssrc/openhuman/memory/tree/canonicalize/mod.rssrc/openhuman/memory/tree/chunker.rssrc/openhuman/memory/tree/ingest.rssrc/openhuman/memory/tree/mod.rssrc/openhuman/memory/tree/rpc.rssrc/openhuman/memory/tree/schemas.rssrc/openhuman/memory/tree/store.rssrc/openhuman/memory/tree/types.rs
| use serde::{Deserialize, Serialize}; | ||
|
|
||
| use crate::openhuman::memory::tree::types::Metadata; |
There was a problem hiding this comment.
Normalize blank source_ref values once in the shared canonicalization layer.
The sibling adapters currently turn Some("") / whitespace-only refs into SourceRef, which means we can persist a provenance pointer that points nowhere. A small shared helper here keeps chat/email/document consistent and preserves source-link integrity.
♻️ Suggested helper
use serde::{Deserialize, Serialize};
-use crate::openhuman::memory::tree::types::Metadata;
+use crate::openhuman::memory::tree::types::{Metadata, SourceRef};
@@
pub struct CanonicaliseRequest<P> {
@@
#[serde(default)]
pub tags: Vec<String>,
}
+
+pub(super) fn normalise_source_ref(value: Option<String>) -> Option<SourceRef> {
+ value
+ .map(|value| value.trim().to_owned())
+ .filter(|value| !value.is_empty())
+ .map(SourceRef::new)
+}Also applies to: 32-44
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/openhuman/memory/tree/canonicalize/mod.rs` around lines 16 - 18, Add a
shared normalizer in the canonicalization layer that trims and converts
blank/whitespace-only source refs to None so adapters can't persist Some("")
pointers; implement a helper (e.g. canonicalize_source_ref or
normalize_source_ref) near the canonicalize module and call it wherever
Metadata.source_ref is set or transformed (reference types::Metadata and the
places noted around lines 16 and 32-44) so that any Option<String> source_ref is
trimmed and turned to None if empty.
- Added support for "memory_tree" namespace description in `namespace_description`. - Implemented clamping for zero token budget in `chunk_markdown` to prevent empty leading chunks. - Introduced detailed logging for document ingestion, including title length. - Updated output schemas in `schemas.rs` for improved clarity. - Enhanced chunk listing with clamping limits and ordering by sequence in `list_chunks`. - Normalized source references in canonicalization functions to drop blank values. - Added comprehensive tests for new features and edge cases in chunking and canonicalization. These changes improve the robustness and usability of the memory tree layer, aligning with ongoing development efforts for Phase 1 of the memory architecture.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
tests/json_rpc_e2e.rs (1)
793-797: Prefer explicit method-name assertions for public API stability.Because
expected_methodsis derived frommemory_tree_schema(...), a rename in schema + registry could still pass this check while breaking external clients expecting fixed names. Consider asserting the three canonical strings directly.Hardening suggestion
- let expected_methods = vec![ - rpc_method_name(&memory_tree_schema("ingest")), - rpc_method_name(&memory_tree_schema("list_chunks")), - rpc_method_name(&memory_tree_schema("get_chunk")), - ]; + let expected_methods = vec![ + "openhuman.memory_tree_ingest".to_string(), + "openhuman.memory_tree_list_chunks".to_string(), + "openhuman.memory_tree_get_chunk".to_string(), + ];🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/json_rpc_e2e.rs` around lines 793 - 797, The test currently builds expected_methods by calling rpc_method_name(&memory_tree_schema(...)) which masks schema/registry renames; replace that derivation with explicit string assertions for the three canonical RPC names (the literal method names that external clients rely on) instead of using memory_tree_schema or rpc_method_name to construct them; update the test around the expected_methods variable (and any assertions that compare to it) to assert the three fixed string values directly so public API names remain stable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/openhuman/memory/tree/chunker.rs`:
- Around line 82-88: The doc comment in src/openhuman/memory/tree/chunker.rs
lists sentence- and word-boundary splitting (items 3 and 4) but the
implementation only performs paragraph, line, and hard-character cuts; either
implement those strategies in the chunking code (e.g., add sentence and word
split logic to the Chunker::split / chunking function using a proper sentence
tokenizer or unicode-segmentation word/sentence boundaries and fallbacks that
preserve UTF-8 code points) or update the doc comment to accurately list only
paragraph, line, and hard-character splitting; pick one approach and make the
code and the doc comment consistent (refer to the Chunker type and its
split/chunk method when applying the change).
In `@src/openhuman/memory/tree/schemas.rs`:
- Around line 70-74: The schema currently marks the field name "owner" with
TypeSchema::String and required: true, but tree_rpc::IngestRequest accepts a
missing owner via #[serde(default)], so update the schema to match runtime
behavior: change the "owner" field to be optional (remove required: true or set
required: false / wrap in an Option-type schema) so it aligns with
IngestRequest's #[serde(default)] handling; reference the "owner" field
definition and the tree_rpc::IngestRequest serde default attribute when making
the change.
---
Nitpick comments:
In `@tests/json_rpc_e2e.rs`:
- Around line 793-797: The test currently builds expected_methods by calling
rpc_method_name(&memory_tree_schema(...)) which masks schema/registry renames;
replace that derivation with explicit string assertions for the three canonical
RPC names (the literal method names that external clients rely on) instead of
using memory_tree_schema or rpc_method_name to construct them; update the test
around the expected_methods variable (and any assertions that compare to it) to
assert the three fixed string values directly so public API names remain stable.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 88708079-b359-4b7d-bd78-d4ee66b57fe5
📒 Files selected for processing (11)
src/core/all.rssrc/openhuman/memory/tree/canonicalize/chat.rssrc/openhuman/memory/tree/canonicalize/document.rssrc/openhuman/memory/tree/canonicalize/email.rssrc/openhuman/memory/tree/canonicalize/mod.rssrc/openhuman/memory/tree/chunker.rssrc/openhuman/memory/tree/ingest.rssrc/openhuman/memory/tree/schemas.rssrc/openhuman/memory/tree/store.rssrc/openhuman/memory/tree/types.rstests/json_rpc_e2e.rs
✅ Files skipped from review due to trivial changes (1)
- src/openhuman/memory/tree/types.rs
🚧 Files skipped from review as they are similar to previous changes (6)
- src/core/all.rs
- src/openhuman/memory/tree/ingest.rs
- src/openhuman/memory/tree/canonicalize/chat.rs
- src/openhuman/memory/tree/canonicalize/email.rs
- src/openhuman/memory/tree/canonicalize/document.rs
- src/openhuman/memory/tree/store.rs
There was a problem hiding this comment.
🧹 Nitpick comments (2)
tests/json_rpc_e2e.rs (1)
810-867: Add a re-ingest idempotency check to this E2E flow.This path verifies ingest/list/get, but it doesn’t yet assert the Phase-1 contract that re-ingesting identical content is stable and idempotent (same chunk IDs, no duplicate chunks).
💡 Proposed test extension
let ingest_outer = assert_no_jsonrpc_error(&ingest, "memory_tree_ingest"); let ingest_result = ingest_outer.get("result").unwrap_or(ingest_outer); @@ let chunk_ids = ingest_result .get("chunk_ids") .and_then(Value::as_array) .expect("chunk_ids array"); assert_eq!(chunk_ids.len(), 1); + let chunk_ids_first = chunk_ids.clone(); + + // Re-ingest identical payload: should be idempotent/stable. + let ingest_again = post_json_rpc( + &rpc_base, + 205, + &expected_methods[0], + json!({ + "source_kind": "document", + "source_id": "notion:launch-plan", + "owner": "alice@example.com", + "tags": ["planning", "launch"], + "payload": { + "provider": "notion", + "title": "Launch Plan", + "body": "Alpha\n\nBeta", + "modified_at": 1700000000000_i64, + "source_ref": " notion://page/launch-plan " + } + }), + ).await; + let ingest_again_outer = assert_no_jsonrpc_error(&ingest_again, "memory_tree_ingest_reingest"); + let ingest_again_result = ingest_again_outer.get("result").unwrap_or(ingest_again_outer); + let chunk_ids_again = ingest_again_result + .get("chunk_ids") + .and_then(Value::as_array) + .expect("chunk_ids array"); + assert_eq!(chunk_ids_again, &chunk_ids_first, "chunk ids should remain stable across re-ingest"); @@ assert_eq!(chunks.len(), 1);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/json_rpc_e2e.rs` around lines 810 - 867, Add an idempotency check after the initial ingest/list assertions: call post_json_rpc again with the same payload using the same expected_methods[0] (memory_tree_ingest) and verify via assert_no_jsonrpc_error that the returned ingest result contains the same "source_id" and identical "chunk_ids" as the first ingest (no new chunk IDs), and then call memory_tree_list_chunks (expected_methods[1]) again and assert the "chunks" array length is unchanged (still 1) and that the chunk at index 0 still has seq_in_source == 0 and the same pointer("/metadata/source_ref/value"); use the existing variables ingest_result, chunk_ids, list_result, and chunk to compare old vs new values to ensure re-ingest is stable and idempotent.src/openhuman/memory/tree/chunker.rs (1)
123-138: Consider usingsep.chars().count()for consistency.Line 124 uses
sep.len()(byte length) while line 137 adds it tocurrent.chars().count()(character count). For the ASCII separators"\n"and"\n\n"used internally, bytes equal chars so this works correctly. However, usingsep.chars().count()would make the units consistent throughout.🔧 Proposed fix for consistency
fn pack_segments(segments: &[&str], sep: &str, max_chars: usize) -> Option<Vec<String>> { - let sep_len = sep.len(); + let sep_len = sep.chars().count(); let mut out: Vec<String> = Vec::new();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/memory/tree/chunker.rs` around lines 123 - 138, In pack_segments change the separator length calculation to use character count rather than byte length for consistency: replace the sep_len = sep.len() usage with sep_chars = sep.chars().count() (and use that variable where sep_len is referenced) so all length math in pack_segments (e.g., projected = current.chars().count() + sep_chars + seg_len and any comparisons) uses character counts consistently; update the variable name (sep_chars) and its references in the function.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/openhuman/memory/tree/chunker.rs`:
- Around line 123-138: In pack_segments change the separator length calculation
to use character count rather than byte length for consistency: replace the
sep_len = sep.len() usage with sep_chars = sep.chars().count() (and use that
variable where sep_len is referenced) so all length math in pack_segments (e.g.,
projected = current.chars().count() + sep_chars + seg_len and any comparisons)
uses character counts consistently; update the variable name (sep_chars) and its
references in the function.
In `@tests/json_rpc_e2e.rs`:
- Around line 810-867: Add an idempotency check after the initial ingest/list
assertions: call post_json_rpc again with the same payload using the same
expected_methods[0] (memory_tree_ingest) and verify via assert_no_jsonrpc_error
that the returned ingest result contains the same "source_id" and identical
"chunk_ids" as the first ingest (no new chunk IDs), and then call
memory_tree_list_chunks (expected_methods[1]) again and assert the "chunks"
array length is unchanged (still 1) and that the chunk at index 0 still has
seq_in_source == 0 and the same pointer("/metadata/source_ref/value"); use the
existing variables ingest_result, chunk_ids, list_result, and chunk to compare
old vs new values to ensure re-ingest is stable and idempotent.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d11a345a-a8ab-4ed6-8d5f-d9be10410ce8
📒 Files selected for processing (3)
src/openhuman/memory/tree/chunker.rssrc/openhuman/memory/tree/schemas.rstests/json_rpc_e2e.rs
✅ Files skipped from review due to trivial changes (1)
- src/openhuman/memory/tree/schemas.rs
… gate (#708) (#733) * feat(memory): phase 1 memory tree - multi-source ingestion & canonical chunks (#707) Adds an isolated memory tree layer under src/openhuman/memory/tree/ implementing Phase 1 of the new memory architecture (umbrella #711). Zero edits to existing memory/*.rs files - the new layer coexists with the legacy TinyHumans-backed client. - Source adapters: chat / email / document -> canonical Markdown - Token-bounded chunker with deterministic SHA-256 chunk IDs - SQLite persistence at <workspace>/memory_tree/chunks.db with full provenance metadata (source_kind, source_id, owner, timestamps, tags, time_range) and back-pointer to raw source - Unified JSON-RPC ingest (dispatches on source_kind + JSON payload): openhuman.memory_tree_ingest, _list_chunks, _get_chunk - DataSource enum covering the 8 providers from m.excalidraw step 1 (Discord/Telegram/Whatsapp/Gmail/OtherEmail/Notion/MeetingNotes/DriveDocs) - ~40 unit tests (chunk ID stability, UTF-8-safe splitting, canonicalisation idempotence, store round-trip, filter behavior) Additive only: new tables in a new DB file, new JSON-RPC namespace, no existing behavior changes. Feeds #708 (scoring), #709 (summary trees), #710 (query tools). Closes #707. Parent: #711. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(memory): phase 2 memory tree - preprocessing, scoring, admission gate (#708) Adds the scoring / admission layer between Phase 1's chunker and store. Stacked on feat/707-memory-ingestion (PR #732) - depends on Phase 1's chunk substrate. - Pluggable EntityExtractor trait + CompositeExtractor chain - RegexEntityExtractor: mechanical entities (emails, URLs, @Handles, #hashtags) - Always on, deterministic, zero deps, UTF-8-safe char spans - Five weighted signals: token count, unique-word ratio, metadata weight, source weight (per-DataSource), interaction (reply/sent/mention/dm tags), entity density - Exact-match entity canonicalisation (email lowercased, @ and # stripped) - Admission gate drops chunks below configurable threshold (default 0.3) - Score rationale persists for EVERY chunk (kept or dropped) for debugging - Entities indexed for KEPT chunks only - Two new SQLite tables added to the memory_tree DB: - mem_tree_score: per-chunk score rationale with all signal values - mem_tree_entity_index: inverted index entity_id -> node_id - Idempotent ALTER TABLE migration adds embedding BLOB column to mem_tree_chunks (used in Phase 3 retrieval, wired but not populated here) - Ingest pipeline converted to async to accommodate the extractor trait; blocking SQLite work isolated on spawn_blocking; JSON-RPC surface unchanged (same memory_tree_ingest / list / get methods) - Phase 2 deliberately ships without GLiNER/semantic NER - per-chunk semantic entities land later behind a cargo feature flag; the composite extractor interface keeps that drop-in trivial Additive only: new tables, new columns, new module. Existing Phase 1 behavior unchanged except that low-signal chunks are now dropped before reaching mem_tree_chunks. Raise score_drop_threshold to 0 to disable the gate and restore Phase-1-identical behavior. Closes #708. Parent: #711. Depends on: #707 (#732). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fix memory tree scoring persistence issues * Fix memory tree scoring robustness issues from PR review - ingest: fail fast if scorer returns fewer/more results than chunks (silent zip truncation would drop chunks or their score rationale) - score::persist_score{,_tx}: clear stale entity-index rows before re-indexing a re-scored chunk, since INSERT OR REPLACE never deletes rows whose entity_id is no longer in the new extraction - score::store::lookup_entity: clamp limit to i64::MAX before casting to prevent a large usize wrapping into a negative LIMIT Adds clear_entity_index_drops_stale_rows regression test. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Steven Enamakel <enamakel@tinyhumans.ai>
Summary
src/openhuman/memory/tree/implementing Phase 1 of the memory architecture umbrella ([Feature] Memory Architecture: Source-grounded wiki memory with phased ingestion, scoring, and traversal #711).source_kind, plus list and get surfaces.memory/*.rsfiles — the new layer lives alongside the legacy TinyHumans-backed client. Additive only, no existing behavior changes.Problem
OpenHuman currently lacks a unified ingestion layer for long-lived personal/team knowledge across messaging, email, and documents. Without a normalized substrate, later memory features (filtering, hierarchical summarization, retrieval) can't be built on a consistent foundation. Issue #707 tracks this foundation.
Solution
New module
src/openhuman/memory/tree/containing:types.rsChunk,Metadata,SourceKind,SourceRef,DataSourceenum (8 providers fromm.excalidraw), deterministic SHA-256 chunk IDscanonicalize/{chat,email,document}.rschunker.rsstore.rs<workspace>/memory_tree/chunks.db; idempotent upsert onchunk.idingest.rsrpc.rs+schemas.rsmemory_treenamespacemod.rsExposed JSON-RPC methods (controller registry):
openhuman.memory_tree_ingest— unified ingest; caller suppliessource_kind+ JSONpayload(shape dispatches on kind)openhuman.memory_tree_list_chunks— filter by kind/source_id/owner/time windowopenhuman.memory_tree_get_chunk— fetch by deterministic idChunk ID:
sha256(source_kind | source_id | seq)truncated to 32 hex chars. Stable across re-ingest; upserts are idempotent.Design decisions / tradeoffs:
memory/) per reviewer guidance — the legacy memory client and this new tree layer coexist until a future phase consolidates them. Avoids a risky rewrite of in-flight functionality._ingest_chat/_ingest_email/_ingest_document— one JSON-RPC method with typedsource_kind+ genericpayloadthat the handler validates. Simpler surface, easier to extend, no collision with the existingmemory.doc_ingest.Submission Checklist
tests/memory_tree_e2e.rs) planned in a follow-up commit on this branch before marking ready-for-review//!+ per-item///on public types and non-obvious logic[memory_tree::*]debug logs throughoutImpact
memory_treenamespace; new SQLite DB file at<workspace>/memory_tree/chunks.dbcreated lazily on first ingest. No auto-execution, no startup subscribers — new code paths fire only when explicitly called.tokio::task::spawn_blockingso the runtime thread isn't blocked.memorynamespace methods and tables are untouched.Related
MEMORY_ARCHITECTURE_LLD.mdin the vezures workspace root (not committed to this repo)Summary by CodeRabbit
New Features
Tests