feat(memory): Phase 1 memory tree - multi-source ingestion & canonical chunks (#707) by sanil-23 · Pull Request #732 · tinyhumansai/openhuman

sanil-23 · 2026-04-21T14:04:49Z

Summary

Introduces an isolated memory tree ingestion layer under src/openhuman/memory/tree/ implementing Phase 1 of the memory architecture umbrella ([Feature] Memory Architecture: Source-grounded wiki memory with phased ingestion, scoring, and traversal #711).
Source adapters (chat / email / document) normalise raw payloads into canonical Markdown; a token-bounded chunker slices them into stable-ID chunks; SQLite persists chunks with full provenance and back-pointers to the raw source.
Exposes a single unified JSON-RPC ingest method that dispatches on source_kind, plus list and get surfaces.
Zero edits to existing memory/*.rs files — the new layer lives alongside the legacy TinyHumans-backed client. Additive only, no existing behavior changes.
Draft while Phase 2 (scoring/[Feature] Memory Architecture: Phase 2 - Preprocessing, scoring, and entity consolidation #708) and Phase 3 (summary trees/[Feature] Memory Architecture: Phase 3 - Summary trees, wiki indexing, and backlinks #709) are stacked on top; ready for early review of the foundation.

Problem

OpenHuman currently lacks a unified ingestion layer for long-lived personal/team knowledge across messaging, email, and documents. Without a normalized substrate, later memory features (filtering, hierarchical summarization, retrieval) can't be built on a consistent foundation. Issue #707 tracks this foundation.

Solution

New module src/openhuman/memory/tree/ containing:

File	Responsibility
`types.rs`	`Chunk`, `Metadata`, `SourceKind`, `SourceRef`, `DataSource` enum (8 providers from `m.excalidraw`), deterministic SHA-256 chunk IDs
`canonicalize/{chat,email,document}.rs`	Per-source-kind adapters → canonical Markdown
`chunker.rs`	Paragraph-preferred, UTF-8-safe splitting with ≤10k-token bound (aligned with the future #709 summariser budget)
`store.rs`	SQLite schema + CRUD at `<workspace>/memory_tree/chunks.db`; idempotent upsert on `chunk.id`
`ingest.rs`	Orchestrator: canonicalise → chunk → persist
`rpc.rs` + `schemas.rs`	JSON-RPC surface under `memory_tree` namespace
`mod.rs`	Module exports

Exposed JSON-RPC methods (controller registry):

openhuman.memory_tree_ingest — unified ingest; caller supplies source_kind + JSON payload (shape dispatches on kind)
openhuman.memory_tree_list_chunks — filter by kind/source_id/owner/time window
openhuman.memory_tree_get_chunk — fetch by deterministic id

Chunk ID: sha256(source_kind | source_id | seq) truncated to 32 hex chars. Stable across re-ingest; upserts are idempotent.

Design decisions / tradeoffs:

Isolated layer (doesn't share files with existing memory/) per reviewer guidance — the legacy memory client and this new tree layer coexist until a future phase consolidates them. Avoids a risky rewrite of in-flight functionality.
Unified ingest method vs separate _ingest_chat / _ingest_email / _ingest_document — one JSON-RPC method with typed source_kind + generic payload that the handler validates. Simpler surface, easier to extend, no collision with the existing memory.doc_ingest.
10k-token chunk max aligns with the [Feature] Memory Architecture: Phase 3 - Summary trees, wiki indexing, and backlinks #709 summariser input budget so no leaf ever busts a seal on its own.
Phase 1 does not include scoring, entity extraction, or summarization — those land in stacked branches for [Feature] Memory Architecture: Phase 2 - Preprocessing, scoring, and entity consolidation #708, [Feature] Memory Architecture: Phase 3 - Summary trees, wiki indexing, and backlinks #709, [Feature] Memory Architecture: Phase 4 - LLM traversal tools and high-level KB queries #710.

Submission Checklist

Unit tests — ~40 tests co-located in each module covering chunk ID stability, UTF-8-safe splitting, canonicalisation idempotence, store round-trip, filter behavior, and re-ingest idempotence
E2E / integration — JSON-RPC integration test (tests/memory_tree_e2e.rs) planned in a follow-up commit on this branch before marking ready-for-review
N/A — not applicable to this PR
Doc comments — module-level //! + per-item /// on public types and non-obvious logic
Inline comments — hot paths and invariants annotated; structured [memory_tree::*] debug logs throughout

Impact

Runtime: additive only. Adds three JSON-RPC methods under a new memory_tree namespace; new SQLite DB file at <workspace>/memory_tree/chunks.db created lazily on first ingest. No auto-execution, no startup subscribers — new code paths fire only when explicitly called.
Platform: desktop; no platform-specific code.
Performance: ingest is sync per chunk (CPU-bound canonicalisation + small SQLite write); JSON-RPC handlers use tokio::task::spawn_blocking so the runtime thread isn't blocked.
Security: no external network calls, no secret handling. SQLite file inherits workspace permissions.
Migration: none — new tables in a new DB file.
Compatibility: fully backward compatible. Existing memory namespace methods and tables are untouched.

Issue(s): Closes [Feature] Memory Architecture: Phase 1 - Multi-source ingestion and chunked canonicalization #707. Parent: [Feature] Memory Architecture: Source-grounded wiki memory with phased ingestion, scoring, and traversal #711.
Follow-up PR(s):
- [Feature] Memory Architecture: Phase 2 - Preprocessing, scoring, and entity consolidation #708 — Phase 2: preprocessing, scoring, entity extraction, entity consolidation (stacked on this branch)
- [Feature] Memory Architecture: Phase 3 - Summary trees, wiki indexing, and backlinks #709 — Phase 3: summary trees, wiki indexing, backlinks
- [Feature] Memory Architecture: Phase 4 - LLM traversal tools and high-level KB queries #710 — Phase 4: LLM traversal tools and KB queries
Design reference: MEMORY_ARCHITECTURE_LLD.md in the vezures workspace root (not committed to this repo)

Summary by CodeRabbit

New Features
- Unified memory ingestion for chats, emails, and documents with canonicalization, deterministic chunking, and durable queryable storage.
- JSON-RPC endpoints to ingest data, list chunks, and fetch individual chunks; the new memory_tree namespace is registered and discoverable.
Tests
- End-to-end JSON-RPC tests validating controller registration, ingestion, listing, retrieval, and error cases.

…l chunks (tinyhumansai#707) Adds an isolated memory tree layer under src/openhuman/memory/tree/ implementing Phase 1 of the new memory architecture (umbrella tinyhumansai#711). Zero edits to existing memory/*.rs files - the new layer coexists with the legacy TinyHumans-backed client. - Source adapters: chat / email / document -> canonical Markdown - Token-bounded chunker with deterministic SHA-256 chunk IDs - SQLite persistence at <workspace>/memory_tree/chunks.db with full provenance metadata (source_kind, source_id, owner, timestamps, tags, time_range) and back-pointer to raw source - Unified JSON-RPC ingest (dispatches on source_kind + JSON payload): openhuman.memory_tree_ingest, _list_chunks, _get_chunk - DataSource enum covering the 8 providers from m.excalidraw step 1 (Discord/Telegram/Whatsapp/Gmail/OtherEmail/Notion/MeetingNotes/DriveDocs) - ~40 unit tests (chunk ID stability, UTF-8-safe splitting, canonicalisation idempotence, store round-trip, filter behavior) Additive only: new tables in a new DB file, new JSON-RPC namespace, no existing behavior changes. Feeds tinyhumansai#708 (scoring), tinyhumansai#709 (summary trees), tinyhumansai#710 (query tools). Closes tinyhumansai#707. Parent: tinyhumansai#711. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-04-21T14:04:57Z

📝 Walkthrough

Walkthrough

Adds a new "memory_tree" ingestion subsystem: canonicalizers for chat/email/doc, markdown chunker, deterministic chunk IDs, SQLite persistence, JSON-RPC handlers and controller/schema registrations, plus end-to-end tests and registry integration.

Changes

Cohort / File(s)	Summary
Registry Integration `src/core/all.rs`	Appends `all_memory_tree_registered_controllers()` to registered controllers, adds `all_memory_tree_controller_schemas()` to declared schemas, and provides a `memory_tree` namespace description.
Module Re-exports `src/openhuman/memory/mod.rs`	Adds `pub mod tree;` and re-exports `all_memory_tree_controller_schemas`, `all_memory_tree_registered_controllers`.
Module Entry & Re-exports `src/openhuman/memory/tree/mod.rs`	New module root exposing submodules (canonicalize, chunker, ingest, rpc, schemas, store, types) and re-exporting schema/registration aggregators.
Canonicalizers `src/openhuman/memory/tree/canonicalize/*`	New canonicalizers: `chat.rs`, `email.rs`, `document.rs` produce `CanonicalisedSource`; `mod.rs` defines `CanonicaliseRequest`, `CanonicalisedSource`, `normalize_source_ref`.
Chunking `src/openhuman/memory/tree/chunker.rs`	Implements markdown chunking by token budget with paragraph→line→UTF-8 fallback, deterministic chunk IDs, token estimation, and chunk metadata sequencing.
Ingestion Orchestration `src/openhuman/memory/tree/ingest.rs`	Orchestrates ingest flows (`ingest_chat`, `ingest_email`, `ingest_document`): canonicalize, chunk, persist via store, return `IngestResult`.
JSON-RPC Handlers `src/openhuman/memory/tree/rpc.rs`	Adds RPC request/response types and async handlers: `ingest_rpc`, `list_chunks_rpc`, `get_chunk_rpc` with spawn_blocking for blocking work and payload deserialization.
Controller Schemas & Registration `src/openhuman/memory/tree/schemas.rs`	Defines `memory_tree` namespace controller schemas (`ingest`, `list_chunks`, `get_chunk`) and registers handlers that parse params and invoke RPC module functions.
Persistence Layer `src/openhuman/memory/tree/store.rs`	SQLite-backed storage at `<workspace>/memory_tree/chunks.db` with lazy init, WAL, schema; implements `upsert_chunks`, `get_chunk`, `list_chunks`, `count_chunks`, filtering, ordering and limit clamping.
Canonical Types `src/openhuman/memory/tree/types.rs`	Defines `SourceKind`, `DataSource`, `SourceRef`, `Metadata`, `Chunk`, and utilities `chunk_id()` and `approx_token_count()`.
End-to-end Test `tests/json_rpc_e2e.rs`	Adds e2e test asserting controller registration and exercising `openhuman.memory_tree_ingest`, `..._list_chunks`, `..._get_chunk` flows and error cases.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant Controller as memory_tree Controller
    participant Canonicalizer as Canonicalizer (chat/email/doc)
    participant Chunker as Markdown Chunker
    participant Store as SQLite Store

    Client->>Controller: JSON-RPC ingest request
    Controller->>Canonicalizer: canonicalise(source_id, owner, tags, payload)
    alt empty source
        Canonicalizer-->>Controller: Ok(None)
        Controller-->>Client: IngestResult { chunks_written: 0 }
    else non-empty
        Canonicalizer-->>Controller: Ok(CanonicalisedSource)
        Controller->>Chunker: chunk_markdown(markdown, metadata, opts)
        Chunker-->>Controller: Vec<Chunk>
        Controller->>Store: upsert_chunks(chunks)
        Store-->>Controller: rows_written
        Controller-->>Client: IngestResult { chunks_written, chunk_ids }
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

[Feature] Memory Architecture: Source-grounded wiki memory with phased ingestion, scoring, and traversal #711: Umbrella for memory architecture; this PR implements Phase 1 ingestion/canonicalization/chunking/persistence that aligns with that issue's objectives.

Possibly related PRs

feat(threads): dedicated threads controller with per-thread session scoping #590: Also extends global controller registry in src/core/all.rs — similar registry integration pattern.
feat(billing, team): add billing and team management RPC functionality #159: Modifies src/core/all.rs and namespace descriptions — analogous additions to controller/schema discovery.
feat(tree-summarizer): hierarchical summary tree module with CLI #423: Alters controller registry/schema discovery paths and adds namespaces; strongly related at code-integration level.

Poem

🐇 I hop through chats, threads, pages bright,
I stitch them into chunks by moonlight,
SQLite nests each little leaf,
Metadata roots hold provenance brief,
Hooray — the memory tree takes flight!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: implementing Phase 1 of the memory tree with multi-source ingestion and canonical chunks, referencing the related issue `#707`.
Linked Issues check	✅ Passed	All major objectives from issue `#707` are addressed: memory-ingestion domain structure, support for chat/email/document sources, canonical Markdown conversion, token-bounded chunking with deterministic IDs, comprehensive provenance metadata, source-link pointers, JSON-RPC exposure, and unit tests covering normalization and metadata.
Out of Scope Changes check	✅ Passed	All changes are directly aligned with Phase 1 memory tree ingestion scope. New module tree/ with canonicalize, chunker, ingest, rpc, schemas, store, and types; no unrelated modifications to existing code besides registry integration in src/core/all.rs.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

🧹 Nitpick comments (1)

src/openhuman/memory/tree/schemas.rs (1)
21-44: Please add the JSON-RPC E2E for this namespace before merge.

This file wires a public transport surface, so the riskiest failures are end-to-end: registry aggregation, param deserialization, and response shaping. The current unit coverage in the leaf modules will not exercise that path.

Based on learnings: "Follow the feature design workflow in order: (1) specify against existing codebase, (2) implement in Rust with unit tests, (3) add JSON-RPC E2E tests, (4) build React UI with core_rpc_relay / coreRpcClient, (5) add app unit tests, (6) add desktop E2E specs"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/schemas.rs` around lines 21 - 44, Add a JSON-RPC
end-to-end test exercising the public namespace wired up by
all_controller_schemas() and all_registered_controllers(): write a test that
starts the JSON-RPC transport/server with the same RegisteredController entries
(schemas("ingest"), schemas("list_chunks"), schemas("get_chunk") wired to
handle_ingest, handle_list_chunks, handle_get_chunk), sends real JSON-RPC
requests for each method, verifies param deserialization and the full response
shaping and error paths, and asserts registry aggregation works (all three
methods are reachable); place the test alongside existing e2e tests and reuse
the same schemas() helper and RegisteredController registration to ensure parity
with production wiring.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/core/all.rs`:
- Around line 119-120: The namespace "memory_tree" has been registered via
controllers.extend(crate::openhuman::memory::all_memory_tree_registered_controllers())
but namespace_description() in src/core/all.rs lacks a matching arm, so CLI/help
shows no summary; update namespace_description() to add a "memory_tree" match
arm (similar style to other arms) that returns a short descriptive string for
the memory_tree namespace, ensuring the function covers "memory_tree" and any
related display helpers that consume namespace_description().

In `@src/openhuman/memory/tree/canonicalize/mod.rs`:
- Around line 16-18: Add a shared normalizer in the canonicalization layer that
trims and converts blank/whitespace-only source refs to None so adapters can't
persist Some("") pointers; implement a helper (e.g. canonicalize_source_ref or
normalize_source_ref) near the canonicalize module and call it wherever
Metadata.source_ref is set or transformed (reference types::Metadata and the
places noted around lines 16 and 32-44) so that any Option<String> source_ref is
trimmed and turned to None if empty.

In `@src/openhuman/memory/tree/chunker.rs`:
- Around line 89-118: split_by_token_budget currently allows max_tokens == 0
which leads to hard_split_by_chars(..., 0) emitting a leading empty chunk and
incorrect behavior; clamp or reject zero up-front. In split_by_token_budget,
normalize the incoming max_tokens to at least 1 (e.g., let max_tokens =
max_tokens.max(1)) before calling approx_token_count, computing max_chars, or
invoking pack_segments/hard_split_by_chars so subsequent logic
(approx_token_count, pack_segments, hard_split_by_chars) never receives 0 and no
empty leading chunk is produced.

In `@src/openhuman/memory/tree/ingest.rs`:
- Around line 99-103: The debug log is printing raw user content (doc.title);
update the log in the ingest path (the log::debug call referencing source_id and
doc.title) to avoid logging the raw title — log a redacted indicator such as
whether a title exists and its length (e.g., title.is_some() and title.map(|t|
t.len())) or a boolean like has_title instead of the full doc.title; keep
source_id as-is and ensure you do not include any raw PII in the log message.

In `@src/openhuman/memory/tree/schemas.rs`:
- Around line 91-117: The outputs schema for ingest_rpc is incorrectly nested
under a single "result" object; change the outputs Vec in the schema (where
FieldSchema currently has name "result" and TypeSchema::Object) to instead
declare three top-level FieldSchema entries named "source_id"
(TypeSchema::String), "chunks_written" (TypeSchema::U64) and "chunk_ids"
(TypeSchema::Array(Box::new(TypeSchema::String))) so the schema matches
RpcOutcome<IngestResult>'s flattened shape; keep the existing comments and
required flags from the nested fields when you move them out.

In `@src/openhuman/memory/tree/store.rs`:
- Around line 128-160: In list_chunks (function list_chunks, variable sql)
change the ORDER BY clause to add seq_in_source as a secondary sort key so equal
timestamp_ms rows are deterministic; replace "ORDER BY timestamp_ms DESC LIMIT
?" with "ORDER BY timestamp_ms DESC, seq_in_source ASC LIMIT ?" to preserve
source order for multi-piece chunks.
- Around line 124-125: The RPC-provided struct field `limit: Option<usize>` must
be validated and safely converted before being used as an SQLite LIMIT; clamp
the requested value to a sane maximum (e.g. define a MAX_RESULTS constant such
as 10_000), default to 100 when None, and then convert using i64::try_from to
detect overflow. Locate the code that reads `limit` (the `limit` field of the
store/tree request and the place where it is cast to i64 for the SQL query) and
replace the direct cast with logic that: 1) maps None -> 100, 2) clamps Some(n)
to the range [1, MAX_RESULTS], 3) uses i64::try_from(clamped) and handles Err by
falling back to MAX_RESULTS (or returning an error), and 4) uses that safe i64
value for the SQLite LIMIT. Ensure no negative or overflowing values can reach
the SQL layer.
- Around line 237-246: The with_connection function currently opens a rusqlite
Connection but doesn't configure concurrency settings; update with_connection to
set a busy timeout and enable WAL mode after opening the DB but before
initializing the schema: call
conn.set_busy_timeout(Some(Duration::from_secs(5))) (or an appropriate duration)
and execute the PRAGMA "PRAGMA journal_mode=WAL;" on the Connection (e.g., via
conn.execute_batch or conn.execute) before running conn.execute_batch(SCHEMA);
keep the same error contexts on failures so any errors from set_busy_timeout or
enabling WAL surface in logs.

In `@src/openhuman/memory/tree/types.rs`:
- Around line 8-10: The module docs incorrectly state chunk IDs use blake3 but
the implementation in chunk_id() uses SHA-256; update the top-of-module doc
comment to state the correct algorithm and truncation (SHA-256(source_kind |
source_id | seq) truncated to 32 hex chars) and/or add a short Rustdoc note
above the chunk_id() function that documents the exact input concatenation and
truncation behavior so docs and code match (or, if you prefer blake3, change the
chunk_id() implementation to use blake3 consistently—pick one and make both doc
and function agree).

---

Nitpick comments:
In `@src/openhuman/memory/tree/schemas.rs`:
- Around line 21-44: Add a JSON-RPC end-to-end test exercising the public
namespace wired up by all_controller_schemas() and all_registered_controllers():
write a test that starts the JSON-RPC transport/server with the same
RegisteredController entries (schemas("ingest"), schemas("list_chunks"),
schemas("get_chunk") wired to handle_ingest, handle_list_chunks,
handle_get_chunk), sends real JSON-RPC requests for each method, verifies param
deserialization and the full response shaping and error paths, and asserts
registry aggregation works (all three methods are reachable); place the test
alongside existing e2e tests and reuse the same schemas() helper and
RegisteredController registration to ensure parity with production wiring.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7925714b-906f-4b33-bdea-9e2232dd9e18

📥 Commits

Reviewing files that changed from the base of the PR and between a29056f and 135fd95.

📒 Files selected for processing (13)

src/core/all.rs
src/openhuman/memory/mod.rs
src/openhuman/memory/tree/canonicalize/chat.rs
src/openhuman/memory/tree/canonicalize/document.rs
src/openhuman/memory/tree/canonicalize/email.rs
src/openhuman/memory/tree/canonicalize/mod.rs
src/openhuman/memory/tree/chunker.rs
src/openhuman/memory/tree/ingest.rs
src/openhuman/memory/tree/mod.rs
src/openhuman/memory/tree/rpc.rs
src/openhuman/memory/tree/schemas.rs
src/openhuman/memory/tree/store.rs
src/openhuman/memory/tree/types.rs

coderabbitai · 2026-04-21T20:42:09Z

+use serde::{Deserialize, Serialize};
+
+use crate::openhuman::memory::tree::types::Metadata;


⚠️ Potential issue | 🟡 Minor

Normalize blank source_ref values once in the shared canonicalization layer.

The sibling adapters currently turn Some("") / whitespace-only refs into SourceRef, which means we can persist a provenance pointer that points nowhere. A small shared helper here keeps chat/email/document consistent and preserves source-link integrity.

♻️ Suggested helper

use serde::{Deserialize, Serialize}; -use crate::openhuman::memory::tree::types::Metadata; +use crate::openhuman::memory::tree::types::{Metadata, SourceRef}; @@ pub struct CanonicaliseRequest<P> { @@ #[serde(default)] pub tags: Vec<String>, } + +pub(super) fn normalise_source_ref(value: Option<String>) -> Option<SourceRef> { + value + .map(|value| value.trim().to_owned()) + .filter(|value| !value.is_empty()) + .map(SourceRef::new) +}

Also applies to: 32-44

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/openhuman/memory/tree/canonicalize/mod.rs` around lines 16 - 18, Add a shared normalizer in the canonicalization layer that trims and converts blank/whitespace-only source refs to None so adapters can't persist Some("") pointers; implement a helper (e.g. canonicalize_source_ref or normalize_source_ref) near the canonicalize module and call it wherever Metadata.source_ref is set or transformed (reference types::Metadata and the places noted around lines 16 and 32-44) so that any Option<String> source_ref is trimmed and turned to None if empty.

- Added support for "memory_tree" namespace description in `namespace_description`. - Implemented clamping for zero token budget in `chunk_markdown` to prevent empty leading chunks. - Introduced detailed logging for document ingestion, including title length. - Updated output schemas in `schemas.rs` for improved clarity. - Enhanced chunk listing with clamping limits and ordering by sequence in `list_chunks`. - Normalized source references in canonicalization functions to drop blank values. - Added comprehensive tests for new features and edge cases in chunking and canonicalization. These changes improve the robustness and usability of the memory tree layer, aligning with ongoing development efforts for Phase 1 of the memory architecture.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

tests/json_rpc_e2e.rs (1)

793-797: Prefer explicit method-name assertions for public API stability.

Because expected_methods is derived from memory_tree_schema(...), a rename in schema + registry could still pass this check while breaking external clients expecting fixed names. Consider asserting the three canonical strings directly.

Hardening suggestion

-    let expected_methods = vec![
-        rpc_method_name(&memory_tree_schema("ingest")),
-        rpc_method_name(&memory_tree_schema("list_chunks")),
-        rpc_method_name(&memory_tree_schema("get_chunk")),
-    ];
+    let expected_methods = vec![
+        "openhuman.memory_tree_ingest".to_string(),
+        "openhuman.memory_tree_list_chunks".to_string(),
+        "openhuman.memory_tree_get_chunk".to_string(),
+    ];

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/json_rpc_e2e.rs` around lines 793 - 797, The test currently builds
expected_methods by calling rpc_method_name(&memory_tree_schema(...)) which
masks schema/registry renames; replace that derivation with explicit string
assertions for the three canonical RPC names (the literal method names that
external clients rely on) instead of using memory_tree_schema or rpc_method_name
to construct them; update the test around the expected_methods variable (and any
assertions that compare to it) to assert the three fixed string values directly
so public API names remain stable.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/openhuman/memory/tree/chunker.rs`:
- Around line 82-88: The doc comment in src/openhuman/memory/tree/chunker.rs
lists sentence- and word-boundary splitting (items 3 and 4) but the
implementation only performs paragraph, line, and hard-character cuts; either
implement those strategies in the chunking code (e.g., add sentence and word
split logic to the Chunker::split / chunking function using a proper sentence
tokenizer or unicode-segmentation word/sentence boundaries and fallbacks that
preserve UTF-8 code points) or update the doc comment to accurately list only
paragraph, line, and hard-character splitting; pick one approach and make the
code and the doc comment consistent (refer to the Chunker type and its
split/chunk method when applying the change).

In `@src/openhuman/memory/tree/schemas.rs`:
- Around line 70-74: The schema currently marks the field name "owner" with
TypeSchema::String and required: true, but tree_rpc::IngestRequest accepts a
missing owner via #[serde(default)], so update the schema to match runtime
behavior: change the "owner" field to be optional (remove required: true or set
required: false / wrap in an Option-type schema) so it aligns with
IngestRequest's #[serde(default)] handling; reference the "owner" field
definition and the tree_rpc::IngestRequest serde default attribute when making
the change.

---

Nitpick comments:
In `@tests/json_rpc_e2e.rs`:
- Around line 793-797: The test currently builds expected_methods by calling
rpc_method_name(&memory_tree_schema(...)) which masks schema/registry renames;
replace that derivation with explicit string assertions for the three canonical
RPC names (the literal method names that external clients rely on) instead of
using memory_tree_schema or rpc_method_name to construct them; update the test
around the expected_methods variable (and any assertions that compare to it) to
assert the three fixed string values directly so public API names remain stable.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 88708079-b359-4b7d-bd78-d4ee66b57fe5

📥 Commits

Reviewing files that changed from the base of the PR and between 135fd95 and 9724d2b.

📒 Files selected for processing (11)

src/core/all.rs
src/openhuman/memory/tree/canonicalize/chat.rs
src/openhuman/memory/tree/canonicalize/document.rs
src/openhuman/memory/tree/canonicalize/email.rs
src/openhuman/memory/tree/canonicalize/mod.rs
src/openhuman/memory/tree/chunker.rs
src/openhuman/memory/tree/ingest.rs
src/openhuman/memory/tree/schemas.rs
src/openhuman/memory/tree/store.rs
src/openhuman/memory/tree/types.rs
tests/json_rpc_e2e.rs

✅ Files skipped from review due to trivial changes (1)

src/openhuman/memory/tree/types.rs

🚧 Files skipped from review as they are similar to previous changes (6)

src/core/all.rs
src/openhuman/memory/tree/ingest.rs
src/openhuman/memory/tree/canonicalize/chat.rs
src/openhuman/memory/tree/canonicalize/email.rs
src/openhuman/memory/tree/canonicalize/document.rs
src/openhuman/memory/tree/store.rs

coderabbitai

🧹 Nitpick comments (2)

tests/json_rpc_e2e.rs (1)

810-867: Add a re-ingest idempotency check to this E2E flow.

This path verifies ingest/list/get, but it doesn’t yet assert the Phase-1 contract that re-ingesting identical content is stable and idempotent (same chunk IDs, no duplicate chunks).

💡 Proposed test extension

     let ingest_outer = assert_no_jsonrpc_error(&ingest, "memory_tree_ingest");
     let ingest_result = ingest_outer.get("result").unwrap_or(ingest_outer);
@@
     let chunk_ids = ingest_result
         .get("chunk_ids")
         .and_then(Value::as_array)
         .expect("chunk_ids array");
     assert_eq!(chunk_ids.len(), 1);
+    let chunk_ids_first = chunk_ids.clone();
+
+    // Re-ingest identical payload: should be idempotent/stable.
+    let ingest_again = post_json_rpc(
+        &rpc_base,
+        205,
+        &expected_methods[0],
+        json!({
+            "source_kind": "document",
+            "source_id": "notion:launch-plan",
+            "owner": "alice@example.com",
+            "tags": ["planning", "launch"],
+            "payload": {
+                "provider": "notion",
+                "title": "Launch Plan",
+                "body": "Alpha\n\nBeta",
+                "modified_at": 1700000000000_i64,
+                "source_ref": " notion://page/launch-plan "
+            }
+        }),
+    ).await;
+    let ingest_again_outer = assert_no_jsonrpc_error(&ingest_again, "memory_tree_ingest_reingest");
+    let ingest_again_result = ingest_again_outer.get("result").unwrap_or(ingest_again_outer);
+    let chunk_ids_again = ingest_again_result
+        .get("chunk_ids")
+        .and_then(Value::as_array)
+        .expect("chunk_ids array");
+    assert_eq!(chunk_ids_again, &chunk_ids_first, "chunk ids should remain stable across re-ingest");
@@
     assert_eq!(chunks.len(), 1);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/json_rpc_e2e.rs` around lines 810 - 867, Add an idempotency check after
the initial ingest/list assertions: call post_json_rpc again with the same
payload using the same expected_methods[0] (memory_tree_ingest) and verify via
assert_no_jsonrpc_error that the returned ingest result contains the same
"source_id" and identical "chunk_ids" as the first ingest (no new chunk IDs),
and then call memory_tree_list_chunks (expected_methods[1]) again and assert the
"chunks" array length is unchanged (still 1) and that the chunk at index 0 still
has seq_in_source == 0 and the same pointer("/metadata/source_ref/value"); use
the existing variables ingest_result, chunk_ids, list_result, and chunk to
compare old vs new values to ensure re-ingest is stable and idempotent.

src/openhuman/memory/tree/chunker.rs (1)

123-138: Consider using sep.chars().count() for consistency.

Line 124 uses sep.len() (byte length) while line 137 adds it to current.chars().count() (character count). For the ASCII separators "\n" and "\n\n" used internally, bytes equal chars so this works correctly. However, using sep.chars().count() would make the units consistent throughout.
🔧 Proposed fix for consistency
 fn pack_segments(segments: &[&str], sep: &str, max_chars: usize) -> Option<Vec<String>> {
-    let sep_len = sep.len();
+    let sep_len = sep.chars().count();
     let mut out: Vec<String> = Vec::new();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/chunker.rs` around lines 123 - 138, In
pack_segments change the separator length calculation to use character count
rather than byte length for consistency: replace the sep_len = sep.len() usage
with sep_chars = sep.chars().count() (and use that variable where sep_len is
referenced) so all length math in pack_segments (e.g., projected =
current.chars().count() + sep_chars + seg_len and any comparisons) uses
character counts consistently; update the variable name (sep_chars) and its
references in the function.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/openhuman/memory/tree/chunker.rs`:
- Around line 123-138: In pack_segments change the separator length calculation
to use character count rather than byte length for consistency: replace the
sep_len = sep.len() usage with sep_chars = sep.chars().count() (and use that
variable where sep_len is referenced) so all length math in pack_segments (e.g.,
projected = current.chars().count() + sep_chars + seg_len and any comparisons)
uses character counts consistently; update the variable name (sep_chars) and its
references in the function.

In `@tests/json_rpc_e2e.rs`:
- Around line 810-867: Add an idempotency check after the initial ingest/list
assertions: call post_json_rpc again with the same payload using the same
expected_methods[0] (memory_tree_ingest) and verify via assert_no_jsonrpc_error
that the returned ingest result contains the same "source_id" and identical
"chunk_ids" as the first ingest (no new chunk IDs), and then call
memory_tree_list_chunks (expected_methods[1]) again and assert the "chunks"
array length is unchanged (still 1) and that the chunk at index 0 still has
seq_in_source == 0 and the same pointer("/metadata/source_ref/value"); use the
existing variables ingest_result, chunk_ids, list_result, and chunk to compare
old vs new values to ensure re-ingest is stable and idempotent.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d11a345a-a8ab-4ed6-8d5f-d9be10410ce8

📥 Commits

Reviewing files that changed from the base of the PR and between 9724d2b and 3528d2b.

📒 Files selected for processing (3)

src/openhuman/memory/tree/chunker.rs
src/openhuman/memory/tree/schemas.rs
tests/json_rpc_e2e.rs

✅ Files skipped from review due to trivial changes (1)

src/openhuman/memory/tree/schemas.rs

@Handles

… gate (#708) (#733) * feat(memory): phase 1 memory tree - multi-source ingestion & canonical chunks (#707) Adds an isolated memory tree layer under src/openhuman/memory/tree/ implementing Phase 1 of the new memory architecture (umbrella #711). Zero edits to existing memory/*.rs files - the new layer coexists with the legacy TinyHumans-backed client. - Source adapters: chat / email / document -> canonical Markdown - Token-bounded chunker with deterministic SHA-256 chunk IDs - SQLite persistence at <workspace>/memory_tree/chunks.db with full provenance metadata (source_kind, source_id, owner, timestamps, tags, time_range) and back-pointer to raw source - Unified JSON-RPC ingest (dispatches on source_kind + JSON payload): openhuman.memory_tree_ingest, _list_chunks, _get_chunk - DataSource enum covering the 8 providers from m.excalidraw step 1 (Discord/Telegram/Whatsapp/Gmail/OtherEmail/Notion/MeetingNotes/DriveDocs) - ~40 unit tests (chunk ID stability, UTF-8-safe splitting, canonicalisation idempotence, store round-trip, filter behavior) Additive only: new tables in a new DB file, new JSON-RPC namespace, no existing behavior changes. Feeds #708 (scoring), #709 (summary trees), #710 (query tools). Closes #707. Parent: #711. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(memory): phase 2 memory tree - preprocessing, scoring, admission gate (#708) Adds the scoring / admission layer between Phase 1's chunker and store. Stacked on feat/707-memory-ingestion (PR #732) - depends on Phase 1's chunk substrate. - Pluggable EntityExtractor trait + CompositeExtractor chain - RegexEntityExtractor: mechanical entities (emails, URLs, @Handles, #hashtags) - Always on, deterministic, zero deps, UTF-8-safe char spans - Five weighted signals: token count, unique-word ratio, metadata weight, source weight (per-DataSource), interaction (reply/sent/mention/dm tags), entity density - Exact-match entity canonicalisation (email lowercased, @ and # stripped) - Admission gate drops chunks below configurable threshold (default 0.3) - Score rationale persists for EVERY chunk (kept or dropped) for debugging - Entities indexed for KEPT chunks only - Two new SQLite tables added to the memory_tree DB: - mem_tree_score: per-chunk score rationale with all signal values - mem_tree_entity_index: inverted index entity_id -> node_id - Idempotent ALTER TABLE migration adds embedding BLOB column to mem_tree_chunks (used in Phase 3 retrieval, wired but not populated here) - Ingest pipeline converted to async to accommodate the extractor trait; blocking SQLite work isolated on spawn_blocking; JSON-RPC surface unchanged (same memory_tree_ingest / list / get methods) - Phase 2 deliberately ships without GLiNER/semantic NER - per-chunk semantic entities land later behind a cargo feature flag; the composite extractor interface keeps that drop-in trivial Additive only: new tables, new columns, new module. Existing Phase 1 behavior unchanged except that low-signal chunks are now dropped before reaching mem_tree_chunks. Raise score_drop_threshold to 0 to disable the gate and restore Phase-1-identical behavior. Closes #708. Parent: #711. Depends on: #707 (#732). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fix memory tree scoring persistence issues * Fix memory tree scoring robustness issues from PR review - ingest: fail fast if scorer returns fewer/more results than chunks (silent zip truncation would drop chunks or their score rationale) - score::persist_score{,_tx}: clear stale entity-index rows before re-indexing a re-scored chunk, since INSERT OR REPLACE never deletes rows whose entity_id is no longer in the new extraction - score::store::lookup_entity: clamp limit to i64::MAX before casting to prevent a large usize wrapping into a negative LIMIT Adds clear_entity_index_drops_stale_rows regression test. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Steven Enamakel <enamakel@tinyhumans.ai>

sanil-23 mentioned this pull request Apr 21, 2026

feat(memory): Phase 2 memory tree - preprocessing, scoring, admission gate (#708) #733

Merged

5 tasks

senamakel marked this pull request as ready for review April 21, 2026 20:33

coderabbitai Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread src/openhuman/memory/tree/chunker.rs Outdated

Comment thread src/openhuman/memory/tree/schemas.rs

fix(memory): address follow-up CodeRabbit comments

3528d2b

coderabbitai Bot reviewed Apr 21, 2026

View reviewed changes

senamakel merged commit 0ce1253 into tinyhumansai:main Apr 21, 2026
7 of 8 checks passed

jwalin-shah mentioned this pull request Apr 22, 2026

refactor(core): move CLI adapters out of src/core/ into per-domain cli modules #758

Merged

3 tasks

		use serde::{Deserialize, Serialize};

		use crate::openhuman::memory::tree::types::Metadata;

Conversation

sanil-23 commented Apr 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Submission Checklist

Impact

Related

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sanil-23 commented Apr 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 21, 2026 •

edited

Loading