Skip to content

feat(memory): Phase 1 memory tree - multi-source ingestion & canonical chunks (#707)#732

Merged
senamakel merged 3 commits intotinyhumansai:mainfrom
sanil-23:feat/707-memory-ingestion
Apr 21, 2026
Merged

feat(memory): Phase 1 memory tree - multi-source ingestion & canonical chunks (#707)#732
senamakel merged 3 commits intotinyhumansai:mainfrom
sanil-23:feat/707-memory-ingestion

Conversation

@sanil-23
Copy link
Copy Markdown
Contributor

@sanil-23 sanil-23 commented Apr 21, 2026

Summary

Problem

OpenHuman currently lacks a unified ingestion layer for long-lived personal/team knowledge across messaging, email, and documents. Without a normalized substrate, later memory features (filtering, hierarchical summarization, retrieval) can't be built on a consistent foundation. Issue #707 tracks this foundation.

Solution

New module src/openhuman/memory/tree/ containing:

File Responsibility
types.rs Chunk, Metadata, SourceKind, SourceRef, DataSource enum (8 providers from m.excalidraw), deterministic SHA-256 chunk IDs
canonicalize/{chat,email,document}.rs Per-source-kind adapters → canonical Markdown
chunker.rs Paragraph-preferred, UTF-8-safe splitting with ≤10k-token bound (aligned with the future #709 summariser budget)
store.rs SQLite schema + CRUD at <workspace>/memory_tree/chunks.db; idempotent upsert on chunk.id
ingest.rs Orchestrator: canonicalise → chunk → persist
rpc.rs + schemas.rs JSON-RPC surface under memory_tree namespace
mod.rs Module exports

Exposed JSON-RPC methods (controller registry):

  • openhuman.memory_tree_ingest — unified ingest; caller supplies source_kind + JSON payload (shape dispatches on kind)
  • openhuman.memory_tree_list_chunks — filter by kind/source_id/owner/time window
  • openhuman.memory_tree_get_chunk — fetch by deterministic id

Chunk ID: sha256(source_kind | source_id | seq) truncated to 32 hex chars. Stable across re-ingest; upserts are idempotent.

Design decisions / tradeoffs:

Submission Checklist

  • Unit tests — ~40 tests co-located in each module covering chunk ID stability, UTF-8-safe splitting, canonicalisation idempotence, store round-trip, filter behavior, and re-ingest idempotence
  • E2E / integration — JSON-RPC integration test (tests/memory_tree_e2e.rs) planned in a follow-up commit on this branch before marking ready-for-review
  • N/A — not applicable to this PR
  • Doc comments — module-level //! + per-item /// on public types and non-obvious logic
  • Inline comments — hot paths and invariants annotated; structured [memory_tree::*] debug logs throughout

Impact

  • Runtime: additive only. Adds three JSON-RPC methods under a new memory_tree namespace; new SQLite DB file at <workspace>/memory_tree/chunks.db created lazily on first ingest. No auto-execution, no startup subscribers — new code paths fire only when explicitly called.
  • Platform: desktop; no platform-specific code.
  • Performance: ingest is sync per chunk (CPU-bound canonicalisation + small SQLite write); JSON-RPC handlers use tokio::task::spawn_blocking so the runtime thread isn't blocked.
  • Security: no external network calls, no secret handling. SQLite file inherits workspace permissions.
  • Migration: none — new tables in a new DB file.
  • Compatibility: fully backward compatible. Existing memory namespace methods and tables are untouched.

Related

Summary by CodeRabbit

  • New Features

    • Unified memory ingestion for chats, emails, and documents with canonicalization, deterministic chunking, and durable queryable storage.
    • JSON-RPC endpoints to ingest data, list chunks, and fetch individual chunks; the new memory_tree namespace is registered and discoverable.
  • Tests

    • End-to-end JSON-RPC tests validating controller registration, ingestion, listing, retrieval, and error cases.

…l chunks (tinyhumansai#707)

Adds an isolated memory tree layer under src/openhuman/memory/tree/
implementing Phase 1 of the new memory architecture (umbrella tinyhumansai#711).
Zero edits to existing memory/*.rs files - the new layer coexists
with the legacy TinyHumans-backed client.

- Source adapters: chat / email / document -> canonical Markdown
- Token-bounded chunker with deterministic SHA-256 chunk IDs
- SQLite persistence at <workspace>/memory_tree/chunks.db with full
  provenance metadata (source_kind, source_id, owner, timestamps,
  tags, time_range) and back-pointer to raw source
- Unified JSON-RPC ingest (dispatches on source_kind + JSON payload):
  openhuman.memory_tree_ingest, _list_chunks, _get_chunk
- DataSource enum covering the 8 providers from m.excalidraw step 1
  (Discord/Telegram/Whatsapp/Gmail/OtherEmail/Notion/MeetingNotes/DriveDocs)
- ~40 unit tests (chunk ID stability, UTF-8-safe splitting,
  canonicalisation idempotence, store round-trip, filter behavior)

Additive only: new tables in a new DB file, new JSON-RPC namespace,
no existing behavior changes. Feeds tinyhumansai#708 (scoring), tinyhumansai#709 (summary
trees), tinyhumansai#710 (query tools).

Closes tinyhumansai#707. Parent: tinyhumansai#711.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

📝 Walkthrough

Walkthrough

Adds a new "memory_tree" ingestion subsystem: canonicalizers for chat/email/doc, markdown chunker, deterministic chunk IDs, SQLite persistence, JSON-RPC handlers and controller/schema registrations, plus end-to-end tests and registry integration.

Changes

Cohort / File(s) Summary
Registry Integration
src/core/all.rs
Appends all_memory_tree_registered_controllers() to registered controllers, adds all_memory_tree_controller_schemas() to declared schemas, and provides a memory_tree namespace description.
Module Re-exports
src/openhuman/memory/mod.rs
Adds pub mod tree; and re-exports all_memory_tree_controller_schemas, all_memory_tree_registered_controllers.
Module Entry & Re-exports
src/openhuman/memory/tree/mod.rs
New module root exposing submodules (canonicalize, chunker, ingest, rpc, schemas, store, types) and re-exporting schema/registration aggregators.
Canonicalizers
src/openhuman/memory/tree/canonicalize/*
New canonicalizers: chat.rs, email.rs, document.rs produce CanonicalisedSource; mod.rs defines CanonicaliseRequest, CanonicalisedSource, normalize_source_ref.
Chunking
src/openhuman/memory/tree/chunker.rs
Implements markdown chunking by token budget with paragraph→line→UTF-8 fallback, deterministic chunk IDs, token estimation, and chunk metadata sequencing.
Ingestion Orchestration
src/openhuman/memory/tree/ingest.rs
Orchestrates ingest flows (ingest_chat, ingest_email, ingest_document): canonicalize, chunk, persist via store, return IngestResult.
JSON-RPC Handlers
src/openhuman/memory/tree/rpc.rs
Adds RPC request/response types and async handlers: ingest_rpc, list_chunks_rpc, get_chunk_rpc with spawn_blocking for blocking work and payload deserialization.
Controller Schemas & Registration
src/openhuman/memory/tree/schemas.rs
Defines memory_tree namespace controller schemas (ingest, list_chunks, get_chunk) and registers handlers that parse params and invoke RPC module functions.
Persistence Layer
src/openhuman/memory/tree/store.rs
SQLite-backed storage at <workspace>/memory_tree/chunks.db with lazy init, WAL, schema; implements upsert_chunks, get_chunk, list_chunks, count_chunks, filtering, ordering and limit clamping.
Canonical Types
src/openhuman/memory/tree/types.rs
Defines SourceKind, DataSource, SourceRef, Metadata, Chunk, and utilities chunk_id() and approx_token_count().
End-to-end Test
tests/json_rpc_e2e.rs
Adds e2e test asserting controller registration and exercising openhuman.memory_tree_ingest, ..._list_chunks, ..._get_chunk flows and error cases.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant Controller as memory_tree Controller
    participant Canonicalizer as Canonicalizer (chat/email/doc)
    participant Chunker as Markdown Chunker
    participant Store as SQLite Store

    Client->>Controller: JSON-RPC ingest request
    Controller->>Canonicalizer: canonicalise(source_id, owner, tags, payload)
    alt empty source
        Canonicalizer-->>Controller: Ok(None)
        Controller-->>Client: IngestResult { chunks_written: 0 }
    else non-empty
        Canonicalizer-->>Controller: Ok(CanonicalisedSource)
        Controller->>Chunker: chunk_markdown(markdown, metadata, opts)
        Chunker-->>Controller: Vec<Chunk>
        Controller->>Store: upsert_chunks(chunks)
        Store-->>Controller: rows_written
        Controller-->>Client: IngestResult { chunks_written, chunk_ids }
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Possibly related PRs

Poem

🐇 I hop through chats, threads, pages bright,
I stitch them into chunks by moonlight,
SQLite nests each little leaf,
Metadata roots hold provenance brief,
Hooray — the memory tree takes flight!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: implementing Phase 1 of the memory tree with multi-source ingestion and canonical chunks, referencing the related issue #707.
Linked Issues check ✅ Passed All major objectives from issue #707 are addressed: memory-ingestion domain structure, support for chat/email/document sources, canonical Markdown conversion, token-bounded chunking with deterministic IDs, comprehensive provenance metadata, source-link pointers, JSON-RPC exposure, and unit tests covering normalization and metadata.
Out of Scope Changes check ✅ Passed All changes are directly aligned with Phase 1 memory tree ingestion scope. New module tree/ with canonicalize, chunker, ingest, rpc, schemas, store, and types; no unrelated modifications to existing code besides registry integration in src/core/all.rs.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (1)
src/openhuman/memory/tree/schemas.rs (1)

21-44: Please add the JSON-RPC E2E for this namespace before merge.

This file wires a public transport surface, so the riskiest failures are end-to-end: registry aggregation, param deserialization, and response shaping. The current unit coverage in the leaf modules will not exercise that path.

Based on learnings: "Follow the feature design workflow in order: (1) specify against existing codebase, (2) implement in Rust with unit tests, (3) add JSON-RPC E2E tests, (4) build React UI with core_rpc_relay / coreRpcClient, (5) add app unit tests, (6) add desktop E2E specs"

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/schemas.rs` around lines 21 - 44, Add a JSON-RPC
end-to-end test exercising the public namespace wired up by
all_controller_schemas() and all_registered_controllers(): write a test that
starts the JSON-RPC transport/server with the same RegisteredController entries
(schemas("ingest"), schemas("list_chunks"), schemas("get_chunk") wired to
handle_ingest, handle_list_chunks, handle_get_chunk), sends real JSON-RPC
requests for each method, verifies param deserialization and the full response
shaping and error paths, and asserts registry aggregation works (all three
methods are reachable); place the test alongside existing e2e tests and reuse
the same schemas() helper and RegisteredController registration to ensure parity
with production wiring.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/core/all.rs`:
- Around line 119-120: The namespace "memory_tree" has been registered via
controllers.extend(crate::openhuman::memory::all_memory_tree_registered_controllers())
but namespace_description() in src/core/all.rs lacks a matching arm, so CLI/help
shows no summary; update namespace_description() to add a "memory_tree" match
arm (similar style to other arms) that returns a short descriptive string for
the memory_tree namespace, ensuring the function covers "memory_tree" and any
related display helpers that consume namespace_description().

In `@src/openhuman/memory/tree/canonicalize/mod.rs`:
- Around line 16-18: Add a shared normalizer in the canonicalization layer that
trims and converts blank/whitespace-only source refs to None so adapters can't
persist Some("") pointers; implement a helper (e.g. canonicalize_source_ref or
normalize_source_ref) near the canonicalize module and call it wherever
Metadata.source_ref is set or transformed (reference types::Metadata and the
places noted around lines 16 and 32-44) so that any Option<String> source_ref is
trimmed and turned to None if empty.

In `@src/openhuman/memory/tree/chunker.rs`:
- Around line 89-118: split_by_token_budget currently allows max_tokens == 0
which leads to hard_split_by_chars(..., 0) emitting a leading empty chunk and
incorrect behavior; clamp or reject zero up-front. In split_by_token_budget,
normalize the incoming max_tokens to at least 1 (e.g., let max_tokens =
max_tokens.max(1)) before calling approx_token_count, computing max_chars, or
invoking pack_segments/hard_split_by_chars so subsequent logic
(approx_token_count, pack_segments, hard_split_by_chars) never receives 0 and no
empty leading chunk is produced.

In `@src/openhuman/memory/tree/ingest.rs`:
- Around line 99-103: The debug log is printing raw user content (doc.title);
update the log in the ingest path (the log::debug call referencing source_id and
doc.title) to avoid logging the raw title — log a redacted indicator such as
whether a title exists and its length (e.g., title.is_some() and title.map(|t|
t.len())) or a boolean like has_title instead of the full doc.title; keep
source_id as-is and ensure you do not include any raw PII in the log message.

In `@src/openhuman/memory/tree/schemas.rs`:
- Around line 91-117: The outputs schema for ingest_rpc is incorrectly nested
under a single "result" object; change the outputs Vec in the schema (where
FieldSchema currently has name "result" and TypeSchema::Object) to instead
declare three top-level FieldSchema entries named "source_id"
(TypeSchema::String), "chunks_written" (TypeSchema::U64) and "chunk_ids"
(TypeSchema::Array(Box::new(TypeSchema::String))) so the schema matches
RpcOutcome<IngestResult>'s flattened shape; keep the existing comments and
required flags from the nested fields when you move them out.

In `@src/openhuman/memory/tree/store.rs`:
- Around line 128-160: In list_chunks (function list_chunks, variable sql)
change the ORDER BY clause to add seq_in_source as a secondary sort key so equal
timestamp_ms rows are deterministic; replace "ORDER BY timestamp_ms DESC LIMIT
?" with "ORDER BY timestamp_ms DESC, seq_in_source ASC LIMIT ?" to preserve
source order for multi-piece chunks.
- Around line 124-125: The RPC-provided struct field `limit: Option<usize>` must
be validated and safely converted before being used as an SQLite LIMIT; clamp
the requested value to a sane maximum (e.g. define a MAX_RESULTS constant such
as 10_000), default to 100 when None, and then convert using i64::try_from to
detect overflow. Locate the code that reads `limit` (the `limit` field of the
store/tree request and the place where it is cast to i64 for the SQL query) and
replace the direct cast with logic that: 1) maps None -> 100, 2) clamps Some(n)
to the range [1, MAX_RESULTS], 3) uses i64::try_from(clamped) and handles Err by
falling back to MAX_RESULTS (or returning an error), and 4) uses that safe i64
value for the SQLite LIMIT. Ensure no negative or overflowing values can reach
the SQL layer.
- Around line 237-246: The with_connection function currently opens a rusqlite
Connection but doesn't configure concurrency settings; update with_connection to
set a busy timeout and enable WAL mode after opening the DB but before
initializing the schema: call
conn.set_busy_timeout(Some(Duration::from_secs(5))) (or an appropriate duration)
and execute the PRAGMA "PRAGMA journal_mode=WAL;" on the Connection (e.g., via
conn.execute_batch or conn.execute) before running conn.execute_batch(SCHEMA);
keep the same error contexts on failures so any errors from set_busy_timeout or
enabling WAL surface in logs.

In `@src/openhuman/memory/tree/types.rs`:
- Around line 8-10: The module docs incorrectly state chunk IDs use blake3 but
the implementation in chunk_id() uses SHA-256; update the top-of-module doc
comment to state the correct algorithm and truncation (SHA-256(source_kind |
source_id | seq) truncated to 32 hex chars) and/or add a short Rustdoc note
above the chunk_id() function that documents the exact input concatenation and
truncation behavior so docs and code match (or, if you prefer blake3, change the
chunk_id() implementation to use blake3 consistently—pick one and make both doc
and function agree).

---

Nitpick comments:
In `@src/openhuman/memory/tree/schemas.rs`:
- Around line 21-44: Add a JSON-RPC end-to-end test exercising the public
namespace wired up by all_controller_schemas() and all_registered_controllers():
write a test that starts the JSON-RPC transport/server with the same
RegisteredController entries (schemas("ingest"), schemas("list_chunks"),
schemas("get_chunk") wired to handle_ingest, handle_list_chunks,
handle_get_chunk), sends real JSON-RPC requests for each method, verifies param
deserialization and the full response shaping and error paths, and asserts
registry aggregation works (all three methods are reachable); place the test
alongside existing e2e tests and reuse the same schemas() helper and
RegisteredController registration to ensure parity with production wiring.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7925714b-906f-4b33-bdea-9e2232dd9e18

📥 Commits

Reviewing files that changed from the base of the PR and between a29056f and 135fd95.

📒 Files selected for processing (13)
  • src/core/all.rs
  • src/openhuman/memory/mod.rs
  • src/openhuman/memory/tree/canonicalize/chat.rs
  • src/openhuman/memory/tree/canonicalize/document.rs
  • src/openhuman/memory/tree/canonicalize/email.rs
  • src/openhuman/memory/tree/canonicalize/mod.rs
  • src/openhuman/memory/tree/chunker.rs
  • src/openhuman/memory/tree/ingest.rs
  • src/openhuman/memory/tree/mod.rs
  • src/openhuman/memory/tree/rpc.rs
  • src/openhuman/memory/tree/schemas.rs
  • src/openhuman/memory/tree/store.rs
  • src/openhuman/memory/tree/types.rs

Comment thread src/core/all.rs
Comment on lines +16 to +18
use serde::{Deserialize, Serialize};

use crate::openhuman::memory::tree::types::Metadata;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Normalize blank source_ref values once in the shared canonicalization layer.

The sibling adapters currently turn Some("") / whitespace-only refs into SourceRef, which means we can persist a provenance pointer that points nowhere. A small shared helper here keeps chat/email/document consistent and preserves source-link integrity.

♻️ Suggested helper
 use serde::{Deserialize, Serialize};
 
-use crate::openhuman::memory::tree::types::Metadata;
+use crate::openhuman::memory::tree::types::{Metadata, SourceRef};
@@
 pub struct CanonicaliseRequest<P> {
@@
     #[serde(default)]
     pub tags: Vec<String>,
 }
+
+pub(super) fn normalise_source_ref(value: Option<String>) -> Option<SourceRef> {
+    value
+        .map(|value| value.trim().to_owned())
+        .filter(|value| !value.is_empty())
+        .map(SourceRef::new)
+}

Also applies to: 32-44

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/canonicalize/mod.rs` around lines 16 - 18, Add a
shared normalizer in the canonicalization layer that trims and converts
blank/whitespace-only source refs to None so adapters can't persist Some("")
pointers; implement a helper (e.g. canonicalize_source_ref or
normalize_source_ref) near the canonicalize module and call it wherever
Metadata.source_ref is set or transformed (reference types::Metadata and the
places noted around lines 16 and 32-44) so that any Option<String> source_ref is
trimmed and turned to None if empty.

Comment thread src/openhuman/memory/tree/chunker.rs
Comment thread src/openhuman/memory/tree/ingest.rs
Comment thread src/openhuman/memory/tree/schemas.rs Outdated
Comment thread src/openhuman/memory/tree/store.rs
Comment thread src/openhuman/memory/tree/store.rs Outdated
Comment thread src/openhuman/memory/tree/store.rs
Comment thread src/openhuman/memory/tree/types.rs Outdated
- Added support for "memory_tree" namespace description in `namespace_description`.
- Implemented clamping for zero token budget in `chunk_markdown` to prevent empty leading chunks.
- Introduced detailed logging for document ingestion, including title length.
- Updated output schemas in `schemas.rs` for improved clarity.
- Enhanced chunk listing with clamping limits and ordering by sequence in `list_chunks`.
- Normalized source references in canonicalization functions to drop blank values.
- Added comprehensive tests for new features and edge cases in chunking and canonicalization.

These changes improve the robustness and usability of the memory tree layer, aligning with ongoing development efforts for Phase 1 of the memory architecture.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
tests/json_rpc_e2e.rs (1)

793-797: Prefer explicit method-name assertions for public API stability.

Because expected_methods is derived from memory_tree_schema(...), a rename in schema + registry could still pass this check while breaking external clients expecting fixed names. Consider asserting the three canonical strings directly.

Hardening suggestion
-    let expected_methods = vec![
-        rpc_method_name(&memory_tree_schema("ingest")),
-        rpc_method_name(&memory_tree_schema("list_chunks")),
-        rpc_method_name(&memory_tree_schema("get_chunk")),
-    ];
+    let expected_methods = vec![
+        "openhuman.memory_tree_ingest".to_string(),
+        "openhuman.memory_tree_list_chunks".to_string(),
+        "openhuman.memory_tree_get_chunk".to_string(),
+    ];
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/json_rpc_e2e.rs` around lines 793 - 797, The test currently builds
expected_methods by calling rpc_method_name(&memory_tree_schema(...)) which
masks schema/registry renames; replace that derivation with explicit string
assertions for the three canonical RPC names (the literal method names that
external clients rely on) instead of using memory_tree_schema or rpc_method_name
to construct them; update the test around the expected_methods variable (and any
assertions that compare to it) to assert the three fixed string values directly
so public API names remain stable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/openhuman/memory/tree/chunker.rs`:
- Around line 82-88: The doc comment in src/openhuman/memory/tree/chunker.rs
lists sentence- and word-boundary splitting (items 3 and 4) but the
implementation only performs paragraph, line, and hard-character cuts; either
implement those strategies in the chunking code (e.g., add sentence and word
split logic to the Chunker::split / chunking function using a proper sentence
tokenizer or unicode-segmentation word/sentence boundaries and fallbacks that
preserve UTF-8 code points) or update the doc comment to accurately list only
paragraph, line, and hard-character splitting; pick one approach and make the
code and the doc comment consistent (refer to the Chunker type and its
split/chunk method when applying the change).

In `@src/openhuman/memory/tree/schemas.rs`:
- Around line 70-74: The schema currently marks the field name "owner" with
TypeSchema::String and required: true, but tree_rpc::IngestRequest accepts a
missing owner via #[serde(default)], so update the schema to match runtime
behavior: change the "owner" field to be optional (remove required: true or set
required: false / wrap in an Option-type schema) so it aligns with
IngestRequest's #[serde(default)] handling; reference the "owner" field
definition and the tree_rpc::IngestRequest serde default attribute when making
the change.

---

Nitpick comments:
In `@tests/json_rpc_e2e.rs`:
- Around line 793-797: The test currently builds expected_methods by calling
rpc_method_name(&memory_tree_schema(...)) which masks schema/registry renames;
replace that derivation with explicit string assertions for the three canonical
RPC names (the literal method names that external clients rely on) instead of
using memory_tree_schema or rpc_method_name to construct them; update the test
around the expected_methods variable (and any assertions that compare to it) to
assert the three fixed string values directly so public API names remain stable.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 88708079-b359-4b7d-bd78-d4ee66b57fe5

📥 Commits

Reviewing files that changed from the base of the PR and between 135fd95 and 9724d2b.

📒 Files selected for processing (11)
  • src/core/all.rs
  • src/openhuman/memory/tree/canonicalize/chat.rs
  • src/openhuman/memory/tree/canonicalize/document.rs
  • src/openhuman/memory/tree/canonicalize/email.rs
  • src/openhuman/memory/tree/canonicalize/mod.rs
  • src/openhuman/memory/tree/chunker.rs
  • src/openhuman/memory/tree/ingest.rs
  • src/openhuman/memory/tree/schemas.rs
  • src/openhuman/memory/tree/store.rs
  • src/openhuman/memory/tree/types.rs
  • tests/json_rpc_e2e.rs
✅ Files skipped from review due to trivial changes (1)
  • src/openhuman/memory/tree/types.rs
🚧 Files skipped from review as they are similar to previous changes (6)
  • src/core/all.rs
  • src/openhuman/memory/tree/ingest.rs
  • src/openhuman/memory/tree/canonicalize/chat.rs
  • src/openhuman/memory/tree/canonicalize/email.rs
  • src/openhuman/memory/tree/canonicalize/document.rs
  • src/openhuman/memory/tree/store.rs

Comment thread src/openhuman/memory/tree/chunker.rs Outdated
Comment thread src/openhuman/memory/tree/schemas.rs
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
tests/json_rpc_e2e.rs (1)

810-867: Add a re-ingest idempotency check to this E2E flow.

This path verifies ingest/list/get, but it doesn’t yet assert the Phase-1 contract that re-ingesting identical content is stable and idempotent (same chunk IDs, no duplicate chunks).

💡 Proposed test extension
     let ingest_outer = assert_no_jsonrpc_error(&ingest, "memory_tree_ingest");
     let ingest_result = ingest_outer.get("result").unwrap_or(ingest_outer);
@@
     let chunk_ids = ingest_result
         .get("chunk_ids")
         .and_then(Value::as_array)
         .expect("chunk_ids array");
     assert_eq!(chunk_ids.len(), 1);
+    let chunk_ids_first = chunk_ids.clone();
+
+    // Re-ingest identical payload: should be idempotent/stable.
+    let ingest_again = post_json_rpc(
+        &rpc_base,
+        205,
+        &expected_methods[0],
+        json!({
+            "source_kind": "document",
+            "source_id": "notion:launch-plan",
+            "owner": "alice@example.com",
+            "tags": ["planning", "launch"],
+            "payload": {
+                "provider": "notion",
+                "title": "Launch Plan",
+                "body": "Alpha\n\nBeta",
+                "modified_at": 1700000000000_i64,
+                "source_ref": " notion://page/launch-plan "
+            }
+        }),
+    ).await;
+    let ingest_again_outer = assert_no_jsonrpc_error(&ingest_again, "memory_tree_ingest_reingest");
+    let ingest_again_result = ingest_again_outer.get("result").unwrap_or(ingest_again_outer);
+    let chunk_ids_again = ingest_again_result
+        .get("chunk_ids")
+        .and_then(Value::as_array)
+        .expect("chunk_ids array");
+    assert_eq!(chunk_ids_again, &chunk_ids_first, "chunk ids should remain stable across re-ingest");
@@
     assert_eq!(chunks.len(), 1);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/json_rpc_e2e.rs` around lines 810 - 867, Add an idempotency check after
the initial ingest/list assertions: call post_json_rpc again with the same
payload using the same expected_methods[0] (memory_tree_ingest) and verify via
assert_no_jsonrpc_error that the returned ingest result contains the same
"source_id" and identical "chunk_ids" as the first ingest (no new chunk IDs),
and then call memory_tree_list_chunks (expected_methods[1]) again and assert the
"chunks" array length is unchanged (still 1) and that the chunk at index 0 still
has seq_in_source == 0 and the same pointer("/metadata/source_ref/value"); use
the existing variables ingest_result, chunk_ids, list_result, and chunk to
compare old vs new values to ensure re-ingest is stable and idempotent.
src/openhuman/memory/tree/chunker.rs (1)

123-138: Consider using sep.chars().count() for consistency.

Line 124 uses sep.len() (byte length) while line 137 adds it to current.chars().count() (character count). For the ASCII separators "\n" and "\n\n" used internally, bytes equal chars so this works correctly. However, using sep.chars().count() would make the units consistent throughout.

🔧 Proposed fix for consistency
 fn pack_segments(segments: &[&str], sep: &str, max_chars: usize) -> Option<Vec<String>> {
-    let sep_len = sep.len();
+    let sep_len = sep.chars().count();
     let mut out: Vec<String> = Vec::new();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/chunker.rs` around lines 123 - 138, In
pack_segments change the separator length calculation to use character count
rather than byte length for consistency: replace the sep_len = sep.len() usage
with sep_chars = sep.chars().count() (and use that variable where sep_len is
referenced) so all length math in pack_segments (e.g., projected =
current.chars().count() + sep_chars + seg_len and any comparisons) uses
character counts consistently; update the variable name (sep_chars) and its
references in the function.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/openhuman/memory/tree/chunker.rs`:
- Around line 123-138: In pack_segments change the separator length calculation
to use character count rather than byte length for consistency: replace the
sep_len = sep.len() usage with sep_chars = sep.chars().count() (and use that
variable where sep_len is referenced) so all length math in pack_segments (e.g.,
projected = current.chars().count() + sep_chars + seg_len and any comparisons)
uses character counts consistently; update the variable name (sep_chars) and its
references in the function.

In `@tests/json_rpc_e2e.rs`:
- Around line 810-867: Add an idempotency check after the initial ingest/list
assertions: call post_json_rpc again with the same payload using the same
expected_methods[0] (memory_tree_ingest) and verify via assert_no_jsonrpc_error
that the returned ingest result contains the same "source_id" and identical
"chunk_ids" as the first ingest (no new chunk IDs), and then call
memory_tree_list_chunks (expected_methods[1]) again and assert the "chunks"
array length is unchanged (still 1) and that the chunk at index 0 still has
seq_in_source == 0 and the same pointer("/metadata/source_ref/value"); use the
existing variables ingest_result, chunk_ids, list_result, and chunk to compare
old vs new values to ensure re-ingest is stable and idempotent.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d11a345a-a8ab-4ed6-8d5f-d9be10410ce8

📥 Commits

Reviewing files that changed from the base of the PR and between 9724d2b and 3528d2b.

📒 Files selected for processing (3)
  • src/openhuman/memory/tree/chunker.rs
  • src/openhuman/memory/tree/schemas.rs
  • tests/json_rpc_e2e.rs
✅ Files skipped from review due to trivial changes (1)
  • src/openhuman/memory/tree/schemas.rs

@senamakel senamakel merged commit 0ce1253 into tinyhumansai:main Apr 21, 2026
7 of 8 checks passed
senamakel added a commit that referenced this pull request Apr 21, 2026
… gate (#708) (#733)

* feat(memory): phase 1 memory tree - multi-source ingestion & canonical chunks (#707)

Adds an isolated memory tree layer under src/openhuman/memory/tree/
implementing Phase 1 of the new memory architecture (umbrella #711).
Zero edits to existing memory/*.rs files - the new layer coexists
with the legacy TinyHumans-backed client.

- Source adapters: chat / email / document -> canonical Markdown
- Token-bounded chunker with deterministic SHA-256 chunk IDs
- SQLite persistence at <workspace>/memory_tree/chunks.db with full
  provenance metadata (source_kind, source_id, owner, timestamps,
  tags, time_range) and back-pointer to raw source
- Unified JSON-RPC ingest (dispatches on source_kind + JSON payload):
  openhuman.memory_tree_ingest, _list_chunks, _get_chunk
- DataSource enum covering the 8 providers from m.excalidraw step 1
  (Discord/Telegram/Whatsapp/Gmail/OtherEmail/Notion/MeetingNotes/DriveDocs)
- ~40 unit tests (chunk ID stability, UTF-8-safe splitting,
  canonicalisation idempotence, store round-trip, filter behavior)

Additive only: new tables in a new DB file, new JSON-RPC namespace,
no existing behavior changes. Feeds #708 (scoring), #709 (summary
trees), #710 (query tools).

Closes #707. Parent: #711.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(memory): phase 2 memory tree - preprocessing, scoring, admission gate (#708)

Adds the scoring / admission layer between Phase 1's chunker and store.
Stacked on feat/707-memory-ingestion (PR #732) - depends on Phase 1's
chunk substrate.

- Pluggable EntityExtractor trait + CompositeExtractor chain
- RegexEntityExtractor: mechanical entities (emails, URLs, @Handles, #hashtags)
  - Always on, deterministic, zero deps, UTF-8-safe char spans
- Five weighted signals: token count, unique-word ratio, metadata weight,
  source weight (per-DataSource), interaction (reply/sent/mention/dm tags),
  entity density
- Exact-match entity canonicalisation (email lowercased, @ and # stripped)
- Admission gate drops chunks below configurable threshold (default 0.3)
  - Score rationale persists for EVERY chunk (kept or dropped) for debugging
  - Entities indexed for KEPT chunks only
- Two new SQLite tables added to the memory_tree DB:
  - mem_tree_score: per-chunk score rationale with all signal values
  - mem_tree_entity_index: inverted index entity_id -> node_id
- Idempotent ALTER TABLE migration adds embedding BLOB column to
  mem_tree_chunks (used in Phase 3 retrieval, wired but not populated here)
- Ingest pipeline converted to async to accommodate the extractor trait;
  blocking SQLite work isolated on spawn_blocking; JSON-RPC surface
  unchanged (same memory_tree_ingest / list / get methods)
- Phase 2 deliberately ships without GLiNER/semantic NER - per-chunk
  semantic entities land later behind a cargo feature flag; the composite
  extractor interface keeps that drop-in trivial

Additive only: new tables, new columns, new module. Existing Phase 1
behavior unchanged except that low-signal chunks are now dropped before
reaching mem_tree_chunks. Raise score_drop_threshold to 0 to disable
the gate and restore Phase-1-identical behavior.

Closes #708. Parent: #711. Depends on: #707 (#732).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix memory tree scoring persistence issues

* Fix memory tree scoring robustness issues from PR review

- ingest: fail fast if scorer returns fewer/more results than chunks
  (silent zip truncation would drop chunks or their score rationale)
- score::persist_score{,_tx}: clear stale entity-index rows before
  re-indexing a re-scored chunk, since INSERT OR REPLACE never deletes
  rows whose entity_id is no longer in the new extraction
- score::store::lookup_entity: clamp limit to i64::MAX before casting
  to prevent a large usize wrapping into a negative LIMIT

Adds clear_entity_index_drops_stale_rows regression test.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Steven Enamakel <enamakel@tinyhumans.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Memory Architecture: Phase 1 - Multi-source ingestion and chunked canonicalization

2 participants