feat(memory): Phase 2 memory tree - preprocessing, scoring, admission gate (#708) by sanil-23 · Pull Request #733 · tinyhumansai/openhuman

sanil-23 · 2026-04-21T14:53:31Z

Summary

Adds Phase 2 of the memory architecture ([Feature] Memory Architecture: Phase 2 - Preprocessing, scoring, and entity consolidation #708): the scoring / admission / enrichment layer that sits between Phase 1's chunker and store.
Stacked on feat/707-memory-ingestion (feat(memory): Phase 1 memory tree - multi-source ingestion & canonical chunks (#707) #732). Depends on Phase 1's mem_tree_chunks substrate.
Low-signal chunks are dropped before reaching the index (score rationale still persisted for debugging). Surviving chunks get their mechanical entities indexed so retrieval can resolve "which chunks mention Alice?" in O(lookup).
Draft while feat(memory): Phase 1 memory tree - multi-source ingestion & canonical chunks (#707) #732 is in review; once feat(memory): Phase 1 memory tree - multi-source ingestion & canonical chunks (#707) #732 merges this will rebase onto new main.
Semantic NER (GLiNER) is deliberately out of scope — pluggable extractor interface is in place, regex-only implementation ships, GLiNER lands in a follow-up PR.

Problem

#708's brief: raw ingested chunks are mixed-signal. Without a preprocessing layer, "+1" reactions and multi-page pasted logs enter the index with the same weight as substantive content. We need:

An explainable score per chunk (not a single heuristic)
Drop weak content before it pollutes downstream summarisation
Entity indexing so Phase 4 retrieval can resolve entity-scoped queries without scanning chunks
Debug-friendly rationale persisted so threshold tuning is tractable

Solution

New module at `src/openhuman/memory/tree/score/`:

Piece	What it does
`extract/regex.rs`	Mechanical entity extraction (email / URL / @handle / #hashtag). Zero deps, UTF-8-safe char spans, deterministic.
`extract/mod.rs`	`EntityExtractor` trait + `CompositeExtractor` that runs a chain and merges output. GLiNER and other semantic extractors plug in here.
`signals/*.rs`	Five weighted signals: token count / unique-word ratio / metadata weight / source weight (per-DataSource) / interaction (reply/sent/mention/dm tags) / entity density
`resolver.rs`	Exact-match entity canonicalisation (`Alice@Example.com` ≡ `alice@example.com`; `@alice` ≡ `alice`; `#launch` ≡ `launch`). Fuzzy merge deferred.
`store.rs`	Two new tables: `mem_tree_score` (per-chunk rationale) and `mem_tree_entity_index` (inverted index).
`mod.rs`	`score_chunk` pipeline + `persist_score` orchestrator wired into `ingest.rs`.

Pipeline

```
canonicalise → chunk → extract entities → compute signals → combine (weighted)
→ admission gate (drop if total < threshold)
→ canonicalise entities (cross-platform merge)
→ persist: chunks (kept only) + score rationale (all) + entity index (kept)
```

Schema

Phase 1's `with_connection` helper now applies two additional tables (via `CREATE TABLE IF NOT EXISTS`) plus an idempotent `ALTER TABLE ADD COLUMN embedding BLOB` migration. The embedding column is wired but unpopulated in this PR — it lands properly in Phase 3 retrieval work.

Async surface

`ingest_chat` / `ingest_email` / `ingest_document` are now `async` to accommodate the `EntityExtractor` trait. Blocking SQLite work stays isolated on `tokio::task::spawn_blocking` inside `persist`. The `memory_tree` JSON-RPC surface is unchanged.

Design decisions / tradeoffs

Regex-only V1, GLiNER later: every prior discussion surfaced GLiNER as the right model choice, but shipping it means the ort crate + ~170MB ONNX file + soft-fallback loader. That belongs in its own PR behind a cargo feature. Phase 2 ships the pipeline skeleton so the drop-in is trivial.
Score rationale persists for dropped chunks too: the chunk itself is tombstoned by not being written, but the rationale row IS written so threshold tuning can audit what got dropped and why. The rationale table has no FK to the chunks table (dropped chunks don't exist there).
Exact-match canonicalisation: sufficient for the cross-platform-identity case (same email in two sources = same person). Fuzzy matching requires real entity graph data and is explicitly deferred.
Default threshold 0.3: calibrated so "+1" / "lol" / "👍" drops but genuine multi-sentence content passes. Tunable via `ScoringConfig.drop_threshold`.

Submission Checklist

Unit tests — ~60 new tests co-located per module: regex extractor precision/recall, signal math, resolver canonicalisation, store round-trip, ingest admission gating
E2E / integration — JSON-RPC integration test (`tests/memory_tree_e2e.rs`) to be added in a follow-up commit on this branch before marking ready-for-review
N/A — not applicable to this PR
Doc comments — module-level `//!` + per-item `///` on public types; decisions documented in module headers
Inline comments — structured `[memory_tree::*]` debug logs through the pipeline

Impact

Runtime: additive only. No new deps. Two new SQLite tables + one new column on `mem_tree_chunks`. `memory_tree_ingest` JSON-RPC method now filters out low-signal chunks (a behavior change, but visible only as `chunks_dropped > 0` in the response).
Platform: desktop; no platform-specific code.
Performance: scoring adds O(chunk_content) regex work per chunk plus a few `f32` arithmetic operations — dominated by SQLite writes, well under 1ms per chunk in practice.
Security: no new external calls, no secret handling.
Migration: one-time idempotent `ALTER TABLE ADD COLUMN embedding BLOB` on `mem_tree_chunks`. Safe to run on Phase-1-only DBs; they get the new column (NULL) and the new tables.
Compatibility: existing `IngestResult` shape gains one new field (`chunks_dropped`); `chunks_written` and `chunk_ids` unchanged.

Issue(s): Closes [Feature] Memory Architecture: Phase 2 - Preprocessing, scoring, and entity consolidation #708. Parent: [Feature] Memory Architecture: Source-grounded wiki memory with phased ingestion, scoring, and traversal #711.
Depends on: [Feature] Memory Architecture: Phase 1 - Multi-source ingestion and chunked canonicalization #707 (feat(memory): Phase 1 memory tree - multi-source ingestion & canonical chunks (#707) #732) — this branch is stacked on top
Follow-up PR(s):
- GLiNER integration behind `memory-tree-gliner` cargo feature
- `tests/memory_tree_e2e.rs` JSON-RPC E2E
- [Feature] Memory Architecture: Phase 3 - Summary trees, wiki indexing, and backlinks #709 — Phase 3: summary trees, wiki indexing, backlinks
- [Feature] Memory Architecture: Phase 4 - LLM traversal tools and high-level KB queries #710 — Phase 4: LLM traversal tools and KB queries

Summary by CodeRabbit

New Features
- Content-quality scoring with an admission gate: low-quality chunks may be dropped and counts reported.
- Entity extraction (emails, URLs, handles, hashtags), canonicalization, and entity→chunk indexing and topics.
Bug Fixes & Improvements
- Ingest flows made asynchronous and more robust; ingest responses now include persisted and dropped counts.
- Score rationales persist even for dropped chunks; idempotency extended to chunks and scores.

…l chunks (tinyhumansai#707) Adds an isolated memory tree layer under src/openhuman/memory/tree/ implementing Phase 1 of the new memory architecture (umbrella tinyhumansai#711). Zero edits to existing memory/*.rs files - the new layer coexists with the legacy TinyHumans-backed client. - Source adapters: chat / email / document -> canonical Markdown - Token-bounded chunker with deterministic SHA-256 chunk IDs - SQLite persistence at <workspace>/memory_tree/chunks.db with full provenance metadata (source_kind, source_id, owner, timestamps, tags, time_range) and back-pointer to raw source - Unified JSON-RPC ingest (dispatches on source_kind + JSON payload): openhuman.memory_tree_ingest, _list_chunks, _get_chunk - DataSource enum covering the 8 providers from m.excalidraw step 1 (Discord/Telegram/Whatsapp/Gmail/OtherEmail/Notion/MeetingNotes/DriveDocs) - ~40 unit tests (chunk ID stability, UTF-8-safe splitting, canonicalisation idempotence, store round-trip, filter behavior) Additive only: new tables in a new DB file, new JSON-RPC namespace, no existing behavior changes. Feeds tinyhumansai#708 (scoring), tinyhumansai#709 (summary trees), tinyhumansai#710 (query tools). Closes tinyhumansai#707. Parent: tinyhumansai#711. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@Handles

… gate (tinyhumansai#708) Adds the scoring / admission layer between Phase 1's chunker and store. Stacked on feat/707-memory-ingestion (PR tinyhumansai#732) - depends on Phase 1's chunk substrate. - Pluggable EntityExtractor trait + CompositeExtractor chain - RegexEntityExtractor: mechanical entities (emails, URLs, @Handles, #hashtags) - Always on, deterministic, zero deps, UTF-8-safe char spans - Five weighted signals: token count, unique-word ratio, metadata weight, source weight (per-DataSource), interaction (reply/sent/mention/dm tags), entity density - Exact-match entity canonicalisation (email lowercased, @ and # stripped) - Admission gate drops chunks below configurable threshold (default 0.3) - Score rationale persists for EVERY chunk (kept or dropped) for debugging - Entities indexed for KEPT chunks only - Two new SQLite tables added to the memory_tree DB: - mem_tree_score: per-chunk score rationale with all signal values - mem_tree_entity_index: inverted index entity_id -> node_id - Idempotent ALTER TABLE migration adds embedding BLOB column to mem_tree_chunks (used in Phase 3 retrieval, wired but not populated here) - Ingest pipeline converted to async to accommodate the extractor trait; blocking SQLite work isolated on spawn_blocking; JSON-RPC surface unchanged (same memory_tree_ingest / list / get methods) - Phase 2 deliberately ships without GLiNER/semantic NER - per-chunk semantic entities land later behind a cargo feature flag; the composite extractor interface keeps that drop-in trivial Additive only: new tables, new columns, new module. Existing Phase 1 behavior unchanged except that low-signal chunks are now dropped before reaching mem_tree_chunks. Raise score_drop_threshold to 0 to disable the gate and restore Phase-1-identical behavior. Closes tinyhumansai#708. Parent: tinyhumansai#711. Depends on: tinyhumansai#707 (tinyhumansai#732). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-04-21T14:53:40Z

📝 Walkthrough

Walkthrough

Added a Phase‑2 scoring/admission pipeline into ingestion: async scoring of chunks (kept vs dropped), persistence of scores and only-kept chunks/entities, new scoring modules (extract, resolver, signals, store), async ingest APIs, schema updates, and SQLite schema/indexing changes plus tests updated to cover admission behavior.

Changes

Cohort / File(s)	Summary
Ingest orchestration `src/openhuman/memory/tree/ingest.rs`, `src/openhuman/memory/tree/rpc.rs`, `src/openhuman/memory/tree/schemas.rs`	Made ingest functions async; inserted async scoring/admission stage producing kept vs dropped partition; adjusted persistence to write only kept chunks while always persisting score rows; updated RPC flow to call async ingests and schema outputs to include `chunks_dropped`.
Phase‑2 scoring core `src/openhuman/memory/tree/score/mod.rs`, `src/openhuman/memory/tree/mod.rs`	New scoring module and public exports; types `ScoreResult`/`ScoringConfig`, scoring pipeline `score_chunk(s)` and batch scoring, default config and constants.
Entity extraction API `src/openhuman/memory/tree/score/extract/...` `src/openhuman/memory/tree/score/extract/mod.rs`, `.../extractor.rs`, `.../regex.rs`, `.../types.rs`	Added extractor trait and composite/regex implementations, deterministic regex extractor, extracted entity/topic types, merge/dedup logic and tests.
Entity canonicalisation `src/openhuman/memory/tree/score/resolver.rs`	Added canonical entity type and deterministic canonical_id generation and canonicalise() implementation with tests.
Scoring signals framework `src/openhuman/memory/tree/score/signals/...` `.../mod.rs`, `.../ops.rs`, `.../types.rs`, `.../interaction.rs`, `.../metadata_weight.rs`, `.../source_weight.rs`, `.../token_count.rs`, `.../unique_words.rs`	Added signals modules and types, ops to compute and combine signals (including entity density), multiple per-signal implementations and unit tests.
Score persistence & entity index `src/openhuman/memory/tree/score/store.rs`, `src/openhuman/memory/tree/store.rs`	New SQLite score and entity-index tables and persistence APIs (upsert_score, index_entities, lookup_entity, counters); updated chunk upsert to ON CONFLICT DO UPDATE, added embedding APIs and additive migration step.
Tests & e2e `tests/json_rpc_e2e.rs`, score unit tests across new modules	Updated e2e payload/assertion for chunks_dropped; many new async unit tests for scoring, extraction, resolver, signals, and store behaviors.

Sequence Diagram(s)

sequenceDiagram
  participant RPC as RPC handler
  participant Ingest as Ingest orchestration
  participant Scorer as Scoring service (score::score_chunks)
  participant Extract as Entity Extractor
  participant Resolver as Canonicaliser
  participant DB as SQLite store

  RPC->>Ingest: parse payload & call async ingest_*()
  Ingest->>Ingest: chunk content -> chunking
  Ingest->>Scorer: score_chunks(chunks, cfg)
  Scorer->>Extract: extract entities per chunk
  Scorer->>Scorer: compute signals, combine -> decide kept/dropped
  Scorer->>Resolver: canonicalise(extracted) (for kept and for rationale)
  Scorer-->>Ingest: ScoreResults (kept/dropped + canonical entities)
  Ingest->>DB: persist kept chunks (upsert) and embeddings
  Ingest->>DB: persist all score rows (kept + dropped)
  Ingest->>DB: index canonical entities only for kept chunks
  DB-->>Ingest: ack persisted rows / counts
  Ingest-->>RPC: return IngestResult { chunks_written, chunks_dropped, chunk_ids }

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

feat(memory): Phase 1 memory tree - multi-source ingestion & canonical chunks (#707) #732: Adds Phase‑1 ingestion and earlier ingest refactors — directly related as this PR converts ingest to async and inserts the scoring/admission stage that integrates with that ingestion flow.

Poem

"A rabbit hopped on keys at night,\n> I scored the chunks by moonlit light.\n> Kept the gems and dropped the dross,\n> Indexed carrots, counted loss.\n> Hooray — small hops, big insight! 🐇"

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately and clearly describes the main changes: implementing Phase 2 memory tree with preprocessing, scoring, and admission gate functionality.
Linked Issues check	✅ Passed	The code changes comprehensively implement all key requirements from issue `#708`: scoring pipeline with multiple signals, entity extraction, admission gate with configurable threshold, persistence of rationale, and debug logging.
Out of Scope Changes check	✅ Passed	All changes align with Phase 2 objectives. Minor additive changes like embedding column addition and chunk upsert refactoring directly support the scoring infrastructure.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

# Conflicts: # src/openhuman/memory/tree/canonicalize/chat.rs # src/openhuman/memory/tree/canonicalize/document.rs # src/openhuman/memory/tree/canonicalize/email.rs # src/openhuman/memory/tree/canonicalize/mod.rs # src/openhuman/memory/tree/chunker.rs # src/openhuman/memory/tree/ingest.rs # src/openhuman/memory/tree/mod.rs # src/openhuman/memory/tree/rpc.rs # src/openhuman/memory/tree/schemas.rs # src/openhuman/memory/tree/store.rs # src/openhuman/memory/tree/types.rs

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/openhuman/memory/tree/store.rs (2)

111-115: ⚠️ Potential issue | 🟠 Major

INSERT OR REPLACE will wipe stored embeddings on re-ingest.

Now that mem_tree_chunks has an embedding column updated out-of-band, replacing a row with an insert that omits embedding resets it to NULL. A second ingest of the same chunk silently discards any vector previously written via set_chunk_embedding().

Suggested fix

-                "INSERT OR REPLACE INTO mem_tree_chunks (
+                "INSERT INTO mem_tree_chunks (
                     id, source_kind, source_id, source_ref, owner,
                     timestamp_ms, time_range_start_ms, time_range_end_ms,
                     tags_json, content, token_count, seq_in_source, created_at_ms
-                ) VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, ?8, ?9, ?10, ?11, ?12, ?13)",
+                ) VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, ?8, ?9, ?10, ?11, ?12, ?13)
+                ON CONFLICT(id) DO UPDATE SET
+                    source_kind = excluded.source_kind,
+                    source_id = excluded.source_id,
+                    source_ref = excluded.source_ref,
+                    owner = excluded.owner,
+                    timestamp_ms = excluded.timestamp_ms,
+                    time_range_start_ms = excluded.time_range_start_ms,
+                    time_range_end_ms = excluded.time_range_end_ms,
+                    tags_json = excluded.tags_json,
+                    content = excluded.content,
+                    token_count = excluded.token_count,
+                    seq_in_source = excluded.seq_in_source,
+                    created_at_ms = excluded.created_at_ms",

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/store.rs` around lines 111 - 115, The SQL uses
"INSERT OR REPLACE INTO mem_tree_chunks ..." which discards the existing
embedding column on re-insert; change the statement to perform an UPSERT that
preserves the stored embedding (e.g., use "INSERT ... ON CONFLICT(id) DO UPDATE
SET ..." and in the DO UPDATE avoid overwriting embedding or set embedding =
COALESCE(EXCLUDED.embedding, mem_tree_chunks.embedding)) so that repeated
ingests update fields like content/tokens but keep vectors written via
set_chunk_embedding(); update the SQL string in store.rs (the mem_tree_chunks
INSERT statement) accordingly.

301-333: ⚠️ Potential issue | 🟠 Major

The lazy ALTER TABLE migration is racy under concurrent first access.

add_column_if_missing() does a read-then-alter across separate statements. If two ingests open a fresh DB at the same time, both can observe the column missing and one of the ALTER TABLE calls will fail with duplicate column name, aborting that request.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/store.rs` around lines 301 - 333, The
add_column_if_missing function is racy because it does a separate
read-then-ALTER; change it to perform the alter under a short serialized
transaction (e.g., BEGIN IMMEDIATE / COMMIT around the pragma check+ALTER) or
attempt the ALTER unconditionally and handle the specific SQLite error for
duplicate column name by treating it as success. Update add_column_if_missing to
wrap the operation in a transaction (or catch rusqlite::Error with
sqlite_error_code() indicating duplicate column name) so concurrent calls don't
fail the request.

🧹 Nitpick comments (4)

src/openhuman/memory/tree/score/signals/mod.rs (1)
22-107: Keep signals/mod.rs export-focused.

ScoreSignals, SignalWeights, compute(), and combine() are all operational code in the module root. Moving those into sibling files such as types.rs/ops.rs will keep this module aligned with the repo structure rule and make future signal growth easier to manage.

As per coding guidelines, "Keep domain mod.rs files light and export-focused; place operational code in sibling files (ops.rs, store.rs, schedule.rs, types.rs, rpc.rs, schemas.rs) and re-export the public API".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/score/signals/mod.rs` around lines 22 - 107,
Refactor this module to keep mod.rs export-focused: move the data types
ScoreSignals and SignalWeights into a new sibling types.rs, move compute,
entity_density_score, and combine into a new sibling ops.rs (or similarly named
files), and have mod.rs simply pub use the exported symbols (ScoreSignals,
SignalWeights, compute, entity_density_score, combine) so the public API is
unchanged; ensure any internal helper functions referenced by
compute/entity_density_score are moved or re-exported as needed and update use
paths accordingly.
src/openhuman/memory/tree/score/extract/mod.rs (1)
15-79: Move the extractor implementation pieces out of extract/mod.rs.

This mod.rs now owns the trait, adapter, and composite execution logic. Splitting those into sibling files and keeping mod.rs to exports/re-exports will make this module easier to grow once semantic extractors land.

As per coding guidelines, "Keep domain mod.rs files light and export-focused; place operational code in sibling files (ops.rs, store.rs, schedule.rs, types.rs, rpc.rs, schemas.rs) and re-export the public API".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/score/extract/mod.rs` around lines 15 - 79, Split
the operational code out of extract/mod.rs: move the EntityExtractor trait and
its async impls (the EntityExtractor trait, the RegexEntityExtractor struct and
its impl block including name() and extract(), and the CompositeExtractor struct
plus its impls including new(), regex_only(), and the EntityExtractor impl for
CompositeExtractor) into one or more sibling files (e.g., trait_and_regex.rs and
composite.rs or extractor.rs and composite.rs), then update mod.rs to only
re-export the public symbols (pub use self::... for EntityExtractor,
RegexEntityExtractor, CompositeExtractor, and related constructors) and keep the
doc comments in mod.rs; ensure all references (method names: name, extract, new,
regex_only; types: ExtractedEntities) are preserved and imports (async_trait,
anyhow, log, regex::extract) are added in the new files.
src/openhuman/memory/tree/score/mod.rs (2)
122-131: Consider concurrent chunk scoring for improved throughput.

The sequential loop processes chunks one at a time. Since score_chunk is pure and independent per chunk, you could use concurrent execution (e.g., futures::future::try_join_all) for better throughput on large batches. This is optional given the current workload characteristics.
♻️ Optional: Concurrent batch scoring
+use futures::future::try_join_all;
+
 pub async fn score_chunks(chunks: &[Chunk], cfg: &ScoringConfig) -> Result<Vec<ScoreResult>> {
-    let mut out = Vec::with_capacity(chunks.len());
-    for c in chunks {
-        out.push(score_chunk(c, cfg).await?);
-    }
-    Ok(out)
+    let futures: Vec<_> = chunks.iter().map(|c| score_chunk(c, cfg)).collect();
+    try_join_all(futures).await
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/score/mod.rs` around lines 122 - 131, The current
score_chunks function runs score_chunk sequentially which limits throughput;
change it to run chunk scoring concurrently by mapping chunks to async tasks and
awaiting them all with a concurrency-aware join (e.g., use
futures::future::try_join_all on an iterator of futures or spawn into a
FuturesUnordered and collect results) so that score_chunk(c, cfg) is invoked for
all chunks concurrently and errors still propagate as a batch Result; update
score_chunks to build the futures from score_chunk and await the joined result,
preserving the Result<Vec<ScoreResult>> return and error semantics.
142-170: Clarify timestamp semantics: timestamp_ms parameter vs Utc::now() for computed_at_ms.

The function receives timestamp_ms as a parameter (used for entity indexing at line 164) but uses Utc::now() for computed_at_ms in the score row. This creates two different time semantics:

timestamp_ms: source/ingest timestamp (passed in)

computed_at_ms: wall-clock time when scoring ran

If this is intentional (scoring time vs source time), consider adding a brief comment clarifying the distinction. If they should match, use timestamp_ms for both.
📝 Add clarifying comment
     let row = store::ScoreRow {
         chunk_id: result.chunk_id.clone(),
         total: result.total,
         signals: result.signals.clone(),
         dropped: !result.kept,
         reason: result.drop_reason.clone(),
+        // computed_at_ms is wall-clock when scoring ran (vs timestamp_ms which
+        // reflects the source's original timestamp for entity indexing)
         computed_at_ms: Utc::now().timestamp_millis(),
     };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/score/mod.rs` around lines 142 - 170, In
persist_score, computed_at_ms in the constructed store::ScoreRow currently uses
Utc::now() while the function accepts timestamp_ms (also passed to
store::index_entities), creating ambiguous time semantics; either replace
Utc::now().timestamp_millis() with the provided timestamp_ms to keep a single
consistent timestamp, or if the distinction is intentional, add a short
clarifying comment above the persist_score implementation explaining that
timestamp_ms is the source/ingest time whereas computed_at_ms (in
store::ScoreRow) is the wall-clock scoring time so reviewers/readers understand
the difference.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/openhuman/memory/tree/ingest.rs`:
- Around line 159-166: The current spawn_blocking closure commits chunks via
store::upsert_chunks and then calls score::persist_score separately, which can
leave mem_tree_chunks committed while subsequent score/index writes fail; wrap
all related DB writes in a single SQLite transaction/connection so they
atomically succeed or fail together: perform store::upsert_chunks and all
score::persist_score calls inside the same blocking closure using the DB
transaction API (e.g., begin/commit/rollback or a Transaction object) or provide
a new store function that accepts a Transaction and does both chunk upsert and
score persistence, ensuring you use the same connection/transaction for
mem_tree_chunks, score table, and entity index to avoid partial commits.

In `@src/openhuman/memory/tree/score/extract/types.rs`:
- Around line 109-116: unique_entity_count() currently only normalizes by
text.to_lowercase(), so entities with different kinds but same surface (e.g.,
`@alice` vs `#alice`) are collapsed; update the uniqueness key in
unique_entity_count() to include the entity kind along with the case-folded text
(for example combine e.text.to_lowercase() and e.kind or e.kind.as_str()) so
each (kind, normalized-text) is treated distinct; ensure the BTreeSet collects
that combined tuple/string to compute the correct length and that any callers
like entity_density_score() will then receive the corrected count.

In `@src/openhuman/memory/tree/score/resolver.rs`:
- Around line 54-60: The canonical_id_for function is lowercasing the entire
surface which breaks exact-match URLs; update canonical_id_for to preserve the
original case when kind == EntityKind::Url by trimming whitespace and leading
'@'/'#' but skipping the to_lowercase() step for URL kinds, while keeping the
existing lowercasing behavior for other kinds; locate the function
canonical_id_for and the EntityKind::Url branch to implement this conditional
handling so URL path/query case is preserved unless you later add parser-aware
normalization.

In `@src/openhuman/memory/tree/score/signals/source_weight.rs`:
- Around line 22-29: infer_data_source currently scans Metadata.tags (used for
user labels) and treats any matching string (via DataSource::parse) as transport
provider info, causing user content tags like "notion" to override source_kind;
fix by restricting detection to a namespaced/tag-prefix or dedicated metadata
field: update infer_data_source to only consider tags that start with a provider
prefix such as "provider:" (e.g., check tag.starts_with("provider:") then strip
the prefix and call DataSource::parse) or read from a new
Metadata.transport_tags / Metadata.provider field instead of Metadata.tags;
update references to DataSource::parse and add a PROVIDER_PREFIX constant to
make the behavior explicit.

---

Outside diff comments:
In `@src/openhuman/memory/tree/store.rs`:
- Around line 111-115: The SQL uses "INSERT OR REPLACE INTO mem_tree_chunks ..."
which discards the existing embedding column on re-insert; change the statement
to perform an UPSERT that preserves the stored embedding (e.g., use "INSERT ...
ON CONFLICT(id) DO UPDATE SET ..." and in the DO UPDATE avoid overwriting
embedding or set embedding = COALESCE(EXCLUDED.embedding,
mem_tree_chunks.embedding)) so that repeated ingests update fields like
content/tokens but keep vectors written via set_chunk_embedding(); update the
SQL string in store.rs (the mem_tree_chunks INSERT statement) accordingly.
- Around line 301-333: The add_column_if_missing function is racy because it
does a separate read-then-ALTER; change it to perform the alter under a short
serialized transaction (e.g., BEGIN IMMEDIATE / COMMIT around the pragma
check+ALTER) or attempt the ALTER unconditionally and handle the specific SQLite
error for duplicate column name by treating it as success. Update
add_column_if_missing to wrap the operation in a transaction (or catch
rusqlite::Error with sqlite_error_code() indicating duplicate column name) so
concurrent calls don't fail the request.

---

Nitpick comments:
In `@src/openhuman/memory/tree/score/extract/mod.rs`:
- Around line 15-79: Split the operational code out of extract/mod.rs: move the
EntityExtractor trait and its async impls (the EntityExtractor trait, the
RegexEntityExtractor struct and its impl block including name() and extract(),
and the CompositeExtractor struct plus its impls including new(), regex_only(),
and the EntityExtractor impl for CompositeExtractor) into one or more sibling
files (e.g., trait_and_regex.rs and composite.rs or extractor.rs and
composite.rs), then update mod.rs to only re-export the public symbols (pub use
self::... for EntityExtractor, RegexEntityExtractor, CompositeExtractor, and
related constructors) and keep the doc comments in mod.rs; ensure all references
(method names: name, extract, new, regex_only; types: ExtractedEntities) are
preserved and imports (async_trait, anyhow, log, regex::extract) are added in
the new files.

In `@src/openhuman/memory/tree/score/mod.rs`:
- Around line 122-131: The current score_chunks function runs score_chunk
sequentially which limits throughput; change it to run chunk scoring
concurrently by mapping chunks to async tasks and awaiting them all with a
concurrency-aware join (e.g., use futures::future::try_join_all on an iterator
of futures or spawn into a FuturesUnordered and collect results) so that
score_chunk(c, cfg) is invoked for all chunks concurrently and errors still
propagate as a batch Result; update score_chunks to build the futures from
score_chunk and await the joined result, preserving the Result<Vec<ScoreResult>>
return and error semantics.
- Around line 142-170: In persist_score, computed_at_ms in the constructed
store::ScoreRow currently uses Utc::now() while the function accepts
timestamp_ms (also passed to store::index_entities), creating ambiguous time
semantics; either replace Utc::now().timestamp_millis() with the provided
timestamp_ms to keep a single consistent timestamp, or if the distinction is
intentional, add a short clarifying comment above the persist_score
implementation explaining that timestamp_ms is the source/ingest time whereas
computed_at_ms (in store::ScoreRow) is the wall-clock scoring time so
reviewers/readers understand the difference.

In `@src/openhuman/memory/tree/score/signals/mod.rs`:
- Around line 22-107: Refactor this module to keep mod.rs export-focused: move
the data types ScoreSignals and SignalWeights into a new sibling types.rs, move
compute, entity_density_score, and combine into a new sibling ops.rs (or
similarly named files), and have mod.rs simply pub use the exported symbols
(ScoreSignals, SignalWeights, compute, entity_density_score, combine) so the
public API is unchanged; ensure any internal helper functions referenced by
compute/entity_density_score are moved or re-exported as needed and update use
paths accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3961de54-ae23-4a02-8844-ecbe7cdb4796

📥 Commits

Reviewing files that changed from the base of the PR and between 0ce1253 and 2b0a9ee.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (18)

src/openhuman/memory/tree/ingest.rs
src/openhuman/memory/tree/mod.rs
src/openhuman/memory/tree/rpc.rs
src/openhuman/memory/tree/schemas.rs
src/openhuman/memory/tree/score/extract/mod.rs
src/openhuman/memory/tree/score/extract/regex.rs
src/openhuman/memory/tree/score/extract/types.rs
src/openhuman/memory/tree/score/mod.rs
src/openhuman/memory/tree/score/resolver.rs
src/openhuman/memory/tree/score/signals/interaction.rs
src/openhuman/memory/tree/score/signals/metadata_weight.rs
src/openhuman/memory/tree/score/signals/mod.rs
src/openhuman/memory/tree/score/signals/source_weight.rs
src/openhuman/memory/tree/score/signals/token_count.rs
src/openhuman/memory/tree/score/signals/unique_words.rs
src/openhuman/memory/tree/score/store.rs
src/openhuman/memory/tree/store.rs
tests/json_rpc_e2e.rs

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (3)

src/openhuman/memory/tree/store.rs (1)
378-424: Split the embedding persistence into a sibling module.

This file is now ~600 lines and mixes chunk CRUD, schema bootstrap, migration helpers, and embedding storage. Moving the embedding-specific code out would keep the boundaries cleaner and bring the file back under the repo’s size target.

As per coding guidelines, "Keep source files ≤ ~500 lines; split modules when growing to maintain single responsibility."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/store.rs` around lines 378 - 424, Extract the
embedding persistence functions set_chunk_embedding and get_chunk_embedding (and
any closely related helpers like their use of with_connection and Config) into a
new sibling module (e.g., src/openhuman/memory/tree/embedding.rs) and re-export
or import them from the original store.rs; move only the embedding-specific
logic and tests, keep shared types (Config, with_connection) referenced via pub
use or module import, update store.rs to delegate to the new module so store.rs
no longer contains embedding packing/unpacking code, and ensure all
imports/paths and crate visibility are adjusted to preserve existing public API
and unit tests.
src/openhuman/memory/tree/score/mod.rs (2)
156-157: Bound chunk-scoring concurrency before semantic extractors are enabled.

try_join_all drives every score_chunk future at once. That is fine for the current regex-only extractor, but it becomes an unbounded fan-out on large ingests once model-backed extractors are added. A buffered stream limit will keep memory/CPU usage predictable.
♻️ One way to cap in-flight scoring
-use futures_util::future::try_join_all;
+use futures_util::{stream, StreamExt, TryStreamExt};
...
 pub async fn score_chunks(chunks: &[Chunk], cfg: &ScoringConfig) -> Result<Vec<ScoreResult>> {
-    try_join_all(chunks.iter().map(|chunk| score_chunk(chunk, cfg))).await
+    const MAX_IN_FLIGHT: usize = 8;
+    stream::iter(chunks.iter().map(|chunk| score_chunk(chunk, cfg)))
+        .buffered(MAX_IN_FLIGHT)
+        .try_collect()
+        .await
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/score/mod.rs` around lines 156 - 157, The current
score_chunks function spawns all score_chunk futures at once via try_join_all
which can cause unbounded concurrency; change it to drive the chunk iterator
through a buffered stream (e.g., futures::stream::iter(chunks).map(|c|
score_chunk(c, cfg)).buffer_unordered(n)) and collect the results to
Vec<ScoreResult>, using a concurrency cap pulled from ScoringConfig (add a field
like max_concurrency or reuse an existing concurrency setting on ScoringConfig).
Update score_chunks to use futures::stream::{iter, StreamExt} and
buffer_unordered with cfg.max_concurrency and handle flattening/collecting
Result items into the same Result<Vec<ScoreResult>> return.
69-73: Log the actual scored inputs and final decision on the keep path.

For chat chunks, the gate uses scoring_token_count after header stripping, but the entry log reports chunk.token_count, and kept chunks never log their computed total/threshold. Add one final debug line with the derived token count, total, threshold, entity count, and kept so score tuning is traceable end-to-end.

As per coding guidelines, "{src/**/*.rs,app/src/**/*.{ts,tsx}}: Add substantial, development-oriented logs while implementing features or fixes so issues are easy to trace end-to-end; log critical checkpoints including entry/exit points, branch decisions, external calls, retries/timeouts, state transitions, and error handling paths`."

Also applies to: 117-123
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/score/mod.rs` around lines 69 - 73, The entry log
currently prints chunk.token_count but the gate uses scoring_token_count and
there is no final log when chunks are kept; update the scoring path in the
function handling chunk scoring (e.g., score_chunk) to (1) use the derived
scoring_token_count for the initial/logging context where appropriate, and (2)
add one final debug log on the kept branch that includes scoring_token_count
(the header-stripped token count for chat chunks), total, threshold,
entity_count, and kept=true/false so the computed total vs threshold decision is
recorded; replicate the same change for the second scoring block referenced
around the 117-123 region. Ensure the log message clearly identifies these
fields and the decision to aid end-to-end score tuning.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/openhuman/memory/tree/ingest.rs`:
- Around line 139-145: The loop using chunks.iter().zip(scores.into_iter()) can
silently truncate when scores.len() != chunks.len(); before iterating, validate
that scores.len() == chunks.len() and return or propagate an error (or panic) if
they differ so we fail fast; locate the scope where chunks, scores, kept_chunks,
and all_results are used (the block that currently iterates and pushes
kept_chunks and all_results) and insert the explicit length check there,
emitting a clear error message that includes the expected chunks.len() and
actual scores.len() to aid debugging.

In `@src/openhuman/memory/tree/score/mod.rs`:
- Around line 8-226: The mod.rs currently contains operational scoring logic
that should be moved into sibling files: extract the data types/constants
(ScoreResult, ScoringConfig, DEFAULT_DROP_THRESHOLD and related serde/impls) and
helper types into a new types.rs, and move runtime functions and persistence
logic (score_chunk, score_chunks, scoring_content_for_chunk, score_row,
persist_score, persist_score_tx and any helper imports like approx_token_count,
canonicalise, compute/combine from signals) into a new ops.rs; keep mod.rs
minimal by declaring pub mod extract; pub mod resolver; pub mod signals; pub mod
store; pub mod types; pub mod ops; and re-export the public API from mod.rs
(e.g., pub use types::{ScoreResult, ScoringConfig, DEFAULT_DROP_THRESHOLD}; pub
use ops::{score_chunk, score_chunks, persist_score, persist_score_tx}); ensure
necessary use/import lines are moved into the respective files (e.g.,
chrono::Utc and rusqlite::Transaction into ops.rs for score_row/persist_*), and
run cargo build to fix any missing imports.
- Around line 178-186: When re-scoring a chunk (check result.kept) before
calling store::index_entities or store::index_entities_tx, delete existing
mem_tree_entity_index rows for that chunk/node_id to remove stale entries;
specifically, add a deletion by node_id (chunk_id) in the non-transactional path
(around the result.kept && !result.canonical_entities block) and the
transactional path (around the index_entities_tx call) so that INSERT OR REPLACE
only inserts the current set, ensuring removed entities are cleared; reference
the functions index_entities and index_entities_tx and the
mem_tree_entity_index/node_id identifiers when adding the DELETE step (or
alternatively validate entries against current chunk state on read).

In `@src/openhuman/memory/tree/score/signals/types.rs`:
- Around line 26-35: The default SignalWeights produce a neutral baseline that
exceeds the Phase 2 gate; update the SignalWeights::default to rebalance the
per-signal weights so neutral (0.5) values don't already pass the default
threshold—specifically reduce metadata_weight, source_weight, and interaction
(e.g., set metadata_weight and source_weight to 1.0 and interaction to 2.0) in
the SignalWeights::default implementation so near-empty chat chunks are rejected
unless token_count/unique_words/entity_density contribute; adjust only the
values in the SignalWeights::default block.

In `@src/openhuman/memory/tree/score/store.rs`:
- Around line 243-254: The current code casts limit with `limit.unwrap_or(100)
as i64` which can wrap for large `usize` values; clamp the optional `limit` to
the maximum representable positive i64 before casting so callers cannot bypass
the LIMIT (same approach as the chunk listing code). Concretely, take the
`limit` optional, default to 100, clamp it to `i64::MAX` (or `i64::MAX as
usize`) and then cast to i64, and use that clamped `limit` when preparing the
query against `mem_tree_entity_index` in the closure passed to
`with_connection`. Ensure you update the binding used in `params![entity_id,
limit]` so the SQL always receives a non-negative, non-wrapped i64.

---

Nitpick comments:
In `@src/openhuman/memory/tree/score/mod.rs`:
- Around line 156-157: The current score_chunks function spawns all score_chunk
futures at once via try_join_all which can cause unbounded concurrency; change
it to drive the chunk iterator through a buffered stream (e.g.,
futures::stream::iter(chunks).map(|c| score_chunk(c, cfg)).buffer_unordered(n))
and collect the results to Vec<ScoreResult>, using a concurrency cap pulled from
ScoringConfig (add a field like max_concurrency or reuse an existing concurrency
setting on ScoringConfig). Update score_chunks to use futures::stream::{iter,
StreamExt} and buffer_unordered with cfg.max_concurrency and handle
flattening/collecting Result items into the same Result<Vec<ScoreResult>>
return.
- Around line 69-73: The entry log currently prints chunk.token_count but the
gate uses scoring_token_count and there is no final log when chunks are kept;
update the scoring path in the function handling chunk scoring (e.g.,
score_chunk) to (1) use the derived scoring_token_count for the initial/logging
context where appropriate, and (2) add one final debug log on the kept branch
that includes scoring_token_count (the header-stripped token count for chat
chunks), total, threshold, entity_count, and kept=true/false so the computed
total vs threshold decision is recorded; replicate the same change for the
second scoring block referenced around the 117-123 region. Ensure the log
message clearly identifies these fields and the decision to aid end-to-end score
tuning.

In `@src/openhuman/memory/tree/store.rs`:
- Around line 378-424: Extract the embedding persistence functions
set_chunk_embedding and get_chunk_embedding (and any closely related helpers
like their use of with_connection and Config) into a new sibling module (e.g.,
src/openhuman/memory/tree/embedding.rs) and re-export or import them from the
original store.rs; move only the embedding-specific logic and tests, keep shared
types (Config, with_connection) referenced via pub use or module import, update
store.rs to delegate to the new module so store.rs no longer contains embedding
packing/unpacking code, and ensure all imports/paths and crate visibility are
adjusted to preserve existing public API and unit tests.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 445d3e66-424a-4a71-9db8-fa3acf3e81e8

📥 Commits

Reviewing files that changed from the base of the PR and between 2b0a9ee and 58d316f.

📒 Files selected for processing (12)

src/openhuman/memory/tree/ingest.rs
src/openhuman/memory/tree/score/extract/extractor.rs
src/openhuman/memory/tree/score/extract/mod.rs
src/openhuman/memory/tree/score/extract/types.rs
src/openhuman/memory/tree/score/mod.rs
src/openhuman/memory/tree/score/resolver.rs
src/openhuman/memory/tree/score/signals/mod.rs
src/openhuman/memory/tree/score/signals/ops.rs
src/openhuman/memory/tree/score/signals/source_weight.rs
src/openhuman/memory/tree/score/signals/types.rs
src/openhuman/memory/tree/score/store.rs
src/openhuman/memory/tree/store.rs

✅ Files skipped from review due to trivial changes (2)

src/openhuman/memory/tree/score/extract/mod.rs
src/openhuman/memory/tree/score/signals/mod.rs

🚧 Files skipped from review as they are similar to previous changes (1)

src/openhuman/memory/tree/score/extract/types.rs

coderabbitai · 2026-04-21T22:21:09Z

+pub mod extract;
+pub mod resolver;
+pub mod signals;
+pub mod store;
+
+use std::sync::Arc;
+
+use anyhow::Result;
+use chrono::Utc;
+use futures_util::future::try_join_all;
+use rusqlite::Transaction;
+use serde::{Deserialize, Serialize};
+
+use self::extract::{EntityExtractor, ExtractedEntities};
+use self::resolver::{canonicalise, CanonicalEntity};
+use self::signals::{ScoreSignals, SignalWeights};
+use crate::openhuman::memory::tree::types::{approx_token_count, Chunk, SourceKind};
+
+/// Default drop threshold. Chunks with `total < DEFAULT_DROP_THRESHOLD`
+/// are tombstoned and never reach the chunk store.
+pub const DEFAULT_DROP_THRESHOLD: f32 = 0.3;
+
+/// Whole outcome of [`score_chunk`].
+#[derive(Clone, Debug, Serialize, Deserialize)]
+pub struct ScoreResult {
+    pub chunk_id: String,
+    pub total: f32,
+    pub signals: ScoreSignals,
+    pub kept: bool,
+    pub drop_reason: Option<String>,
+    pub extracted: ExtractedEntities,
+    pub canonical_entities: Vec<CanonicalEntity>,
+}
+
+/// Configuration passed through the ingest pipeline for Phase 2 behaviour.
+///
+/// Held as a struct (vs config struct fields) so callers can override per-run
+/// without mutating global config — useful for tests and explicit threshold
+/// tuning.
+pub struct ScoringConfig {
+    pub extractor: Arc<dyn EntityExtractor>,
+    pub weights: SignalWeights,
+    pub drop_threshold: f32,
+}
+
+impl ScoringConfig {
+    /// Phase 2 default: regex-only extractor, default weights, default threshold.
+    pub fn default_regex_only() -> Self {
+        Self {
+            extractor: Arc::new(extract::CompositeExtractor::regex_only()),
+            weights: SignalWeights::default(),
+            drop_threshold: DEFAULT_DROP_THRESHOLD,
+        }
+    }
+}
+
+/// Compute the score for one chunk.
+///
+/// Pure function — does not touch the store. Callers decide what to persist
+/// based on [`ScoreResult::kept`].
+pub async fn score_chunk(chunk: &Chunk, cfg: &ScoringConfig) -> Result<ScoreResult> {
+    log::debug!(
+        "[memory_tree::score] score_chunk chunk_id={} tokens={}",
+        chunk.id,
+        chunk.token_count
+    );
+
+    let scoring_content = scoring_content_for_chunk(chunk);
+    let scoring_token_count = approx_token_count(&scoring_content);
+
+    // 1. Extract entities (regex + any configured semantic extractors)
+    let extracted = cfg.extractor.extract(&scoring_content).await?;
+
+    // 2. Compute signals
+    let signals = self::signals::compute(
+        &chunk.metadata,
+        &scoring_content,
+        scoring_token_count,
+        &extracted,
+    );
+
+    // 3. Weighted combine
+    let total = self::signals::combine(&signals, &cfg.weights);
+
+    // 4. Admission gate. Source and interaction priors are deliberately
+    // non-zero, so guard against very short entity-free chatter being kept by
+    // metadata alone.
+    let tiny_entity_free =
+        scoring_token_count < self::signals::token_count::TOKEN_MIN && extracted.is_empty();
+    let kept = !tiny_entity_free && total >= cfg.drop_threshold;
+    let drop_reason = if kept {
+        None
+    } else if tiny_entity_free {
+        Some(format!(
+            "token_count {} < minimum {} and no entities extracted",
+            scoring_token_count,
+            self::signals::token_count::TOKEN_MIN
+        ))
+    } else {
+        Some(format!(
+            "total {total:.3} < threshold {:.3}",
+            cfg.drop_threshold
+        ))
+    };
+
+    // 5. Canonicalise for indexing (only meaningful when kept — but we
+    //    canonicalise unconditionally so the result is inspectable in tests)
+    let canonical_entities = canonicalise(&extracted);
+
+    if !kept {
+        log::debug!(
+            "[memory_tree::score] drop chunk_id={} total={:.3} reason={:?}",
+            chunk.id,
+            total,
+            drop_reason
+        );
+    }
+
+    Ok(ScoreResult {
+        chunk_id: chunk.id.clone(),
+        total,
+        signals,
+        kept,
+        drop_reason,
+        extracted,
+        canonical_entities,
+    })
+}
+
+fn scoring_content_for_chunk(chunk: &Chunk) -> String {
+    if chunk.metadata.source_kind != SourceKind::Chat {
+        return chunk.content.clone();
+    }
+
+    chunk
+        .content
+        .lines()
+        .filter(|line| {
+            let trimmed = line.trim_start();
+            !trimmed.starts_with("# Chat transcript") && !trimmed.starts_with("## ")
+        })
+        .collect::<Vec<_>>()
+        .join("\n")
+}
+
+/// Score a batch of chunks. Errors from any single chunk fail the batch —
+/// scoring is pure-ish (only the extractor may error) and a failure here is
+/// a real bug, not a per-chunk issue to tolerate silently.
+pub async fn score_chunks(chunks: &[Chunk], cfg: &ScoringConfig) -> Result<Vec<ScoreResult>> {
+    try_join_all(chunks.iter().map(|chunk| score_chunk(chunk, cfg))).await
+}
+
+// ── Persistence helpers used by the ingest orchestrator ─────────────────
+
+/// Persist the score row + entity-index rows for one kept chunk.
+///
+/// The caller is responsible for having already written the chunk itself
+/// into `mem_tree_chunks` (so the FK-like relation is satisfied). Dropped
+/// chunks still get a score row persisted for diagnostics — callers should
+/// pass `None` for `tree_id` in that case, since the chunk won't appear in
+/// a tree.
+pub fn persist_score(
+    config: &crate::openhuman::config::Config,
+    result: &ScoreResult,
+    timestamp_ms: i64,
+    tree_id: Option<&str>,
+) -> Result<()> {
+    let row = score_row(result);
+    store::upsert_score(config, &row)?;
+
+    if result.kept && !result.canonical_entities.is_empty() {
+        store::index_entities(
+            config,
+            &result.canonical_entities,
+            &result.chunk_id,
+            "leaf",
+            timestamp_ms,
+            tree_id,
+        )?;
+    }
+
+    Ok(())
+}
+
+pub(crate) fn persist_score_tx(
+    tx: &Transaction<'_>,
+    result: &ScoreResult,
+    timestamp_ms: i64,
+    tree_id: Option<&str>,
+) -> Result<()> {
+    let row = score_row(result);
+    store::upsert_score_tx(tx, &row)?;
+
+    if result.kept && !result.canonical_entities.is_empty() {
+        store::index_entities_tx(
+            tx,
+            &result.canonical_entities,
+            &result.chunk_id,
+            "leaf",
+            timestamp_ms,
+            tree_id,
+        )?;
+    }
+
+    Ok(())
+}
+
+fn score_row(result: &ScoreResult) -> store::ScoreRow {
+    // Score rows keep wall-clock scoring time; the separate timestamp_ms
+    // argument used for entity indexes is the source/ingest ordering time.
+    store::ScoreRow {
+        chunk_id: result.chunk_id.clone(),
+        total: result.total,
+        signals: result.signals.clone(),
+        dropped: !result.kept,
+        reason: result.drop_reason.clone(),
+        computed_at_ms: Utc::now().timestamp_millis(),
+    }
+}


🛠️ Refactor suggestion | 🟠 Major

Move the runtime scoring pipeline out of mod.rs.

This module now owns scoring, batching, and persistence behavior, which conflicts with the repo rule for src/openhuman/**/mod.rs. Please split the operational code into sibling files such as types.rs / ops.rs and keep mod.rs to declarations and re-exports.

As per coding guidelines, "src/openhuman/**/mod.rs: Keep domain mod.rs files light and export-focused; place operational code in sibling files (ops.rs, store.rs, schedule.rs, types.rs, rpc.rs, schemas.rs) and re-export the public API`."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/openhuman/memory/tree/score/mod.rs` around lines 8 - 226, The mod.rs currently contains operational scoring logic that should be moved into sibling files: extract the data types/constants (ScoreResult, ScoringConfig, DEFAULT_DROP_THRESHOLD and related serde/impls) and helper types into a new types.rs, and move runtime functions and persistence logic (score_chunk, score_chunks, scoring_content_for_chunk, score_row, persist_score, persist_score_tx and any helper imports like approx_token_count, canonicalise, compute/combine from signals) into a new ops.rs; keep mod.rs minimal by declaring pub mod extract; pub mod resolver; pub mod signals; pub mod store; pub mod types; pub mod ops; and re-export the public API from mod.rs (e.g., pub use types::{ScoreResult, ScoringConfig, DEFAULT_DROP_THRESHOLD}; pub use ops::{score_chunk, score_chunks, persist_score, persist_score_tx}); ensure necessary use/import lines are moved into the respective files (e.g., chrono::Utc and rusqlite::Transaction into ops.rs for score_row/persist_*), and run cargo build to fix any missing imports.

coderabbitai · 2026-04-21T22:21:09Z

+impl Default for SignalWeights {
+    fn default() -> Self {
+        Self {
+            token_count: 1.0,
+            unique_words: 1.0,
+            metadata_weight: 1.5,
+            source_weight: 1.5,
+            interaction: 3.0, // strongest signal — direct user engagement
+            entity_density: 1.0,
+        }


⚠️ Potential issue | 🟠 Major

These defaults put the neutral baseline above the default gate.

With the current weights, a chat chunk that only gets the neutral 0.5 from metadata_weight, source_weight, and interaction already lands at 0.333…. At the Phase 2 default threshold of 0.3, near-empty chat chunks will be admitted before token/entity signals add anything. Rebalance the defaults or raise the default threshold so the gate can actually reject low-value content.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/openhuman/memory/tree/score/signals/types.rs` around lines 26 - 35, The default SignalWeights produce a neutral baseline that exceeds the Phase 2 gate; update the SignalWeights::default to rebalance the per-signal weights so neutral (0.5) values don't already pass the default threshold—specifically reduce metadata_weight, source_weight, and interaction (e.g., set metadata_weight and source_weight to 1.0 and interaction to 2.0) in the SignalWeights::default implementation so near-empty chat chunks are rejected unless token_count/unique_words/entity_density contribute; adjust only the values in the SignalWeights::default block.

- ingest: fail fast if scorer returns fewer/more results than chunks (silent zip truncation would drop chunks or their score rationale) - score::persist_score{,_tx}: clear stale entity-index rows before re-indexing a re-scored chunk, since INSERT OR REPLACE never deletes rows whose entity_id is no longer in the new extraction - score::store::lookup_entity: clamp limit to i64::MAX before casting to prevent a large usize wrapping into a negative LIMIT Adds clear_entity_index_drops_stale_rows regression test.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

src/openhuman/memory/tree/score/mod.rs (1)

178-184: ⚠️ Potential issue | 🟠 Major

Clear entity lookups when a chunk flips from kept to dropped.

clear_entity_index_for_node(_tx) is still gated by result.kept. If a chunk was indexed on an earlier ingest and now scores below threshold, the score row flips to dropped but the old mem_tree_entity_index rows survive, so lookup_entity can still return a chunk that admission just rejected. Move the clear outside the if result.kept guard, then only reinsert when result.kept is true.

Suggested change

-    if result.kept {
-        // Clear any stale entity-index rows for this chunk before re-indexing.
-        // INSERT OR REPLACE on (entity_id, node_id) never deletes rows whose
-        // entity_id is no longer present in the new extraction — so a re-score
-        // that drops an entity would otherwise leave a phantom index row.
-        store::clear_entity_index_for_node(config, &result.chunk_id)?;
-        if !result.canonical_entities.is_empty() {
+    store::clear_entity_index_for_node(config, &result.chunk_id)?;
+    if result.kept && !result.canonical_entities.is_empty() {
             store::index_entities(
                 config,
                 &result.canonical_entities,
                 &result.chunk_id,
                 "leaf",
                 timestamp_ms,
                 tree_id,
             )?;
-        }
     }

Also applies to: 208-210

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/score/mod.rs` around lines 178 - 184, The
entity-index clear is incorrectly conditional on result.kept, so when a chunk
flips from kept to dropped stale mem_tree_entity_index rows remain; move the
call to store::clear_entity_index_for_node(config, &result.chunk_id) so it runs
unconditionally whenever processing the score result, and keep the
insertion/reinsertion of entities (using result.canonical_entities) only inside
the if result.kept branch; apply the same change for the second occurrence that
mirrors this logic (the later block around the other result handling).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/openhuman/memory/tree/ingest.rs`:
- Around line 128-130: The current early-return when chunks.is_empty()
(returning IngestResult::empty(source_id)) causes old mem_tree_chunks rows to
remain because we only upsert kept_chunks; change the ingest flow to perform a
transactional replace of the source’s chunk set: begin a DB transaction, delete
existing mem_tree_chunks for the given source_id, then insert the kept_chunks
(which may be zero rows) and commit; do the same replacement logic for the other
upsert branch referenced around the 169-179 section to avoid append-only
behavior. Also add a regression test that simulates a source that first produces
kept chunks and then re-ingests with (a) lowered admission causing kept→dropped
and (b) an empty chunk set, asserting that mem_tree_chunks contains exactly the
expected rows after each re-ingest. Ensure you reference the same identifiers
used in ingest.rs (kept_chunks, source_id, IngestResult::empty, mem_tree_chunks)
when updating the code and tests.

---

Duplicate comments:
In `@src/openhuman/memory/tree/score/mod.rs`:
- Around line 178-184: The entity-index clear is incorrectly conditional on
result.kept, so when a chunk flips from kept to dropped stale
mem_tree_entity_index rows remain; move the call to
store::clear_entity_index_for_node(config, &result.chunk_id) so it runs
unconditionally whenever processing the score result, and keep the
insertion/reinsertion of entities (using result.canonical_entities) only inside
the if result.kept branch; apply the same change for the second occurrence that
mirrors this logic (the later block around the other result handling).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 03a05e78-ffc7-48e5-841b-8d4f525d7694

📥 Commits

Reviewing files that changed from the base of the PR and between 58d316f and d9bbd43.

📒 Files selected for processing (3)

src/openhuman/memory/tree/ingest.rs
src/openhuman/memory/tree/score/mod.rs
src/openhuman/memory/tree/score/store.rs

🚧 Files skipped from review as they are similar to previous changes (1)

src/openhuman/memory/tree/score/store.rs

coderabbitai · 2026-04-21T23:08:50Z

+    if chunks.is_empty() {
+        return Ok(IngestResult::empty(source_id));
+    }


⚠️ Potential issue | 🟠 Major

Prune stale chunk rows on re-ingest.

This path only upserts kept_chunks and returns early when chunks.is_empty(). If a source previously produced persisted chunks and a later ingest now chunks to zero or falls below the admission threshold, the old mem_tree_chunks rows survive, so retrieval can still surface content that this ingest just rejected. Please replace the source’s chunk set transactionally instead of doing append-only upserts, and add a regression test for kept→dropped / kept→empty re-ingests.

Also applies to: 169-179

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/openhuman/memory/tree/ingest.rs` around lines 128 - 130, The current early-return when chunks.is_empty() (returning IngestResult::empty(source_id)) causes old mem_tree_chunks rows to remain because we only upsert kept_chunks; change the ingest flow to perform a transactional replace of the source’s chunk set: begin a DB transaction, delete existing mem_tree_chunks for the given source_id, then insert the kept_chunks (which may be zero rows) and commit; do the same replacement logic for the other upsert branch referenced around the 169-179 section to avoid append-only behavior. Also add a regression test that simulates a source that first produces kept chunks and then re-ingests with (a) lowered admission causing kept→dropped and (b) an empty chunk set, asserting that mem_tree_chunks contains exactly the expected rows after each re-ingest. Ensure you reference the same identifiers used in ingest.rs (kept_chunks, source_id, IngestResult::empty, mem_tree_chunks) when updating the code and tests.

…ort-circuit (#708) (#775) * feat(memory): LLM-based NER + importance signal with cheap-signals short-circuit (#708) Adds an Ollama-backed entity extractor and an LLM-derived importance score as a new signal in the admission gate. Builds on merged Phase 2 (#733) without changing its surface — additive only. Why LLM (not GLiNER): zero new deps (reuses existing reqwest + async infra), reuses the same Ollama setup openhuman uses for embeddings, better per-entity quality scales with model choice, no native ONNX runtime to ship. Latency cost (~100-300ms per chunk on a small model like qwen2.5:0.5b) is amortised by the short-circuit. The admission gate stays a hybrid - cheap deterministic signals always run; the LLM is one signal among many, never the sole arbiter. Backwards-compatible: with default SignalWeights the LLM weight is 0.0 and behaviour matches pre-LLM Phase 2 exactly. What's new: - src/openhuman/memory/tree/score/extract/llm.rs: LlmEntityExtractor implementing the existing EntityExtractor trait. Posts a structured JSON request to an Ollama-compatible /api/chat endpoint asking for NER + an importance rating in one call. Span recovery via string search; hallucinated entities (surface not in source text) dropped. Soft fallback: HTTP failures log a warn and return empty extraction. - ExtractedEntities gains llm_importance (Option<f32>) + llm_importance_reason (Option<String>); merge() takes max importance. - ScoreSignals gains llm_importance (f32, defaults to 0.0). - SignalWeights gains llm_importance (default 0.0; with_llm_enabled() helper sets it to 2.0 - comparable to metadata/source weights). - signals::combine_cheap_only(): variant that excludes the LLM signal, used for the short-circuit decision in score_chunk. - ScoringConfig gains llm_extractor (Option<Arc<dyn EntityExtractor>>), definite_keep_threshold (default 0.85), definite_drop_threshold (default 0.15). New with_llm_extractor() constructor. - score_chunk() pipeline: 1. Always-on regex extraction 2. Cheap signals + combine_cheap_only -> cheap_total 3. If cheap_total >= definite_keep OR <= definite_drop: skip LLM 4. Else (borderline band): run LLM extractor, merge results, recompute 5. Final combine + admission gate against drop_threshold - mem_tree_score table gains nullable llm_importance + llm_importance_reason columns via idempotent ALTER TABLE migration. - ScoreRow + upsert/get persist the new column. What's not changed: - Existing JSON-RPC surface - Default behaviour (LLM weight=0, no llm_extractor configured = same outputs as before) - mem_tree_chunks / mem_tree_entity_index schemas Tests added: - LLM extractor: prompt construction, JSON parsing, hallucination drop, importance clamping, strict vs lenient unknown-kind handling - Score pipeline: short-circuit on definite_keep/definite_drop skips LLM call (verified via FakeLlm call counter); borderline band consults LLM once; LLM failure falls back gracefully without erroring LLM-NER work follows up on #708. Parent: #711. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(memory): address CI failures and CodeRabbit review on LLM-NER (#708) CI was red on PR #775 due to 10 E0063 errors — existing test-only struct literals for ExtractedEntities / ScoreSignals / ScoreRow weren't updated when the new optional llm_importance fields landed. Local cargo check on lib-only passed; cargo check --tests caught them, as does CI. CodeRabbit also flagged four semantic issues: 1. LlmEntityExtractor::extract() returned Err on HTTP transport, non-2xx status, and JSON parse failures. The soft-fallback contract (documented in the module header) only worked because score_chunk catches errors; any other caller got a hard error. Now split into extract_or_empty() which logs a warn on each failure mode and returns ExtractedEntities::default() unconditionally. The public extract() stays on the async-trait surface but never returns Err. 2. find_char_span() always searched from byte offset 0, so when the LLM returned the same surface twice (explicitly asked for duplicates in the prompt), both entities got (start=0, end=len) — same span, not distinct mentions. Added find_char_span_from(haystack, needle, byte_from, char_from) that resumes from a cursor. into_extracted_entities now tracks a per-surface (byte_after, char_after) cursor in a HashMap<String, (usize, u32)>, so duplicate mentions get distinct non-overlapping spans and over-claim (LLM returns 3 "Alice" when source has 2) drops the excess entries with a debug log. 3. combine() always included w.llm_importance in the denominator even when llm_importance=0.0 because the LLM didn't run. That artificially lowered the total on short-circuit paths. score_chunk now branches on llm_consulted: uses combine() when LLM ran (importance actually contributes), combine_cheap_only() when LLM was skipped or failed (denominator excludes the LLM weight). Observable in two new tests: short_circuit_reports_cheap_only_total verifies r.total == combine_cheap_only(r.signals, weights) and strictly exceeds combine(r.signals, weights) when llm_importance=0; llm_consulted_ reports_full_total verifies the opposite path. 4. Same issue from a different angle (CodeRabbit flagged both). Fixed by the same llm_consulted branch. Test-literal fixes: - signals/ops.rs: ScoreSignals literals pick up llm_importance field; make_entities helper uses ..Default::default() on ExtractedEntities. - extract/types.rs: three test ExtractedEntities literals gain llm_importance: None, llm_importance_reason: None. - resolver.rs: canonicalise_batch_preserves_spans literal gains the two fields. - score/store.rs: sample_row gains signals.llm_importance and llm_importance_reason. New tests (7 added): - find_char_span_from_advances_past_prior_match - find_char_span_from_returns_none_after_exhaustion - find_char_span_from_preserves_utf8 - find_char_span_from_rejects_non_char_boundary - into_extracted_entities_gives_distinct_spans_to_duplicate_mentions - into_extracted_entities_drops_extra_duplicate_when_source_only_has_one - extract_soft_fallback_on_unreachable_endpoint (transport failure → empty extraction, not Err) - short_circuit_reports_cheap_only_total - llm_consulted_reports_full_total cargo check --lib clean; cargo fmt clean. The integration-test rmeta errors visible locally are stale-metadata artifacts from incremental builds with prior struct shapes; CI's clean build resolves them and tests/*.rs files do not construct any of the touched types. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(memory): LLM-based NER + importance signal with cheap-signals short-circuit (#708) Adds an Ollama-backed entity extractor and an LLM-derived importance score as a new signal in the admission gate. Builds on merged Phase 2 (#733) without changing its surface — additive only. Why LLM (not GLiNER): zero new deps (reuses existing reqwest + async infra), reuses the same Ollama setup openhuman uses for embeddings, better per-entity quality scales with model choice, no native ONNX runtime to ship. Latency cost (~100-300ms per chunk on a small model like qwen2.5:0.5b) is amortised by the short-circuit. The admission gate stays a hybrid - cheap deterministic signals always run; the LLM is one signal among many, never the sole arbiter. Backwards-compatible: with default SignalWeights the LLM weight is 0.0 and behaviour matches pre-LLM Phase 2 exactly. What's new: - src/openhuman/memory/tree/score/extract/llm.rs: LlmEntityExtractor implementing the existing EntityExtractor trait. Posts a structured JSON request to an Ollama-compatible /api/chat endpoint asking for NER + an importance rating in one call. Span recovery via string search; hallucinated entities (surface not in source text) dropped. Soft fallback: HTTP failures log a warn and return empty extraction. - ExtractedEntities gains llm_importance (Option<f32>) + llm_importance_reason (Option<String>); merge() takes max importance. - ScoreSignals gains llm_importance (f32, defaults to 0.0). - SignalWeights gains llm_importance (default 0.0; with_llm_enabled() helper sets it to 2.0 - comparable to metadata/source weights). - signals::combine_cheap_only(): variant that excludes the LLM signal, used for the short-circuit decision in score_chunk. - ScoringConfig gains llm_extractor (Option<Arc<dyn EntityExtractor>>), definite_keep_threshold (default 0.85), definite_drop_threshold (default 0.15). New with_llm_extractor() constructor. - score_chunk() pipeline: 1. Always-on regex extraction 2. Cheap signals + combine_cheap_only -> cheap_total 3. If cheap_total >= definite_keep OR <= definite_drop: skip LLM 4. Else (borderline band): run LLM extractor, merge results, recompute 5. Final combine + admission gate against drop_threshold - mem_tree_score table gains nullable llm_importance + llm_importance_reason columns via idempotent ALTER TABLE migration. - ScoreRow + upsert/get persist the new column. What's not changed: - Existing JSON-RPC surface - Default behaviour (LLM weight=0, no llm_extractor configured = same outputs as before) - mem_tree_chunks / mem_tree_entity_index schemas Tests added: - LLM extractor: prompt construction, JSON parsing, hallucination drop, importance clamping, strict vs lenient unknown-kind handling - Score pipeline: short-circuit on definite_keep/definite_drop skips LLM call (verified via FakeLlm call counter); borderline band consults LLM once; LLM failure falls back gracefully without erroring LLM-NER work follows up on #708. Parent: #711. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(memory): address CI failures and CodeRabbit review on LLM-NER (#708) CI was red on PR #775 due to 10 E0063 errors — existing test-only struct literals for ExtractedEntities / ScoreSignals / ScoreRow weren't updated when the new optional llm_importance fields landed. Local cargo check on lib-only passed; cargo check --tests caught them, as does CI. CodeRabbit also flagged four semantic issues: 1. LlmEntityExtractor::extract() returned Err on HTTP transport, non-2xx status, and JSON parse failures. The soft-fallback contract (documented in the module header) only worked because score_chunk catches errors; any other caller got a hard error. Now split into extract_or_empty() which logs a warn on each failure mode and returns ExtractedEntities::default() unconditionally. The public extract() stays on the async-trait surface but never returns Err. 2. find_char_span() always searched from byte offset 0, so when the LLM returned the same surface twice (explicitly asked for duplicates in the prompt), both entities got (start=0, end=len) — same span, not distinct mentions. Added find_char_span_from(haystack, needle, byte_from, char_from) that resumes from a cursor. into_extracted_entities now tracks a per-surface (byte_after, char_after) cursor in a HashMap<String, (usize, u32)>, so duplicate mentions get distinct non-overlapping spans and over-claim (LLM returns 3 "Alice" when source has 2) drops the excess entries with a debug log. 3. combine() always included w.llm_importance in the denominator even when llm_importance=0.0 because the LLM didn't run. That artificially lowered the total on short-circuit paths. score_chunk now branches on llm_consulted: uses combine() when LLM ran (importance actually contributes), combine_cheap_only() when LLM was skipped or failed (denominator excludes the LLM weight). Observable in two new tests: short_circuit_reports_cheap_only_total verifies r.total == combine_cheap_only(r.signals, weights) and strictly exceeds combine(r.signals, weights) when llm_importance=0; llm_consulted_ reports_full_total verifies the opposite path. 4. Same issue from a different angle (CodeRabbit flagged both). Fixed by the same llm_consulted branch. Test-literal fixes: - signals/ops.rs: ScoreSignals literals pick up llm_importance field; make_entities helper uses ..Default::default() on ExtractedEntities. - extract/types.rs: three test ExtractedEntities literals gain llm_importance: None, llm_importance_reason: None. - resolver.rs: canonicalise_batch_preserves_spans literal gains the two fields. - score/store.rs: sample_row gains signals.llm_importance and llm_importance_reason. New tests (7 added): - find_char_span_from_advances_past_prior_match - find_char_span_from_returns_none_after_exhaustion - find_char_span_from_preserves_utf8 - find_char_span_from_rejects_non_char_boundary - into_extracted_entities_gives_distinct_spans_to_duplicate_mentions - into_extracted_entities_drops_extra_duplicate_when_source_only_has_one - extract_soft_fallback_on_unreachable_endpoint (transport failure → empty extraction, not Err) - short_circuit_reports_cheap_only_total - llm_consulted_reports_full_total cargo check --lib clean; cargo fmt clean. The integration-test rmeta errors visible locally are stale-metadata artifacts from incremental builds with prior struct shapes; CI's clean build resolves them and tests/*.rs files do not construct any of the touched types. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(memory): source trees + bucket-seal foundation (#709) Phase 3a of the memory architecture (#711 umbrella). Lifts admitted leaves into a per-source hierarchy so Phase 4 retrieval has a tree to walk. ## What's in this PR * **Schema** — three new tables in `chunks.db` alongside Phase 1/2: `mem_tree_trees`, `mem_tree_summaries`, `mem_tree_buffers`. Plus a `parent_summary_id` backlink column on `mem_tree_chunks`. All migrations are additive and idempotent (`ALTER TABLE ADD COLUMN`, same pattern as Phase 2). * **`source_tree/` module** — isolated subdir; does not touch the legacy `openhuman::memory` layer. - `types.rs` — `Tree`, `SummaryNode`, `Buffer`, `TreeKind` (Source/Topic/Global), `TreeStatus`. Constants: `TOKEN_BUDGET = 10_000`, `DEFAULT_FLUSH_AGE_SECS = 7d`. - `store.rs` — CRUD for all three tables via the shared `with_connection` entry point. Writes are transactional; summary insert is idempotent on primary key (INSERT OR IGNORE). - `registry.rs` — `get_or_create_source_tree(scope)` — idempotent lookup keyed on `(kind, scope)`. - `bucket_seal.rs` — `append_leaf` + cascade: push a leaf into L0, seal when `token_sum >= TOKEN_BUDGET`, recurse upward. One seal = one transaction; children get a parent backlink (leaves via `parent_summary_id`, summaries via `parent_id`). Tree's `max_level` and `root_id` bump on root split. - `flush.rs` — `flush_stale_buffers(max_age)` force-seals any buffer whose `oldest_at` is older than `max_age`. Primitive only; wiring into a scheduler is out of scope. - `summariser/` — `Summariser` trait + `InertSummariser` fallback (concatenate children, union entities preserving first-seen order, hard-truncate to budget). Trait is async; Ollama implementation slots in later without breaking callers. * **Ingest wiring** — `tree/ingest.rs::persist` now calls `append_leaf` once per kept chunk after chunks + scores land. Failures are logged at warn level and do not fail the ingest — leaves are already persisted and the next flush can still rebuild the tree. ## Concurrency / LLM The async summariser call happens OUTSIDE any DB transaction, so a slow LLM never holds SQLite locks. Blocking DB calls inside `append_leaf` are acceptable for Phase 3a because the Inert summariser does no real I/O; when a networked summariser lands, wrap the DB calls in `tokio::task::spawn_blocking`. ## Tests (23 new, all green) * `types` — enum round-trips, buffer staleness predicates * `store` — tree insert + unique scope, summary insert + idempotence, buffer upsert/clear, stale-buffer ordering * `registry` — `get_or_create` idempotence, distinct scopes yield distinct trees, id prefix format * `summariser/inert` — provenance-prefixed concat, entity union with order preservation, budget truncation, empty-content skip * `bucket_seal` — append-below-budget is buffered only (no seal); 12k-token cross-over produces an L1 summary with correct `child_ids`, L0 cleared, L1 carries the new summary id, tree `max_level`/`root_id` updated, leaf → parent backlink populated * `flush` — stale buffer force-seals under budget; recent buffer is a no-op Broader `memory::tree` suite: 160 tests pass (23 new + 137 existing Phase 1/2), no regressions. ## Stacked on #775 This PR applies on top of `feat/708-memory-llm-ner` (PR #775, Phase 2 LLM-NER follow-up). Merge #775 first; this branch will rebase cleanly onto main. ## Out of scope (tracked separately) * **3b — Global activity digest tree** — time-aligned cross-source recap * **3c — Topic trees** — hotness-driven per-entity materialisation * **Phase 4 (#710)** — retrieval / query / gate Closes part of #709 (source-tree foundation). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * style(memory): cargo fmt on source_tree module (#709) CI rustfmt caught a few multi-line function signatures and import groups that rustfmt wants collapsed / reordered. No behaviour change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(memory): address CodeRabbit review on source-tree (#709) Three correctness fixes from CodeRabbit on PR #789, plus a design clarification for the InertSummariser. False-positive findings (schema-out-of-sync, rustfmt — already in b7395a8) are not addressed here because they are not real issues. ## Changes ### Idempotent buffer append (`bucket_seal.rs`) `append_to_buffer` commits before the async seal/cascade runs, so a retry after a failed seal (summariser timeout, crash, transient ingest error) would re-push the same leaf and double-count tokens. Added a dedup check on `item_ids` so the append is a true no-op for already-buffered chunks. ### InertSummariser: honest stub, no entity propagation (`summariser/inert.rs`) Summary-level entities and topics are **LLM-derived from the summary's own synthesised content** by design, not a mechanical union of children's labels. The networked summariser (future) does its own extraction on its output. The inert fallback has no NER, so it emits empty entities/topics — an honest stub rather than a misleading union. Removed `union_preserve_order` helper and updated the corresponding test. ### Forward-compat entity index on seal (`bucket_seal.rs`, `score/store.rs`) Added `index_summary_entity_ids_tx` — a summary-specific indexer that takes canonical ids only (matching the `Vec<String>` shape of `SummaryOutput.entities`) and writes them into `mem_tree_entity_index` with `node_kind='summary'`, placeholder `entity_kind`/`surface` (both set to the canonical id), and the summary's score. Called from `seal_one_level`'s write transaction. No-op today (InertSummariser emits empty), but wired so the Ollama summariser lands as a drop-in without touching bucket-seal. ### UNIQUE race in `get_or_create_source_tree` (`registry.rs`) Two concurrent callers for the same scope could both pass the lookup and then race on `insert_tree`; the loser got a UNIQUE violation bubbled as an error instead of the pre-existing row. Now catches `ConstraintViolation` (and the message-based fallback for chained errors), re-queries by scope, and returns the winning row. New test `get_or_create_recovers_from_unique_race` covers both the recovery path and the `is_unique_violation` detector. ## Tests 161 passed / 0 failed across `memory::tree` (+1 new test vs the prior baseline of 160). No regressions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * style(memory): cargo fmt on source_tree::registry (#709) CI's rustfmt check flagged the race-recovery block in get_or_create_source_tree — rustfmt prefers chaining `?.ok_or_else(...)` on one line and wrapping the assert! macro. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(memory): index_summary_entity_ids_tx writes parseable entity_kind (#709) CodeRabbit on PR #789 caught a bug in the summary entity indexer I added for Phase 3a. The helper was writing the full canonical id (e.g. "email:alice@example.com") into the `entity_kind` column instead of the kind prefix. `lookup_entity()` later runs `EntityKind::parse()` on that column and would fail with `unknown EntityKind` the first time a result set mixed leaf and summary hits — blocking any Phase 4 read that needed summary-level entity resolution. Fix: split the canonical id on `:` and store the prefix only. Falls back to the full id with a `warn!` for malformed ids that lack `:`, so bad data surfaces rather than silently stays latent. Added regression test `summary_entity_index_kind_is_parseable` that: - Indexes a leaf entity row (the existing happy path) - Indexes two summary entity rows via index_summary_entity_ids_tx - Runs lookup_entity for email, hashtag, and a cross-kind id - Asserts each row comes back with a correctly-parsed EntityKind Before the fix, the lookup_entity calls in this test would fail with a FromSqlConversionFailure on the summary row's entity_kind column. After the fix, mixed leaf+summary lookups round-trip cleanly. Tests: 162 passed / 0 failed in memory::tree (+1 vs 161 baseline). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(memory): promote extracted topics to canonical entities (#709) Addresses a design gap surfaced during PR #789 review: the Phase 3 plan promised topic trees would scope to "entities / topics" but the implementation routed on canonical_entities only. A chunk saying "Phoenix migration ships Friday" produced `r.extracted.topics = ["phoenix", "migration"]` but `canonical_entities = []` — meaning no topic tree named Phoenix would ever spawn. Fix (Shape A from the review discussion): extend the canonical entity stream with topic rows. No new tables, no new routing path, no new hotness counters. ## Changes - `score/extract/types.rs`: add `EntityKind::Topic` with as_str/parse round-trip. `non_exhaustive` enum so this is an additive change. - `score/resolver.rs::canonicalise`: after emitting canonical entities from `extracted.entities`, also emit one per `extracted.topics` entry with `kind: Topic`, `canonical_id: "topic:<lowercased>"`. Span fields are 0 (topics are chunk-level, not substring-scoped). Dedups same-label topics internally; keeps `hashtag:launch` and `topic:launch` as separate entities by design. ## Downstream effect (zero-touch) - `score::store::index_entities_tx` indexes topic entities the same way as email entities — rows land in `mem_tree_entity_index` with `entity_kind = "topic"`. - Phase 3c routing iterates `canonical_entities` — topics now trigger topic-tree routing automatically. A chunk about "phoenix" starts accumulating hotness for `topic:phoenix`; once the threshold crosses, `get_or_create_topic_tree("topic:phoenix")` materialises a dedicated topic tree. Backfill via `lookup_entity("topic:phoenix")` then hydrates historic leaves. - Phase 4 retrieval can filter `WHERE entity_kind = 'topic'` to surface themes without mixing in people/emails. ## Tests Five new tests in `resolver.rs`: - topics → canonical entities with `kind: Topic` - lowercase normalisation + dedup on label - hashtag + topic with the same label coexist (different kind prefix) - entities emitted first, topics appended - `EntityKind::Topic` round-trips through `parse` / `as_str` Broader memory::tree: 167 passed / 0 failed (162 baseline + 5 new). No regressions. ## Caveats for follow-up 1. Topic noise — LLM-extracted themes are softer signal than regex entities. Hotness threshold (TOPIC_CREATION_THRESHOLD=10.0) gates tree creation, so transient topics won't spawn trees. 2. Topic/hashtag duplication — "Phoenix #phoenix" creates both `topic:phoenix` and `hashtag:phoenix` trees. Tolerable (kinds are semantically distinct) but a dedup policy may be worth revisiting in a Phase 4 retrieval follow-up. 3. Topic normalisation is lowercase+trim only. Plurals / stemming are deferred. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Steven Enamakel <31011319+senamakel@users.noreply.github.com>

sanil-23 and others added 2 commits April 21, 2026 19:33

senamakel marked this pull request as ready for review April 21, 2026 21:48

coderabbitai Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread src/openhuman/memory/tree/ingest.rs Outdated

Comment thread src/openhuman/memory/tree/score/extract/types.rs Outdated

Comment thread src/openhuman/memory/tree/score/resolver.rs

Comment thread src/openhuman/memory/tree/score/signals/source_weight.rs

Fix memory tree scoring persistence issues

58d316f

coderabbitai Bot reviewed Apr 21, 2026

View reviewed changes

senamakel merged commit 5fff556 into tinyhumansai:main Apr 21, 2026
7 of 8 checks passed

jwalin-shah mentioned this pull request Apr 22, 2026

refactor(core): move CLI adapters out of src/core/ into per-domain cli modules #758

Merged

3 tasks

sanil-23 mentioned this pull request Apr 22, 2026

feat(memory): LLM-based NER + importance signal with cheap-signals short-circuit (#708) #775

Merged

5 tasks

coderabbitai Bot mentioned this pull request Apr 22, 2026

feat(memory): source trees + bucket-seal foundation (#709) #789

Merged

6 tasks

This was referenced Apr 23, 2026

test(memory_tree): unit tests for retrieval RPC handlers (#710) #839

Merged

feat(slack): backfill ingestion + LLM summariser + tree fanout gate #934

Merged

Conversation

sanil-23 commented Apr 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Pipeline

Schema

Async surface

Design decisions / tradeoffs

Submission Checklist

Impact

Related

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sanil-23 commented Apr 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 21, 2026 •

edited

Loading