feat(memory): Phase 2 memory tree - preprocessing, scoring, admission gate (#708)#733
Conversation
…l chunks (tinyhumansai#707) Adds an isolated memory tree layer under src/openhuman/memory/tree/ implementing Phase 1 of the new memory architecture (umbrella tinyhumansai#711). Zero edits to existing memory/*.rs files - the new layer coexists with the legacy TinyHumans-backed client. - Source adapters: chat / email / document -> canonical Markdown - Token-bounded chunker with deterministic SHA-256 chunk IDs - SQLite persistence at <workspace>/memory_tree/chunks.db with full provenance metadata (source_kind, source_id, owner, timestamps, tags, time_range) and back-pointer to raw source - Unified JSON-RPC ingest (dispatches on source_kind + JSON payload): openhuman.memory_tree_ingest, _list_chunks, _get_chunk - DataSource enum covering the 8 providers from m.excalidraw step 1 (Discord/Telegram/Whatsapp/Gmail/OtherEmail/Notion/MeetingNotes/DriveDocs) - ~40 unit tests (chunk ID stability, UTF-8-safe splitting, canonicalisation idempotence, store round-trip, filter behavior) Additive only: new tables in a new DB file, new JSON-RPC namespace, no existing behavior changes. Feeds tinyhumansai#708 (scoring), tinyhumansai#709 (summary trees), tinyhumansai#710 (query tools). Closes tinyhumansai#707. Parent: tinyhumansai#711. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… gate (tinyhumansai#708) Adds the scoring / admission layer between Phase 1's chunker and store. Stacked on feat/707-memory-ingestion (PR tinyhumansai#732) - depends on Phase 1's chunk substrate. - Pluggable EntityExtractor trait + CompositeExtractor chain - RegexEntityExtractor: mechanical entities (emails, URLs, @Handles, #hashtags) - Always on, deterministic, zero deps, UTF-8-safe char spans - Five weighted signals: token count, unique-word ratio, metadata weight, source weight (per-DataSource), interaction (reply/sent/mention/dm tags), entity density - Exact-match entity canonicalisation (email lowercased, @ and # stripped) - Admission gate drops chunks below configurable threshold (default 0.3) - Score rationale persists for EVERY chunk (kept or dropped) for debugging - Entities indexed for KEPT chunks only - Two new SQLite tables added to the memory_tree DB: - mem_tree_score: per-chunk score rationale with all signal values - mem_tree_entity_index: inverted index entity_id -> node_id - Idempotent ALTER TABLE migration adds embedding BLOB column to mem_tree_chunks (used in Phase 3 retrieval, wired but not populated here) - Ingest pipeline converted to async to accommodate the extractor trait; blocking SQLite work isolated on spawn_blocking; JSON-RPC surface unchanged (same memory_tree_ingest / list / get methods) - Phase 2 deliberately ships without GLiNER/semantic NER - per-chunk semantic entities land later behind a cargo feature flag; the composite extractor interface keeps that drop-in trivial Additive only: new tables, new columns, new module. Existing Phase 1 behavior unchanged except that low-signal chunks are now dropped before reaching mem_tree_chunks. Raise score_drop_threshold to 0 to disable the gate and restore Phase-1-identical behavior. Closes tinyhumansai#708. Parent: tinyhumansai#711. Depends on: tinyhumansai#707 (tinyhumansai#732). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughAdded a Phase‑2 scoring/admission pipeline into ingestion: async scoring of chunks (kept vs dropped), persistence of scores and only-kept chunks/entities, new scoring modules (extract, resolver, signals, store), async ingest APIs, schema updates, and SQLite schema/indexing changes plus tests updated to cover admission behavior. Changes
Sequence Diagram(s)sequenceDiagram
participant RPC as RPC handler
participant Ingest as Ingest orchestration
participant Scorer as Scoring service (score::score_chunks)
participant Extract as Entity Extractor
participant Resolver as Canonicaliser
participant DB as SQLite store
RPC->>Ingest: parse payload & call async ingest_*()
Ingest->>Ingest: chunk content -> chunking
Ingest->>Scorer: score_chunks(chunks, cfg)
Scorer->>Extract: extract entities per chunk
Scorer->>Scorer: compute signals, combine -> decide kept/dropped
Scorer->>Resolver: canonicalise(extracted) (for kept and for rationale)
Scorer-->>Ingest: ScoreResults (kept/dropped + canonical entities)
Ingest->>DB: persist kept chunks (upsert) and embeddings
Ingest->>DB: persist all score rows (kept + dropped)
Ingest->>DB: index canonical entities only for kept chunks
DB-->>Ingest: ack persisted rows / counts
Ingest-->>RPC: return IngestResult { chunks_written, chunks_dropped, chunk_ids }
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
# Conflicts: # src/openhuman/memory/tree/canonicalize/chat.rs # src/openhuman/memory/tree/canonicalize/document.rs # src/openhuman/memory/tree/canonicalize/email.rs # src/openhuman/memory/tree/canonicalize/mod.rs # src/openhuman/memory/tree/chunker.rs # src/openhuman/memory/tree/ingest.rs # src/openhuman/memory/tree/mod.rs # src/openhuman/memory/tree/rpc.rs # src/openhuman/memory/tree/schemas.rs # src/openhuman/memory/tree/store.rs # src/openhuman/memory/tree/types.rs
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/openhuman/memory/tree/store.rs (2)
111-115:⚠️ Potential issue | 🟠 Major
INSERT OR REPLACEwill wipe stored embeddings on re-ingest.Now that
mem_tree_chunkshas anembeddingcolumn updated out-of-band, replacing a row with an insert that omitsembeddingresets it toNULL. A second ingest of the same chunk silently discards any vector previously written viaset_chunk_embedding().Suggested fix
- "INSERT OR REPLACE INTO mem_tree_chunks ( + "INSERT INTO mem_tree_chunks ( id, source_kind, source_id, source_ref, owner, timestamp_ms, time_range_start_ms, time_range_end_ms, tags_json, content, token_count, seq_in_source, created_at_ms - ) VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, ?8, ?9, ?10, ?11, ?12, ?13)", + ) VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, ?8, ?9, ?10, ?11, ?12, ?13) + ON CONFLICT(id) DO UPDATE SET + source_kind = excluded.source_kind, + source_id = excluded.source_id, + source_ref = excluded.source_ref, + owner = excluded.owner, + timestamp_ms = excluded.timestamp_ms, + time_range_start_ms = excluded.time_range_start_ms, + time_range_end_ms = excluded.time_range_end_ms, + tags_json = excluded.tags_json, + content = excluded.content, + token_count = excluded.token_count, + seq_in_source = excluded.seq_in_source, + created_at_ms = excluded.created_at_ms",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/memory/tree/store.rs` around lines 111 - 115, The SQL uses "INSERT OR REPLACE INTO mem_tree_chunks ..." which discards the existing embedding column on re-insert; change the statement to perform an UPSERT that preserves the stored embedding (e.g., use "INSERT ... ON CONFLICT(id) DO UPDATE SET ..." and in the DO UPDATE avoid overwriting embedding or set embedding = COALESCE(EXCLUDED.embedding, mem_tree_chunks.embedding)) so that repeated ingests update fields like content/tokens but keep vectors written via set_chunk_embedding(); update the SQL string in store.rs (the mem_tree_chunks INSERT statement) accordingly.
301-333:⚠️ Potential issue | 🟠 MajorThe lazy
ALTER TABLEmigration is racy under concurrent first access.
add_column_if_missing()does a read-then-alter across separate statements. If two ingests open a fresh DB at the same time, both can observe the column missing and one of theALTER TABLEcalls will fail withduplicate column name, aborting that request.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/memory/tree/store.rs` around lines 301 - 333, The add_column_if_missing function is racy because it does a separate read-then-ALTER; change it to perform the alter under a short serialized transaction (e.g., BEGIN IMMEDIATE / COMMIT around the pragma check+ALTER) or attempt the ALTER unconditionally and handle the specific SQLite error for duplicate column name by treating it as success. Update add_column_if_missing to wrap the operation in a transaction (or catch rusqlite::Error with sqlite_error_code() indicating duplicate column name) so concurrent calls don't fail the request.
🧹 Nitpick comments (4)
src/openhuman/memory/tree/score/signals/mod.rs (1)
22-107: Keepsignals/mod.rsexport-focused.
ScoreSignals,SignalWeights,compute(), andcombine()are all operational code in the module root. Moving those into sibling files such astypes.rs/ops.rswill keep this module aligned with the repo structure rule and make future signal growth easier to manage.As per coding guidelines, "Keep domain
mod.rsfiles light and export-focused; place operational code in sibling files (ops.rs,store.rs,schedule.rs,types.rs,rpc.rs,schemas.rs) and re-export the public API".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/memory/tree/score/signals/mod.rs` around lines 22 - 107, Refactor this module to keep mod.rs export-focused: move the data types ScoreSignals and SignalWeights into a new sibling types.rs, move compute, entity_density_score, and combine into a new sibling ops.rs (or similarly named files), and have mod.rs simply pub use the exported symbols (ScoreSignals, SignalWeights, compute, entity_density_score, combine) so the public API is unchanged; ensure any internal helper functions referenced by compute/entity_density_score are moved or re-exported as needed and update use paths accordingly.src/openhuman/memory/tree/score/extract/mod.rs (1)
15-79: Move the extractor implementation pieces out ofextract/mod.rs.This
mod.rsnow owns the trait, adapter, and composite execution logic. Splitting those into sibling files and keepingmod.rsto exports/re-exports will make this module easier to grow once semantic extractors land.As per coding guidelines, "Keep domain
mod.rsfiles light and export-focused; place operational code in sibling files (ops.rs,store.rs,schedule.rs,types.rs,rpc.rs,schemas.rs) and re-export the public API".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/memory/tree/score/extract/mod.rs` around lines 15 - 79, Split the operational code out of extract/mod.rs: move the EntityExtractor trait and its async impls (the EntityExtractor trait, the RegexEntityExtractor struct and its impl block including name() and extract(), and the CompositeExtractor struct plus its impls including new(), regex_only(), and the EntityExtractor impl for CompositeExtractor) into one or more sibling files (e.g., trait_and_regex.rs and composite.rs or extractor.rs and composite.rs), then update mod.rs to only re-export the public symbols (pub use self::... for EntityExtractor, RegexEntityExtractor, CompositeExtractor, and related constructors) and keep the doc comments in mod.rs; ensure all references (method names: name, extract, new, regex_only; types: ExtractedEntities) are preserved and imports (async_trait, anyhow, log, regex::extract) are added in the new files.src/openhuman/memory/tree/score/mod.rs (2)
122-131: Consider concurrent chunk scoring for improved throughput.The sequential loop processes chunks one at a time. Since
score_chunkis pure and independent per chunk, you could use concurrent execution (e.g.,futures::future::try_join_all) for better throughput on large batches. This is optional given the current workload characteristics.♻️ Optional: Concurrent batch scoring
+use futures::future::try_join_all; + pub async fn score_chunks(chunks: &[Chunk], cfg: &ScoringConfig) -> Result<Vec<ScoreResult>> { - let mut out = Vec::with_capacity(chunks.len()); - for c in chunks { - out.push(score_chunk(c, cfg).await?); - } - Ok(out) + let futures: Vec<_> = chunks.iter().map(|c| score_chunk(c, cfg)).collect(); + try_join_all(futures).await }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/memory/tree/score/mod.rs` around lines 122 - 131, The current score_chunks function runs score_chunk sequentially which limits throughput; change it to run chunk scoring concurrently by mapping chunks to async tasks and awaiting them all with a concurrency-aware join (e.g., use futures::future::try_join_all on an iterator of futures or spawn into a FuturesUnordered and collect results) so that score_chunk(c, cfg) is invoked for all chunks concurrently and errors still propagate as a batch Result; update score_chunks to build the futures from score_chunk and await the joined result, preserving the Result<Vec<ScoreResult>> return and error semantics.
142-170: Clarify timestamp semantics:timestamp_msparameter vsUtc::now()forcomputed_at_ms.The function receives
timestamp_msas a parameter (used for entity indexing at line 164) but usesUtc::now()forcomputed_at_msin the score row. This creates two different time semantics:
timestamp_ms: source/ingest timestamp (passed in)computed_at_ms: wall-clock time when scoring ranIf this is intentional (scoring time vs source time), consider adding a brief comment clarifying the distinction. If they should match, use
timestamp_msfor both.📝 Add clarifying comment
let row = store::ScoreRow { chunk_id: result.chunk_id.clone(), total: result.total, signals: result.signals.clone(), dropped: !result.kept, reason: result.drop_reason.clone(), + // computed_at_ms is wall-clock when scoring ran (vs timestamp_ms which + // reflects the source's original timestamp for entity indexing) computed_at_ms: Utc::now().timestamp_millis(), };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/memory/tree/score/mod.rs` around lines 142 - 170, In persist_score, computed_at_ms in the constructed store::ScoreRow currently uses Utc::now() while the function accepts timestamp_ms (also passed to store::index_entities), creating ambiguous time semantics; either replace Utc::now().timestamp_millis() with the provided timestamp_ms to keep a single consistent timestamp, or if the distinction is intentional, add a short clarifying comment above the persist_score implementation explaining that timestamp_ms is the source/ingest time whereas computed_at_ms (in store::ScoreRow) is the wall-clock scoring time so reviewers/readers understand the difference.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/openhuman/memory/tree/ingest.rs`:
- Around line 159-166: The current spawn_blocking closure commits chunks via
store::upsert_chunks and then calls score::persist_score separately, which can
leave mem_tree_chunks committed while subsequent score/index writes fail; wrap
all related DB writes in a single SQLite transaction/connection so they
atomically succeed or fail together: perform store::upsert_chunks and all
score::persist_score calls inside the same blocking closure using the DB
transaction API (e.g., begin/commit/rollback or a Transaction object) or provide
a new store function that accepts a Transaction and does both chunk upsert and
score persistence, ensuring you use the same connection/transaction for
mem_tree_chunks, score table, and entity index to avoid partial commits.
In `@src/openhuman/memory/tree/score/extract/types.rs`:
- Around line 109-116: unique_entity_count() currently only normalizes by
text.to_lowercase(), so entities with different kinds but same surface (e.g.,
`@alice` vs `#alice`) are collapsed; update the uniqueness key in
unique_entity_count() to include the entity kind along with the case-folded text
(for example combine e.text.to_lowercase() and e.kind or e.kind.as_str()) so
each (kind, normalized-text) is treated distinct; ensure the BTreeSet collects
that combined tuple/string to compute the correct length and that any callers
like entity_density_score() will then receive the corrected count.
In `@src/openhuman/memory/tree/score/resolver.rs`:
- Around line 54-60: The canonical_id_for function is lowercasing the entire
surface which breaks exact-match URLs; update canonical_id_for to preserve the
original case when kind == EntityKind::Url by trimming whitespace and leading
'@'/'#' but skipping the to_lowercase() step for URL kinds, while keeping the
existing lowercasing behavior for other kinds; locate the function
canonical_id_for and the EntityKind::Url branch to implement this conditional
handling so URL path/query case is preserved unless you later add parser-aware
normalization.
In `@src/openhuman/memory/tree/score/signals/source_weight.rs`:
- Around line 22-29: infer_data_source currently scans Metadata.tags (used for
user labels) and treats any matching string (via DataSource::parse) as transport
provider info, causing user content tags like "notion" to override source_kind;
fix by restricting detection to a namespaced/tag-prefix or dedicated metadata
field: update infer_data_source to only consider tags that start with a provider
prefix such as "provider:" (e.g., check tag.starts_with("provider:") then strip
the prefix and call DataSource::parse) or read from a new
Metadata.transport_tags / Metadata.provider field instead of Metadata.tags;
update references to DataSource::parse and add a PROVIDER_PREFIX constant to
make the behavior explicit.
---
Outside diff comments:
In `@src/openhuman/memory/tree/store.rs`:
- Around line 111-115: The SQL uses "INSERT OR REPLACE INTO mem_tree_chunks ..."
which discards the existing embedding column on re-insert; change the statement
to perform an UPSERT that preserves the stored embedding (e.g., use "INSERT ...
ON CONFLICT(id) DO UPDATE SET ..." and in the DO UPDATE avoid overwriting
embedding or set embedding = COALESCE(EXCLUDED.embedding,
mem_tree_chunks.embedding)) so that repeated ingests update fields like
content/tokens but keep vectors written via set_chunk_embedding(); update the
SQL string in store.rs (the mem_tree_chunks INSERT statement) accordingly.
- Around line 301-333: The add_column_if_missing function is racy because it
does a separate read-then-ALTER; change it to perform the alter under a short
serialized transaction (e.g., BEGIN IMMEDIATE / COMMIT around the pragma
check+ALTER) or attempt the ALTER unconditionally and handle the specific SQLite
error for duplicate column name by treating it as success. Update
add_column_if_missing to wrap the operation in a transaction (or catch
rusqlite::Error with sqlite_error_code() indicating duplicate column name) so
concurrent calls don't fail the request.
---
Nitpick comments:
In `@src/openhuman/memory/tree/score/extract/mod.rs`:
- Around line 15-79: Split the operational code out of extract/mod.rs: move the
EntityExtractor trait and its async impls (the EntityExtractor trait, the
RegexEntityExtractor struct and its impl block including name() and extract(),
and the CompositeExtractor struct plus its impls including new(), regex_only(),
and the EntityExtractor impl for CompositeExtractor) into one or more sibling
files (e.g., trait_and_regex.rs and composite.rs or extractor.rs and
composite.rs), then update mod.rs to only re-export the public symbols (pub use
self::... for EntityExtractor, RegexEntityExtractor, CompositeExtractor, and
related constructors) and keep the doc comments in mod.rs; ensure all references
(method names: name, extract, new, regex_only; types: ExtractedEntities) are
preserved and imports (async_trait, anyhow, log, regex::extract) are added in
the new files.
In `@src/openhuman/memory/tree/score/mod.rs`:
- Around line 122-131: The current score_chunks function runs score_chunk
sequentially which limits throughput; change it to run chunk scoring
concurrently by mapping chunks to async tasks and awaiting them all with a
concurrency-aware join (e.g., use futures::future::try_join_all on an iterator
of futures or spawn into a FuturesUnordered and collect results) so that
score_chunk(c, cfg) is invoked for all chunks concurrently and errors still
propagate as a batch Result; update score_chunks to build the futures from
score_chunk and await the joined result, preserving the Result<Vec<ScoreResult>>
return and error semantics.
- Around line 142-170: In persist_score, computed_at_ms in the constructed
store::ScoreRow currently uses Utc::now() while the function accepts
timestamp_ms (also passed to store::index_entities), creating ambiguous time
semantics; either replace Utc::now().timestamp_millis() with the provided
timestamp_ms to keep a single consistent timestamp, or if the distinction is
intentional, add a short clarifying comment above the persist_score
implementation explaining that timestamp_ms is the source/ingest time whereas
computed_at_ms (in store::ScoreRow) is the wall-clock scoring time so
reviewers/readers understand the difference.
In `@src/openhuman/memory/tree/score/signals/mod.rs`:
- Around line 22-107: Refactor this module to keep mod.rs export-focused: move
the data types ScoreSignals and SignalWeights into a new sibling types.rs, move
compute, entity_density_score, and combine into a new sibling ops.rs (or
similarly named files), and have mod.rs simply pub use the exported symbols
(ScoreSignals, SignalWeights, compute, entity_density_score, combine) so the
public API is unchanged; ensure any internal helper functions referenced by
compute/entity_density_score are moved or re-exported as needed and update use
paths accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3961de54-ae23-4a02-8844-ecbe7cdb4796
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (18)
src/openhuman/memory/tree/ingest.rssrc/openhuman/memory/tree/mod.rssrc/openhuman/memory/tree/rpc.rssrc/openhuman/memory/tree/schemas.rssrc/openhuman/memory/tree/score/extract/mod.rssrc/openhuman/memory/tree/score/extract/regex.rssrc/openhuman/memory/tree/score/extract/types.rssrc/openhuman/memory/tree/score/mod.rssrc/openhuman/memory/tree/score/resolver.rssrc/openhuman/memory/tree/score/signals/interaction.rssrc/openhuman/memory/tree/score/signals/metadata_weight.rssrc/openhuman/memory/tree/score/signals/mod.rssrc/openhuman/memory/tree/score/signals/source_weight.rssrc/openhuman/memory/tree/score/signals/token_count.rssrc/openhuman/memory/tree/score/signals/unique_words.rssrc/openhuman/memory/tree/score/store.rssrc/openhuman/memory/tree/store.rstests/json_rpc_e2e.rs
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (3)
src/openhuman/memory/tree/store.rs (1)
378-424: Split the embedding persistence into a sibling module.This file is now ~600 lines and mixes chunk CRUD, schema bootstrap, migration helpers, and embedding storage. Moving the embedding-specific code out would keep the boundaries cleaner and bring the file back under the repo’s size target.
As per coding guidelines, "Keep source files ≤ ~500 lines; split modules when growing to maintain single responsibility."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/memory/tree/store.rs` around lines 378 - 424, Extract the embedding persistence functions set_chunk_embedding and get_chunk_embedding (and any closely related helpers like their use of with_connection and Config) into a new sibling module (e.g., src/openhuman/memory/tree/embedding.rs) and re-export or import them from the original store.rs; move only the embedding-specific logic and tests, keep shared types (Config, with_connection) referenced via pub use or module import, update store.rs to delegate to the new module so store.rs no longer contains embedding packing/unpacking code, and ensure all imports/paths and crate visibility are adjusted to preserve existing public API and unit tests.src/openhuman/memory/tree/score/mod.rs (2)
156-157: Bound chunk-scoring concurrency before semantic extractors are enabled.
try_join_alldrives everyscore_chunkfuture at once. That is fine for the current regex-only extractor, but it becomes an unbounded fan-out on large ingests once model-backed extractors are added. A buffered stream limit will keep memory/CPU usage predictable.♻️ One way to cap in-flight scoring
-use futures_util::future::try_join_all; +use futures_util::{stream, StreamExt, TryStreamExt}; ... pub async fn score_chunks(chunks: &[Chunk], cfg: &ScoringConfig) -> Result<Vec<ScoreResult>> { - try_join_all(chunks.iter().map(|chunk| score_chunk(chunk, cfg))).await + const MAX_IN_FLIGHT: usize = 8; + stream::iter(chunks.iter().map(|chunk| score_chunk(chunk, cfg))) + .buffered(MAX_IN_FLIGHT) + .try_collect() + .await }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/memory/tree/score/mod.rs` around lines 156 - 157, The current score_chunks function spawns all score_chunk futures at once via try_join_all which can cause unbounded concurrency; change it to drive the chunk iterator through a buffered stream (e.g., futures::stream::iter(chunks).map(|c| score_chunk(c, cfg)).buffer_unordered(n)) and collect the results to Vec<ScoreResult>, using a concurrency cap pulled from ScoringConfig (add a field like max_concurrency or reuse an existing concurrency setting on ScoringConfig). Update score_chunks to use futures::stream::{iter, StreamExt} and buffer_unordered with cfg.max_concurrency and handle flattening/collecting Result items into the same Result<Vec<ScoreResult>> return.
69-73: Log the actual scored inputs and final decision on the keep path.For chat chunks, the gate uses
scoring_token_countafter header stripping, but the entry log reportschunk.token_count, and kept chunks never log their computed total/threshold. Add one final debug line with the derived token count, total, threshold, entity count, andkeptso score tuning is traceable end-to-end.As per coding guidelines, "
{src/**/*.rs,app/src/**/*.{ts,tsx}}: Add substantial, development-oriented logs while implementing features or fixes so issues are easy to trace end-to-end; log critical checkpoints including entry/exit points, branch decisions, external calls, retries/timeouts, state transitions, and error handling paths`."Also applies to: 117-123
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/memory/tree/score/mod.rs` around lines 69 - 73, The entry log currently prints chunk.token_count but the gate uses scoring_token_count and there is no final log when chunks are kept; update the scoring path in the function handling chunk scoring (e.g., score_chunk) to (1) use the derived scoring_token_count for the initial/logging context where appropriate, and (2) add one final debug log on the kept branch that includes scoring_token_count (the header-stripped token count for chat chunks), total, threshold, entity_count, and kept=true/false so the computed total vs threshold decision is recorded; replicate the same change for the second scoring block referenced around the 117-123 region. Ensure the log message clearly identifies these fields and the decision to aid end-to-end score tuning.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/openhuman/memory/tree/ingest.rs`:
- Around line 139-145: The loop using chunks.iter().zip(scores.into_iter()) can
silently truncate when scores.len() != chunks.len(); before iterating, validate
that scores.len() == chunks.len() and return or propagate an error (or panic) if
they differ so we fail fast; locate the scope where chunks, scores, kept_chunks,
and all_results are used (the block that currently iterates and pushes
kept_chunks and all_results) and insert the explicit length check there,
emitting a clear error message that includes the expected chunks.len() and
actual scores.len() to aid debugging.
In `@src/openhuman/memory/tree/score/mod.rs`:
- Around line 8-226: The mod.rs currently contains operational scoring logic
that should be moved into sibling files: extract the data types/constants
(ScoreResult, ScoringConfig, DEFAULT_DROP_THRESHOLD and related serde/impls) and
helper types into a new types.rs, and move runtime functions and persistence
logic (score_chunk, score_chunks, scoring_content_for_chunk, score_row,
persist_score, persist_score_tx and any helper imports like approx_token_count,
canonicalise, compute/combine from signals) into a new ops.rs; keep mod.rs
minimal by declaring pub mod extract; pub mod resolver; pub mod signals; pub mod
store; pub mod types; pub mod ops; and re-export the public API from mod.rs
(e.g., pub use types::{ScoreResult, ScoringConfig, DEFAULT_DROP_THRESHOLD}; pub
use ops::{score_chunk, score_chunks, persist_score, persist_score_tx}); ensure
necessary use/import lines are moved into the respective files (e.g.,
chrono::Utc and rusqlite::Transaction into ops.rs for score_row/persist_*), and
run cargo build to fix any missing imports.
- Around line 178-186: When re-scoring a chunk (check result.kept) before
calling store::index_entities or store::index_entities_tx, delete existing
mem_tree_entity_index rows for that chunk/node_id to remove stale entries;
specifically, add a deletion by node_id (chunk_id) in the non-transactional path
(around the result.kept && !result.canonical_entities block) and the
transactional path (around the index_entities_tx call) so that INSERT OR REPLACE
only inserts the current set, ensuring removed entities are cleared; reference
the functions index_entities and index_entities_tx and the
mem_tree_entity_index/node_id identifiers when adding the DELETE step (or
alternatively validate entries against current chunk state on read).
In `@src/openhuman/memory/tree/score/signals/types.rs`:
- Around line 26-35: The default SignalWeights produce a neutral baseline that
exceeds the Phase 2 gate; update the SignalWeights::default to rebalance the
per-signal weights so neutral (0.5) values don't already pass the default
threshold—specifically reduce metadata_weight, source_weight, and interaction
(e.g., set metadata_weight and source_weight to 1.0 and interaction to 2.0) in
the SignalWeights::default implementation so near-empty chat chunks are rejected
unless token_count/unique_words/entity_density contribute; adjust only the
values in the SignalWeights::default block.
In `@src/openhuman/memory/tree/score/store.rs`:
- Around line 243-254: The current code casts limit with `limit.unwrap_or(100)
as i64` which can wrap for large `usize` values; clamp the optional `limit` to
the maximum representable positive i64 before casting so callers cannot bypass
the LIMIT (same approach as the chunk listing code). Concretely, take the
`limit` optional, default to 100, clamp it to `i64::MAX` (or `i64::MAX as
usize`) and then cast to i64, and use that clamped `limit` when preparing the
query against `mem_tree_entity_index` in the closure passed to
`with_connection`. Ensure you update the binding used in `params![entity_id,
limit]` so the SQL always receives a non-negative, non-wrapped i64.
---
Nitpick comments:
In `@src/openhuman/memory/tree/score/mod.rs`:
- Around line 156-157: The current score_chunks function spawns all score_chunk
futures at once via try_join_all which can cause unbounded concurrency; change
it to drive the chunk iterator through a buffered stream (e.g.,
futures::stream::iter(chunks).map(|c| score_chunk(c, cfg)).buffer_unordered(n))
and collect the results to Vec<ScoreResult>, using a concurrency cap pulled from
ScoringConfig (add a field like max_concurrency or reuse an existing concurrency
setting on ScoringConfig). Update score_chunks to use futures::stream::{iter,
StreamExt} and buffer_unordered with cfg.max_concurrency and handle
flattening/collecting Result items into the same Result<Vec<ScoreResult>>
return.
- Around line 69-73: The entry log currently prints chunk.token_count but the
gate uses scoring_token_count and there is no final log when chunks are kept;
update the scoring path in the function handling chunk scoring (e.g.,
score_chunk) to (1) use the derived scoring_token_count for the initial/logging
context where appropriate, and (2) add one final debug log on the kept branch
that includes scoring_token_count (the header-stripped token count for chat
chunks), total, threshold, entity_count, and kept=true/false so the computed
total vs threshold decision is recorded; replicate the same change for the
second scoring block referenced around the 117-123 region. Ensure the log
message clearly identifies these fields and the decision to aid end-to-end score
tuning.
In `@src/openhuman/memory/tree/store.rs`:
- Around line 378-424: Extract the embedding persistence functions
set_chunk_embedding and get_chunk_embedding (and any closely related helpers
like their use of with_connection and Config) into a new sibling module (e.g.,
src/openhuman/memory/tree/embedding.rs) and re-export or import them from the
original store.rs; move only the embedding-specific logic and tests, keep shared
types (Config, with_connection) referenced via pub use or module import, update
store.rs to delegate to the new module so store.rs no longer contains embedding
packing/unpacking code, and ensure all imports/paths and crate visibility are
adjusted to preserve existing public API and unit tests.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 445d3e66-424a-4a71-9db8-fa3acf3e81e8
📒 Files selected for processing (12)
src/openhuman/memory/tree/ingest.rssrc/openhuman/memory/tree/score/extract/extractor.rssrc/openhuman/memory/tree/score/extract/mod.rssrc/openhuman/memory/tree/score/extract/types.rssrc/openhuman/memory/tree/score/mod.rssrc/openhuman/memory/tree/score/resolver.rssrc/openhuman/memory/tree/score/signals/mod.rssrc/openhuman/memory/tree/score/signals/ops.rssrc/openhuman/memory/tree/score/signals/source_weight.rssrc/openhuman/memory/tree/score/signals/types.rssrc/openhuman/memory/tree/score/store.rssrc/openhuman/memory/tree/store.rs
✅ Files skipped from review due to trivial changes (2)
- src/openhuman/memory/tree/score/extract/mod.rs
- src/openhuman/memory/tree/score/signals/mod.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- src/openhuman/memory/tree/score/extract/types.rs
| pub mod extract; | ||
| pub mod resolver; | ||
| pub mod signals; | ||
| pub mod store; | ||
|
|
||
| use std::sync::Arc; | ||
|
|
||
| use anyhow::Result; | ||
| use chrono::Utc; | ||
| use futures_util::future::try_join_all; | ||
| use rusqlite::Transaction; | ||
| use serde::{Deserialize, Serialize}; | ||
|
|
||
| use self::extract::{EntityExtractor, ExtractedEntities}; | ||
| use self::resolver::{canonicalise, CanonicalEntity}; | ||
| use self::signals::{ScoreSignals, SignalWeights}; | ||
| use crate::openhuman::memory::tree::types::{approx_token_count, Chunk, SourceKind}; | ||
|
|
||
| /// Default drop threshold. Chunks with `total < DEFAULT_DROP_THRESHOLD` | ||
| /// are tombstoned and never reach the chunk store. | ||
| pub const DEFAULT_DROP_THRESHOLD: f32 = 0.3; | ||
|
|
||
| /// Whole outcome of [`score_chunk`]. | ||
| #[derive(Clone, Debug, Serialize, Deserialize)] | ||
| pub struct ScoreResult { | ||
| pub chunk_id: String, | ||
| pub total: f32, | ||
| pub signals: ScoreSignals, | ||
| pub kept: bool, | ||
| pub drop_reason: Option<String>, | ||
| pub extracted: ExtractedEntities, | ||
| pub canonical_entities: Vec<CanonicalEntity>, | ||
| } | ||
|
|
||
| /// Configuration passed through the ingest pipeline for Phase 2 behaviour. | ||
| /// | ||
| /// Held as a struct (vs config struct fields) so callers can override per-run | ||
| /// without mutating global config — useful for tests and explicit threshold | ||
| /// tuning. | ||
| pub struct ScoringConfig { | ||
| pub extractor: Arc<dyn EntityExtractor>, | ||
| pub weights: SignalWeights, | ||
| pub drop_threshold: f32, | ||
| } | ||
|
|
||
| impl ScoringConfig { | ||
| /// Phase 2 default: regex-only extractor, default weights, default threshold. | ||
| pub fn default_regex_only() -> Self { | ||
| Self { | ||
| extractor: Arc::new(extract::CompositeExtractor::regex_only()), | ||
| weights: SignalWeights::default(), | ||
| drop_threshold: DEFAULT_DROP_THRESHOLD, | ||
| } | ||
| } | ||
| } | ||
|
|
||
| /// Compute the score for one chunk. | ||
| /// | ||
| /// Pure function — does not touch the store. Callers decide what to persist | ||
| /// based on [`ScoreResult::kept`]. | ||
| pub async fn score_chunk(chunk: &Chunk, cfg: &ScoringConfig) -> Result<ScoreResult> { | ||
| log::debug!( | ||
| "[memory_tree::score] score_chunk chunk_id={} tokens={}", | ||
| chunk.id, | ||
| chunk.token_count | ||
| ); | ||
|
|
||
| let scoring_content = scoring_content_for_chunk(chunk); | ||
| let scoring_token_count = approx_token_count(&scoring_content); | ||
|
|
||
| // 1. Extract entities (regex + any configured semantic extractors) | ||
| let extracted = cfg.extractor.extract(&scoring_content).await?; | ||
|
|
||
| // 2. Compute signals | ||
| let signals = self::signals::compute( | ||
| &chunk.metadata, | ||
| &scoring_content, | ||
| scoring_token_count, | ||
| &extracted, | ||
| ); | ||
|
|
||
| // 3. Weighted combine | ||
| let total = self::signals::combine(&signals, &cfg.weights); | ||
|
|
||
| // 4. Admission gate. Source and interaction priors are deliberately | ||
| // non-zero, so guard against very short entity-free chatter being kept by | ||
| // metadata alone. | ||
| let tiny_entity_free = | ||
| scoring_token_count < self::signals::token_count::TOKEN_MIN && extracted.is_empty(); | ||
| let kept = !tiny_entity_free && total >= cfg.drop_threshold; | ||
| let drop_reason = if kept { | ||
| None | ||
| } else if tiny_entity_free { | ||
| Some(format!( | ||
| "token_count {} < minimum {} and no entities extracted", | ||
| scoring_token_count, | ||
| self::signals::token_count::TOKEN_MIN | ||
| )) | ||
| } else { | ||
| Some(format!( | ||
| "total {total:.3} < threshold {:.3}", | ||
| cfg.drop_threshold | ||
| )) | ||
| }; | ||
|
|
||
| // 5. Canonicalise for indexing (only meaningful when kept — but we | ||
| // canonicalise unconditionally so the result is inspectable in tests) | ||
| let canonical_entities = canonicalise(&extracted); | ||
|
|
||
| if !kept { | ||
| log::debug!( | ||
| "[memory_tree::score] drop chunk_id={} total={:.3} reason={:?}", | ||
| chunk.id, | ||
| total, | ||
| drop_reason | ||
| ); | ||
| } | ||
|
|
||
| Ok(ScoreResult { | ||
| chunk_id: chunk.id.clone(), | ||
| total, | ||
| signals, | ||
| kept, | ||
| drop_reason, | ||
| extracted, | ||
| canonical_entities, | ||
| }) | ||
| } | ||
|
|
||
| fn scoring_content_for_chunk(chunk: &Chunk) -> String { | ||
| if chunk.metadata.source_kind != SourceKind::Chat { | ||
| return chunk.content.clone(); | ||
| } | ||
|
|
||
| chunk | ||
| .content | ||
| .lines() | ||
| .filter(|line| { | ||
| let trimmed = line.trim_start(); | ||
| !trimmed.starts_with("# Chat transcript") && !trimmed.starts_with("## ") | ||
| }) | ||
| .collect::<Vec<_>>() | ||
| .join("\n") | ||
| } | ||
|
|
||
| /// Score a batch of chunks. Errors from any single chunk fail the batch — | ||
| /// scoring is pure-ish (only the extractor may error) and a failure here is | ||
| /// a real bug, not a per-chunk issue to tolerate silently. | ||
| pub async fn score_chunks(chunks: &[Chunk], cfg: &ScoringConfig) -> Result<Vec<ScoreResult>> { | ||
| try_join_all(chunks.iter().map(|chunk| score_chunk(chunk, cfg))).await | ||
| } | ||
|
|
||
| // ── Persistence helpers used by the ingest orchestrator ───────────────── | ||
|
|
||
| /// Persist the score row + entity-index rows for one kept chunk. | ||
| /// | ||
| /// The caller is responsible for having already written the chunk itself | ||
| /// into `mem_tree_chunks` (so the FK-like relation is satisfied). Dropped | ||
| /// chunks still get a score row persisted for diagnostics — callers should | ||
| /// pass `None` for `tree_id` in that case, since the chunk won't appear in | ||
| /// a tree. | ||
| pub fn persist_score( | ||
| config: &crate::openhuman::config::Config, | ||
| result: &ScoreResult, | ||
| timestamp_ms: i64, | ||
| tree_id: Option<&str>, | ||
| ) -> Result<()> { | ||
| let row = score_row(result); | ||
| store::upsert_score(config, &row)?; | ||
|
|
||
| if result.kept && !result.canonical_entities.is_empty() { | ||
| store::index_entities( | ||
| config, | ||
| &result.canonical_entities, | ||
| &result.chunk_id, | ||
| "leaf", | ||
| timestamp_ms, | ||
| tree_id, | ||
| )?; | ||
| } | ||
|
|
||
| Ok(()) | ||
| } | ||
|
|
||
| pub(crate) fn persist_score_tx( | ||
| tx: &Transaction<'_>, | ||
| result: &ScoreResult, | ||
| timestamp_ms: i64, | ||
| tree_id: Option<&str>, | ||
| ) -> Result<()> { | ||
| let row = score_row(result); | ||
| store::upsert_score_tx(tx, &row)?; | ||
|
|
||
| if result.kept && !result.canonical_entities.is_empty() { | ||
| store::index_entities_tx( | ||
| tx, | ||
| &result.canonical_entities, | ||
| &result.chunk_id, | ||
| "leaf", | ||
| timestamp_ms, | ||
| tree_id, | ||
| )?; | ||
| } | ||
|
|
||
| Ok(()) | ||
| } | ||
|
|
||
| fn score_row(result: &ScoreResult) -> store::ScoreRow { | ||
| // Score rows keep wall-clock scoring time; the separate timestamp_ms | ||
| // argument used for entity indexes is the source/ingest ordering time. | ||
| store::ScoreRow { | ||
| chunk_id: result.chunk_id.clone(), | ||
| total: result.total, | ||
| signals: result.signals.clone(), | ||
| dropped: !result.kept, | ||
| reason: result.drop_reason.clone(), | ||
| computed_at_ms: Utc::now().timestamp_millis(), | ||
| } | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Move the runtime scoring pipeline out of mod.rs.
This module now owns scoring, batching, and persistence behavior, which conflicts with the repo rule for src/openhuman/**/mod.rs. Please split the operational code into sibling files such as types.rs / ops.rs and keep mod.rs to declarations and re-exports.
As per coding guidelines, "src/openhuman/**/mod.rs: Keep domain mod.rs files light and export-focused; place operational code in sibling files (ops.rs, store.rs, schedule.rs, types.rs, rpc.rs, schemas.rs) and re-export the public API`."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/openhuman/memory/tree/score/mod.rs` around lines 8 - 226, The mod.rs
currently contains operational scoring logic that should be moved into sibling
files: extract the data types/constants (ScoreResult, ScoringConfig,
DEFAULT_DROP_THRESHOLD and related serde/impls) and helper types into a new
types.rs, and move runtime functions and persistence logic (score_chunk,
score_chunks, scoring_content_for_chunk, score_row, persist_score,
persist_score_tx and any helper imports like approx_token_count, canonicalise,
compute/combine from signals) into a new ops.rs; keep mod.rs minimal by
declaring pub mod extract; pub mod resolver; pub mod signals; pub mod store; pub
mod types; pub mod ops; and re-export the public API from mod.rs (e.g., pub use
types::{ScoreResult, ScoringConfig, DEFAULT_DROP_THRESHOLD}; pub use
ops::{score_chunk, score_chunks, persist_score, persist_score_tx}); ensure
necessary use/import lines are moved into the respective files (e.g.,
chrono::Utc and rusqlite::Transaction into ops.rs for score_row/persist_*), and
run cargo build to fix any missing imports.
| impl Default for SignalWeights { | ||
| fn default() -> Self { | ||
| Self { | ||
| token_count: 1.0, | ||
| unique_words: 1.0, | ||
| metadata_weight: 1.5, | ||
| source_weight: 1.5, | ||
| interaction: 3.0, // strongest signal — direct user engagement | ||
| entity_density: 1.0, | ||
| } |
There was a problem hiding this comment.
These defaults put the neutral baseline above the default gate.
With the current weights, a chat chunk that only gets the neutral 0.5 from metadata_weight, source_weight, and interaction already lands at 0.333…. At the Phase 2 default threshold of 0.3, near-empty chat chunks will be admitted before token/entity signals add anything. Rebalance the defaults or raise the default threshold so the gate can actually reject low-value content.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/openhuman/memory/tree/score/signals/types.rs` around lines 26 - 35, The
default SignalWeights produce a neutral baseline that exceeds the Phase 2 gate;
update the SignalWeights::default to rebalance the per-signal weights so neutral
(0.5) values don't already pass the default threshold—specifically reduce
metadata_weight, source_weight, and interaction (e.g., set metadata_weight and
source_weight to 1.0 and interaction to 2.0) in the SignalWeights::default
implementation so near-empty chat chunks are rejected unless
token_count/unique_words/entity_density contribute; adjust only the values in
the SignalWeights::default block.
- ingest: fail fast if scorer returns fewer/more results than chunks
(silent zip truncation would drop chunks or their score rationale)
- score::persist_score{,_tx}: clear stale entity-index rows before
re-indexing a re-scored chunk, since INSERT OR REPLACE never deletes
rows whose entity_id is no longer in the new extraction
- score::store::lookup_entity: clamp limit to i64::MAX before casting
to prevent a large usize wrapping into a negative LIMIT
Adds clear_entity_index_drops_stale_rows regression test.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/openhuman/memory/tree/score/mod.rs (1)
178-184:⚠️ Potential issue | 🟠 MajorClear entity lookups when a chunk flips from kept to dropped.
clear_entity_index_for_node(_tx)is still gated byresult.kept. If a chunk was indexed on an earlier ingest and now scores below threshold, the score row flips to dropped but the oldmem_tree_entity_indexrows survive, solookup_entitycan still return a chunk that admission just rejected. Move the clear outside theif result.keptguard, then only reinsert whenresult.keptis true.Suggested change
- if result.kept { - // Clear any stale entity-index rows for this chunk before re-indexing. - // INSERT OR REPLACE on (entity_id, node_id) never deletes rows whose - // entity_id is no longer present in the new extraction — so a re-score - // that drops an entity would otherwise leave a phantom index row. - store::clear_entity_index_for_node(config, &result.chunk_id)?; - if !result.canonical_entities.is_empty() { + store::clear_entity_index_for_node(config, &result.chunk_id)?; + if result.kept && !result.canonical_entities.is_empty() { store::index_entities( config, &result.canonical_entities, &result.chunk_id, "leaf", timestamp_ms, tree_id, )?; - } }Also applies to: 208-210
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/memory/tree/score/mod.rs` around lines 178 - 184, The entity-index clear is incorrectly conditional on result.kept, so when a chunk flips from kept to dropped stale mem_tree_entity_index rows remain; move the call to store::clear_entity_index_for_node(config, &result.chunk_id) so it runs unconditionally whenever processing the score result, and keep the insertion/reinsertion of entities (using result.canonical_entities) only inside the if result.kept branch; apply the same change for the second occurrence that mirrors this logic (the later block around the other result handling).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/openhuman/memory/tree/ingest.rs`:
- Around line 128-130: The current early-return when chunks.is_empty()
(returning IngestResult::empty(source_id)) causes old mem_tree_chunks rows to
remain because we only upsert kept_chunks; change the ingest flow to perform a
transactional replace of the source’s chunk set: begin a DB transaction, delete
existing mem_tree_chunks for the given source_id, then insert the kept_chunks
(which may be zero rows) and commit; do the same replacement logic for the other
upsert branch referenced around the 169-179 section to avoid append-only
behavior. Also add a regression test that simulates a source that first produces
kept chunks and then re-ingests with (a) lowered admission causing kept→dropped
and (b) an empty chunk set, asserting that mem_tree_chunks contains exactly the
expected rows after each re-ingest. Ensure you reference the same identifiers
used in ingest.rs (kept_chunks, source_id, IngestResult::empty, mem_tree_chunks)
when updating the code and tests.
---
Duplicate comments:
In `@src/openhuman/memory/tree/score/mod.rs`:
- Around line 178-184: The entity-index clear is incorrectly conditional on
result.kept, so when a chunk flips from kept to dropped stale
mem_tree_entity_index rows remain; move the call to
store::clear_entity_index_for_node(config, &result.chunk_id) so it runs
unconditionally whenever processing the score result, and keep the
insertion/reinsertion of entities (using result.canonical_entities) only inside
the if result.kept branch; apply the same change for the second occurrence that
mirrors this logic (the later block around the other result handling).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 03a05e78-ffc7-48e5-841b-8d4f525d7694
📒 Files selected for processing (3)
src/openhuman/memory/tree/ingest.rssrc/openhuman/memory/tree/score/mod.rssrc/openhuman/memory/tree/score/store.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- src/openhuman/memory/tree/score/store.rs
| if chunks.is_empty() { | ||
| return Ok(IngestResult::empty(source_id)); | ||
| } |
There was a problem hiding this comment.
Prune stale chunk rows on re-ingest.
This path only upserts kept_chunks and returns early when chunks.is_empty(). If a source previously produced persisted chunks and a later ingest now chunks to zero or falls below the admission threshold, the old mem_tree_chunks rows survive, so retrieval can still surface content that this ingest just rejected. Please replace the source’s chunk set transactionally instead of doing append-only upserts, and add a regression test for kept→dropped / kept→empty re-ingests.
Also applies to: 169-179
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/openhuman/memory/tree/ingest.rs` around lines 128 - 130, The current
early-return when chunks.is_empty() (returning IngestResult::empty(source_id))
causes old mem_tree_chunks rows to remain because we only upsert kept_chunks;
change the ingest flow to perform a transactional replace of the source’s chunk
set: begin a DB transaction, delete existing mem_tree_chunks for the given
source_id, then insert the kept_chunks (which may be zero rows) and commit; do
the same replacement logic for the other upsert branch referenced around the
169-179 section to avoid append-only behavior. Also add a regression test that
simulates a source that first produces kept chunks and then re-ingests with (a)
lowered admission causing kept→dropped and (b) an empty chunk set, asserting
that mem_tree_chunks contains exactly the expected rows after each re-ingest.
Ensure you reference the same identifiers used in ingest.rs (kept_chunks,
source_id, IngestResult::empty, mem_tree_chunks) when updating the code and
tests.
…ort-circuit (#708) (#775) * feat(memory): LLM-based NER + importance signal with cheap-signals short-circuit (#708) Adds an Ollama-backed entity extractor and an LLM-derived importance score as a new signal in the admission gate. Builds on merged Phase 2 (#733) without changing its surface — additive only. Why LLM (not GLiNER): zero new deps (reuses existing reqwest + async infra), reuses the same Ollama setup openhuman uses for embeddings, better per-entity quality scales with model choice, no native ONNX runtime to ship. Latency cost (~100-300ms per chunk on a small model like qwen2.5:0.5b) is amortised by the short-circuit. The admission gate stays a hybrid - cheap deterministic signals always run; the LLM is one signal among many, never the sole arbiter. Backwards-compatible: with default SignalWeights the LLM weight is 0.0 and behaviour matches pre-LLM Phase 2 exactly. What's new: - src/openhuman/memory/tree/score/extract/llm.rs: LlmEntityExtractor implementing the existing EntityExtractor trait. Posts a structured JSON request to an Ollama-compatible /api/chat endpoint asking for NER + an importance rating in one call. Span recovery via string search; hallucinated entities (surface not in source text) dropped. Soft fallback: HTTP failures log a warn and return empty extraction. - ExtractedEntities gains llm_importance (Option<f32>) + llm_importance_reason (Option<String>); merge() takes max importance. - ScoreSignals gains llm_importance (f32, defaults to 0.0). - SignalWeights gains llm_importance (default 0.0; with_llm_enabled() helper sets it to 2.0 - comparable to metadata/source weights). - signals::combine_cheap_only(): variant that excludes the LLM signal, used for the short-circuit decision in score_chunk. - ScoringConfig gains llm_extractor (Option<Arc<dyn EntityExtractor>>), definite_keep_threshold (default 0.85), definite_drop_threshold (default 0.15). New with_llm_extractor() constructor. - score_chunk() pipeline: 1. Always-on regex extraction 2. Cheap signals + combine_cheap_only -> cheap_total 3. If cheap_total >= definite_keep OR <= definite_drop: skip LLM 4. Else (borderline band): run LLM extractor, merge results, recompute 5. Final combine + admission gate against drop_threshold - mem_tree_score table gains nullable llm_importance + llm_importance_reason columns via idempotent ALTER TABLE migration. - ScoreRow + upsert/get persist the new column. What's not changed: - Existing JSON-RPC surface - Default behaviour (LLM weight=0, no llm_extractor configured = same outputs as before) - mem_tree_chunks / mem_tree_entity_index schemas Tests added: - LLM extractor: prompt construction, JSON parsing, hallucination drop, importance clamping, strict vs lenient unknown-kind handling - Score pipeline: short-circuit on definite_keep/definite_drop skips LLM call (verified via FakeLlm call counter); borderline band consults LLM once; LLM failure falls back gracefully without erroring LLM-NER work follows up on #708. Parent: #711. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(memory): address CI failures and CodeRabbit review on LLM-NER (#708) CI was red on PR #775 due to 10 E0063 errors — existing test-only struct literals for ExtractedEntities / ScoreSignals / ScoreRow weren't updated when the new optional llm_importance fields landed. Local cargo check on lib-only passed; cargo check --tests caught them, as does CI. CodeRabbit also flagged four semantic issues: 1. LlmEntityExtractor::extract() returned Err on HTTP transport, non-2xx status, and JSON parse failures. The soft-fallback contract (documented in the module header) only worked because score_chunk catches errors; any other caller got a hard error. Now split into extract_or_empty() which logs a warn on each failure mode and returns ExtractedEntities::default() unconditionally. The public extract() stays on the async-trait surface but never returns Err. 2. find_char_span() always searched from byte offset 0, so when the LLM returned the same surface twice (explicitly asked for duplicates in the prompt), both entities got (start=0, end=len) — same span, not distinct mentions. Added find_char_span_from(haystack, needle, byte_from, char_from) that resumes from a cursor. into_extracted_entities now tracks a per-surface (byte_after, char_after) cursor in a HashMap<String, (usize, u32)>, so duplicate mentions get distinct non-overlapping spans and over-claim (LLM returns 3 "Alice" when source has 2) drops the excess entries with a debug log. 3. combine() always included w.llm_importance in the denominator even when llm_importance=0.0 because the LLM didn't run. That artificially lowered the total on short-circuit paths. score_chunk now branches on llm_consulted: uses combine() when LLM ran (importance actually contributes), combine_cheap_only() when LLM was skipped or failed (denominator excludes the LLM weight). Observable in two new tests: short_circuit_reports_cheap_only_total verifies r.total == combine_cheap_only(r.signals, weights) and strictly exceeds combine(r.signals, weights) when llm_importance=0; llm_consulted_ reports_full_total verifies the opposite path. 4. Same issue from a different angle (CodeRabbit flagged both). Fixed by the same llm_consulted branch. Test-literal fixes: - signals/ops.rs: ScoreSignals literals pick up llm_importance field; make_entities helper uses ..Default::default() on ExtractedEntities. - extract/types.rs: three test ExtractedEntities literals gain llm_importance: None, llm_importance_reason: None. - resolver.rs: canonicalise_batch_preserves_spans literal gains the two fields. - score/store.rs: sample_row gains signals.llm_importance and llm_importance_reason. New tests (7 added): - find_char_span_from_advances_past_prior_match - find_char_span_from_returns_none_after_exhaustion - find_char_span_from_preserves_utf8 - find_char_span_from_rejects_non_char_boundary - into_extracted_entities_gives_distinct_spans_to_duplicate_mentions - into_extracted_entities_drops_extra_duplicate_when_source_only_has_one - extract_soft_fallback_on_unreachable_endpoint (transport failure → empty extraction, not Err) - short_circuit_reports_cheap_only_total - llm_consulted_reports_full_total cargo check --lib clean; cargo fmt clean. The integration-test rmeta errors visible locally are stale-metadata artifacts from incremental builds with prior struct shapes; CI's clean build resolves them and tests/*.rs files do not construct any of the touched types. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(memory): LLM-based NER + importance signal with cheap-signals short-circuit (#708) Adds an Ollama-backed entity extractor and an LLM-derived importance score as a new signal in the admission gate. Builds on merged Phase 2 (#733) without changing its surface — additive only. Why LLM (not GLiNER): zero new deps (reuses existing reqwest + async infra), reuses the same Ollama setup openhuman uses for embeddings, better per-entity quality scales with model choice, no native ONNX runtime to ship. Latency cost (~100-300ms per chunk on a small model like qwen2.5:0.5b) is amortised by the short-circuit. The admission gate stays a hybrid - cheap deterministic signals always run; the LLM is one signal among many, never the sole arbiter. Backwards-compatible: with default SignalWeights the LLM weight is 0.0 and behaviour matches pre-LLM Phase 2 exactly. What's new: - src/openhuman/memory/tree/score/extract/llm.rs: LlmEntityExtractor implementing the existing EntityExtractor trait. Posts a structured JSON request to an Ollama-compatible /api/chat endpoint asking for NER + an importance rating in one call. Span recovery via string search; hallucinated entities (surface not in source text) dropped. Soft fallback: HTTP failures log a warn and return empty extraction. - ExtractedEntities gains llm_importance (Option<f32>) + llm_importance_reason (Option<String>); merge() takes max importance. - ScoreSignals gains llm_importance (f32, defaults to 0.0). - SignalWeights gains llm_importance (default 0.0; with_llm_enabled() helper sets it to 2.0 - comparable to metadata/source weights). - signals::combine_cheap_only(): variant that excludes the LLM signal, used for the short-circuit decision in score_chunk. - ScoringConfig gains llm_extractor (Option<Arc<dyn EntityExtractor>>), definite_keep_threshold (default 0.85), definite_drop_threshold (default 0.15). New with_llm_extractor() constructor. - score_chunk() pipeline: 1. Always-on regex extraction 2. Cheap signals + combine_cheap_only -> cheap_total 3. If cheap_total >= definite_keep OR <= definite_drop: skip LLM 4. Else (borderline band): run LLM extractor, merge results, recompute 5. Final combine + admission gate against drop_threshold - mem_tree_score table gains nullable llm_importance + llm_importance_reason columns via idempotent ALTER TABLE migration. - ScoreRow + upsert/get persist the new column. What's not changed: - Existing JSON-RPC surface - Default behaviour (LLM weight=0, no llm_extractor configured = same outputs as before) - mem_tree_chunks / mem_tree_entity_index schemas Tests added: - LLM extractor: prompt construction, JSON parsing, hallucination drop, importance clamping, strict vs lenient unknown-kind handling - Score pipeline: short-circuit on definite_keep/definite_drop skips LLM call (verified via FakeLlm call counter); borderline band consults LLM once; LLM failure falls back gracefully without erroring LLM-NER work follows up on #708. Parent: #711. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(memory): address CI failures and CodeRabbit review on LLM-NER (#708) CI was red on PR #775 due to 10 E0063 errors — existing test-only struct literals for ExtractedEntities / ScoreSignals / ScoreRow weren't updated when the new optional llm_importance fields landed. Local cargo check on lib-only passed; cargo check --tests caught them, as does CI. CodeRabbit also flagged four semantic issues: 1. LlmEntityExtractor::extract() returned Err on HTTP transport, non-2xx status, and JSON parse failures. The soft-fallback contract (documented in the module header) only worked because score_chunk catches errors; any other caller got a hard error. Now split into extract_or_empty() which logs a warn on each failure mode and returns ExtractedEntities::default() unconditionally. The public extract() stays on the async-trait surface but never returns Err. 2. find_char_span() always searched from byte offset 0, so when the LLM returned the same surface twice (explicitly asked for duplicates in the prompt), both entities got (start=0, end=len) — same span, not distinct mentions. Added find_char_span_from(haystack, needle, byte_from, char_from) that resumes from a cursor. into_extracted_entities now tracks a per-surface (byte_after, char_after) cursor in a HashMap<String, (usize, u32)>, so duplicate mentions get distinct non-overlapping spans and over-claim (LLM returns 3 "Alice" when source has 2) drops the excess entries with a debug log. 3. combine() always included w.llm_importance in the denominator even when llm_importance=0.0 because the LLM didn't run. That artificially lowered the total on short-circuit paths. score_chunk now branches on llm_consulted: uses combine() when LLM ran (importance actually contributes), combine_cheap_only() when LLM was skipped or failed (denominator excludes the LLM weight). Observable in two new tests: short_circuit_reports_cheap_only_total verifies r.total == combine_cheap_only(r.signals, weights) and strictly exceeds combine(r.signals, weights) when llm_importance=0; llm_consulted_ reports_full_total verifies the opposite path. 4. Same issue from a different angle (CodeRabbit flagged both). Fixed by the same llm_consulted branch. Test-literal fixes: - signals/ops.rs: ScoreSignals literals pick up llm_importance field; make_entities helper uses ..Default::default() on ExtractedEntities. - extract/types.rs: three test ExtractedEntities literals gain llm_importance: None, llm_importance_reason: None. - resolver.rs: canonicalise_batch_preserves_spans literal gains the two fields. - score/store.rs: sample_row gains signals.llm_importance and llm_importance_reason. New tests (7 added): - find_char_span_from_advances_past_prior_match - find_char_span_from_returns_none_after_exhaustion - find_char_span_from_preserves_utf8 - find_char_span_from_rejects_non_char_boundary - into_extracted_entities_gives_distinct_spans_to_duplicate_mentions - into_extracted_entities_drops_extra_duplicate_when_source_only_has_one - extract_soft_fallback_on_unreachable_endpoint (transport failure → empty extraction, not Err) - short_circuit_reports_cheap_only_total - llm_consulted_reports_full_total cargo check --lib clean; cargo fmt clean. The integration-test rmeta errors visible locally are stale-metadata artifacts from incremental builds with prior struct shapes; CI's clean build resolves them and tests/*.rs files do not construct any of the touched types. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(memory): source trees + bucket-seal foundation (#709) Phase 3a of the memory architecture (#711 umbrella). Lifts admitted leaves into a per-source hierarchy so Phase 4 retrieval has a tree to walk. ## What's in this PR * **Schema** — three new tables in `chunks.db` alongside Phase 1/2: `mem_tree_trees`, `mem_tree_summaries`, `mem_tree_buffers`. Plus a `parent_summary_id` backlink column on `mem_tree_chunks`. All migrations are additive and idempotent (`ALTER TABLE ADD COLUMN`, same pattern as Phase 2). * **`source_tree/` module** — isolated subdir; does not touch the legacy `openhuman::memory` layer. - `types.rs` — `Tree`, `SummaryNode`, `Buffer`, `TreeKind` (Source/Topic/Global), `TreeStatus`. Constants: `TOKEN_BUDGET = 10_000`, `DEFAULT_FLUSH_AGE_SECS = 7d`. - `store.rs` — CRUD for all three tables via the shared `with_connection` entry point. Writes are transactional; summary insert is idempotent on primary key (INSERT OR IGNORE). - `registry.rs` — `get_or_create_source_tree(scope)` — idempotent lookup keyed on `(kind, scope)`. - `bucket_seal.rs` — `append_leaf` + cascade: push a leaf into L0, seal when `token_sum >= TOKEN_BUDGET`, recurse upward. One seal = one transaction; children get a parent backlink (leaves via `parent_summary_id`, summaries via `parent_id`). Tree's `max_level` and `root_id` bump on root split. - `flush.rs` — `flush_stale_buffers(max_age)` force-seals any buffer whose `oldest_at` is older than `max_age`. Primitive only; wiring into a scheduler is out of scope. - `summariser/` — `Summariser` trait + `InertSummariser` fallback (concatenate children, union entities preserving first-seen order, hard-truncate to budget). Trait is async; Ollama implementation slots in later without breaking callers. * **Ingest wiring** — `tree/ingest.rs::persist` now calls `append_leaf` once per kept chunk after chunks + scores land. Failures are logged at warn level and do not fail the ingest — leaves are already persisted and the next flush can still rebuild the tree. ## Concurrency / LLM The async summariser call happens OUTSIDE any DB transaction, so a slow LLM never holds SQLite locks. Blocking DB calls inside `append_leaf` are acceptable for Phase 3a because the Inert summariser does no real I/O; when a networked summariser lands, wrap the DB calls in `tokio::task::spawn_blocking`. ## Tests (23 new, all green) * `types` — enum round-trips, buffer staleness predicates * `store` — tree insert + unique scope, summary insert + idempotence, buffer upsert/clear, stale-buffer ordering * `registry` — `get_or_create` idempotence, distinct scopes yield distinct trees, id prefix format * `summariser/inert` — provenance-prefixed concat, entity union with order preservation, budget truncation, empty-content skip * `bucket_seal` — append-below-budget is buffered only (no seal); 12k-token cross-over produces an L1 summary with correct `child_ids`, L0 cleared, L1 carries the new summary id, tree `max_level`/`root_id` updated, leaf → parent backlink populated * `flush` — stale buffer force-seals under budget; recent buffer is a no-op Broader `memory::tree` suite: 160 tests pass (23 new + 137 existing Phase 1/2), no regressions. ## Stacked on #775 This PR applies on top of `feat/708-memory-llm-ner` (PR #775, Phase 2 LLM-NER follow-up). Merge #775 first; this branch will rebase cleanly onto main. ## Out of scope (tracked separately) * **3b — Global activity digest tree** — time-aligned cross-source recap * **3c — Topic trees** — hotness-driven per-entity materialisation * **Phase 4 (#710)** — retrieval / query / gate Closes part of #709 (source-tree foundation). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * style(memory): cargo fmt on source_tree module (#709) CI rustfmt caught a few multi-line function signatures and import groups that rustfmt wants collapsed / reordered. No behaviour change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(memory): address CodeRabbit review on source-tree (#709) Three correctness fixes from CodeRabbit on PR #789, plus a design clarification for the InertSummariser. False-positive findings (schema-out-of-sync, rustfmt — already in b7395a8) are not addressed here because they are not real issues. ## Changes ### Idempotent buffer append (`bucket_seal.rs`) `append_to_buffer` commits before the async seal/cascade runs, so a retry after a failed seal (summariser timeout, crash, transient ingest error) would re-push the same leaf and double-count tokens. Added a dedup check on `item_ids` so the append is a true no-op for already-buffered chunks. ### InertSummariser: honest stub, no entity propagation (`summariser/inert.rs`) Summary-level entities and topics are **LLM-derived from the summary's own synthesised content** by design, not a mechanical union of children's labels. The networked summariser (future) does its own extraction on its output. The inert fallback has no NER, so it emits empty entities/topics — an honest stub rather than a misleading union. Removed `union_preserve_order` helper and updated the corresponding test. ### Forward-compat entity index on seal (`bucket_seal.rs`, `score/store.rs`) Added `index_summary_entity_ids_tx` — a summary-specific indexer that takes canonical ids only (matching the `Vec<String>` shape of `SummaryOutput.entities`) and writes them into `mem_tree_entity_index` with `node_kind='summary'`, placeholder `entity_kind`/`surface` (both set to the canonical id), and the summary's score. Called from `seal_one_level`'s write transaction. No-op today (InertSummariser emits empty), but wired so the Ollama summariser lands as a drop-in without touching bucket-seal. ### UNIQUE race in `get_or_create_source_tree` (`registry.rs`) Two concurrent callers for the same scope could both pass the lookup and then race on `insert_tree`; the loser got a UNIQUE violation bubbled as an error instead of the pre-existing row. Now catches `ConstraintViolation` (and the message-based fallback for chained errors), re-queries by scope, and returns the winning row. New test `get_or_create_recovers_from_unique_race` covers both the recovery path and the `is_unique_violation` detector. ## Tests 161 passed / 0 failed across `memory::tree` (+1 new test vs the prior baseline of 160). No regressions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * style(memory): cargo fmt on source_tree::registry (#709) CI's rustfmt check flagged the race-recovery block in get_or_create_source_tree — rustfmt prefers chaining `?.ok_or_else(...)` on one line and wrapping the assert! macro. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(memory): index_summary_entity_ids_tx writes parseable entity_kind (#709) CodeRabbit on PR #789 caught a bug in the summary entity indexer I added for Phase 3a. The helper was writing the full canonical id (e.g. "email:alice@example.com") into the `entity_kind` column instead of the kind prefix. `lookup_entity()` later runs `EntityKind::parse()` on that column and would fail with `unknown EntityKind` the first time a result set mixed leaf and summary hits — blocking any Phase 4 read that needed summary-level entity resolution. Fix: split the canonical id on `:` and store the prefix only. Falls back to the full id with a `warn!` for malformed ids that lack `:`, so bad data surfaces rather than silently stays latent. Added regression test `summary_entity_index_kind_is_parseable` that: - Indexes a leaf entity row (the existing happy path) - Indexes two summary entity rows via index_summary_entity_ids_tx - Runs lookup_entity for email, hashtag, and a cross-kind id - Asserts each row comes back with a correctly-parsed EntityKind Before the fix, the lookup_entity calls in this test would fail with a FromSqlConversionFailure on the summary row's entity_kind column. After the fix, mixed leaf+summary lookups round-trip cleanly. Tests: 162 passed / 0 failed in memory::tree (+1 vs 161 baseline). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(memory): promote extracted topics to canonical entities (#709) Addresses a design gap surfaced during PR #789 review: the Phase 3 plan promised topic trees would scope to "entities / topics" but the implementation routed on canonical_entities only. A chunk saying "Phoenix migration ships Friday" produced `r.extracted.topics = ["phoenix", "migration"]` but `canonical_entities = []` — meaning no topic tree named Phoenix would ever spawn. Fix (Shape A from the review discussion): extend the canonical entity stream with topic rows. No new tables, no new routing path, no new hotness counters. ## Changes - `score/extract/types.rs`: add `EntityKind::Topic` with as_str/parse round-trip. `non_exhaustive` enum so this is an additive change. - `score/resolver.rs::canonicalise`: after emitting canonical entities from `extracted.entities`, also emit one per `extracted.topics` entry with `kind: Topic`, `canonical_id: "topic:<lowercased>"`. Span fields are 0 (topics are chunk-level, not substring-scoped). Dedups same-label topics internally; keeps `hashtag:launch` and `topic:launch` as separate entities by design. ## Downstream effect (zero-touch) - `score::store::index_entities_tx` indexes topic entities the same way as email entities — rows land in `mem_tree_entity_index` with `entity_kind = "topic"`. - Phase 3c routing iterates `canonical_entities` — topics now trigger topic-tree routing automatically. A chunk about "phoenix" starts accumulating hotness for `topic:phoenix`; once the threshold crosses, `get_or_create_topic_tree("topic:phoenix")` materialises a dedicated topic tree. Backfill via `lookup_entity("topic:phoenix")` then hydrates historic leaves. - Phase 4 retrieval can filter `WHERE entity_kind = 'topic'` to surface themes without mixing in people/emails. ## Tests Five new tests in `resolver.rs`: - topics → canonical entities with `kind: Topic` - lowercase normalisation + dedup on label - hashtag + topic with the same label coexist (different kind prefix) - entities emitted first, topics appended - `EntityKind::Topic` round-trips through `parse` / `as_str` Broader memory::tree: 167 passed / 0 failed (162 baseline + 5 new). No regressions. ## Caveats for follow-up 1. Topic noise — LLM-extracted themes are softer signal than regex entities. Hotness threshold (TOPIC_CREATION_THRESHOLD=10.0) gates tree creation, so transient topics won't spawn trees. 2. Topic/hashtag duplication — "Phoenix #phoenix" creates both `topic:phoenix` and `hashtag:phoenix` trees. Tolerable (kinds are semantically distinct) but a dedup policy may be worth revisiting in a Phase 4 retrieval follow-up. 3. Topic normalisation is lowercase+trim only. Plurals / stemming are deferred. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Steven Enamakel <31011319+senamakel@users.noreply.github.com>
Summary
mem_tree_chunkssubstrate.Problem
#708's brief: raw ingested chunks are mixed-signal. Without a preprocessing layer, "+1" reactions and multi-page pasted logs enter the index with the same weight as substantive content. We need:
Solution
New module at `src/openhuman/memory/tree/score/`:
Pipeline
```
canonicalise → chunk → extract entities → compute signals → combine (weighted)
→ admission gate (drop if total < threshold)
→ canonicalise entities (cross-platform merge)
→ persist: chunks (kept only) + score rationale (all) + entity index (kept)
```
Schema
Phase 1's `with_connection` helper now applies two additional tables (via `CREATE TABLE IF NOT EXISTS`) plus an idempotent `ALTER TABLE ADD COLUMN embedding BLOB` migration. The embedding column is wired but unpopulated in this PR — it lands properly in Phase 3 retrieval work.
Async surface
`ingest_chat` / `ingest_email` / `ingest_document` are now `async` to accommodate the `EntityExtractor` trait. Blocking SQLite work stays isolated on `tokio::task::spawn_blocking` inside `persist`. The `memory_tree` JSON-RPC surface is unchanged.
Design decisions / tradeoffs
ortcrate + ~170MB ONNX file + soft-fallback loader. That belongs in its own PR behind a cargo feature. Phase 2 ships the pipeline skeleton so the drop-in is trivial.Submission Checklist
Impact
Related
Summary by CodeRabbit
New Features
Bug Fixes & Improvements