Skip to content

feat(memory): topic trees (lazy hotness-driven materialisation) (#709)#799

Merged
senamakel merged 1 commit intotinyhumansai:mainfrom
sanil-23:feat/709c-topic-trees
Apr 22, 2026
Merged

feat(memory): topic trees (lazy hotness-driven materialisation) (#709)#799
senamakel merged 1 commit intotinyhumansai:mainfrom
sanil-23:feat/709c-topic-trees

Conversation

@sanil-23
Copy link
Copy Markdown
Contributor

@sanil-23 sanil-23 commented Apr 22, 2026

Summary

Phase 3c of the memory architecture (umbrella #711). Adds per-entity topic trees spawned lazily when an entity's hotness crosses a threshold, so Phase 4 retrieval can resolve "what did Alice say about Phoenix?" without scanning the full chunk pool.

No frontier LLM curator — pure arithmetic hotness over pre-existing signals.

Mechanics

  • Hotness (pure math): log(mention_count_30d) + distinct_sources*0.5 + recency_decay + centrality + query_hits*2
  • Thresholds: TOPIC_CREATION_THRESHOLD=10.0, TOPIC_ARCHIVE_THRESHOLD=2.0, TOPIC_RECHECK_EVERY=100 (ingests between recomputes)
  • Spawn flow: on each ingest, route_leaf_to_topic_trees bumps counters and appends to every existing topic tree whose entity the chunk mentions. Every TOPIC_RECHECK_EVERY ingests, the curator recomputes hotness; if over threshold, a new topic tree is created and backfilled from mem_tree_entity_index.
  • Reuses source_tree::bucket_seal::append_leaf as-is — topic trees are just TreeKind::Topic rows in the existing schema.

What's in this PR

New module src/openhuman/memory/tree/topic_tree/ (8 files):

  • types.rsEntityIndexStats, HotnessCounters, threshold constants
  • hotness.rs — pure scorer + recency_decay helper
  • store.rs — CRUD for the new mem_tree_entity_hotness table
  • registry.rsget_or_create_topic_tree(entity_id), archive_topic_tree, list_topic_trees, race-recovery via is_unique_violation
  • curator.rsmaybe_spawn_topic_tree, force_recompute, SpawnOutcome enum
  • backfill.rsbackfill_topic_tree hydrates historic leaves for a newly-spawned entity via score::store::lookup_entity
  • routing.rsroute_leaf_to_topic_trees(config, leaf, canonical_entities, summariser) — non-fatal ingest hook

Schema (additive, idempotent — same pattern as Phase 3a):

  • New table mem_tree_entity_hotness (entity_id PK + counters + last_hotness + ingests_since_check)

Ingest wiring:

  • tree/ingest.rs::append_leaves_to_tree now calls route_leaf_to_topic_trees after the source-tree append — failures log at warn and DO NOT fail the ingest.

🚨 Stacked on #789

Base is feat/709-summary-trees (PR #789, Phase 3a source trees). Merge #789 first, then rebase this branch onto main. The diff against main right now includes #775 + #789's commits.

Independent of #798 (Phase 3b Global digest) — 3b and 3c can merge in any order after #789.

Test plan

  • cargo check --lib clean
  • cargo test --lib memory::tree201 passed / 0 failed (161 baseline + 40 new)
  • cargo fmt --check clean
  • Hotness edge cases: zero mentions, spike recovery, old-but-widely-cited
  • Curator: no creation below threshold; single-creation when crossed; idempotent on retry
  • Backfill: new topic tree populated from pre-existing entity_index hits
  • Routing: leaf mentioning Alice appends to both source tree AND Alice topic tree
  • Registry race recovery on UNIQUE
  • Archive flips status='archived' but keeps rows readable
  • Cadence: hotness recompute only fires every TOPIC_RECHECK_EVERY=100 ingests
  • Integration: ingest across 2 sources mentioning Alice → hotness crosses → topic tree materialised → subsequent Alice leaf routes into both trees

Out of scope (deferred to follow-ups)

  • JSON-RPC surfaces — core library only
  • Archive cron sweep — archive_topic_tree primitive ships; scheduled job deferred
  • Graph centrality and query-hit counter increments — columns exist, writes come in Phase 4
  • 30-day mention-count windowing — currently grows monotonically (TODO documented inline)
  • LLM summariser — InertSummariser fallback; topic trees inherit the honest-stub entity/topic pattern from Phase 3a

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Topic trees now automatically materialize based on entity relevance scores
    • New information automatically distributes to matching topic trees
    • Entity activity tracking with mention frequency and recency analysis
    • Archive management for inactive topic trees
  • Chores

    • Updated database schema to persist entity activity metrics

@sanil-23 sanil-23 requested a review from a team April 22, 2026 17:44
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 22, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

The changes implement Phase 3c lazy topic-tree materialization for the memory system. New leaf appends are routed to matching entity topic trees via a curator that accumulates ingest-derived hotness metrics and spawns trees when thresholds are met. The feature includes hotness scoring algorithms, a topic-tree registry, backfill logic, routing, and SQLite persistence for entity hotness counters.

Changes

Cohort / File(s) Summary
Ingest Pipeline Extension
src/openhuman/memory/tree/ingest.rs, src/openhuman/memory/tree/mod.rs
After each leaf append in the ingest pipeline, newly appended leaves are routed to matching topic trees. Module exports extend to include the new topic_tree submodule.
Hotness Scoring & Types
src/openhuman/memory/tree/topic_tree/hotness.rs, src/openhuman/memory/tree/topic_tree/types.rs
Implements deterministic hotness scoring with log-dampened mentions, distinct-source weighting, piecewise-linear recency decay (1/7/30 day breakpoints), graph centrality, and query-hit boosting. Defines tuning constants (TOPIC_CREATION_THRESHOLD, TOPIC_ARCHIVE_THRESHOLD, TOPIC_RECHECK_EVERY) and core data structures (EntityIndexStats, HotnessCounters).
Topic Tree Curation
src/openhuman/memory/tree/topic_tree/curator.rs
Tracks per-entity hotness via lightweight counters (mention_count_30d, last_seen_ms, ingests_since_check). Batches hotness recomputation by cadence; on threshold crossing spawns topic trees and triggers backfill. Exposes SpawnOutcome enum and maybe_spawn_topic_tree/force_recompute functions.
Topic Tree Lifecycle Management
src/openhuman/memory/tree/topic_tree/registry.rs
Manages mem_tree_trees rows with kind=Topic. Provides idempotent get_or_create_topic_tree, list_topic_trees, and archive_topic_tree with race-condition handling for concurrent creations.
Topic Tree Routing
src/openhuman/memory/tree/topic_tree/routing.rs
Fans out ingest leaves to all matching entity topic trees. Routes each entity independently while ticking the curator for spawn decisions. Handles errors per-entity as non-fatal warnings.
Topic Tree Backfill
src/openhuman/memory/tree/topic_tree/backfill.rs
Best-effort historical population of newly created topic trees. Queries entity-index for up to BACKFILL_LIMIT matching chunks, filters to leaf entries, and appends them; missing chunks are logged and skipped.
Database Schema & Persistence
src/openhuman/memory/tree/store.rs, src/openhuman/memory/tree/topic_tree/store.rs
Tree-level store adds mem_tree_entity_hotness SQLite table with index on last_hotness. Topic-tree store provides get, get_or_fresh, upsert, and distinct_sources_for APIs for hotness counter persistence and distinct-source counting.
Topic Tree Module Entrypoint
src/openhuman/memory/tree/topic_tree/mod.rs
Declares and re-exports topic-tree submodules and public APIs: curator, hotness, registry, routing, types, backfill, and store functions for external callers.

Sequence Diagrams

sequenceDiagram
    participant Ingest as Ingest Pipeline
    participant Router as Routing Module
    participant Curator as Curator
    participant Registry as Registry
    participant Backfill as Backfill
    
    Ingest->>Ingest: append_leaf() succeeds
    Ingest->>Router: route_leaf_to_topic_trees(leaf, entities)
    
    loop for each entity
        Router->>Registry: check active topic tree
        alt tree exists & active
            Router->>Router: append leaf to topic tree
        end
        Router->>Curator: maybe_spawn_topic_tree(entity_id)
        
        alt hotness meets threshold (cadence triggered)
            Curator->>Curator: run_full_recompute()
            alt hotness >= TOPIC_CREATION_THRESHOLD
                Curator->>Registry: get_or_create_topic_tree(entity_id)
                Registry-->>Curator: Tree
                Curator->>Backfill: backfill_topic_tree(tree, entity_id)
                Backfill->>Backfill: query entity-index matches
                Backfill->>Backfill: append historical chunks
                Backfill-->>Curator: count appended
                Curator-->>Router: Spawned
            else
                Curator-->>Router: BelowThreshold or TreeExists
            end
        else cadence not triggered
            Curator-->>Router: CountersBumped
        end
    end
    
    Router-->>Ingest: success (routing errors non-fatal)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • PR #733: Introduces scoring and canonical entity extraction in the ingest pipeline that this PR consumes to route leaves by entity ID to topic trees.
  • PR #732: Directly extends the leaf append behavior in append_leaves_to_tree to call route_leaf_to_topic_trees after each successful append.
  • PR #789: Modifies the same append_leaves_to_tree ingest pipeline entry point that forms the integration point for topic tree routing.

Suggested reviewers

  • senamakel

Poem

🐰 A fuzzy forest grows in hotness glow,
Where lazy topic trees now bloom and grow.
Each entity's warmth, once scattered and wide,
Now clusters in trees where meanings reside.
Hop through the scorer, bound to the curator's leap!
🌱✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and specifically describes the main change: implementing topic trees with lazy hotness-driven materialisation as part of Phase 3c.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🧹 Nitpick comments (8)
src/openhuman/memory/tree/source_tree/bucket_seal.rs (2)

385-388: entities always empty for leaf inputs.

In hydrate_leaf_inputs, entities are always set to Vec::new() regardless of what's in the score row. This means leaf-level entity information isn't propagated to summaries. The comment at lines 265-270 mentions this is intentional for Phase 3a (InertSummariser emits empty entities), but when the Ollama summariser lands, this may need revisiting.

Consider propagating entities from score rows

If ScoreResult contains extracted entity IDs, they could be propagated here:

         let (score_value, entities, topics) = match &score {
-            Some(row) => (row.total, Vec::new(), chunk.metadata.tags.clone()),
-            None => (0.0, Vec::new(), chunk.metadata.tags.clone()),
+            Some(row) => (row.total, row.entity_ids.clone(), chunk.metadata.tags.clone()),
+            None => (0.0, Vec::new(), chunk.metadata.tags.clone()),
         };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/source_tree/bucket_seal.rs` around lines 385 - 388,
In hydrate_leaf_inputs (bucket_seal.rs) the match that sets (score_value,
entities, topics) always uses Vec::new() for entities; update the Some(row) arm
to propagate entity IDs from the ScoreResult (e.g., row.entities or
row.entity_ids) instead of an empty Vec so leaf-level entity info flows into
summaries; keep the None arm as is, and ensure any type conversions required for
the entities variable match the expected type used downstream in
hydrate_leaf_inputs and subsequent summariser code.

91-91: unchecked_transaction() usage note.

unchecked_transaction() doesn't check for existing transactions, which is fine here since with_connection opens a fresh connection per call. However, if the connection management changes to connection pooling or reuse, this could silently create nested transactions. Consider adding a comment or using transaction() with proper error handling for future-proofing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/source_tree/bucket_seal.rs` at line 91, The call to
conn.unchecked_transaction() can create nested transactions silently if
connections are later pooled/reused; update the code in bucket_seal.rs around
the conn.unchecked_transaction() usage to either (a) replace
unchecked_transaction() with conn.transaction() and handle the potential error
return, or (b) if you intentionally need unchecked behavior, add a clear inline
comment referencing with_connection() and why a fresh connection is guaranteed
(so future maintainers won't accidentally introduce nested-transaction bugs).
Ensure you update the surrounding error handling to match the chosen API
(transaction() returns a Result) and keep the transaction lifecycle tied to the
connection usage.
src/openhuman/memory/tree/score/extract/llm.rs (1)

369-379: parse_kind accepts aliases not mentioned in the prompt.

The prompt (line 216) specifies "person|organization|location|event|product", but parse_kind also accepts "people", "organisation", "org", "place", "loc", "miscellaneous", and "other". This is defensive, but the prompt could be updated to mention accepted aliases, or the parser could be stricter to match the prompt exactly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/score/extract/llm.rs` around lines 369 - 379, The
parser function parse_kind currently accepts extra aliases ("people",
"organisation", "org", "place", "loc", "miscellaneous", "other") that are not
listed in the prompt; either tighten parse_kind to only accept the exact
canonical strings from the prompt ("person", "organization", "location",
"event", "product") by removing aliases from parse_kind, or update the prompt
text (the constant or string near the prompt generation code that lists allowed
kinds) to explicitly include the accepted aliases; locate the parse_kind
function and the prompt text that references the allowed kinds and make the two
consistent (choose strict acceptance in parse_kind or expand the prompt to list
the aliases).
src/openhuman/memory/tree/topic_tree/registry.rs (2)

126-134: ConstraintViolation is broader than UNIQUE violations.

rusqlite::ErrorCode::ConstraintViolation covers UNIQUE, CHECK, NOT NULL, and FOREIGN KEY violations. The fallback string check ("UNIQUE constraint failed") is more precise. Consider checking sqlite_err.extended_code for SQLITE_CONSTRAINT_UNIQUE (2067) for accuracy, or document that the over-broad match is intentional and safe due to the re-query fallback.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/topic_tree/registry.rs` around lines 126 - 134, The
current is_unique_violation function treats any rusqlite::Error::SqliteFailure
with ErrorCode::ConstraintViolation as a UNIQUE error; change it to check the
extended error code for the specific UNIQUE constraint (SQLITE_CONSTRAINT_UNIQUE
= 2067) instead of relying on the broad ErrorCode::ConstraintViolation, i.e.,
when matching rusqlite::Error::SqliteFailure(sqlite_err, _) inspect
sqlite_err.extended_code (compare to 2067 or a named constant) and only return
true for that value; keep the existing string fallback ("UNIQUE constraint
failed") or add a short comment if you intentionally want the broader match.

140-192: Consider making row_to_tree shared rather than duplicating.

The comment acknowledges this duplication. A cleaner approach would be to make source_tree::store::row_to_tree pub(crate) and reuse it here, reducing maintenance burden when the schema evolves.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/topic_tree/registry.rs` around lines 140 - 192, The
local duplicate row_to_tree_loose implementation should be removed and the
shared parser reused: make source_tree::store::row_to_tree pub(crate) (or pub)
so this module can call source_tree::store::row_to_tree instead of duplicating
logic, then replace usages of row_to_tree_loose with a direct call to
source_tree::store::row_to_tree (adjust imports/path as needed) and delete the
local row_to_tree_loose function to avoid maintenance drift; ensure function
signature and error types match (or adapt call sites) and update any
tests/imports that reference the old private function.
src/openhuman/memory/tree/source_tree/store.rs (1)

313-340: upsert_buffer_tx generates now_ms internally.

This means the updated_at_ms is set when the function is called, not when the transaction commits. For buffer operations this is fine, but note that multiple buffer upserts in the same transaction will have slightly different updated_at_ms values. If consistent timestamps across a transaction are needed, consider passing now_ms as a parameter.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/source_tree/store.rs` around lines 313 - 340, The
upsert_buffer_tx currently calls Utc::now() inside the function, causing
different updated_at_ms for multiple upserts in the same transaction; change the
signature of upsert_buffer_tx to accept a now_ms parameter (e.g.,
upsert_buffer_tx(tx: &Transaction<'_>, buf: &Buffer, now_ms: i64)) and use that
value instead of computing Utc::now() locally, then update all callers to
compute a single now_ms before beginning the transaction (or at transaction
start) and pass it to each upsert_buffer_tx so all buffer rows in the same
transaction share a consistent updated_at_ms.
src/openhuman/memory/tree/topic_tree/types.rs (2)

23-32: Consider making thresholds configurable.

The hardcoded constants TOPIC_CREATION_THRESHOLD=10.0 and TOPIC_ARCHIVE_THRESHOLD=2.0 are reasonable defaults, but different deployments might need tuning. Consider moving these to Config with defaults, or document that they're intentionally fixed for Phase 3c.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/topic_tree/types.rs` around lines 23 - 32, The
magic-number thresholds TOPIC_CREATION_THRESHOLD and TOPIC_ARCHIVE_THRESHOLD
should be made configurable instead of hardcoded; add them as configurable
fields (e.g., topic_creation_threshold: f32 and topic_archive_threshold: f32) on
your existing Config (or create a TopicThresholds sub-struct), provide the
current values as defaults (10.0 and 2.0), and update all uses of
TOPIC_CREATION_THRESHOLD, TOPIC_ARCHIVE_THRESHOLD (and optionally
TOPIC_RECHECK_EVERY) to read from Config (or the new sub-struct) so deployments
can override via configuration.

79-93: Add a doc comment to fresh() explaining that callers must initialize mention_count_30d.

Verification confirms that all callers (store.rs, routing.rs, curator.rs) correctly increment mention_count_30d after calling fresh(), and the test explicitly validates the 0 initialization. This is intentional design—fresh() creates a skeleton, and callers are responsible for populating mention counts. Add explicit documentation to make this contract clear to future maintainers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/topic_tree/types.rs` around lines 79 - 93, Add a
doc comment to the HotnessCounters::fresh constructor documenting that it
returns a skeleton HotnessCounters and that callers are required to
initialize/increment mention_count_30d after calling fresh(); explicitly mention
that other fields are zero/None by design (mention_count_30d = 0,
distinct_sources = 0, last_seen_ms = None, etc.) so future maintainers
understand the contract enforced by callers like store.rs, routing.rs and
curator.rs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/openhuman/memory/tree/score/extract/llm.rs`:
- Around line 184-187: The debug log uses envelope.message.content.len() (byte
length) which is inconsistent with the earlier use of text.chars().count()
(character count) and misreports lengths for non-ASCII content; change the
logging to use envelope.message.content.chars().count() instead so the
log::debug call reports character count consistently (replace
envelope.message.content.len() with envelope.message.content.chars().count()).

In `@src/openhuman/memory/tree/score/extract/types.rs`:
- Around line 109-116: ExtractedEntities::is_empty currently only checks
entities and topics, so extractor output that includes only llm_importance or
llm_importance_reason is treated as empty and dropped; update the is_empty
method on the ExtractedEntities type to also return false when
llm_importance.is_some() or llm_importance_reason.as_deref().map(str::is_empty)
== Some(false) (i.e., non-empty string), ensuring the new fields (llm_importance
and llm_importance_reason) are considered when deciding emptiness.

In `@src/openhuman/memory/tree/score/mod.rs`:
- Around line 154-165: The current logic sets llm_consulted = true whenever
llm.extract(&scoring_content) succeeds, causing the "full combine" path to be
used even if the extractor returned llm_importance = None; update the gating so
that you only take the full combine branch when
extracted.llm_importance.is_some() rather than on success of llm.extract.
Concretely, after merging into extracted in the Ok(more) branch of llm.extract,
set llm_consulted (or the condition used later for the full combine) only if
extracted.llm_importance.is_some(), and ensure the recompute of signals via
self::signals::compute(&chunk.metadata, &scoring_content, scoring_token_count,
&extracted) remains correct for the entity/topic-only case.

In `@src/openhuman/memory/tree/score/store.rs`:
- Around line 233-263: index_summary_entity_ids_tx is currently writing the full
canonical id into the entity_kind column which breaks lookup_entity (it expects
values parsable by EntityKind::parse); change the value passed for entity_kind
in the stmt.execute call to a parseable kind (e.g. extract the prefix before the
first ':' from canonical_id, or call an existing helper that returns the
EntityKind string) instead of writing the entire canonical_id, leaving surface
as the full canonical_id and keeping other params the same so lookup_entity and
EntityKind::parse receive valid input.

In `@src/openhuman/memory/tree/source_tree/summariser/inert.rs`:
- Around line 80-90: truncate_to_budget currently uses a single char ceiling
(budget * 4) which can still yield token counts > budget; update
truncate_to_budget to enforce tokens <= budget by iteratively shrinking the
string until approx_token_count(&truncated) <= budget. Use the existing
approx_token_count to measure tokens, reduce the char length by a conservative
step (e.g., subtract a few characters or scale by token/char ratio) in a loop,
and return the final truncated String and its token count; ensure the function
always returns tokens <= budget and keep references to truncate_to_budget and
approx_token_count when locating the change.

In `@src/openhuman/memory/tree/source_tree/summariser/mod.rs`:
- Around line 54-61: The summarise contract currently only advises staying under
ctx.token_budget; make it enforceable by adding a boundary check: after calling
summarise(...) validate SummaryOutput.token_count <= ctx.token_budget and return
an Err if exceeded. Implement this as a default wrapper method on the same trait
(e.g. summarise_checked or have summarise call a private summarise_inner) so
implementors still implement the summarise core (or rename existing method to
summarise_inner) and the public summarise enforces the invariant; reference
SummaryInput, SummaryContext, SummaryOutput, token_count and ctx.token_budget
when adding the check and error return.

In `@src/openhuman/memory/tree/topic_tree/backfill.rs`:
- Around line 88-96: Backfill is building LeafRef entries with per-entity
EntityHit.score and a single entity_id, but live ingest (see ingest.rs) sets
LeafRef.score to the chunk admission score (result.total) and includes all
canonical entities; update the backfill construction in LeafRef (in backfill.rs)
to fetch and use the chunk's admission score (the same value used as
result.total during ingest — e.g., from the chunk record or its
metadata/admission field) instead of EntityHit.score, and populate entities with
the full canonical entity list for that chunk (not just vec![entity_id]) so
leaves match live ingest behavior.
- Around line 44-106: The backfill loop does synchronous SQLite reads on the
async worker via lookup_entity and repeated get_chunk calls, which can block
Tokio; move all DB work off the async executor by collecting hits and
materialising required chunks inside a blocking context (e.g., spawn_blocking or
a single blocking transaction) before entering the async append_leaf(tree,
summariser). Specifically: call lookup_entity inside spawn_blocking (or open one
blocking transaction) to get hits, then inside the same blocking scope fetch and
build the Vec<LeafRef> (using get_chunk and skipping missing/non-leaf hits), and
only after that return to the async context to iterate and await append_leaf;
this preserves the existing append_leaf, LeafRef, and hit filtering logic but
prevents blocking during the await loop.

In `@src/openhuman/memory/tree/topic_tree/curator.rs`:
- Around line 67-88: The current flow reads and mutates counters via
get_or_fresh then upserts, which can race and clobber
mention_count_30d/ingests_since_check; change the bump/reset logic to be atomic
by performing the increment and last_seen/last_updated updates inside a single
DB transaction or an in-place SQL UPDATE/RETURNING before evaluating
TOPIC_RECHECK_EVERY. Specifically, replace the read-modify-upsert sequence
around get_or_fresh, the fields counters.mention_count_30d,
counters.ingests_since_check, counters.last_seen_ms, and
counters.last_updated_ms, and the subsequent upsert() + branch that checks
TOPIC_RECHECK_EVERY with a single DB operation that returns the updated counters
(or lock the row in a transaction), then use that returned state to decide
whether to call run_full_recompute.

In `@src/openhuman/memory/tree/topic_tree/routing.rs`:
- Around line 78-104: The append to the topic-specific tree currently uses
append_leaf(...)? which returns early on error and prevents
maybe_spawn_topic_tree(...) from running; change this to call append_leaf inside
a local match or if let Err(e) pattern, log the failure (including context like
leaf.chunk_id, tree.id and entity_id) and do not propagate the error, then
continue to always call maybe_spawn_topic_tree(config, entity_id, summariser).
Locate the append call in the block that builds topic_leaf (LeafRef) and replace
the ? propagation with local error handling so the curator tick
(maybe_spawn_topic_tree) runs regardless of append_leaf outcome.

In `@src/openhuman/memory/tree/topic_tree/store.rs`:
- Around line 97-109: distinct_sources_for currently counts every distinct
tree_id from mem_tree_entity_index which includes topic/global summaries and
causes entities to count their own topic tree; modify the SQL in
distinct_sources_for to only count tree_ids that correspond to actual source
trees by joining mem_tree (or whatever table holds tree metadata) and filtering
on its kind (e.g., WHERE mt.kind NOT IN ('topic','global') or WHERE mt.kind =
'source') so the query becomes a COUNT(DISTINCT mtei.tree_id) with an INNER JOIN
mem_tree mt ON mt.id = mtei.tree_id and the additional kind filter; keep
function name distinct_sources_for and table mem_tree_entity_index to locate and
update the query.

---

Nitpick comments:
In `@src/openhuman/memory/tree/score/extract/llm.rs`:
- Around line 369-379: The parser function parse_kind currently accepts extra
aliases ("people", "organisation", "org", "place", "loc", "miscellaneous",
"other") that are not listed in the prompt; either tighten parse_kind to only
accept the exact canonical strings from the prompt ("person", "organization",
"location", "event", "product") by removing aliases from parse_kind, or update
the prompt text (the constant or string near the prompt generation code that
lists allowed kinds) to explicitly include the accepted aliases; locate the
parse_kind function and the prompt text that references the allowed kinds and
make the two consistent (choose strict acceptance in parse_kind or expand the
prompt to list the aliases).

In `@src/openhuman/memory/tree/source_tree/bucket_seal.rs`:
- Around line 385-388: In hydrate_leaf_inputs (bucket_seal.rs) the match that
sets (score_value, entities, topics) always uses Vec::new() for entities; update
the Some(row) arm to propagate entity IDs from the ScoreResult (e.g.,
row.entities or row.entity_ids) instead of an empty Vec so leaf-level entity
info flows into summaries; keep the None arm as is, and ensure any type
conversions required for the entities variable match the expected type used
downstream in hydrate_leaf_inputs and subsequent summariser code.
- Line 91: The call to conn.unchecked_transaction() can create nested
transactions silently if connections are later pooled/reused; update the code in
bucket_seal.rs around the conn.unchecked_transaction() usage to either (a)
replace unchecked_transaction() with conn.transaction() and handle the potential
error return, or (b) if you intentionally need unchecked behavior, add a clear
inline comment referencing with_connection() and why a fresh connection is
guaranteed (so future maintainers won't accidentally introduce
nested-transaction bugs). Ensure you update the surrounding error handling to
match the chosen API (transaction() returns a Result) and keep the transaction
lifecycle tied to the connection usage.

In `@src/openhuman/memory/tree/source_tree/store.rs`:
- Around line 313-340: The upsert_buffer_tx currently calls Utc::now() inside
the function, causing different updated_at_ms for multiple upserts in the same
transaction; change the signature of upsert_buffer_tx to accept a now_ms
parameter (e.g., upsert_buffer_tx(tx: &Transaction<'_>, buf: &Buffer, now_ms:
i64)) and use that value instead of computing Utc::now() locally, then update
all callers to compute a single now_ms before beginning the transaction (or at
transaction start) and pass it to each upsert_buffer_tx so all buffer rows in
the same transaction share a consistent updated_at_ms.

In `@src/openhuman/memory/tree/topic_tree/registry.rs`:
- Around line 126-134: The current is_unique_violation function treats any
rusqlite::Error::SqliteFailure with ErrorCode::ConstraintViolation as a UNIQUE
error; change it to check the extended error code for the specific UNIQUE
constraint (SQLITE_CONSTRAINT_UNIQUE = 2067) instead of relying on the broad
ErrorCode::ConstraintViolation, i.e., when matching
rusqlite::Error::SqliteFailure(sqlite_err, _) inspect sqlite_err.extended_code
(compare to 2067 or a named constant) and only return true for that value; keep
the existing string fallback ("UNIQUE constraint failed") or add a short comment
if you intentionally want the broader match.
- Around line 140-192: The local duplicate row_to_tree_loose implementation
should be removed and the shared parser reused: make
source_tree::store::row_to_tree pub(crate) (or pub) so this module can call
source_tree::store::row_to_tree instead of duplicating logic, then replace
usages of row_to_tree_loose with a direct call to
source_tree::store::row_to_tree (adjust imports/path as needed) and delete the
local row_to_tree_loose function to avoid maintenance drift; ensure function
signature and error types match (or adapt call sites) and update any
tests/imports that reference the old private function.

In `@src/openhuman/memory/tree/topic_tree/types.rs`:
- Around line 23-32: The magic-number thresholds TOPIC_CREATION_THRESHOLD and
TOPIC_ARCHIVE_THRESHOLD should be made configurable instead of hardcoded; add
them as configurable fields (e.g., topic_creation_threshold: f32 and
topic_archive_threshold: f32) on your existing Config (or create a
TopicThresholds sub-struct), provide the current values as defaults (10.0 and
2.0), and update all uses of TOPIC_CREATION_THRESHOLD, TOPIC_ARCHIVE_THRESHOLD
(and optionally TOPIC_RECHECK_EVERY) to read from Config (or the new sub-struct)
so deployments can override via configuration.
- Around line 79-93: Add a doc comment to the HotnessCounters::fresh constructor
documenting that it returns a skeleton HotnessCounters and that callers are
required to initialize/increment mention_count_30d after calling fresh();
explicitly mention that other fields are zero/None by design (mention_count_30d
= 0, distinct_sources = 0, last_seen_ms = None, etc.) so future maintainers
understand the contract enforced by callers like store.rs, routing.rs and
curator.rs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e7fd9c00-8616-42d7-87db-1a7dbb367945

📥 Commits

Reviewing files that changed from the base of the PR and between b3b451f and 19c21ad.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (29)
  • src/openhuman/memory/tree/ingest.rs
  • src/openhuman/memory/tree/mod.rs
  • src/openhuman/memory/tree/score/extract/llm.rs
  • src/openhuman/memory/tree/score/extract/mod.rs
  • src/openhuman/memory/tree/score/extract/regex.rs
  • src/openhuman/memory/tree/score/extract/types.rs
  • src/openhuman/memory/tree/score/mod.rs
  • src/openhuman/memory/tree/score/resolver.rs
  • src/openhuman/memory/tree/score/signals/mod.rs
  • src/openhuman/memory/tree/score/signals/ops.rs
  • src/openhuman/memory/tree/score/signals/types.rs
  • src/openhuman/memory/tree/score/store.rs
  • src/openhuman/memory/tree/source_tree/bucket_seal.rs
  • src/openhuman/memory/tree/source_tree/flush.rs
  • src/openhuman/memory/tree/source_tree/mod.rs
  • src/openhuman/memory/tree/source_tree/registry.rs
  • src/openhuman/memory/tree/source_tree/store.rs
  • src/openhuman/memory/tree/source_tree/summariser/inert.rs
  • src/openhuman/memory/tree/source_tree/summariser/mod.rs
  • src/openhuman/memory/tree/source_tree/types.rs
  • src/openhuman/memory/tree/store.rs
  • src/openhuman/memory/tree/topic_tree/backfill.rs
  • src/openhuman/memory/tree/topic_tree/curator.rs
  • src/openhuman/memory/tree/topic_tree/hotness.rs
  • src/openhuman/memory/tree/topic_tree/mod.rs
  • src/openhuman/memory/tree/topic_tree/registry.rs
  • src/openhuman/memory/tree/topic_tree/routing.rs
  • src/openhuman/memory/tree/topic_tree/store.rs
  • src/openhuman/memory/tree/topic_tree/types.rs

Comment on lines +184 to +187
log::debug!(
"[memory_tree::extract::llm] response chars={}",
envelope.message.content.len()
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inconsistent character vs byte count in logging.

Line 149 correctly uses text.chars().count() for character count, but line 186 uses envelope.message.content.len() which returns byte length. For consistency and accurate logging with non-ASCII content, use .chars().count() here too.

Proposed fix
         log::debug!(
-            "[memory_tree::extract::llm] response chars={}",
-            envelope.message.content.len()
+            "[memory_tree::extract::llm] response chars={}",
+            envelope.message.content.chars().count()
         );
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
log::debug!(
"[memory_tree::extract::llm] response chars={}",
envelope.message.content.len()
);
log::debug!(
"[memory_tree::extract::llm] response chars={}",
envelope.message.content.chars().count()
);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/score/extract/llm.rs` around lines 184 - 187, The
debug log uses envelope.message.content.len() (byte length) which is
inconsistent with the earlier use of text.chars().count() (character count) and
misreports lengths for non-ASCII content; change the logging to use
envelope.message.content.chars().count() instead so the log::debug call reports
character count consistently (replace envelope.message.content.len() with
envelope.message.content.chars().count()).

Comment on lines +109 to +116
/// Optional LLM-rated importance in `[0.0, 1.0]` for this chunk.
/// `None` means no LLM signal is available.
#[serde(default)]
pub llm_importance: Option<f32>,
/// One-line audit trail from the LLM explaining the importance rating.
/// Used purely for diagnostics; never feeds back into scoring.
#[serde(default)]
pub llm_importance_reason: Option<String>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Treat LLM-only extractor output as non-empty.

ExtractedEntities::is_empty() still only checks entities and topics. With these new fields, an extractor that returns only llm_importance / llm_importance_reason is still treated as empty, so the new signal can be dropped before scoring.

Possible fix
 pub fn is_empty(&self) -> bool {
-    self.entities.is_empty() && self.topics.is_empty()
+    self.entities.is_empty()
+        && self.topics.is_empty()
+        && self.llm_importance.is_none()
+        && self.llm_importance_reason.is_none()
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/// Optional LLM-rated importance in `[0.0, 1.0]` for this chunk.
/// `None` means no LLM signal is available.
#[serde(default)]
pub llm_importance: Option<f32>,
/// One-line audit trail from the LLM explaining the importance rating.
/// Used purely for diagnostics; never feeds back into scoring.
#[serde(default)]
pub llm_importance_reason: Option<String>,
pub fn is_empty(&self) -> bool {
self.entities.is_empty()
&& self.topics.is_empty()
&& self.llm_importance.is_none()
&& self.llm_importance_reason.is_none()
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/score/extract/types.rs` around lines 109 - 116,
ExtractedEntities::is_empty currently only checks entities and topics, so
extractor output that includes only llm_importance or llm_importance_reason is
treated as empty and dropped; update the is_empty method on the
ExtractedEntities type to also return false when llm_importance.is_some() or
llm_importance_reason.as_deref().map(str::is_empty) == Some(false) (i.e.,
non-empty string), ensuring the new fields (llm_importance and
llm_importance_reason) are considered when deciding emptiness.

Comment on lines +154 to +165
match llm.extract(&scoring_content).await {
Ok(more) => {
extracted.merge(more);
// Recompute signals so llm_importance flows in.
signals = self::signals::compute(
&chunk.metadata,
&scoring_content,
scoring_token_count,
&extracted,
);
true
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Only use the full combine when an LLM importance value actually exists.

llm_consulted flips to true after any successful extractor call, but the extractor contract allows llm_importance: None. In that case Lines 205-209 still divide by the full weight set and silently lower borderline scores even though the LLM only contributed entities/topics. Please gate the full combine on extracted.llm_importance.is_some() instead of “the LLM call succeeded”.

🔧 Minimal fix sketch
             match llm.extract(&scoring_content).await {
                 Ok(more) => {
                     extracted.merge(more);
                     // Recompute signals so llm_importance flows in.
                     signals = self::signals::compute(
                         &chunk.metadata,
                         &scoring_content,
                         scoring_token_count,
                         &extracted,
                     );
-                    true
+                    extracted.llm_importance.is_some()
                 }
@@
-    let total = if llm_consulted {
+    let total = if llm_consulted {
         self::signals::combine(&signals, &cfg.weights)
     } else {
         self::signals::combine_cheap_only(&signals, &cfg.weights)
     };

Also applies to: 205-209

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/score/mod.rs` around lines 154 - 165, The current
logic sets llm_consulted = true whenever llm.extract(&scoring_content) succeeds,
causing the "full combine" path to be used even if the extractor returned
llm_importance = None; update the gating so that you only take the full combine
branch when extracted.llm_importance.is_some() rather than on success of
llm.extract. Concretely, after merging into extracted in the Ok(more) branch of
llm.extract, set llm_consulted (or the condition used later for the full
combine) only if extracted.llm_importance.is_some(), and ensure the recompute of
signals via self::signals::compute(&chunk.metadata, &scoring_content,
scoring_token_count, &extracted) remains correct for the entity/topic-only case.

Comment thread src/openhuman/memory/tree/score/store.rs
Comment on lines +80 to +90
fn truncate_to_budget(text: &str, budget: u32) -> (String, u32) {
let initial = approx_token_count(text);
if initial <= budget {
return (text.to_string(), initial);
}
// Character ceiling derived from the same ~4 chars/token heuristic.
let char_ceiling = (budget as usize).saturating_mul(4);
let truncated: String = text.chars().take(char_ceiling).collect();
let tokens = approx_token_count(&truncated);
(truncated, tokens)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

truncate_to_budget does not guarantee budget compliance.

The current one-shot budget * 4 clamp can still return token_count > budget (your test at Line 160 already tolerates overflow). This should strictly enforce <= budget.

Proposed strict truncation fix
 fn truncate_to_budget(text: &str, budget: u32) -> (String, u32) {
+    if budget == 0 {
+        return (String::new(), 0);
+    }
     let initial = approx_token_count(text);
     if initial <= budget {
         return (text.to_string(), initial);
     }
-    // Character ceiling derived from the same ~4 chars/token heuristic.
+    // Character ceiling derived from the same ~4 chars/token heuristic.
     let char_ceiling = (budget as usize).saturating_mul(4);
-    let truncated: String = text.chars().take(char_ceiling).collect();
-    let tokens = approx_token_count(&truncated);
-    (truncated, tokens)
+    let mut truncated: String = text.chars().take(char_ceiling).collect();
+    let mut tokens = approx_token_count(&truncated);
+    while tokens > budget && !truncated.is_empty() {
+        truncated.pop(); // UTF-8 safe (char boundary)
+        tokens = approx_token_count(&truncated);
+    }
+    (truncated, tokens)
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/source_tree/summariser/inert.rs` around lines 80 -
90, truncate_to_budget currently uses a single char ceiling (budget * 4) which
can still yield token counts > budget; update truncate_to_budget to enforce
tokens <= budget by iteratively shrinking the string until
approx_token_count(&truncated) <= budget. Use the existing approx_token_count to
measure tokens, reduce the char length by a conservative step (e.g., subtract a
few characters or scale by token/char ratio) in a loop, and return the final
truncated String and its token count; ensure the function always returns tokens
<= budget and keep references to truncate_to_budget and approx_token_count when
locating the change.

Comment on lines +44 to +106
let hits = lookup_entity(config, entity_id, Some(BACKFILL_LIMIT))
.with_context(|| format!("failed to lookup entity {entity_id}"))?;

if hits.is_empty() {
log::debug!(
"[topic_tree::backfill] no entity-index hits for entity_id={} — empty backfill",
entity_id
);
return Ok(0);
}

// Sort by timestamp ASC so the buffer's `oldest_at` and the sealed
// summary's `time_range_start` reflect the true historical order, not
// the DESC ordering `lookup_entity` returns.
let mut hits = hits;
hits.sort_by_key(|h| h.timestamp_ms);

let mut appended = 0usize;
for hit in hits {
// Skip summary-node hits — Phase 3c backfill only routes raw leaves
// into the topic tree. Including summary nodes would fold
// summaries-of-summaries across unrelated sources, which defeats
// the point.
if hit.node_kind != "leaf" {
log::debug!(
"[topic_tree::backfill] skipping non-leaf hit node_id={} kind={}",
hit.node_id,
hit.node_kind
);
continue;
}

let chunk = match get_chunk(config, &hit.node_id)? {
Some(c) => c,
None => {
log::warn!(
"[topic_tree::backfill] missing chunk {} for entity {} — skipping",
hit.node_id,
entity_id
);
continue;
}
};

let leaf = LeafRef {
chunk_id: chunk.id.clone(),
token_count: chunk.token_count,
timestamp: chunk.metadata.timestamp,
content: chunk.content.clone(),
entities: vec![entity_id.to_string()],
topics: chunk.metadata.tags.clone(),
score: hit.score,
};

append_leaf(config, tree, &leaf, summariser)
.await
.with_context(|| {
format!(
"backfill append_leaf failed tree_id={} chunk_id={}",
tree.id, chunk.id
)
})?;
appended += 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Move the SQLite read phase off the async worker.

lookup_entity() and every get_chunk() call go through blocking SQLite helpers, so materialising one hot topic tree can do 1 + up to 500 synchronous DB reads inline on the Tokio worker. Because this path runs from ingest, a large backfill can stall unrelated async work. Please move the read phase behind spawn_blocking or batch it inside one blocking transaction before the async append_leaf(...).await loop.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/topic_tree/backfill.rs` around lines 44 - 106, The
backfill loop does synchronous SQLite reads on the async worker via
lookup_entity and repeated get_chunk calls, which can block Tokio; move all DB
work off the async executor by collecting hits and materialising required chunks
inside a blocking context (e.g., spawn_blocking or a single blocking
transaction) before entering the async append_leaf(tree, summariser).
Specifically: call lookup_entity inside spawn_blocking (or open one blocking
transaction) to get hits, then inside the same blocking scope fetch and build
the Vec<LeafRef> (using get_chunk and skipping missing/non-leaf hits), and only
after that return to the async context to iterate and await append_leaf; this
preserves the existing append_leaf, LeafRef, and hit filtering logic but
prevents blocking during the await loop.

Comment thread src/openhuman/memory/tree/topic_tree/backfill.rs
Comment on lines +67 to +88
let mut counters = get_or_fresh(config, entity_id)?;

// 2. Cheap per-ingest bumps.
counters.mention_count_30d = counters.mention_count_30d.saturating_add(1);
counters.last_seen_ms = Some(now_ms);
counters.ingests_since_check = counters.ingests_since_check.saturating_add(1);
counters.last_updated_ms = now_ms;

// 3. Decide whether to run the full recompute.
if counters.ingests_since_check < TOPIC_RECHECK_EVERY {
upsert(config, &counters)?;
log::debug!(
"[topic_tree::curator] bumped counters entity={} mentions={} ingests_since_check={}",
entity_id,
counters.mention_count_30d,
counters.ingests_since_check
);
return Ok(SpawnOutcome::CountersBumped);
}

// 4. Full recompute.
run_full_recompute(config, entity_id, &mut counters, now_ms, summariser).await
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make the hotness counter updates atomic.

Both entrypoints read the current row, mutate it in memory, and then overwrite the whole record later. If two ingests touch the same entity concurrently, one worker can clobber the other's mention_count_30d / ingests_since_check, which also makes the cadence gate and spawn path nondeterministic. Please move the bump/reset flow into a single transaction or an in-place SQL update before deciding whether to spawn/backfill.

Also applies to: 100-102, 159-160

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/topic_tree/curator.rs` around lines 67 - 88, The
current flow reads and mutates counters via get_or_fresh then upserts, which can
race and clobber mention_count_30d/ingests_since_check; change the bump/reset
logic to be atomic by performing the increment and last_seen/last_updated
updates inside a single DB transaction or an in-place SQL UPDATE/RETURNING
before evaluating TOPIC_RECHECK_EVERY. Specifically, replace the
read-modify-upsert sequence around get_or_fresh, the fields
counters.mention_count_30d, counters.ingests_since_check, counters.last_seen_ms,
and counters.last_updated_ms, and the subsequent upsert() + branch that checks
TOPIC_RECHECK_EVERY with a single DB operation that returns the updated counters
(or lock the row in a transaction), then use that returned state to decide
whether to call run_full_recompute.

Comment thread src/openhuman/memory/tree/topic_tree/routing.rs
Comment on lines +97 to +109
pub fn distinct_sources_for(config: &Config, entity_id: &str) -> Result<u32> {
with_connection(config, |conn| {
let n: i64 = conn
.query_row(
"SELECT COUNT(DISTINCT tree_id)
FROM mem_tree_entity_index
WHERE entity_id = ?1 AND tree_id IS NOT NULL",
params![entity_id],
|r| r.get(0),
)
.context("failed to count distinct sources")?;
Ok(n.max(0) as u32)
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Filter distinct_sources_for() down to source trees.

This query counts every distinct tree_id in mem_tree_entity_index. But src/openhuman/memory/tree/source_tree/bucket_seal.rs:245-326 indexes summary entities with the current tree id for all tree kinds, so once topic-tree summaries exist the entity will count its own topic tree as an extra “source”. That artificially boosts hotness and can skew later archive/recheck decisions.

🔧 Query shape to avoid counting topic/global trees
-        let n: i64 = conn
-            .query_row(
-                "SELECT COUNT(DISTINCT tree_id)
-                   FROM mem_tree_entity_index
-                  WHERE entity_id = ?1 AND tree_id IS NOT NULL",
-                params![entity_id],
-                |r| r.get(0),
-            )
+        let n: i64 = conn
+            .query_row(
+                "SELECT COUNT(DISTINCT t.scope)
+                   FROM mem_tree_entity_index idx
+                   JOIN mem_tree_trees t ON t.id = idx.tree_id
+                  WHERE idx.entity_id = ?1
+                    AND t.kind = 'source'",
+                params![entity_id],
+                |r| r.get(0),
+            )
             .context("failed to count distinct sources")?;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pub fn distinct_sources_for(config: &Config, entity_id: &str) -> Result<u32> {
with_connection(config, |conn| {
let n: i64 = conn
.query_row(
"SELECT COUNT(DISTINCT tree_id)
FROM mem_tree_entity_index
WHERE entity_id = ?1 AND tree_id IS NOT NULL",
params![entity_id],
|r| r.get(0),
)
.context("failed to count distinct sources")?;
Ok(n.max(0) as u32)
})
pub fn distinct_sources_for(config: &Config, entity_id: &str) -> Result<u32> {
with_connection(config, |conn| {
let n: i64 = conn
.query_row(
"SELECT COUNT(DISTINCT t.scope)
FROM mem_tree_entity_index idx
JOIN mem_tree_trees t ON t.id = idx.tree_id
WHERE idx.entity_id = ?1
AND t.kind = 'source'",
params![entity_id],
|r| r.get(0),
)
.context("failed to count distinct sources")?;
Ok(n.max(0) as u32)
})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/memory/tree/topic_tree/store.rs` around lines 97 - 109,
distinct_sources_for currently counts every distinct tree_id from
mem_tree_entity_index which includes topic/global summaries and causes entities
to count their own topic tree; modify the SQL in distinct_sources_for to only
count tree_ids that correspond to actual source trees by joining mem_tree (or
whatever table holds tree metadata) and filtering on its kind (e.g., WHERE
mt.kind NOT IN ('topic','global') or WHERE mt.kind = 'source') so the query
becomes a COUNT(DISTINCT mtei.tree_id) with an INNER JOIN mem_tree mt ON mt.id =
mtei.tree_id and the additional kind filter; keep function name
distinct_sources_for and table mem_tree_entity_index to locate and update the
query.

…humansai#709)

Phase 3c of the memory architecture (umbrella tinyhumansai#711). Adds per-entity
topic trees spawned lazily when an entity's hotness crosses a threshold,
so Phase 4 retrieval can resolve "what did Alice say about Phoenix?"
without scanning the full chunk pool.

## What's in this commit

New module `src/openhuman/memory/tree/topic_tree/`:

- `types.rs` — `EntityIndexStats` (hotness input), `HotnessCounters` (the
  persisted row) and the `TOPIC_CREATION_THRESHOLD=10.0`,
  `TOPIC_ARCHIVE_THRESHOLD=2.0`, `TOPIC_RECHECK_EVERY=100` constants.
- `hotness.rs` — pure arithmetic scorer plus deterministic
  `hotness_at(entity_id, stats, now_ms)` variant for tests, and a
  piecewise-linear `recency_decay` helper.
- `store.rs` — SQLite helpers (`get`, `get_or_fresh`, `upsert`,
  `distinct_sources_for`, `count`) for the new `mem_tree_entity_hotness`
  table.
- `registry.rs` — `get_or_create_topic_tree`, `force_create_topic_tree`,
  `list_topic_trees`, `archive_topic_tree`. Race-recovers on UNIQUE
  violations via `is_unique_violation`, mirroring `source_tree::registry`.
- `curator.rs` — `maybe_spawn_topic_tree` (per-ingest tick) and
  `force_recompute` (admin path). Implements the
  `TOPIC_RECHECK_EVERY`-gated recompute: refresh `distinct_sources` from
  the entity index, compute hotness, spawn + backfill if threshold
  crossed.
- `backfill.rs` — `backfill_topic_tree` walks the entity index and routes
  every historic leaf (ts-ASC ordered) through
  `source_tree::bucket_seal::append_leaf`. Capped at 500 leaves. Skips
  summary-node hits so topic trees only ever ingest raw leaves. Missing
  chunks log a warn and are skipped, never failing the spawn.
- `routing.rs` — `route_leaf_to_topic_trees` fans a kept leaf out to
  every active matching topic tree and ticks the curator for each
  entity. Archived trees are skipped but counters still bump.

Wired from ingest (`tree/ingest.rs::append_leaves_to_tree`) AFTER the
source-tree `append_leaf` succeeds — non-fatal on error, logged at warn
level so routing issues never poison the ingest hot path.

## Hotness scoring (pure)

```
hotness = ln(mentions + 1)          // dampened volume
        + 0.5 * distinct_sources    // cross-source bonus
        + recency_decay(last_seen)  // 1.0 @ day 0 → 0 @ day 30
        + graph_centrality          // Phase 4+; None → 0
        + 2.0 * query_hits          // retrieval feedback; Phase 4+
```

`graph_centrality` and `query_hits_30d` columns are persisted but the
increment code paths are deferred to later phases, as planned.

## Schema (additive, idempotent)

New table `mem_tree_entity_hotness` (keyed on `entity_id`) added to the
Phase 1 `SCHEMA` constant in `tree/store.rs` so it migrates through the
same `with_connection` path as the existing tables. `CREATE TABLE IF NOT
EXISTS` keeps it re-run-safe. An ancillary index on `last_hotness`
supports future sweep queries.

## Routing

The ingest path is the only non-admin caller. After `append_leaf` puts a
leaf in the source tree, `route_leaf_to_topic_trees` runs with the
chunk's canonical entity list. For each entity:

1. If an active topic tree exists, append the leaf to it (reusing
   `source_tree::bucket_seal::append_leaf` with `entities=[entity_id]`).
2. Tick `maybe_spawn_topic_tree` — may bump counters or, on cadence,
   recompute hotness and spawn + backfill a new tree.

Per-entity errors are caught and logged; a top-level failure is demoted
to a warn in the ingest caller so the source-tree append always wins.

## Reuses from Phase 3a

- `source_tree::bucket_seal::append_leaf` — same `&Tree` API works for
  `TreeKind::Topic` end-to-end.
- `source_tree::summariser::{Summariser, InertSummariser}` — honest stub
  emits empty entity/topic vecs on summary nodes.
- `mem_tree_trees` / `mem_tree_summaries` / `mem_tree_buffers` schema,
  discriminated by `kind = 'topic'` and `scope = <entity canonical id>`.
- Registry race recovery via `is_unique_violation` (catches UNIQUE on
  `insert_tree`, re-queries on collision).
- Idempotent append (duplicate `item_ids` in the L0 buffer are no-ops),
  so backfill is safe to re-run.

## Tests

40 new tests (10 hotness, 3 types, 6 store, 7 registry, 4 curator, 4
backfill, 5 routing, 1 integration). `memory::tree` suite: 201 passing,
0 regressions (Phase 3a baseline was 161 → +40 new).

Coverage highlights:
- Hotness pure math (zero-entity, spike-over-threshold, old-but-widely-
  cited retains signal, query-hit boost, recency decay edges).
- Curator: first-ingest-just-bumps, no-spawn-below-threshold,
  spawn-fires-exactly-once-when-crossed, cadence gating.
- Backfill: appends-all-entity-leaves, skips missing chunks,
  idempotent, skips summary nodes.
- Routing: empty-entities-noop, appends-to-existing-tree, archived-tree-
  skipped, multi-entity-fan-out, end-to-end-integration-materialisation.
- Registry: idempotent get-or-create, UNIQUE race recovery, archive
  flips status (not deletion), kind/scope cleanly separated from source.

## Deliberate non-goals (deferred)

- No JSON-RPC surface (out of scope for Phase 3c core).
- No archive cron sweep — `archive_topic_tree` primitive only.
- `graph_centrality` / `query_hits_30d` increments deferred to later
  phases (columns exist, reads work, writes are Phase 4+).

## Stacked on tinyhumansai#789

Base is `feat/709-summary-trees`. Merge tinyhumansai#789 first, then rebase this
branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sanil-23 sanil-23 force-pushed the feat/709c-topic-trees branch from 19c21ad to ad618a2 Compare April 22, 2026 20:36
@senamakel senamakel merged commit 32b5afd into tinyhumansai:main Apr 22, 2026
6 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants