Skip to content

feat(tree-summarizer): hierarchical summary tree module with CLI#423

Merged
senamakel merged 10 commits intotinyhumansai:mainfrom
senamakel:feat/summary-tree
Apr 8, 2026
Merged

feat(tree-summarizer): hierarchical summary tree module with CLI#423
senamakel merged 10 commits intotinyhumansai:mainfrom
senamakel:feat/summary-tree

Conversation

@senamakel
Copy link
Copy Markdown
Member

@senamakel senamakel commented Apr 8, 2026

Summary

  • Adds a new tree_summarizer domain module that maintains a hierarchical tree of time-based summaries (Hour → Day → Month → Year → Root) stored as markdown files in memory namespaces
  • Each level has graduated token budgets (1k/2k/4k/8k/20k) to preserve information as summaries compress upward
  • Includes a dedicated CLI (openhuman tree-summarizer) for ingesting content, running summarization jobs, querying the tree, and inspecting status
  • Integrates with the event bus, controller registry, and provider system following existing domain patterns

New files

File Purpose
src/openhuman/tree_summarizer/types.rs NodeLevel, TreeNode, helpers (token estimation, node ID derivation)
src/openhuman/tree_summarizer/store.rs Markdown file persistence, buffer management, tree traversal
src/openhuman/tree_summarizer/engine.rs Core summarization: ingest → hour leaf → propagate upward via LLM
src/openhuman/tree_summarizer/ops.rs RPC operation wrappers (ingest, run, query, status, rebuild)
src/openhuman/tree_summarizer/schemas.rs Controller schema definitions + handler registration
src/openhuman/tree_summarizer/bus.rs Event bus subscriber stub
src/openhuman/tree_summarizer/mod.rs Module exports
src/core/tree_summarizer_cli.rs Dedicated CLI with file/stdin input, positional args

CLI usage

openhuman tree-summarizer ingest my-ns --content 'raw data'
openhuman tree-summarizer ingest my-ns --file notes.txt
cat data.md | openhuman tree-summarizer ingest my-ns --file -
openhuman tree-summarizer run my-ns
openhuman tree-summarizer query my-ns 2024/03/15
openhuman tree-summarizer status my-ns
openhuman tree-summarizer rebuild my-ns

Test plan

  • cargo check passes cleanly
  • cargo fmt applied
  • cargo test -p openhuman -- tree_summarizer — store unit tests (roundtrip read/write, children, buffer ops)
  • Manual: ingest content via CLI, run summarization, query tree nodes
  • Manual: verify markdown files created in expected directory structure

Summary by CodeRabbit

  • New Features

    • Adds a tree-summarizer feature: top-level CLI command with subcommands (ingest, run, query, status, rebuild), hourly background summarization, bottom-up propagation, on-disk hierarchical summary tree, and rebuild support.
    • CLI help updated to show tree-summarizer usage.
  • Observability

    • Emits new events for hour completion, propagation progress, and rebuild completion; subscriber registered at startup.
  • Documentation

    • Added event-bus and CLI usage notes.
  • Tests

    • Added unit tests for storage, types, CLI parsing, and event handling.

…d event handling

- Introduced a new `tree_summarizer` module to manage hierarchical time-based summaries, organizing data into a tree structure (root → year → month → day → hour).
- Added functionality to ingest raw content, summarize it into hour leaves, and propagate summaries upward through the tree.
- Implemented event handling for summarization completion and tree rebuild events, enhancing observability and modularity.
- Created RPC operations for ingesting content, triggering summarization, querying the tree, and retrieving tree status.
- Added comprehensive tests to ensure the reliability of the summarization process and event handling.

This update significantly enhances the summarization capabilities of the system, allowing for efficient data organization and retrieval.
- Introduced a new `tree-summarizer` command to the CLI, allowing users to ingest content, run summarization jobs, query the summary tree, check status, and rebuild the tree.
- Updated the CLI help documentation to include the new command and its subcommands.
- Added a new module `tree_summarizer_cli` to encapsulate the tree summarization functionality.

This enhancement improves the usability of the summarization features, providing a streamlined interface for managing hierarchical summaries directly from the command line.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 8, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a new tree_summarizer namespace: types, storage, engine, RPCs, CLI entry, controller schemas, event definitions/subscriber, and startup/registry integration across multiple modules and tests.

Changes

Cohort / File(s) Summary
CLI Integration
src/core/cli.rs, src/core/mod.rs, src/core/tree_summarizer_cli.rs
Add top-level tree-summarizer subcommand, export new core CLI module, and implement run_tree_summarizer_command with subcommands (ingest, run, query, status, rebuild) and argument parsing/runtime bootstrap.
Registry & Core Wiring
src/core/all.rs
Register openhuman::tree_summarizer controllers and add namespace_description entry for "tree_summarizer".
Openhuman Exports
src/openhuman/mod.rs, src/openhuman/tree_summarizer/mod.rs
Export new tree_summarizer module; add module entrypoint that re-exports rpc ops, schemas, registered controllers, and types.
Domain Types
src/openhuman/tree_summarizer/types.rs
Introduce NodeLevel, TreeNode, TreeStatus, IngestRequest, QueryResult, token estimation, node-id/path derivation, and related tests.
Persistence & Buffering
src/openhuman/tree_summarizer/store.rs
Implement markdown+YAML frontmatter storage under memory/namespaces/*/tree: node read/write, children/ancestors, count/status/delete, buffer write/drain, parsing helpers, and comprehensive unit tests.
Engine & Background Loop
src/openhuman/tree_summarizer/engine.rs
Add summarization engine: run_summarization, propagation, rebuild_tree, and run_hourly_loop implementing buffer draining, summarization propagation, rebuild logic, and event emission.
RPC Layer
src/openhuman/tree_summarizer/ops.rs
Add RPC ops: tree_summarizer_ingest, tree_summarizer_run, tree_summarizer_query, tree_summarizer_status, tree_summarizer_rebuild with validation and provider creation helper.
Controller Schemas & Handlers
src/openhuman/tree_summarizer/schemas.rs
Add controller schemas, registered controllers, and async handlers that parse params, load RPC config, invoke ops, and serialize outcomes.
Events & Bus Subscriber
src/openhuman/event_bus/events.rs, src/openhuman/tree_summarizer/bus.rs, src/openhuman/channels/runtime/startup.rs
Add DomainEvent variants for tree_summarizer, route them to "tree_summarizer", implement TreeSummarizerEventSubscriber, and register subscriber at startup.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant CLI as "CLI Handler"
    participant Ops as "RPC Ops"
    participant Engine as "Engine"
    participant Store as "Store"
    participant EventBus as "Event Bus"

    User->>CLI: tree-summarizer ingest --content "..."
    CLI->>Ops: tree_summarizer_ingest(namespace, content)
    Ops->>Store: buffer_write(content)
    Ops-->>CLI: RpcOutcome (buffer metadata)

    User->>CLI: tree-summarizer run
    CLI->>Ops: tree_summarizer_run(namespace)
    Ops->>Engine: run_summarization(namespace, ts)
    Engine->>Store: buffer_drain(namespace)
    Engine->>Engine: summarize hour node
    Engine->>Store: write_node(hour_node)
    Engine->>EventBus: emit TreeSummarizerHourCompleted
    loop propagate upward
        Engine->>Store: read_children(parent_id)
        Engine->>Engine: combine/summarize children
        Engine->>Store: write_node(parent_node)
        Engine->>EventBus: emit TreeSummarizerPropagated
    end
    Ops-->>CLI: RpcOutcome (result JSON)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~55 minutes

Possibly related PRs

Suggested reviewers

  • YellowSnnowmann

Poem

🐰 I nibble buffers into tidy little lines,
Hour by hour I fold thoughts into pines,
Roots hum summaries, branches hum back,
I hop the paths and leave a cozy track,
Hooray — the tree now tells its signs!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(tree-summarizer): hierarchical summary tree module with CLI' clearly and specifically describes the primary change: adding a new tree-summarizer module with CLI support.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/openhuman/tree_summarizer/bus.rs`:
- Around line 1-4: bus.rs is only a stub and lacks a subscriber, so events
TreeSummarizerHourCompleted, TreeSummarizerPropagated, and
TreeSummarizerRebuildCompleted (defined in event_bus/events.rs and emitted from
engine.rs) are never handled; implement a TreeSummarizerEventSubscriber struct
in src/openhuman/tree_summarizer/bus.rs that implements the EventHandler trait
(matching the handler signature used by other domain subscribers) and processes
those three event types, then register it at startup by calling
event_bus::subscribe_global(Box::new(TreeSummarizerEventSubscriber::new())) in
the startup registration code (follow the pattern used by CronDeliverySubscriber
in channels/runtime/startup.rs) so the subscriber is active on application boot.

In `@src/openhuman/tree_summarizer/engine.rs`:
- Around line 117-129: The rebuild flow deletes the whole tree
(store::delete_tree) after collect_hour_leaves_recursive (which skips the
"tree/buffer"), causing unsummarized buffered inputs to be lost; before calling
store::delete_tree, locate the buffer directory under the tree base (from
store::tree_dir(config, namespace) + "/buffer"), atomically move or copy its
contents to a temporary safe location, then proceed with delete and rebuild, and
finally replay or restore the buffered files back into the namespace (using your
ingestion/replay function or reinsert logic) so buffered inputs are preserved
and reprocessed after the rebuild.
- Around line 35-73: The code currently uses the run timestamp (ts) and
derive_node_ids(&ts) to make a single hour leaf for all buffered entries, which
causes wrong bucketing and overwrites; change the flow to group the buffered
entries returned by store::buffer_drain(...) by each entry's own timestamp
(e.g., entry.timestamp or equivalent), call derive_node_ids(...) per-group/hour
to get that hour_id, produce a combined text per-hour and call
summarize_to_limit(provider, &combined_per_hour, NodeLevel::Hour.max_tokens(),
"hour", &hour_id).await for each group, then build a TreeNode per hour_id and
call store::write_node(config, &hour_node) per-hour (before writing, read
existing hour node if any and merge/accumulate summaries or child_count to avoid
overwriting). Ensure references to hour_id, derive_parent_id, TreeNode,
summarize_to_limit, and store::write_node are updated to operate per-hour group
rather than once for the global ts.
- Around line 283-303: After receiving the LLM output from
provider.chat_with_system (used with SUMMARIZATION_MODEL and
SUMMARIZATION_TEMP), validate the response against the allowed summary size (max
tokens/characters) and enforce a postcondition: compute the response size (using
your tokenizer if available, otherwise a character limit), and if it exceeds the
configured maximum (e.g., SUMMARIZATION_MAX_TOKENS or a MAX_SUMMARY_CHARS
constant), either truncate the response to that limit or return an
error/contextual failure including node_id and level_name; update the code path
around the response variable in engine.rs to perform this check before
logging/tracing and returning Ok(response).

In `@src/openhuman/tree_summarizer/ops.rs`:
- Around line 84-96: Validate the user-controlled node_id before calling
store::read_node and store::read_children: reject or normalize any input that is
not exactly "root" or matching the canonical date-path patterns (e.g., "YYYY",
"YYYY/MM", "YYYY/MM/DD", "YYYY/MM/DD/HH") and disallow path traversal characters
like "..", leading slashes, or empty segments; perform this check where node_id
is read (the node_id / target_id handling) and return a clear error before
invoking read_node/read_children so invalid IDs never reach filesystem code. Use
a strict whitelist regex for the allowed formats and trim namespace separately
(namespace.trim()) as currently done, and ensure the same validation is applied
to target_id before passing it to read_node and read_children.
- Around line 12-38: The ingest path drops optional metadata from the public
IngestRequest; update tree_summarizer_ingest to accept and validate the metadata
field from the request (match the optional metadata shape declared in schemas.rs
/ IngestRequest), then pass it through to storage instead of discarding
it—either extend store::buffer_write to take a metadata parameter (or call a new
store function that persists content+metadata) and include metadata in the
returned RpcOutcome JSON; ensure you also update any callers such as
handle_ingest to forward metadata and adjust store::buffer_write usage
accordingly so metadata is not lost silently.

In `@src/openhuman/tree_summarizer/schemas.rs`:
- Around line 124-147: The ControllerSchema entries for the "status" and
"rebuild" handlers currently declare an output field named "status" but the
handlers actually return RpcOutcome::single_log(...) serialized via
into_cli_compatible_json(), which produces an envelope { "result": ..., "logs":
[...] }; update both ControllerSchema blocks ("status" and "rebuild") to rename
the output FieldSchema name from "status" to "result" (keep ty: TypeSchema::Json
and the comment but adjust wording if desired) so the published contract matches
the actual CLI/RPC response envelope.

In `@src/openhuman/tree_summarizer/store.rs`:
- Around line 42-47: The current sanitize(namespace: &str) function
lossy-replaces reserved chars which can collapse distinct namespaces (e.g.,
"a/b" vs "a:b"); change it to either (A) perform reversible encoding (e.g.,
percent-encoding or url-safe Base64) so the on-disk folder maps 1:1 back to the
original namespace, or (B) reject namespaces containing invalid characters by
returning a Result and propagating validation to callers; update sanitize's
signature and all call sites accordingly (reference: sanitize function in
store.rs) so that callers handle the Result or use the reversible encoding
output instead of the previous underscore-mapped string.
- Around line 315-318: The cleanup loop currently ignores errors from
std::fs::remove_file causing buffer_drain to report success even when files
weren't deleted; update buffer_drain (or the function that iterates over
entries) to check the Result of remove_file and return/propagate an Err when any
removal fails instead of discarding it—i.e., replace the throwaway let _ =
std::fs::remove_file(path) with proper error handling (map_err or ?), so
buffer_drain fails on cleanup errors and prevents duplicate reads on the next
run.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0518e168-93bb-44c9-91fa-0bb0a2f72b2e

📥 Commits

Reviewing files that changed from the base of the PR and between 3034ec1 and b63c06c.

📒 Files selected for processing (13)
  • src/core/all.rs
  • src/core/cli.rs
  • src/core/mod.rs
  • src/core/tree_summarizer_cli.rs
  • src/openhuman/event_bus/events.rs
  • src/openhuman/mod.rs
  • src/openhuman/tree_summarizer/bus.rs
  • src/openhuman/tree_summarizer/engine.rs
  • src/openhuman/tree_summarizer/mod.rs
  • src/openhuman/tree_summarizer/ops.rs
  • src/openhuman/tree_summarizer/schemas.rs
  • src/openhuman/tree_summarizer/store.rs
  • src/openhuman/tree_summarizer/types.rs

Comment on lines +45 to +99
fn parse_opts(args: &[String]) -> Result<(CliOpts, Vec<String>)> {
let mut verbose = false;
let mut content: Option<String> = None;
let mut file: Option<String> = None;
let mut node_id: Option<String> = None;
let mut rest = Vec::new();
let mut i = 0;

while i < args.len() {
match args[i].as_str() {
"--content" | "-c" => {
let val = args
.get(i + 1)
.ok_or_else(|| anyhow::anyhow!("missing value for --content"))?;
content = Some(val.clone());
i += 2;
}
"--file" | "-f" => {
let val = args
.get(i + 1)
.ok_or_else(|| anyhow::anyhow!("missing value for --file"))?;
file = Some(val.clone());
i += 2;
}
"--node-id" | "--node" => {
let val = args
.get(i + 1)
.ok_or_else(|| anyhow::anyhow!("missing value for --node-id"))?;
node_id = Some(val.clone());
i += 2;
}
"-v" | "--verbose" => {
verbose = true;
i += 1;
}
"-h" | "--help" => {
rest.push(args[i].clone());
i += 1;
}
_ => {
rest.push(args[i].clone());
i += 1;
}
}
}

Ok((
CliOpts {
verbose,
content,
file,
node_id,
},
rest,
))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Unknown flags and extra args are silently ignored.

At Line 84, unrecognized options are treated as positional args; then subcommands (e.g., Lines 185, 270, 305) consume only rest[0] and ignore leftovers. This can mask user mistakes (--verbsoe, stray args) and execute the wrong command shape.

🔧 Suggested fix (strict option/arity validation)
 fn parse_opts(args: &[String]) -> Result<(CliOpts, Vec<String>)> {
@@
-            _ => {
-                rest.push(args[i].clone());
-                i += 1;
-            }
+            other if other.starts_with('-') => {
+                return Err(anyhow::anyhow!("unknown option: {other}"));
+            }
+            _ => {
+                rest.push(args[i].clone());
+                i += 1;
+            }
         }
     }
@@
 fn run_ingest(args: &[String]) -> Result<()> {
     let (opts, rest) = parse_opts(args)?;
@@
-    let namespace = &rest[0];
+    if rest.len() != 1 {
+        return Err(anyhow::anyhow!(
+            "unexpected positional args: {:?}",
+            &rest[1..]
+        ));
+    }
+    let namespace = &rest[0];
@@
 fn run_summarize(args: &[String]) -> Result<()> {
     let (opts, rest) = parse_opts(args)?;
@@
-    let namespace = &rest[0];
+    if rest.len() != 1 {
+        return Err(anyhow::anyhow!(
+            "unexpected positional args: {:?}",
+            &rest[1..]
+        ));
+    }
+    let namespace = &rest[0];
@@
 fn run_query(args: &[String]) -> Result<()> {
     let (opts, rest) = parse_opts(args)?;
@@
+    if rest.len() > 2 {
+        return Err(anyhow::anyhow!(
+            "unexpected positional args: {:?}",
+            &rest[2..]
+        ));
+    }
     let namespace = &rest[0];
@@
 fn run_status(args: &[String]) -> Result<()> {
     let (opts, rest) = parse_opts(args)?;
@@
-    let namespace = &rest[0];
+    if rest.len() != 1 {
+        return Err(anyhow::anyhow!(
+            "unexpected positional args: {:?}",
+            &rest[1..]
+        ));
+    }
+    let namespace = &rest[0];
@@
 fn run_rebuild(args: &[String]) -> Result<()> {
     let (opts, rest) = parse_opts(args)?;
@@
-    let namespace = &rest[0];
+    if rest.len() != 1 {
+        return Err(anyhow::anyhow!(
+            "unexpected positional args: {:?}",
+            &rest[1..]
+        ));
+    }
+    let namespace = &rest[0];

Also applies to: 107-325

Comment thread src/openhuman/tree_summarizer/bus.rs Outdated
Comment on lines +1 to +4
//! Event bus integration for tree_summarizer.
//!
//! Stub module — no active subscribers yet. Events are defined in
//! `event_bus/events.rs` and published from `engine.rs`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Verify whether tree_summarizer has a concrete event subscriber.
rg -n --type rust -C3 'impl\s+EventHandler\s+for|struct\s+\w+Subscriber|fn\s+name\(&self\)\s*->\s*&str' src/openhuman/tree_summarizer || true

# Verify startup/global registration points.
rg -n --type rust -C3 'subscribe_global\(|\.subscribe\(' src || true

# Verify any explicit tree_summarizer subscriber registration.
rg -n --type rust -C3 'tree_summarizer.*Subscriber|Subscriber.*tree_summarizer|tree_summarizer.*subscribe' src || true

Repository: tinyhumansai/openhuman

Length of output: 8876


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check if tree_summarizer has any event variants in DomainEvent enum
rg -n --type rust 'TreeSummarizer|tree_summarizer' src/openhuman/event_bus/events.rs || echo "No tree_summarizer events found"

# Show the DomainEvent enum definition to understand current structure
rg -A 50 'pub enum DomainEvent' src/openhuman/event_bus/events.rs | head -70

Repository: tinyhumansai/openhuman

Length of output: 2322


Complete event-bus integration for tree_summarizer: Add subscriber implementation to bus.rs and register it in startup.

Tree_summarizer defines three events (TreeSummarizerHourCompleted, TreeSummarizerPropagated, TreeSummarizerRebuildCompleted) in events.rs, but the bus.rs stub has no handler. Events emitted from engine.rs will not trigger any domain logic until a subscriber is registered. Create a <Purpose>Subscriber struct (e.g., TreeSummarizerEventSubscriber) that implements EventHandler, then register it via event_bus::subscribe_global() during startup—typically in channels/runtime/startup.rs following the pattern used for CronDeliverySubscriber and other domain handlers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/tree_summarizer/bus.rs` around lines 1 - 4, bus.rs is only a
stub and lacks a subscriber, so events TreeSummarizerHourCompleted,
TreeSummarizerPropagated, and TreeSummarizerRebuildCompleted (defined in
event_bus/events.rs and emitted from engine.rs) are never handled; implement a
TreeSummarizerEventSubscriber struct in src/openhuman/tree_summarizer/bus.rs
that implements the EventHandler trait (matching the handler signature used by
other domain subscribers) and processes those three event types, then register
it at startup by calling
event_bus::subscribe_global(Box::new(TreeSummarizerEventSubscriber::new())) in
the startup registration code (follow the pattern used by CronDeliverySubscriber
in channels/runtime/startup.rs) so the subscriber is active on application boot.

Comment thread src/openhuman/tree_summarizer/engine.rs Outdated
Comment thread src/openhuman/tree_summarizer/engine.rs Outdated
Comment thread src/openhuman/tree_summarizer/engine.rs
Comment thread src/openhuman/tree_summarizer/ops.rs
Comment thread src/openhuman/tree_summarizer/ops.rs
Comment thread src/openhuman/tree_summarizer/schemas.rs
Comment on lines +42 to +47
fn sanitize(namespace: &str) -> String {
namespace
.replace(['/', '\\', ':', '*', '?', '"', '<', '>', '|'], "_")
.trim()
.to_string()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Lossy namespace sanitization can merge unrelated trees.

Replacing reserved characters with _ makes distinct namespaces collide on disk ("a/b" and "a:b" both become a_b). The hourly scanner then only sees the sanitized directory name, so logs/events can drift from the original namespace as well. Reject invalid characters up front or use a reversible encoding.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/tree_summarizer/store.rs` around lines 42 - 47, The current
sanitize(namespace: &str) function lossy-replaces reserved chars which can
collapse distinct namespaces (e.g., "a/b" vs "a:b"); change it to either (A)
perform reversible encoding (e.g., percent-encoding or url-safe Base64) so the
on-disk folder maps 1:1 back to the original namespace, or (B) reject namespaces
containing invalid characters by returning a Result and propagating validation
to callers; update sanitize's signature and all call sites accordingly
(reference: sanitize function in store.rs) so that callers handle the Result or
use the reversible encoding output instead of the previous underscore-mapped
string.

Comment thread src/openhuman/tree_summarizer/store.rs Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
src/core/tree_summarizer_cli.rs (1)

45-87: ⚠️ Potential issue | 🟠 Major

Reject unknown flags and unexpected extra positional args.

Line 84 currently treats unknown options as positional args, and later handlers (Line 124, Line 185, Line 270, Line 305) consume only the first positional value. Typos like --verbsoe or stray args are silently accepted and can run the wrong command shape.

🔧 Suggested fix (strict option + arity validation)
 fn parse_opts(args: &[String]) -> Result<(CliOpts, Vec<String>)> {
@@
-            _ => {
+            other if other.starts_with('-') => {
+                return Err(anyhow::anyhow!("unknown option: {other}"));
+            }
+            _ => {
                 rest.push(args[i].clone());
                 i += 1;
             }
         }
     }
@@
 fn run_ingest(args: &[String]) -> Result<()> {
     let (opts, rest) = parse_opts(args)?;
@@
-    let namespace = &rest[0];
+    if rest.len() != 1 {
+        return Err(anyhow::anyhow!("unexpected positional args: {:?}", &rest[1..]));
+    }
+    let namespace = &rest[0];
@@
 fn run_summarize(args: &[String]) -> Result<()> {
@@
-    let namespace = &rest[0];
+    if rest.len() != 1 {
+        return Err(anyhow::anyhow!("unexpected positional args: {:?}", &rest[1..]));
+    }
+    let namespace = &rest[0];
@@
 fn run_query(args: &[String]) -> Result<()> {
@@
+    if rest.len() > 2 {
+        return Err(anyhow::anyhow!("unexpected positional args: {:?}", &rest[2..]));
+    }
     let namespace = &rest[0];
@@
 fn run_status(args: &[String]) -> Result<()> {
@@
-    let namespace = &rest[0];
+    if rest.len() != 1 {
+        return Err(anyhow::anyhow!("unexpected positional args: {:?}", &rest[1..]));
+    }
+    let namespace = &rest[0];
@@
 fn run_rebuild(args: &[String]) -> Result<()> {
@@
-    let namespace = &rest[0];
+    if rest.len() != 1 {
+        return Err(anyhow::anyhow!("unexpected positional args: {:?}", &rest[1..]));
+    }
+    let namespace = &rest[0];

Also applies to: 124-125, 185-186, 230-235, 270-271, 305-306

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/core/tree_summarizer_cli.rs` around lines 45 - 87, In parse_opts, stop
treating unknown options as positional: change the default branch so any arg
starting with '-' returns an Err with a clear message (e.g., "unknown option:
...") instead of pushing into rest, and keep allowing non-flag tokens to be
collected into rest; after the loop validate rest.len() against the expected
arity (reject unexpected extra positional args) and return an Err if there are
too many; refer to parse_opts and the local vars rest, content, file, node_id,
verbose when making these changes so flag/arity validation is enforced
consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/core/tree_summarizer_cli.rs`:
- Around line 214-215: The CLI help text claims the query returns a node and its
direct children, but the backend RPC in tree_summarizer::ops
(src/openhuman/tree_summarizer/ops.rs) currently returns only {"node":
node.summary}; fix by either updating the help text in tree_summarizer_cli (the
println lines) to accurately state that only the node summary is returned, or
extend the RPC response in tree_summarizer::ops (e.g., add a "children" field
with direct children summaries) and update any associated serialization/response
types so the CLI text remains correct.
- Around line 339-342: The code currently swallows config load/init errors by
calling unwrap_or_default on crate::openhuman::config::Config::load_or_init(),
which can hide I/O/parse/permission problems; change the handling to propagate
the error instead of falling back to defaults (e.g., remove unwrap_or_default
and use the ? operator or return a Result) so that the failure from
Config::load_or_init() surfaces, and then call config.apply_env_overrides() only
after a successful load.

---

Duplicate comments:
In `@src/core/tree_summarizer_cli.rs`:
- Around line 45-87: In parse_opts, stop treating unknown options as positional:
change the default branch so any arg starting with '-' returns an Err with a
clear message (e.g., "unknown option: ...") instead of pushing into rest, and
keep allowing non-flag tokens to be collected into rest; after the loop validate
rest.len() against the expected arity (reject unexpected extra positional args)
and return an Err if there are too many; refer to parse_opts and the local vars
rest, content, file, node_id, verbose when making these changes so flag/arity
validation is enforced consistently.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 376c2f80-ed5f-4d6f-8ea4-7a0e35a1dec0

📥 Commits

Reviewing files that changed from the base of the PR and between b63c06c and d595c37.

📒 Files selected for processing (3)
  • src/core/cli.rs
  • src/core/mod.rs
  • src/core/tree_summarizer_cli.rs
✅ Files skipped from review due to trivial changes (2)
  • src/core/mod.rs
  • src/core/cli.rs

Comment on lines +214 to +215
println!("Read a summary tree node and its direct children.");
println!();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Query help text over-promises response shape.

Line 214 says query reads “direct children,” but src/openhuman/tree_summarizer/ops.rs currently returns only {"node": node.summary}. Please align the help text or expand the RPC response.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/core/tree_summarizer_cli.rs` around lines 214 - 215, The CLI help text
claims the query returns a node and its direct children, but the backend RPC in
tree_summarizer::ops (src/openhuman/tree_summarizer/ops.rs) currently returns
only {"node": node.summary}; fix by either updating the help text in
tree_summarizer_cli (the println lines) to accurately state that only the node
summary is returned, or extend the RPC response in tree_summarizer::ops (e.g.,
add a "children" field with direct children summaries) and update any associated
serialization/response types so the CLI text remains correct.

Comment on lines +339 to +342
let mut config = crate::openhuman::config::Config::load_or_init()
.await
.unwrap_or_default();
config.apply_env_overrides();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t silently fall back to default config on load/init failure.

Line 341 (unwrap_or_default) hides real config errors (I/O/parse/permission) and proceeds with defaults, which can cause unintended writes/behavior. Return the error instead.

🔧 Suggested fix
 async fn load_config() -> Result<crate::openhuman::config::Config> {
-    let mut config = crate::openhuman::config::Config::load_or_init()
-        .await
-        .unwrap_or_default();
+    let mut config = crate::openhuman::config::Config::load_or_init()
+        .await
+        .map_err(|e| anyhow::anyhow!("failed to load/init config: {e}"))?;
     config.apply_env_overrides();
     Ok(config)
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let mut config = crate::openhuman::config::Config::load_or_init()
.await
.unwrap_or_default();
config.apply_env_overrides();
let mut config = crate::openhuman::config::Config::load_or_init()
.await
.map_err(|e| anyhow::anyhow!("failed to load/init config: {e}"))?;
config.apply_env_overrides();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/core/tree_summarizer_cli.rs` around lines 339 - 342, The code currently
swallows config load/init errors by calling unwrap_or_default on
crate::openhuman::config::Config::load_or_init(), which can hide
I/O/parse/permission problems; change the handling to propagate the error
instead of falling back to defaults (e.g., remove unwrap_or_default and use the
? operator or return a Result) so that the failure from Config::load_or_init()
surfaces, and then call config.apply_env_overrides() only after a successful
load.

…servability logging

- Added a new `TreeSummarizerEventSubscriber` to log events related to tree summarization, enhancing observability.
- Updated the `start_channels` function to register the new subscriber.
- Refactored the `run_summarization` function to group buffered entries by hour and publish events upon completion of summarization.
- Improved documentation and added tests for the new subscriber functionality.

This update aims to provide better insights into the summarization process and facilitate future cross-module workflows.
…ature

- Simplified the buffer backup process by consolidating the rename operation with context handling for better error reporting.
- Cleaned up the function signature of `derive_node_ids_from_hour_id` for improved readability.

These changes enhance code clarity and maintainability within the tree summarization engine.
- Updated the `tree_summarizer_ingest` function to accept an optional metadata parameter, allowing users to include additional context during content ingestion.
- Refactored related functions to validate and handle metadata, improving the overall robustness of the summarization process.
- Adjusted the buffer write functionality to store metadata alongside content, enhancing the data structure for future retrieval and processing.

These changes aim to enrich the summarization capabilities and provide more context for ingested content.
…D validation

- Enhanced the formatting of error messages in the `validate_node_id` function for better readability and consistency.
- Adjusted the string formatting to use multi-line syntax, improving clarity in error reporting.
- Minor formatting changes in the `strip_buffer_frontmatter` function to enhance code readability.

These changes aim to improve the maintainability and clarity of error handling within the tree summarization module.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
src/openhuman/tree_summarizer/engine.rs (2)

173-196: ⚠️ Potential issue | 🔴 Critical

Move the buffer backup outside tree/ before calling delete_tree().

buffer_path.with_file_name("buffer_backup") resolves to <tree>/buffer_backup, but store::delete_tree() removes the whole <tree> directory. The backup is gone before the restore branch runs, so unsummarized buffered input is still lost during rebuild.

💡 Proposed fix
     // Preserve the buffer directory by moving it to a temp location
     let buffer_path = store::buffer_dir(config, namespace);
-    let buffer_backup = buffer_path.with_file_name("buffer_backup");
+    let buffer_backup = base
+        .parent()
+        .unwrap_or(&base)
+        .join(".tree_buffer_backup");
     let buffer_existed = buffer_path.exists();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/tree_summarizer/engine.rs` around lines 173 - 196, The buffer
backup is being created inside the tree directory so
store::delete_tree(namespace) removes it; change the backup location to be
outside the tree before calling store::delete_tree() by computing buffer_backup
as a path outside the tree (e.g. sibling or temp directory) instead of
buffer_path.with_file_name("buffer_backup"), then perform the rename from
store::buffer_dir(config, namespace) to that external buffer_backup, call
store::delete_tree(config, namespace), and finally rename back from that
external buffer_backup to store::buffer_dir(config, namespace) in the restore
branch; update references to buffer_path, buffer_backup and the restore logic in
engine.rs accordingly.

365-379: ⚠️ Potential issue | 🟠 Major

The per-node response cap is effectively disabled here.

Line 366 compares against max_chars.max(MAX_SUMMARY_CHARS), which is always the global 80k ceiling for the current hour/day/month/year/root budgets. Non-root summaries can therefore exceed their own cap without truncation and keep inflating parent nodes.

💡 Proposed fix
-    let response = if response.len() > max_chars.max(MAX_SUMMARY_CHARS) {
+    let hard_limit = max_chars.min(MAX_SUMMARY_CHARS);
+    let response = if response.len() > hard_limit {
         tracing::warn!(
             "[tree_summarizer] LLM response for node {} (level={}) was {} chars, truncating to {} chars",
             node_id,
             level_name,
             response.len(),
-            max_chars
+            hard_limit
         );
         // Truncate at a char boundary
-        let truncated = &response[..response.floor_char_boundary(max_chars)];
+        let truncated = &response[..response.floor_char_boundary(hard_limit)];
         truncated.to_string()
     } else {
         response
     };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/tree_summarizer/engine.rs` around lines 365 - 379, The per-node
cap is effectively disabled because the code uses
max_chars.max(MAX_SUMMARY_CHARS) (which yields the global MAX_SUMMARY_CHARS)
when checking/truncating the LLM response for node_id/level_name; change the
logic to compute an enforced limit as the minimum of max_chars and
MAX_SUMMARY_CHARS (e.g., let limit = std::cmp::min(max_chars,
MAX_SUMMARY_CHARS)) and then compare response.len() against that limit, use
response.floor_char_boundary(limit) for truncation, and update the tracing::warn
message to report the applied limit rather than max_chars.max(...).
🧹 Nitpick comments (1)
src/openhuman/channels/runtime/startup.rs (1)

429-432: Use event_bus::subscribe_global() for startup registration.

This works because bus is the global singleton here, but new startup subscribers are supposed to register through the global helper rather than the instance method. Keeping this one on subscribe_global() makes the tree-summarizer wiring match the repo convention and keeps new startup code from depending on the local bus handle.

Based on learnings: Register subscribers in startup (e.g., channels/runtime/startup.rs) via the singleton event_bus::subscribe_global() function

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/channels/runtime/startup.rs` around lines 429 - 432, Replace
the instance subscription call using the local bus handle with the global
helper: instead of calling
bus.subscribe(Arc::new(crate::openhuman::tree_summarizer::bus::TreeSummarizerEventSubscriber::new())),
register the subscriber via event_bus::subscribe_global(Arc::new(...)). Locate
the use of TreeSummarizerEventSubscriber and swap the subscribe invocation to
event_bus::subscribe_global so startup registration follows the repo convention
and stops depending on the local bus variable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/openhuman/tree_summarizer/engine.rs`:
- Around line 476-480: The code currently silently skips parse failures for hour
files causing malformed leaves to be lost; change the handling around
parse_node_markdown_pub so that instead of ignoring Err, you return an error
(propagate) including the node_id context (e.g., format!("failed to parse hour
leaf {}: {}", node_id, err)) — replace the if let Ok(...) block in the loop that
fills leaves with a match or ?-style propagation so parse_node_markdown_pub
errors abort the rebuild_tree, ensuring hour_leaves are not written when a file
is unreadable.
- Around line 41-45: buffer_drain() currently reads-and-deletes buffer files
causing permanent loss if subsequent writes fail; change the flow so
buffer_drain only reads non-destructively (or provides a read_without_delete
API) and delete each buffer file or batch only after its corresponding hour leaf
is durably written (i.e., after the hour-leaf write/flush/rename succeeds in the
tree-summarizer logic). In rebuild_tree(), stop creating the backup with
with_file_name() under the tree directory (which delete_tree() later removes);
instead move/rename the buffer directory to a sibling path outside the tree
(e.g., parent.join("buffer_backup")) before calling delete_tree(), so the backup
survives deletion; update references to buffer_drain, rebuild_tree, delete_tree,
and the with_file_name() backup creation to implement these two safeguards
(read-without-delete + post-write delete, and moving backup outside the tree).

---

Duplicate comments:
In `@src/openhuman/tree_summarizer/engine.rs`:
- Around line 173-196: The buffer backup is being created inside the tree
directory so store::delete_tree(namespace) removes it; change the backup
location to be outside the tree before calling store::delete_tree() by computing
buffer_backup as a path outside the tree (e.g. sibling or temp directory)
instead of buffer_path.with_file_name("buffer_backup"), then perform the rename
from store::buffer_dir(config, namespace) to that external buffer_backup, call
store::delete_tree(config, namespace), and finally rename back from that
external buffer_backup to store::buffer_dir(config, namespace) in the restore
branch; update references to buffer_path, buffer_backup and the restore logic in
engine.rs accordingly.
- Around line 365-379: The per-node cap is effectively disabled because the code
uses max_chars.max(MAX_SUMMARY_CHARS) (which yields the global
MAX_SUMMARY_CHARS) when checking/truncating the LLM response for
node_id/level_name; change the logic to compute an enforced limit as the minimum
of max_chars and MAX_SUMMARY_CHARS (e.g., let limit = std::cmp::min(max_chars,
MAX_SUMMARY_CHARS)) and then compare response.len() against that limit, use
response.floor_char_boundary(limit) for truncation, and update the tracing::warn
message to report the applied limit rather than max_chars.max(...).

---

Nitpick comments:
In `@src/openhuman/channels/runtime/startup.rs`:
- Around line 429-432: Replace the instance subscription call using the local
bus handle with the global helper: instead of calling
bus.subscribe(Arc::new(crate::openhuman::tree_summarizer::bus::TreeSummarizerEventSubscriber::new())),
register the subscriber via event_bus::subscribe_global(Arc::new(...)). Locate
the use of TreeSummarizerEventSubscriber and swap the subscribe invocation to
event_bus::subscribe_global so startup registration follows the repo convention
and stops depending on the local bus variable.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b2d48290-f2b1-419a-a9c5-a7419a058d6f

📥 Commits

Reviewing files that changed from the base of the PR and between d595c37 and 40862fc.

📒 Files selected for processing (3)
  • src/openhuman/channels/runtime/startup.rs
  • src/openhuman/tree_summarizer/bus.rs
  • src/openhuman/tree_summarizer/engine.rs

Comment thread src/openhuman/tree_summarizer/engine.rs Outdated
Comment thread src/openhuman/tree_summarizer/engine.rs Outdated
…n process

- Replaced the buffer draining mechanism with a non-destructive read approach, allowing for safer data handling during summarization.
- Introduced a new `buffer_delete` function to explicitly manage the deletion of buffer entries after successful processing.
- Updated the `run_summarization` function to reflect these changes, ensuring that buffer entries are only deleted after durable writes are confirmed.
- Improved the backup process for the buffer directory during tree rebuilds, ensuring it is preserved outside the tree structure.

These modifications aim to improve data integrity and clarity in the summarization workflow.
…dling

- Updated the `parse_node_markdown` function to trim trailing whitespace from the body after splitting frontmatter, enhancing data cleanliness.
- Modified test cases to use specific timestamps instead of the current time, ensuring consistent and predictable test results.
- Adjusted assertions in tests to reflect the new timestamp-based ordering of entries.

These changes aim to improve the robustness of markdown parsing and the reliability of test outcomes in the tree summarization module.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (3)
src/core/tree_summarizer_cli.rs (2)

338-343: ⚠️ Potential issue | 🟠 Major

Propagate config load/init errors instead of defaulting.

Falling back to Config::default() on any load_or_init() error masks parse/I/O/permission problems and can send the CLI to a different workspace than the user configured.

🩹 Proposed fix
 async fn load_config() -> Result<crate::openhuman::config::Config> {
     let mut config = crate::openhuman::config::Config::load_or_init()
         .await
-        .unwrap_or_default();
+        .map_err(|e| anyhow::anyhow!("failed to load/init config: {e}"))?;
     config.apply_env_overrides();
     Ok(config)
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/core/tree_summarizer_cli.rs` around lines 338 - 343, The current
load_config function swallows load/init errors by using unwrap_or_default on
crate::openhuman::config::Config::load_or_init(), which can hide
parse/I/O/permission problems; change that call to propagate errors (use ? on
the awaited Result from Config::load_or_init()) so the function returns the
actual error instead of falling back to Config::default(), keep calling
config.apply_env_overrides() afterwards and return Ok(config) as before.

45-99: ⚠️ Potential issue | 🟠 Major

Fail fast on unknown flags and stray positional args.

parse_opts() still pushes every unrecognized token into rest, and each subcommand then reads only the first positional argument. Typos like --verbsoe, duplicate namespaces, or a conflicting positional node_id can silently run with the wrong inputs instead of erroring.

Also applies to: 124-125, 185-185, 230-234, 270-270, 305-305

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/core/tree_summarizer_cli.rs` around lines 45 - 99, parse_opts currently
accepts any unknown token into rest which hides typos and stray args; change it
to fail fast: inside parse_opts (function parse_opts and the CliOpts handling)
update the default match arm so that if args[i] starts with '-' you return
Err(anyhow::anyhow!("unknown flag: {}", args[i])); otherwise push positional
args into rest but after parsing validate rest.len() (e.g. ensure at most one
positional or whatever the CLI expects) and return Err(...) if there are
unexpected extra positional arguments; apply the same change pattern to the
other parse functions/locations referenced (lines around 124-125, 185, 230-234,
270, 305) so unknown flags and extra positional args error out instead of being
silently accepted.
src/openhuman/tree_summarizer/store.rs (1)

24-30: ⚠️ Potential issue | 🟠 Major

Namespace-to-path mapping is still lossy.

sanitize() still collapses distinct namespaces onto the same on-disk folder (a/b, a:b, a.ba_b), and validate_namespace() lets those inputs through. That can merge unrelated trees and make status/delete operate on the wrong namespace. Reject reserved characters up front or switch to a reversible encoding before joining the path.

Also applies to: 43-65

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/tree_summarizer/store.rs` around lines 24 - 30, The
namespace-to-path mapping is lossy because sanitize() collapses distinct inputs
(e.g. '/', ':', '.') into the same folder; update validate_namespace() to
explicitly reject reserved/path characters (at minimum '/', '\\', ':', '.', and
any OS path separator or null) so those inputs are refused rather than
sanitized, and add the same validation checks where namespaces are accepted;
keep tree_dir() (and the other namespace-to-path helpers in the same file around
lines 43-65) unchanged in path construction but rely on the strengthened
validate_namespace() to prevent ambiguous, colliding namespace values.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/openhuman/tree_summarizer/store.rs`:
- Around line 259-319: The current nested loops over std::fs::read_dir use
into_iter().flatten().flatten() which silently swallows IO and DirEntry errors;
change each loop (the one iterating over base, year_dir, month_dir, day_dir, and
hour entries) to explicitly iterate over the Result returned by
std::fs::read_dir (for entry_res in std::fs::read_dir(&some_dir) { let entry =
entry_res?; ... }) and similarly handle file_type() calls using ? where they
return io::Result so filesystem errors propagate instead of being ignored;
update the blocks that reference name, month_entry.file_name(),
day_entry.file_name(), hname, timestamp_from_hour_path, oldest, newest and depth
to use the unwrapped entry variables after the ? propagation so any
permission/corruption error bubbles up as an Err from this function rather than
producing a partial TreeStatus.
- Around line 370-376: buffer_write currently writes metadata into a YAML
frontmatter block but buffer_drain blindly strips the first frontmatter it
finds, causing store-owned metadata to be discarded and also removing legitimate
user frontmatter from ingested files; update buffer_drain (and the shared
frontmatter helper used around buffer_write/buffer_drain) to detect and parse a
store-specific marker (e.g., a unique key like "op_store_metadata" or similar)
in the frontmatter, return that parsed metadata separately instead of removing
arbitrary frontmatter, and only strip the frontmatter when it matches the
store-owned marker; keep standard Markdown frontmatter intact for files that
don’t include the store marker.
- Around line 421-430: The current loop over entries that calls
std::fs::remove_file(path) can abort partway and drop already-read buffers;
change this to a two-phase (staging) delete: first atomically move/rename each
file to a temporary "to-delete" location (use std::fs::rename into a temp dir or
path.with_extension(".deleting") inside the same filesystem) in the same loop
that currently iterates entries, then in a second loop remove_file() only from
the temp locations; ensure removal attempts continue for all staged files and
aggregate or return a combined error instead of early-returning on the first
failure. This touches the loop variable entries and the remove_file calls shown
in the diff — implement atomic rename for staging and robust final deletion over
the staged names.

---

Duplicate comments:
In `@src/core/tree_summarizer_cli.rs`:
- Around line 338-343: The current load_config function swallows load/init
errors by using unwrap_or_default on
crate::openhuman::config::Config::load_or_init(), which can hide
parse/I/O/permission problems; change that call to propagate errors (use ? on
the awaited Result from Config::load_or_init()) so the function returns the
actual error instead of falling back to Config::default(), keep calling
config.apply_env_overrides() afterwards and return Ok(config) as before.
- Around line 45-99: parse_opts currently accepts any unknown token into rest
which hides typos and stray args; change it to fail fast: inside parse_opts
(function parse_opts and the CliOpts handling) update the default match arm so
that if args[i] starts with '-' you return Err(anyhow::anyhow!("unknown flag:
{}", args[i])); otherwise push positional args into rest but after parsing
validate rest.len() (e.g. ensure at most one positional or whatever the CLI
expects) and return Err(...) if there are unexpected extra positional arguments;
apply the same change pattern to the other parse functions/locations referenced
(lines around 124-125, 185, 230-234, 270, 305) so unknown flags and extra
positional args error out instead of being silently accepted.

In `@src/openhuman/tree_summarizer/store.rs`:
- Around line 24-30: The namespace-to-path mapping is lossy because sanitize()
collapses distinct inputs (e.g. '/', ':', '.') into the same folder; update
validate_namespace() to explicitly reject reserved/path characters (at minimum
'/', '\\', ':', '.', and any OS path separator or null) so those inputs are
refused rather than sanitized, and add the same validation checks where
namespaces are accepted; keep tree_dir() (and the other namespace-to-path
helpers in the same file around lines 43-65) unchanged in path construction but
rely on the strengthened validate_namespace() to prevent ambiguous, colliding
namespace values.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2f21ba8a-b832-4e58-8119-1d39ba535846

📥 Commits

Reviewing files that changed from the base of the PR and between 40862fc and 9799388.

⛔ Files ignored due to path filters (1)
  • app/src-tauri/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (4)
  • src/core/tree_summarizer_cli.rs
  • src/openhuman/tree_summarizer/ops.rs
  • src/openhuman/tree_summarizer/schemas.rs
  • src/openhuman/tree_summarizer/store.rs
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/openhuman/tree_summarizer/schemas.rs
  • src/openhuman/tree_summarizer/ops.rs

Comment on lines +259 to +319
if base.exists() {
for entry in std::fs::read_dir(&base).into_iter().flatten().flatten() {
let name = entry.file_name().to_string_lossy().to_string();
if entry.file_type().map(|t| t.is_dir()).unwrap_or(false) && name.len() == 4 {
if depth < 2 {
depth = 2;
}
// Scan months, days, hours inside
let year_dir = entry.path();
for month_entry in std::fs::read_dir(&year_dir).into_iter().flatten().flatten() {
if month_entry.file_type().map(|t| t.is_dir()).unwrap_or(false) {
if depth < 3 {
depth = 3;
}
let month_dir = month_entry.path();
for day_entry in std::fs::read_dir(&month_dir)
.into_iter()
.flatten()
.flatten()
{
if day_entry.file_type().map(|t| t.is_dir()).unwrap_or(false) {
if depth < 4 {
depth = 4;
}
// Check for hour .md files
let day_dir = day_entry.path();
for hour_entry in
std::fs::read_dir(&day_dir).into_iter().flatten().flatten()
{
let hname =
hour_entry.file_name().to_string_lossy().to_string();
if hname.ends_with(".md") && hname != "summary.md" {
if depth < 5 {
depth = 5;
}
// Try to parse timestamp from path
if let Some(ts) = timestamp_from_hour_path(
&name,
&month_entry.file_name().to_string_lossy().to_string(),
&day_entry.file_name().to_string_lossy().to_string(),
&hname,
) {
match &oldest {
None => oldest = Some(ts),
Some(o) if ts < *o => oldest = Some(ts),
_ => {}
}
match &newest {
None => newest = Some(ts),
Some(n) if ts > *n => newest = Some(ts),
_ => {}
}
}
}
}
}
}
}
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Status scanning shouldn't silently suppress filesystem errors.

The nested into_iter().flatten().flatten() loops discard read_dir and DirEntry errors and then return a partial TreeStatus as success. A permission problem or corrupt entry will look like a smaller tree instead of surfacing an actual failure. Use explicit iteration with ? here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/tree_summarizer/store.rs` around lines 259 - 319, The current
nested loops over std::fs::read_dir use into_iter().flatten().flatten() which
silently swallows IO and DirEntry errors; change each loop (the one iterating
over base, year_dir, month_dir, day_dir, and hour entries) to explicitly iterate
over the Result returned by std::fs::read_dir (for entry_res in
std::fs::read_dir(&some_dir) { let entry = entry_res?; ... }) and similarly
handle file_type() calls using ? where they return io::Result so filesystem
errors propagate instead of being ignored; update the blocks that reference
name, month_entry.file_name(), day_entry.file_name(), hname,
timestamp_from_hour_path, oldest, newest and depth to use the unwrapped entry
variables after the ? propagation so any permission/corruption error bubbles up
as an Err from this function rather than producing a partial TreeStatus.

Comment on lines +370 to +376
// If metadata is provided, write it as a YAML frontmatter block
let file_content = if let Some(meta) = metadata {
let meta_str = serde_json::to_string(meta).unwrap_or_default();
format!("---\nmetadata: {meta_str}\n---\n\n{content}")
} else {
content.to_string()
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

buffer_drain() treats any leading frontmatter as disposable metadata.

buffer_write() stores metadata only in the staged file's frontmatter, but buffer_drain() returns just the stripped body, so that metadata never reaches callers. The same helper also strips the first frontmatter block from ordinary Markdown content when the user ingests a file that already starts with ---. Parse a store-owned marker and return metadata separately instead of blanket-stripping the first block.

Also applies to: 416-418, 442-455

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/tree_summarizer/store.rs` around lines 370 - 376, buffer_write
currently writes metadata into a YAML frontmatter block but buffer_drain blindly
strips the first frontmatter it finds, causing store-owned metadata to be
discarded and also removing legitimate user frontmatter from ingested files;
update buffer_drain (and the shared frontmatter helper used around
buffer_write/buffer_drain) to detect and parse a store-specific marker (e.g., a
unique key like "op_store_metadata" or similar) in the frontmatter, return that
parsed metadata separately instead of removing arbitrary frontmatter, and only
strip the frontmatter when it matches the store-owned marker; keep standard
Markdown frontmatter intact for files that don’t include the store marker.

Comment thread src/openhuman/tree_summarizer/store.rs Outdated
Comment on lines +421 to +430
// Delete after successful reads — propagate errors to prevent duplicates
for (name, path) in &entries {
std::fs::remove_file(path).with_context(|| {
format!(
"failed to remove buffer entry '{}' at {}",
name,
path.display()
)
})?;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Abort-on-first-delete can drop already-drained entries.

If one remove_file() call fails, earlier files in the loop have already been deleted but their contents are never returned because the function exits with Err. A single cleanup failure can therefore lose buffered data. This drain needs an all-or-nothing staging step instead of deleting in place.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/tree_summarizer/store.rs` around lines 421 - 430, The current
loop over entries that calls std::fs::remove_file(path) can abort partway and
drop already-read buffers; change this to a two-phase (staging) delete: first
atomically move/rename each file to a temporary "to-delete" location (use
std::fs::rename into a temp dir or path.with_extension(".deleting") inside the
same filesystem) in the same loop that currently iterates entries, then in a
second loop remove_file() only from the temp locations; ensure removal attempts
continue for all staged files and aggregate or return a combined error instead
of early-returning on the first failure. This touches the loop variable entries
and the remove_file calls shown in the diff — implement atomic rename for
staging and robust final deletion over the staged names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant