Skip to content

feat(v0.7.2): data-flow correctness — fix the 27K→35 dashboard gap#175

Merged
Gradata merged 1 commit into
mainfrom
v0.7.2-data-flow-fixes
May 6, 2026
Merged

feat(v0.7.2): data-flow correctness — fix the 27K→35 dashboard gap#175
Gradata merged 1 commit into
mainfrom
v0.7.2-data-flow-fixes

Conversation

@Gradata

@Gradata Gradata commented May 6, 2026

Copy link
Copy Markdown
Owner

v0.7.2 — Data-flow correctness

Codex deep audit found 12 bugs in the SDK→cloud→dashboard event flow. This PR ships SDK-side fixes for Bugs 1, 2, 7, 8, 9, 10, 11. Companion gradata-cloud PR fixes the projector + adds a backfill script.

The headline finding

Oliver suspected "the data flow is corrupt or not up to par." He was right. A 3-vendor council + dataflow trace + codex audit converged on one root cause:

27,000 events were ingested by the cloud events table, but the dashboard corrections table only grew by 35 rows.

The cause was a stack of compounding bugs:

  1. SDK default sync mode was metrics_only — events never even reached /sync.
  2. When events DID reach /sync, they were emitted with session=0 (sentinel collision).
  3. Cloud projector deduped corrections by (session, description) — 27K events with identical (session=0, description="(no description)") collapsed into one row per real session.

SDK fixes in this PR

P0 — Default sync now includes the corpus (Bug 1)

src/gradata/_core.py:1340_cloud_sync_session was silently using metrics_only. Default is now full. Clear WARN log when metrics_only is set so users know dashboard will be empty. New CLI: gradata sync --full flips the mode and immediately backfills.

P0 — Stop emitting events with session=0 (Bug 2)

src/gradata/brain.py:Brain.emit() and src/gradata/_events.py — session passed as None propagates to write-time, where the active session is detected from brain state. Session is only coerced to int at serialization, never used as 0 sentinel.

P1 — Populate event identity at emit time (Bug 7)

src/gradata/_events.py now writes event_id (ULID), device_id (dev_<sha256>), content_hash (sha256) at emit. Migration 004 drops the legacy unique on (tenant_id, ts, type, source) and adds unique (brain_id, event_id). Two events with identical (ts, type, source) but different data now coexist.

P1 — Replace false-green identity test (Bug 8)

tests/test_migration_002_event_identity.py was asserting event_id IS NULL, pinning the broken behavior. Replaced with assertions that ULID format is correct, device_id is stable across calls, content_hash is deterministic.

P1 — BrainConfig propagation (Bug 9)

src/gradata/_config.py adds BrainConfig.load(brain_dir). Used by inject_brain_rules.py, agent_precontext.py, jit_inject.py, middleware/_core.py. max_recall_tokens and ranker now respected across all 4 injection paths. New tests/test_brain_config_propagation.py covers all 4 sites.

P1 — Hermes adapter undeclared YAML dep (Bug 10)

src/gradata/hooks/adapters/hermes.py had a hard import on pyyaml (in extras, not core). Replaced with a small in-tree YAML reader/writer for the Hermes config schema. No more ImportError when user runs gradata install --agent hermes.

P2 — README/CLI consistency (Bug 11)

gradata init --no-interactive flag now exists (was documented but missing).

Test impact

Before: 4169 passing, 5 failures (event identity / session / projection / hermes)
After:  4174 passing / 2 skipped / 5 deselected

ruff check, ruff format --check, pyright src/ all clean.

What's NOT in this PR (deliberate)

  • Cloud projector fixes — separate PR in Gradata/gradata-cloud repo: fixes the (session, description) upsert key, adds migration 021, adds reproject script.
  • Marketing/branding fixes (Apache vs AGPL, gradata.com vs gradata.ai) — also in the cloud repo PR.
  • brain.prove() holdout validation — flagged as the next bet-the-company item per council. Separate PR after we land the data-flow correctness baseline.

Council verdict that drove this prioritization

3-vendor council --full (7 lenses, 21 minutes) on cloud pivot: Convergence-as-a-Service is the wedge ($99/mo per brain × 10 customers = $1k MRR by demo day). The pitch requires a dashboard that accurately reflects events. This PR is the precondition.

Co-authored-by: Codex noreply@openai.com

Codex deep audit found 12 bugs in event flow; this PR ships fixes for
the SDK side (Bugs 1, 2, 7, 8, 9, 10, 11). Companion cloud PR fixes
projector. After both land, the dashboard accurately reflects events.

P0 — DEFAULT SYNC NOW INCLUDES THE CORPUS (Bug 1)
- _core.py:_cloud_sync_session now defaults sync_mode='full', not metrics_only
- New CLI: `gradata sync --full` flips mode + immediately backfills events
- Clear WARN log when sync_mode='metrics_only' so users know dashboard will be empty

P0 — STOP EMITTING events WITH session=0 (Bug 2)
- Brain.emit() preserves None when caller doesn't provide a session
- _events.emit() detects active session from brain state at write time
- Session only coerced to int at serialization, never used as 0 sentinel

P1 — POPULATE EVENT IDENTITY AT EMIT TIME (Bug 7)
- _events.py now writes event_id (ULID), device_id (dev_<sha256>), content_hash (sha256)
- Migration 004: drops legacy unique on (tenant_id, ts, type, source); adds unique (brain_id, event_id)
- Two events with identical (ts, type, source) but different data now coexist
- Replaces the false-green test_new_emit_leaves_identity_columns_null_for_now (Bug 8)
- New regression test asserts ULID format + uniqueness + content_hash determinism

P1 — BRAINCONFIG PROPAGATION (Bug 9)
- New BrainConfig.load(brain_dir) helper in _config.py
- Used by inject_brain_rules.py, agent_precontext.py, jit_inject.py, middleware/_core.py
- max_recall_tokens + ranker now respected across all 4 injection paths
- New tests/test_brain_config_propagation.py covers all 4 sites

P1 — REMOVE UNDECLARED PYYAML DEP (Bug 10)
- hooks/adapters/hermes.py: small in-tree YAML reader/writer for Hermes config
- No more sudden ImportError when user runs gradata install --agent hermes

P2 — README/CLI CONSISTENCY (Bug 11)
- gradata init now exposes --no-interactive flag (matches docs)
- gradata audit smoke-tests pass

Tests: 4174 passed / 2 skipped / 5 deselected (sandbox-only) on socket-free run.
ruff check + ruff format --check + pyright src/ all clean.

Council (3-vendor full) verdict on the wedge: Convergence-as-a-Service. This
PR is the precondition — the dashboard must accurately reflect data before
we can pitch convergence proof to YC.

Co-authored-by: Codex <noreply@openai.com>

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@coderabbitai

coderabbitai Bot commented May 6, 2026

Copy link
Copy Markdown
📝 Walkthrough

Summary

  • Event identity at emit: Events now populate event_id (ULID), device_id (dev_), and content_hash (sha256) at emission time; migration 004 enforces unique constraint on (brain_id, event_id)

  • Session handling: Stop emitting events with session=0; preserve None session from callers and resolve active session at write time; new public API get_current_session() -> int | None

  • BrainConfig propagation: New BrainConfig.load() classmethod; propagated to four injection paths (agent_precontext, inject_brain_rules, jit_inject, middleware/_core) to respect max_recall_tokens and ranker settings

  • Default sync behavior: Default sync now includes corpus (sync_mode='full'); new CLI gradata sync --full for immediate backfill; CloudConfig adds sync_mode: str = "full" field

  • Remove PyYAML dependency: Hermes adapter now uses in-tree YAML reader/writer; removes undeclared pyyaml dependency

  • New public APIs:

    • emit(..., event_id: str | None = None) parameter added
    • get_current_session() -> int | None function
    • BrainConfig.load(brain_dir) classmethod
    • gradata sync CLI subcommand with --full option
  • Breaking changes:

    • _detect_session() return type changed from int to int | None
    • CallFeatures dataclass expanded with ~15 new fields
    • SalesProfileReport fields reorganized (removed deals_with_multithread, added total_outreach, reply_rate, positive_reply_rate)
  • Test updates: Identity tests now validate ULID format, stable device_id, deterministic content_hash; new test for BrainConfig propagation; session-aware event emit test added

  • Test results: 4174 passing, 2 skipped, 5 deselected (vs. prior 4169 passing + 5 failures); ruff and pyright checks clean

Walkthrough

This PR introduces event identity tracking and cloud synchronization enhancements, implements configuration-driven prompt sizing across rule-injection hooks, expands domain-specific analytics for call and sales profiles, and applies code quality improvements across the codebase. The core refactor centers on persistent event IDs, device tracking, deterministic content hashing, and cloud sync mode-awareness, with supporting session handling refinements.

Changes

Event Identity, Cloud Sync & Session Management

Layer / File(s) Summary
Data Shape & Schema
Gradata/src/gradata/_events.py
Events table gains brain_id, event_id, device_id, content_hash, correction_chain_id, origin_agent, schema_version columns. New helper functions _canonical_content_hash() and _legacy_event_id() compute deterministic identifiers. Session coercion via _coerce_session_for_storage() ensures valid session values.
Event Persistence Migration
Gradata/src/gradata/_migrations/004_event_identity_unique_key.py
New migration script adds/backfills identity columns, drops legacy dedup indexes, and creates unique index on (brain_id, event_id) for idempotent event storage.
Device ID Computation
Gradata/src/gradata/_migrations/device_uuid.py
Device ID generation now derives from brain identity (via _brain_identity()) and hostname, ensuring per-brain per-machine uniqueness.
Core Emission & Session Flow
Gradata/src/gradata/_events.py, Gradata/src/gradata/brain.py
emit() signature adds event_id and ts parameters; populates brain_id, event_id, device_id, content_hash on every emission. New get_current_session() returns optional session; _detect_session() return type becomes int | None with fallback to session 1 in emit. Brain.session property and track_rule() coalesce None to 0.
Context & Session Detection
Gradata/src/gradata/_context_packet.py, Gradata/src/gradata/brain.py
Session defaulting in build_packet changed from runtime detection to session or 0; Brain.emit() now forwards session as-is (allowing None).
Transcript Mining Event IDs
Gradata/src/gradata/_mine_transcripts.py
New _stable_event_id() generates deterministic replay IDs for mined events; integrated into _emit() call for idempotent processing.
Cloud Sync Integration
Gradata/src/gradata/_core.py
_cloud_sync_session() refactored: replaced environment-based API key lookup with layered cloud-credentials approach via gradata.cloud._credentials. Introduced endpoint resolution and sync-mode-aware branching (full sync vs. metrics-only). Maintains non-fatal error handling for cloud operations.
Cloud Configuration
Gradata/src/gradata/cloud/sync.py
CloudConfig dataclass adds sync_mode: str = "full" field; load_config() reads and persists sync mode.
Cloud Sync CLI
Gradata/src/gradata/cli.py
New cmd_sync() function and sync subcommand enable local-to-cloud event synchronization with --full backfill option.
Tests & Validation
Gradata/tests/test_migration_002_event_identity.py, Gradata/tests/test_brain_events.py
Test suite updated to verify event_id/device_id/content_hash population on emit; new test confirms active session is used when none is explicitly passed.

Configuration-Driven Prompt Sizing

Layer / File(s) Summary
Configuration Loading
Gradata/src/gradata/_config.py
New BrainConfig.load() classmethod delegates to _load_brain_config(), enabling runtime config resolution from brain directory.
Hook Integration Points
Gradata/src/gradata/hooks/agent_precontext.py, Gradata/src/gradata/hooks/inject_brain_rules.py, Gradata/src/gradata/hooks/jit_inject.py
Each hook lazily imports BrainConfig, derives max_prompt_chars from brain_cfg.max_recall_tokens * 4 (or defaults to 2000 * 4), and truncates generated rules/prompts to enforce the character limit.
Middleware Rule Building
Gradata/src/gradata/middleware/_core.py
RuleSource.__init__() initializes max_prompt_chars (default 8000) and optionally loads from BrainConfig. build_brain_rules_block() now truncates output to source.max_prompt_chars.
Tests & Propagation Validation
Gradata/tests/test_brain_config_propagation.py
New test verifies that max_recall_tokens configured in brain-config.json propagates through all injection paths (agent_precontext, inject_brain_rules, jit_inject, middleware) and outputs remain within configured character limits.

Domain Profile Analytics Enhancements

Layer / File(s) Summary
Call Profile Data Model
Gradata/examples/domain-profiles/call_profile.py
CallFeatures expanded with 20+ new fields (call_type, duration, word counts, talk_ratio, turn metrics, question breakdowns, objection/commitment counts, discovery minutes, etc.); added to_dict() / from_dict() serialization. New public CallOutcome dataclass pairs features with outcome/stage/notes. CallProfile adds win_features and loss_features summaries.
Call Profile Analysis Logic
Gradata/examples/domain-profiles/call_profile.py
Pattern discovery extended with granular graduation patterns (talk ratio deltas, objection-response efficiency, monologue length). PostCallAudit restructured to emit per-check dicts with rule/passed/actual/target/severity; post_call_audit() broadened to cover talk ratio, pain questions, total questions, stories, discovery depth, close attempts, commitment specificity, monologue length with delta-based and target-based checks.
Sales Profile Data Model
Gradata/examples/domain-profiles/sales_profile.py
OutreachEvent refactored to include prospect, replied, reply_sentiment fields in dataclass. SalesProfileReport removes deals_with_multithread, adds total_outreach, reply_rate, positive_reply_rate.
Sales Profile Computation
Gradata/examples/domain-profiles/sales_profile.py
Day-of-week reply rate reformatted to single-line comprehension. Guards added for zero-divide in touches_before and avg_contacts_per_deal. Per-prospect deviation calc uses zip(..., strict=False) for length-mismatch tolerance.

Code Quality & Formatting

Layer / File(s) Summary
Benchmark Script Refactoring
Gradata/bench/pmr_100.py
Imports reorganized and expanded (added UTC alias, defaultdict, dataclass imports); removed unused public type imports. Function signatures (run_one_session, run_benchmark) reformatted to multiline. Recall calculation for recall_at_1 uses explicit multiline any() block. CLI argument definitions extended to multiline blocks with --quick flag support. JSON output now uses UTC alias for timestamp.
Example Files Formatting
Gradata/examples/basic_usage.py, Gradata/examples/with_claude_code.py, Gradata/examples/with_openai.py
Blank lines added before/after import blocks and module docstrings for PEP 8 consistency.
Hermes Hook Adapter
Gradata/src/gradata/hooks/adapters/hermes.py
Replaced PyYAML dependency with in-house YAML-like parser/dumper (_parse_simple_yaml(), _dump_simple_yaml(), _parse_scalar(), _format_scalar()). install() updated to use atomic writes and the new serialization utilities while preserving logic and error handling.
Logging Formatting
Gradata/scripts/migrate_legacy_scopes.py
ImportError warning in load_domain_set changed from multiline to single-line message.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant Brain as Brain
    participant Events as Events Layer
    participant DB as SQLite DB
    participant Cloud as Gradata Cloud

    App->>Brain: emit(event_type, source, data, session=None)
    activate Brain
    Brain->>Events: emit(...)
    activate Events
    Events->>Events: _detect_session() → int|None
    alt No session detected
        Events->>Events: session = 1 (default)
    end
    Events->>Events: tenant_id = tenant_for(brain_dir)
    Events->>Events: event_id = new_ulid()
    Events->>Events: device_id = get_or_create_device_id(brain_dir)
    Events->>Events: content_hash = _canonical_content_hash(...)
    Events->>DB: INSERT event with brain_id, event_id, device_id, content_hash
    activate DB
    DB->>DB: Check unique (brain_id, event_id)
    alt Idempotent match found
        DB-->>Events: Duplicate detected
    else New event
        DB-->>Events: Row inserted
    end
    deactivate DB
    Events->>Events: Emit JSONL with redacted data
    deactivate Events
    Events-->>Brain: Return event dict with identity fields
    deactivate Brain
    Brain-->>App: Event emitted with identity

    par Cloud Sync
        App->>Brain: sync (via CLI)
        activate Brain
        Brain->>Cloud: Sync events using event_id as dedup key
        activate Cloud
        Cloud->>Cloud: Filter by (brain_id, event_id) uniqueness
        Cloud->>Cloud: Resolve sync_mode from CloudConfig
        alt sync_mode == "full"
            Cloud->>Cloud: Sync all events + corrections
        else sync_mode == "metrics_only"
            Cloud->>Cloud: Report metrics only
        end
        Cloud-->>Brain: Sync results
        deactivate Cloud
        deactivate Brain
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

  • Gradata/gradata#162: Both PRs introduce and refine event identity and deterministic content hashing for cloud sync deduplication and backfill workflows.
  • Gradata/gradata#155: Both PRs refactor the PMR-100 bench script (run_one_session, run_benchmark signatures, CLI flags, and SessionResult dataclass).
  • Gradata/gradata#136: Both PRs modify rule-injection hooks (agent_precontext, inject_brain_rules, jit_inject) to implement prompt-size limiting and truncation behavior driven by brain configuration.

Suggested labels

feature, breaking-change, enhancement

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 46.48% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: fixing a data-flow correctness issue (the 27K→35 dashboard gap) in v0.7.2, which aligns with the PR's primary objective.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the root causes (metrics_only default, session=0 sentinel, deduplication issues), detailing the seven SDK-side fixes, test results, and explicitly listing what is excluded from this PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch v0.7.2-data-flow-fixes

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 OpenGrep (1.20.0)

OpenGrep fatal error (exit code 2):
┌──────────────┐
│ Opengrep CLI │
└──────────────┘

�[32m✔�[39m �[1mOpengrep OSS�[0m
�[32m✔�[39m Basic security coverage for first-party code vulnerabilities.

�[1m Loading rules from local config...�[0m
[00.19][ERROR]: Error: exception Glob.Lexer.Syntax_error("malformed glob pattern: missing ']'")
Raised at Glob__Lexer.syntax_error in file "libs/glob/Lexer.mll", line 8, characters 2-26
Called from Glob__Lexer.__ocaml_lex_token_rec in file "libs/glob/Lexer.mll", line 29, characters 26-53
Cal


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added bug Something isn't working feature breaking-change labels May 6, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
Gradata/examples/domain-profiles/call_profile.py (1)

429-443: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Average the newly added CallFeatures fields too.

avg_features, win_features, and loss_features now serialize the expanded schema, but _avg_features() still leaves several new fields at zero/default. That publishes incorrect aggregates for downstream audits and exports.

Suggested fix
     return CallFeatures(
+        call_type=fs[0].call_type,
+        total_words=round(sum(f.total_words for f in fs) / n),
+        user_words=round(sum(f.user_words for f in fs) / n),
+        prospect_words=round(sum(f.prospect_words for f in fs) / n),
         talk_ratio=sum(f.talk_ratio for f in fs) / n,
+        turn_count=round(sum(f.turn_count for f in fs) / n),
+        user_turns=round(sum(f.user_turns for f in fs) / n),
         question_count=round(sum(f.question_count for f in fs) / n),
+        open_question_count=round(sum(f.open_question_count for f in fs) / n),
+        closed_question_count=round(sum(f.closed_question_count for f in fs) / n),
         open_question_ratio=sum(f.open_question_ratio for f in fs) / n,
         pain_question_count=round(sum(f.pain_question_count for f in fs) / n),
         avg_turn_length=sum(f.avg_turn_length for f in fs) / n,
         longest_monologue=round(sum(f.longest_monologue for f in fs) / n),
         story_count=round(sum(f.story_count for f in fs) / n),
         objection_count=round(sum(f.objection_count for f in fs) / n),
         objection_responses=round(sum(f.objection_responses for f in fs) / n),
         close_attempts=round(sum(f.close_attempts for f in fs) / n),
+        commitment_count=round(sum(f.commitment_count for f in fs) / n),
+        specific_commitments=round(sum(f.specific_commitments for f in fs) / n),
         commitment_specificity=sum(f.commitment_specificity for f in fs) / n,
         discovery_minutes=sum(f.discovery_minutes for f in fs) / n,
+        first_pitch_minute=sum(f.first_pitch_minute for f in fs) / n,
         duration_minutes=sum(f.duration_minutes for f in fs) / n,
     )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Gradata/examples/domain-profiles/call_profile.py` around lines 429 - 443,
_avg_features() currently leaves several newly added CallFeatures fields at
defaults; update it to compute averages for every field on CallFeatures
(including commitment_specificity, discovery_minutes, duration_minutes and any
other fields added recently) by summing over fs and dividing by n, and apply the
same rounding semantics used elsewhere (e.g., round integer-like counts such as
question_count, pain_question_count, longest_monologue, story_count,
objection_count, objection_responses, close_attempts) so that avg_features,
win_features and loss_features serialize complete, correct aggregates; locate
and modify the _avg_features function and ensure it returns a CallFeatures
populated with the averaged values.
Gradata/src/gradata/_events.py (1)

700-704: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add logging for exception in _detect_session.

The except Exception: pass block silently swallows all errors without any logging. Per coding guidelines, exceptions should at minimum be logged with logger.warning(...) and exc_info=True to avoid silent failures in a memory product.

🛡️ Proposed fix
     except Exception:
-        pass
+        _log.debug("_detect_session query failed", exc_info=True)
 
     return None

As per coding guidelines: "Never use bare except: pass — use typed exceptions or at minimum logger.warning(...) with exc_info=True to avoid silent failure in a memory product."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Gradata/src/gradata/_events.py` around lines 700 - 704, The except block in
_detect_session currently swallows all exceptions; replace the bare except in
_detect_session with catching Exception as e and log it using logger.warning
with a descriptive message and exc_info=True (e.g., logger.warning("Failed to
detect session", exc_info=True)) before returning None so errors are not
silently ignored.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@Gradata/src/gradata/_core.py`:
- Around line 1241-1242: Before resolving shared credentials, check the brain's
sync flag and skip shared-credential resolution when sync is disabled: move or
guard the calls to _load_cloud_config(brain.dir) and
_cloud_creds.resolve_credential(...) behind a conditional that verifies the
brain's sync_enabled (or equivalent) property, and ensure brain_end_session()
uses that same check so we never auto-upload or resolve globals for brains that
are not cloud-enabled; update references in the brain_end_session flow, the
_load_cloud_config usage, and any code paths that call
_cloud_creds.resolve_credential to only run when sync_enabled is true.
- Around line 1345-1348: The global TOML should not override a per-brain
persisted sync_mode: after calling _parse_toml_cloud(config_path) only apply
cfg.get("sync_mode", ...) when the brain does not already have a sync_mode set;
e.g. keep the initial sync_mode from getattr(brain_cloud_cfg, "sync_mode",
"full") and replace the unconditional sync_mode = cfg.get(...) with a
conditional (if not hasattr(brain_cloud_cfg, "sync_mode") or
brain_cloud_cfg.sync_mode in (None, "")) then use cfg.get(...). This ensures
cmd_sync()'s per-brain "full" setting isn't clobbered by legacy
~/.gradata/config.toml.

In `@Gradata/src/gradata/_migrations/004_event_identity_unique_key.py`:
- Around line 104-117: The migration function _canonical_content_hash currently
calls json.dumps without default=str; update the json.dumps invocation inside
_canonical_content_hash to include default=str (i.e., json.dumps(...,
default=str, sort_keys=True, ...)) so the behavior matches the _events.py
version and handles non-serializable values consistently during backfill.

In `@Gradata/src/gradata/cli.py`:
- Around line 886-888: The brain_root resolution currently lets GRADATA_BRAIN
override an explicit --brain-dir; change the precedence so the CLI flag wins
first, then the BRAIN_DIR env var, then GRADATA_BRAIN, then Path.cwd(). Update
the expression that sets brain_root in gradata/cli.py to prefer getattr(args,
"brain_dir", None) first, then env_str("BRAIN_DIR"), then
env_str("GRADATA_BRAIN"), and finally Path.cwd(), and call .resolve() on the
resulting Path.

In `@Gradata/src/gradata/cloud/sync.py`:
- Line 101: The load code currently sets sync_mode directly from disk via
sync_mode=str(data.get("sync_mode", "full") or "full"), allowing arbitrary
values; change this to normalize (strip and lower) and validate against an
allowed set (e.g., {"full", "metrics_only"}) and fall back to "full" for any
invalid or missing value. Locate the code that reads data.get("sync_mode") (the
sync_mode assignment) and replace it with logic that normalizes the string,
checks membership in the allowed set, and assigns "full" when the value is
invalid or absent so the program cannot silently accept unsupported modes.

In `@Gradata/src/gradata/hooks/adapters/hermes.py`:
- Around line 101-107: The current _format_scalar uses repr(text) which can emit
Python-style escapes that your parser won't unescape; change it to wrap the
string in single quotes instead and escape any existing single quotes inside the
value by doubling them (i.e., replace ' with ''), so _format_scalar returns a
single-quoted safe string for values containing characters like :, #, {, }, [,
].

In `@Gradata/src/gradata/middleware/_core.py`:
- Around line 292-295: The current truncation slices the full block string and
can chop off the closing </brain-rules> tag; instead compute the available body
budget using source.max_prompt_chars minus the length of the opening
"<brain-rules>\n" and closing "\n</brain-rules>" (and any added newlines),
truncate the joined lines to that budget (operating on the "lines" content),
then assemble the final "block" by concatenating the opening tag, the truncated
body (with rstrip to remove trailing whitespace), and the closing tag so the
<brain-rules> framing is always intact.

---

Outside diff comments:
In `@Gradata/examples/domain-profiles/call_profile.py`:
- Around line 429-443: _avg_features() currently leaves several newly added
CallFeatures fields at defaults; update it to compute averages for every field
on CallFeatures (including commitment_specificity, discovery_minutes,
duration_minutes and any other fields added recently) by summing over fs and
dividing by n, and apply the same rounding semantics used elsewhere (e.g., round
integer-like counts such as question_count, pain_question_count,
longest_monologue, story_count, objection_count, objection_responses,
close_attempts) so that avg_features, win_features and loss_features serialize
complete, correct aggregates; locate and modify the _avg_features function and
ensure it returns a CallFeatures populated with the averaged values.

In `@Gradata/src/gradata/_events.py`:
- Around line 700-704: The except block in _detect_session currently swallows
all exceptions; replace the bare except in _detect_session with catching
Exception as e and log it using logger.warning with a descriptive message and
exc_info=True (e.g., logger.warning("Failed to detect session", exc_info=True))
before returning None so errors are not silently ignored.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 05b30e23-8d52-4ce0-9806-c0f4dbb9d1f4

📥 Commits

Reviewing files that changed from the base of the PR and between 3d7b7f8 and 4576028.

📒 Files selected for processing (25)
  • Gradata/bench/pmr_100.py
  • Gradata/examples/basic_usage.py
  • Gradata/examples/domain-profiles/call_profile.py
  • Gradata/examples/domain-profiles/sales_profile.py
  • Gradata/examples/with_claude_code.py
  • Gradata/examples/with_openai.py
  • Gradata/scripts/migrate_legacy_scopes.py
  • Gradata/src/gradata/_config.py
  • Gradata/src/gradata/_context_packet.py
  • Gradata/src/gradata/_core.py
  • Gradata/src/gradata/_events.py
  • Gradata/src/gradata/_migrations/004_event_identity_unique_key.py
  • Gradata/src/gradata/_migrations/device_uuid.py
  • Gradata/src/gradata/_mine_transcripts.py
  • Gradata/src/gradata/brain.py
  • Gradata/src/gradata/cli.py
  • Gradata/src/gradata/cloud/sync.py
  • Gradata/src/gradata/hooks/adapters/hermes.py
  • Gradata/src/gradata/hooks/agent_precontext.py
  • Gradata/src/gradata/hooks/inject_brain_rules.py
  • Gradata/src/gradata/hooks/jit_inject.py
  • Gradata/src/gradata/middleware/_core.py
  • Gradata/tests/test_brain_config_propagation.py
  • Gradata/tests/test_brain_events.py
  • Gradata/tests/test_migration_002_event_identity.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: pytest macos-latest / py3.11
  • GitHub Check: pytest ubuntu-latest / py3.12
  • GitHub Check: pytest windows-latest / py3.11
  • GitHub Check: pytest windows-latest / py3.12
  • GitHub Check: pytest macos-latest / py3.12
  • GitHub Check: pytest ubuntu-latest / py3.11
🧰 Additional context used
📓 Path-based instructions (2)
Gradata/src/**/*.py

📄 CodeRabbit inference engine (Gradata/AGENTS.md)

Gradata/src/**/*.py: Prefer sentence-transformers for local embeddings, google-genai for Gemini embeddings, cryptography for AES-GCM encrypted system.db, bm25s for BM25 rule ranking, and mem0ai for external memory adapters — guard all optional dependency imports with try / except ImportError at the call site, never at module level
Maintain strict layering: Layer 0 (Primitives: _types.py, _db.py, _events.py, _paths.py, _file_lock.py; Patterns: contrib/patterns/) must never import from Layer 1 (Enhancements: enhancements/, rules/) or Layer 2 (Public API: brain.py, cli.py, daemon.py, mcp_server.py)
Never use bare except: pass — use typed exceptions or at minimum logger.warning(...) with exc_info=True to avoid silent failure in a memory product
Never import from out-of-scope sibling directories ../Sprites/ or ../Hausgem/ within gradata/* code — that is a layering bug
Never leak private-sibling paths into public docs/code — no references to ../Sprites/, ../Hausgem/, email addresses, OneDrive paths, or Sprites-specific examples from inside gradata/*
Use atomic-write helper when writing JSON files to prevent corruption from mid-write crashes

Files:

  • Gradata/src/gradata/_config.py
  • Gradata/src/gradata/_mine_transcripts.py
  • Gradata/src/gradata/hooks/jit_inject.py
  • Gradata/src/gradata/middleware/_core.py
  • Gradata/src/gradata/hooks/agent_precontext.py
  • Gradata/src/gradata/cli.py
  • Gradata/src/gradata/hooks/inject_brain_rules.py
  • Gradata/src/gradata/cloud/sync.py
  • Gradata/src/gradata/hooks/adapters/hermes.py
  • Gradata/src/gradata/_context_packet.py
  • Gradata/src/gradata/_core.py
  • Gradata/src/gradata/brain.py
  • Gradata/src/gradata/_migrations/004_event_identity_unique_key.py
  • Gradata/src/gradata/_migrations/device_uuid.py
  • Gradata/src/gradata/_events.py
Gradata/tests/**/*.py

📄 CodeRabbit inference engine (Gradata/AGENTS.md)

Gradata/tests/**/*.py: Set BRAIN_DIR environment variable via tmp_path in conftest.py for test isolation — ensure _paths.py module cache refreshes when calling Brain.init() directly inside tests
Add unit tests in tests/test_*.py for every CI push without LLM calls (deterministic); mark integration tests with @pytest.mark.integration and skip them by default (they hit real LLM APIs)

Files:

  • Gradata/tests/test_brain_events.py
  • Gradata/tests/test_migration_002_event_identity.py
  • Gradata/tests/test_brain_config_propagation.py
🔇 Additional comments (22)
Gradata/scripts/migrate_legacy_scopes.py (1)

120-121: LGTM!

Minor formatting improvement consolidating the warning message to a single line. Functionality is unchanged.

Gradata/src/gradata/hooks/adapters/hermes.py (4)

17-67: Parser scope is appropriate for the use case.

The limited YAML subset (nested dicts/lists, simple scalars) matches the expected Hermes hook config structure. The "items" fallback on line 51 handles unexpected list-under-key structures defensively without failing.

One observation: lines 44-46 check if isinstance(container, dict) before assigning current_list to the container, but current_list is always reassigned regardless. This is safe since orphan lists won't corrupt data, though the comment could clarify this is intentional.


70-75: LGTM!

Scalar parsing handles the expected value types (booleans and strings) correctly. Numeric values would be preserved as strings, which is fine since hook configs use string IDs and commands.


78-98: LGTM!

The recursive dumping correctly handles nested dicts and lists with proper indentation. Multi-key dict items within lists use the expected YAML continuation style (first key with - , subsequent keys aligned).


110-139: LGTM!

The install function correctly uses the new parser/dumper and maintains atomic writes via atomic_write_text. Defensive type checks for the hooks structure are preserved, and error handling properly returns a failure result. As per coding guidelines, atomic-write helper is used to prevent corruption from mid-write crashes.

Gradata/examples/with_claude_code.py (1)

16-16: Formatting-only change is safe.

Line 16 is a non-functional spacing update; behavior and example flow remain unchanged.

Gradata/examples/with_openai.py (1)

10-10: Formatting-only change is safe.

Line 10 is purely cosmetic and does not alter runtime behavior.

Gradata/examples/basic_usage.py (1)

2-4: Spacing cleanup looks good.

Lines 2–4 are non-functional formatting adjustments only.

Gradata/src/gradata/_context_packet.py (1)

383-386: Session fallback logic is consistent here.

The Line 385 coalescing keeps session concrete for audit/wrap-up query paths and avoids None propagation into int-based filters.

Gradata/src/gradata/_config.py (1)

82-85: Good API addition for config loading.

BrainConfig.load(...) is a clean, centralized entrypoint that reduces duplication in consumers.

Gradata/src/gradata/middleware/_core.py (1)

125-132: Config-driven prompt cap wiring looks good.

Using per-brain max_recall_tokens to size middleware injection is a solid propagation of runtime config.

Gradata/src/gradata/hooks/agent_precontext.py (1)

148-154: Cap propagation and enforcement are consistent.

Lines 148–154 and 226–227 correctly apply per-brain prompt sizing with a safe fallback.

Also applies to: 226-227

Gradata/src/gradata/brain.py (1)

239-239: LGTM! Session handling changes are consistent with the PR objectives.

The changes correctly:

  1. Coalesce None to 0 in the session property (Line 239)
  2. Forward session value directly to emit() which now handles None internally (Line 1516)
  3. Coalesce missing session to 0 in track_rule (Line 1920)

This aligns with the stated goal of preserving None session from callers and resolving the active session at write time.

Also applies to: 1516-1516, 1920-1920

Gradata/tests/test_migration_002_event_identity.py (1)

171-194: LGTM! Well-structured test for event identity population.

The test correctly validates:

  • Two emitted events get distinct event_id values (ULID format)
  • device_id is stable across emits within the same brain
  • content_hash is identical for events with the same content
  • Returned event dicts match the persisted DB values
Gradata/tests/test_brain_config_propagation.py (1)

1-55: LGTM! Good integration test for BrainConfig propagation.

The test comprehensively validates that max_recall_tokens from brain-config.json is respected across all four injection paths (session injection, agent precontext, JIT inject, and middleware). The helper function generates appropriately large lessons to test truncation behavior.

One minor observation: Line 54's assertion (assert os.environ["GRADATA_JIT_ENABLED"] == "1") is redundant since you set it on line 34 with monkeypatch.setenv. However, this doesn't affect correctness.

Gradata/src/gradata/_mine_transcripts.py (1)

222-238: LGTM! Idempotent event IDs for transcript mining.

The _stable_event_id function generates a deterministic 32-character hash from event metadata, enabling idempotent replays of mined transcripts. The hardcoded session=0 is appropriate for backfill scenarios where historical session context isn't available.

Also applies to: 444-444

Gradata/src/gradata/_migrations/device_uuid.py (1)

30-47: LGTM! Brain-aware device ID generation.

The refactored implementation correctly:

  1. Derives brain identity from manifest metadata (with fallback to path)
  2. Seeds the device ID from hostname:brain_identity for per-brain, per-machine uniqueness
  3. Maintains backward compatibility with existing atomic write patterns

Also applies to: 75-75

Gradata/bench/pmr_100.py (1)

41-61: LGTM! Formatting and style improvements.

The benchmark file has been cleaned up with:

  • Reorganized imports with UTC alias for cleaner datetime usage
  • Multiline function signatures for better readability
  • Multiline CLI argument definitions matching broader CLI conventions
  • Added --quick flag for fast feedback (10 sessions vs 100)

No functional changes to the benchmark logic.

Also applies to: 296-298, 350-352, 453-469

Gradata/src/gradata/_migrations/004_event_identity_unique_key.py (1)

1-101: LGTM! Migration correctly adds brain_id and unique constraint.

The migration:

  1. Adds brain_id column and other identity columns
  2. Drops obsolete dedup indexes
  3. Backfills identity fields for existing rows
  4. Creates the unique (brain_id, event_id) index
Gradata/src/gradata/_events.py (3)

37-76: LGTM! Well-designed identity and session normalization helpers.

The new helper functions are well-structured:

  • _canonical_content_hash: Creates deterministic SHA256 for content deduplication
  • _legacy_event_id: Provides stable IDs for historical JSONL rows
  • _coerce_session_for_storage: Normalizes session values, returning None for invalid/non-positive values (aligning with the PR objective to stop emitting session=0)

165-209: LGTM! Schema additions for event identity tracking.

The schema correctly adds:

  • Identity columns: brain_id, event_id, device_id, content_hash
  • Lineage columns: correction_chain_id, origin_agent
  • Unique index on (brain_id, event_id) for deduplication

The use of contextlib.suppress(sqlite3.OperationalError) for ALTER TABLE statements handles the idempotency of schema migrations gracefully.


354-358: LGTM! Session and identity resolution with defensive fallbacks.

The emit() function now correctly:

  1. Resolves session at write time (defaulting to 1 if undetectable), then coerces to None for non-positive values
  2. Ensures redacted_data is always a dict (wrapping scalars in {"_raw": ...})
  3. Provides hash-based fallbacks for tenant_id and device_id resolution failures

This aligns with the PR objective of resolving active session at write time while maintaining robustness.

Also applies to: 398-412

Comment on lines +1241 to +1242
brain_cloud_cfg = _load_cloud_config(brain.dir)
api_key = _cloud_creds.resolve_credential(fallback=brain_cloud_cfg.token)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Honor sync_enabled before resolving shared credentials.

This auto-sync path now uploads whenever a global keyfile or env credential exists, even if this brain was never cloud-enabled. That can leak data from the wrong brain as soon as brain_end_session() runs.

Suggested fix
         brain_cloud_cfg = _load_cloud_config(brain.dir)
+        if not brain_cloud_cfg.sync_enabled:
+            return
         api_key = _cloud_creds.resolve_credential(fallback=brain_cloud_cfg.token)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
brain_cloud_cfg = _load_cloud_config(brain.dir)
api_key = _cloud_creds.resolve_credential(fallback=brain_cloud_cfg.token)
brain_cloud_cfg = _load_cloud_config(brain.dir)
if not brain_cloud_cfg.sync_enabled:
return
api_key = _cloud_creds.resolve_credential(fallback=brain_cloud_cfg.token)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Gradata/src/gradata/_core.py` around lines 1241 - 1242, Before resolving
shared credentials, check the brain's sync flag and skip shared-credential
resolution when sync is disabled: move or guard the calls to
_load_cloud_config(brain.dir) and _cloud_creds.resolve_credential(...) behind a
conditional that verifies the brain's sync_enabled (or equivalent) property, and
ensure brain_end_session() uses that same check so we never auto-upload or
resolve globals for brains that are not cloud-enabled; update references in the
brain_end_session flow, the _load_cloud_config usage, and any code paths that
call _cloud_creds.resolve_credential to only run when sync_enabled is true.

Comment on lines +1345 to +1348
sync_mode = getattr(brain_cloud_cfg, "sync_mode", "full") or "full"
try:
cfg = _parse_toml_cloud(config_path)
sync_mode = cfg.get("sync_mode", "metrics_only")
sync_mode = cfg.get("sync_mode", sync_mode)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not let legacy TOML override the per-brain sync_mode.

These lines can undo the "full" mode that cmd_sync() just persisted for this brain, so a stale ~/.gradata/config.toml with metrics_only will still skip event backfill and recreate the dashboard gap.

Suggested fix
-        sync_mode = getattr(brain_cloud_cfg, "sync_mode", "full") or "full"
-        try:
-            cfg = _parse_toml_cloud(config_path)
-            sync_mode = cfg.get("sync_mode", sync_mode)
-        except Exception:
-            pass
+        sync_mode = str(getattr(brain_cloud_cfg, "sync_mode", "") or "").strip().lower()
+        if not sync_mode and config_path.is_file():
+            try:
+                cfg = _parse_toml_cloud(config_path)
+                sync_mode = str(cfg.get("sync_mode", "") or "").strip().lower()
+            except Exception:
+                pass
+        sync_mode = sync_mode or "full"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
sync_mode = getattr(brain_cloud_cfg, "sync_mode", "full") or "full"
try:
cfg = _parse_toml_cloud(config_path)
sync_mode = cfg.get("sync_mode", "metrics_only")
sync_mode = cfg.get("sync_mode", sync_mode)
sync_mode = str(getattr(brain_cloud_cfg, "sync_mode", "") or "").strip().lower()
if not sync_mode and config_path.is_file():
try:
cfg = _parse_toml_cloud(config_path)
sync_mode = str(cfg.get("sync_mode", "") or "").strip().lower()
except Exception:
pass
sync_mode = sync_mode or "full"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Gradata/src/gradata/_core.py` around lines 1345 - 1348, The global TOML
should not override a per-brain persisted sync_mode: after calling
_parse_toml_cloud(config_path) only apply cfg.get("sync_mode", ...) when the
brain does not already have a sync_mode set; e.g. keep the initial sync_mode
from getattr(brain_cloud_cfg, "sync_mode", "full") and replace the unconditional
sync_mode = cfg.get(...) with a conditional (if not hasattr(brain_cloud_cfg,
"sync_mode") or brain_cloud_cfg.sync_mode in (None, "")) then use cfg.get(...).
This ensures cmd_sync()'s per-brain "full" setting isn't clobbered by legacy
~/.gradata/config.toml.

Comment on lines +104 to +117
def _canonical_content_hash(ev_type: str, source: str | None, data_json: str | None) -> str:
import hashlib

try:
data = json.loads(data_json) if data_json else {}
except (json.JSONDecodeError, TypeError):
data = {"_raw": data_json}
canonical = json.dumps(
{"type": ev_type, "source": source or "", "data": data},
sort_keys=True,
separators=(",", ":"),
ensure_ascii=False,
)
return hashlib.sha256(canonical.encode("utf-8")).hexdigest()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | 💤 Low value

Minor inconsistency: _canonical_content_hash differs from _events.py version.

The migration's _canonical_content_hash is missing default=str in json.dumps() (Line 111-116), whereas the _events.py version includes it. This is acceptable for the migration's backfill use case since it processes data already serialized to JSON in the database. However, for consistency and future-proofing:

🔧 Optional: Add default=str for consistency
     canonical = json.dumps(
         {"type": ev_type, "source": source or "", "data": data},
         sort_keys=True,
         separators=(",", ":"),
         ensure_ascii=False,
+        default=str,
     )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Gradata/src/gradata/_migrations/004_event_identity_unique_key.py` around
lines 104 - 117, The migration function _canonical_content_hash currently calls
json.dumps without default=str; update the json.dumps invocation inside
_canonical_content_hash to include default=str (i.e., json.dumps(...,
default=str, sort_keys=True, ...)) so the behavior matches the _events.py
version and handles non-serializable values consistently during backfill.

Comment on lines +886 to +888
brain_root = Path(
env_str("GRADATA_BRAIN") or getattr(args, "brain_dir", None) or Path.cwd()
).resolve()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Resolve the sync target with the standard brain precedence.

This path ignores BRAIN_DIR and lets GRADATA_BRAIN override an explicit --brain-dir, so gradata sync --full can backfill the wrong brain.

Suggested fix
-    brain_root = Path(
-        env_str("GRADATA_BRAIN") or getattr(args, "brain_dir", None) or Path.cwd()
-    ).resolve()
+    brain_root = _resolve_brain_root(args).resolve()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Gradata/src/gradata/cli.py` around lines 886 - 888, The brain_root resolution
currently lets GRADATA_BRAIN override an explicit --brain-dir; change the
precedence so the CLI flag wins first, then the BRAIN_DIR env var, then
GRADATA_BRAIN, then Path.cwd(). Update the expression that sets brain_root in
gradata/cli.py to prefer getattr(args, "brain_dir", None) first, then
env_str("BRAIN_DIR"), then env_str("GRADATA_BRAIN"), and finally Path.cwd(), and
call .resolve() on the resulting Path.

token=str(data.get("token", "")),
api_base=_normalize_api_base(str(data.get("api_base", _DEFAULT_API_BASE))),
contribute_corpus=bool(data.get("contribute_corpus", False)),
sync_mode=str(data.get("sync_mode", "full") or "full"),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Normalize and validate sync_mode on load.

Line 101 accepts arbitrary strings from disk. Please constrain to supported modes (e.g., full, metrics_only) and default invalid values to full to avoid silent mode drift.

🔧 Proposed fix
-            sync_mode=str(data.get("sync_mode", "full") or "full"),
+            sync_mode=(
+                (lambda m: m if m in {"full", "metrics_only"} else "full")(
+                    str(data.get("sync_mode", "full") or "full").strip().lower()
+                )
+            ),
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Gradata/src/gradata/cloud/sync.py` at line 101, The load code currently sets
sync_mode directly from disk via sync_mode=str(data.get("sync_mode", "full") or
"full"), allowing arbitrary values; change this to normalize (strip and lower)
and validate against an allowed set (e.g., {"full", "metrics_only"}) and fall
back to "full" for any invalid or missing value. Locate the code that reads
data.get("sync_mode") (the sync_mode assignment) and replace it with logic that
normalizes the string, checks membership in the allowed set, and assigns "full"
when the value is invalid or absent so the program cannot silently accept
unsupported modes.

Comment on lines +101 to +107
def _format_scalar(value) -> str:
if isinstance(value, bool):
return "true" if value else "false"
text = str(value)
if any(ch in text for ch in (":", "#", "{", "}", "[", "]")):
return repr(text)
return text

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | 💤 Low value

Minor edge case with repr() escaping.

Using repr() for quoting could produce Python-style escape sequences (e.g., \') that the custom parser's strip("'\"") won't unescape. For the expected hook config values (simple paths and IDs), this is unlikely to occur. If edge cases arise, consider using single-quote wrapping directly instead of repr().

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Gradata/src/gradata/hooks/adapters/hermes.py` around lines 101 - 107, The
current _format_scalar uses repr(text) which can emit Python-style escapes that
your parser won't unescape; change it to wrap the string in single quotes
instead and escape any existing single quotes inside the value by doubling them
(i.e., replace ' with ''), so _format_scalar returns a single-quoted safe string
for values containing characters like :, #, {, }, [, ].

Comment on lines +292 to +295
block = "<brain-rules>\n" + "\n".join(lines) + "\n</brain-rules>"
if len(block) > source.max_prompt_chars:
block = block[: source.max_prompt_chars].rstrip()
return block

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve valid <brain-rules> framing when truncating.

Line 294 truncates the full block string, which can cut </brain-rules> and return malformed markup. Compute the body budget first, then wrap.

🔧 Proposed fix
-    block = "<brain-rules>\n" + "\n".join(lines) + "\n</brain-rules>"
-    if len(block) > source.max_prompt_chars:
-        block = block[: source.max_prompt_chars].rstrip()
-    return block
+    open_tag = "<brain-rules>\n"
+    close_tag = "\n</brain-rules>"
+    max_chars = max(0, int(source.max_prompt_chars))
+    min_wrapper = len(open_tag) + len(close_tag)
+    if max_chars <= min_wrapper:
+        return ""
+
+    body = "\n".join(lines)
+    body_budget = max_chars - min_wrapper
+    if len(body) > body_budget:
+        body = body[:body_budget].rstrip()
+    return f"{open_tag}{body}{close_tag}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
block = "<brain-rules>\n" + "\n".join(lines) + "\n</brain-rules>"
if len(block) > source.max_prompt_chars:
block = block[: source.max_prompt_chars].rstrip()
return block
open_tag = "<brain-rules>\n"
close_tag = "\n</brain-rules>"
max_chars = max(0, int(source.max_prompt_chars))
min_wrapper = len(open_tag) + len(close_tag)
if max_chars <= min_wrapper:
return ""
body = "\n".join(lines)
body_budget = max_chars - min_wrapper
if len(body) > body_budget:
body = body[:body_budget].rstrip()
return f"{open_tag}{body}{close_tag}"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Gradata/src/gradata/middleware/_core.py` around lines 292 - 295, The current
truncation slices the full block string and can chop off the closing
</brain-rules> tag; instead compute the available body budget using
source.max_prompt_chars minus the length of the opening "<brain-rules>\n" and
closing "\n</brain-rules>" (and any added newlines), truncate the joined lines
to that budget (operating on the "lines" content), then assemble the final
"block" by concatenating the opening tag, the truncated body (with rstrip to
remove trailing whitespace), and the closing tag so the <brain-rules> framing is
always intact.

@Gradata Gradata merged commit 30eb02d into main May 6, 2026
14 of 15 checks passed
@Gradata Gradata deleted the v0.7.2-data-flow-fixes branch May 6, 2026 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change bug Something isn't working feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant