feat(v0.7.2): data-flow correctness — fix the 27K→35 dashboard gap#175
Conversation
Codex deep audit found 12 bugs in event flow; this PR ships fixes for the SDK side (Bugs 1, 2, 7, 8, 9, 10, 11). Companion cloud PR fixes projector. After both land, the dashboard accurately reflects events. P0 — DEFAULT SYNC NOW INCLUDES THE CORPUS (Bug 1) - _core.py:_cloud_sync_session now defaults sync_mode='full', not metrics_only - New CLI: `gradata sync --full` flips mode + immediately backfills events - Clear WARN log when sync_mode='metrics_only' so users know dashboard will be empty P0 — STOP EMITTING events WITH session=0 (Bug 2) - Brain.emit() preserves None when caller doesn't provide a session - _events.emit() detects active session from brain state at write time - Session only coerced to int at serialization, never used as 0 sentinel P1 — POPULATE EVENT IDENTITY AT EMIT TIME (Bug 7) - _events.py now writes event_id (ULID), device_id (dev_<sha256>), content_hash (sha256) - Migration 004: drops legacy unique on (tenant_id, ts, type, source); adds unique (brain_id, event_id) - Two events with identical (ts, type, source) but different data now coexist - Replaces the false-green test_new_emit_leaves_identity_columns_null_for_now (Bug 8) - New regression test asserts ULID format + uniqueness + content_hash determinism P1 — BRAINCONFIG PROPAGATION (Bug 9) - New BrainConfig.load(brain_dir) helper in _config.py - Used by inject_brain_rules.py, agent_precontext.py, jit_inject.py, middleware/_core.py - max_recall_tokens + ranker now respected across all 4 injection paths - New tests/test_brain_config_propagation.py covers all 4 sites P1 — REMOVE UNDECLARED PYYAML DEP (Bug 10) - hooks/adapters/hermes.py: small in-tree YAML reader/writer for Hermes config - No more sudden ImportError when user runs gradata install --agent hermes P2 — README/CLI CONSISTENCY (Bug 11) - gradata init now exposes --no-interactive flag (matches docs) - gradata audit smoke-tests pass Tests: 4174 passed / 2 skipped / 5 deselected (sandbox-only) on socket-free run. ruff check + ruff format --check + pyright src/ all clean. Council (3-vendor full) verdict on the wedge: Convergence-as-a-Service. This PR is the precondition — the dashboard must accurately reflect data before we can pitch convergence proof to YC. Co-authored-by: Codex <noreply@openai.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
📝 WalkthroughSummary
WalkthroughThis PR introduces event identity tracking and cloud synchronization enhancements, implements configuration-driven prompt sizing across rule-injection hooks, expands domain-specific analytics for call and sales profiles, and applies code quality improvements across the codebase. The core refactor centers on persistent event IDs, device tracking, deterministic content hashing, and cloud sync mode-awareness, with supporting session handling refinements. ChangesEvent Identity, Cloud Sync & Session Management
Configuration-Driven Prompt Sizing
Domain Profile Analytics Enhancements
Code Quality & Formatting
Sequence Diagram(s)sequenceDiagram
participant App as Application
participant Brain as Brain
participant Events as Events Layer
participant DB as SQLite DB
participant Cloud as Gradata Cloud
App->>Brain: emit(event_type, source, data, session=None)
activate Brain
Brain->>Events: emit(...)
activate Events
Events->>Events: _detect_session() → int|None
alt No session detected
Events->>Events: session = 1 (default)
end
Events->>Events: tenant_id = tenant_for(brain_dir)
Events->>Events: event_id = new_ulid()
Events->>Events: device_id = get_or_create_device_id(brain_dir)
Events->>Events: content_hash = _canonical_content_hash(...)
Events->>DB: INSERT event with brain_id, event_id, device_id, content_hash
activate DB
DB->>DB: Check unique (brain_id, event_id)
alt Idempotent match found
DB-->>Events: Duplicate detected
else New event
DB-->>Events: Row inserted
end
deactivate DB
Events->>Events: Emit JSONL with redacted data
deactivate Events
Events-->>Brain: Return event dict with identity fields
deactivate Brain
Brain-->>App: Event emitted with identity
par Cloud Sync
App->>Brain: sync (via CLI)
activate Brain
Brain->>Cloud: Sync events using event_id as dedup key
activate Cloud
Cloud->>Cloud: Filter by (brain_id, event_id) uniqueness
Cloud->>Cloud: Resolve sync_mode from CloudConfig
alt sync_mode == "full"
Cloud->>Cloud: Sync all events + corrections
else sync_mode == "metrics_only"
Cloud->>Cloud: Report metrics only
end
Cloud-->>Brain: Sync results
deactivate Cloud
deactivate Brain
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 OpenGrep (1.20.0)OpenGrep fatal error (exit code 2): �[32m✔�[39m �[1mOpengrep OSS�[0m �[1m Loading rules from local config...�[0m Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
Gradata/examples/domain-profiles/call_profile.py (1)
429-443:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAverage the newly added
CallFeaturesfields too.
avg_features,win_features, andloss_featuresnow serialize the expanded schema, but_avg_features()still leaves several new fields at zero/default. That publishes incorrect aggregates for downstream audits and exports.Suggested fix
return CallFeatures( + call_type=fs[0].call_type, + total_words=round(sum(f.total_words for f in fs) / n), + user_words=round(sum(f.user_words for f in fs) / n), + prospect_words=round(sum(f.prospect_words for f in fs) / n), talk_ratio=sum(f.talk_ratio for f in fs) / n, + turn_count=round(sum(f.turn_count for f in fs) / n), + user_turns=round(sum(f.user_turns for f in fs) / n), question_count=round(sum(f.question_count for f in fs) / n), + open_question_count=round(sum(f.open_question_count for f in fs) / n), + closed_question_count=round(sum(f.closed_question_count for f in fs) / n), open_question_ratio=sum(f.open_question_ratio for f in fs) / n, pain_question_count=round(sum(f.pain_question_count for f in fs) / n), avg_turn_length=sum(f.avg_turn_length for f in fs) / n, longest_monologue=round(sum(f.longest_monologue for f in fs) / n), story_count=round(sum(f.story_count for f in fs) / n), objection_count=round(sum(f.objection_count for f in fs) / n), objection_responses=round(sum(f.objection_responses for f in fs) / n), close_attempts=round(sum(f.close_attempts for f in fs) / n), + commitment_count=round(sum(f.commitment_count for f in fs) / n), + specific_commitments=round(sum(f.specific_commitments for f in fs) / n), commitment_specificity=sum(f.commitment_specificity for f in fs) / n, discovery_minutes=sum(f.discovery_minutes for f in fs) / n, + first_pitch_minute=sum(f.first_pitch_minute for f in fs) / n, duration_minutes=sum(f.duration_minutes for f in fs) / n, )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Gradata/examples/domain-profiles/call_profile.py` around lines 429 - 443, _avg_features() currently leaves several newly added CallFeatures fields at defaults; update it to compute averages for every field on CallFeatures (including commitment_specificity, discovery_minutes, duration_minutes and any other fields added recently) by summing over fs and dividing by n, and apply the same rounding semantics used elsewhere (e.g., round integer-like counts such as question_count, pain_question_count, longest_monologue, story_count, objection_count, objection_responses, close_attempts) so that avg_features, win_features and loss_features serialize complete, correct aggregates; locate and modify the _avg_features function and ensure it returns a CallFeatures populated with the averaged values.Gradata/src/gradata/_events.py (1)
700-704:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd logging for exception in
_detect_session.The
except Exception: passblock silently swallows all errors without any logging. Per coding guidelines, exceptions should at minimum be logged withlogger.warning(...)andexc_info=Trueto avoid silent failures in a memory product.🛡️ Proposed fix
except Exception: - pass + _log.debug("_detect_session query failed", exc_info=True) return NoneAs per coding guidelines: "Never use bare
except: pass— use typed exceptions or at minimumlogger.warning(...)withexc_info=Trueto avoid silent failure in a memory product."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Gradata/src/gradata/_events.py` around lines 700 - 704, The except block in _detect_session currently swallows all exceptions; replace the bare except in _detect_session with catching Exception as e and log it using logger.warning with a descriptive message and exc_info=True (e.g., logger.warning("Failed to detect session", exc_info=True)) before returning None so errors are not silently ignored.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@Gradata/src/gradata/_core.py`:
- Around line 1241-1242: Before resolving shared credentials, check the brain's
sync flag and skip shared-credential resolution when sync is disabled: move or
guard the calls to _load_cloud_config(brain.dir) and
_cloud_creds.resolve_credential(...) behind a conditional that verifies the
brain's sync_enabled (or equivalent) property, and ensure brain_end_session()
uses that same check so we never auto-upload or resolve globals for brains that
are not cloud-enabled; update references in the brain_end_session flow, the
_load_cloud_config usage, and any code paths that call
_cloud_creds.resolve_credential to only run when sync_enabled is true.
- Around line 1345-1348: The global TOML should not override a per-brain
persisted sync_mode: after calling _parse_toml_cloud(config_path) only apply
cfg.get("sync_mode", ...) when the brain does not already have a sync_mode set;
e.g. keep the initial sync_mode from getattr(brain_cloud_cfg, "sync_mode",
"full") and replace the unconditional sync_mode = cfg.get(...) with a
conditional (if not hasattr(brain_cloud_cfg, "sync_mode") or
brain_cloud_cfg.sync_mode in (None, "")) then use cfg.get(...). This ensures
cmd_sync()'s per-brain "full" setting isn't clobbered by legacy
~/.gradata/config.toml.
In `@Gradata/src/gradata/_migrations/004_event_identity_unique_key.py`:
- Around line 104-117: The migration function _canonical_content_hash currently
calls json.dumps without default=str; update the json.dumps invocation inside
_canonical_content_hash to include default=str (i.e., json.dumps(...,
default=str, sort_keys=True, ...)) so the behavior matches the _events.py
version and handles non-serializable values consistently during backfill.
In `@Gradata/src/gradata/cli.py`:
- Around line 886-888: The brain_root resolution currently lets GRADATA_BRAIN
override an explicit --brain-dir; change the precedence so the CLI flag wins
first, then the BRAIN_DIR env var, then GRADATA_BRAIN, then Path.cwd(). Update
the expression that sets brain_root in gradata/cli.py to prefer getattr(args,
"brain_dir", None) first, then env_str("BRAIN_DIR"), then
env_str("GRADATA_BRAIN"), and finally Path.cwd(), and call .resolve() on the
resulting Path.
In `@Gradata/src/gradata/cloud/sync.py`:
- Line 101: The load code currently sets sync_mode directly from disk via
sync_mode=str(data.get("sync_mode", "full") or "full"), allowing arbitrary
values; change this to normalize (strip and lower) and validate against an
allowed set (e.g., {"full", "metrics_only"}) and fall back to "full" for any
invalid or missing value. Locate the code that reads data.get("sync_mode") (the
sync_mode assignment) and replace it with logic that normalizes the string,
checks membership in the allowed set, and assigns "full" when the value is
invalid or absent so the program cannot silently accept unsupported modes.
In `@Gradata/src/gradata/hooks/adapters/hermes.py`:
- Around line 101-107: The current _format_scalar uses repr(text) which can emit
Python-style escapes that your parser won't unescape; change it to wrap the
string in single quotes instead and escape any existing single quotes inside the
value by doubling them (i.e., replace ' with ''), so _format_scalar returns a
single-quoted safe string for values containing characters like :, #, {, }, [,
].
In `@Gradata/src/gradata/middleware/_core.py`:
- Around line 292-295: The current truncation slices the full block string and
can chop off the closing </brain-rules> tag; instead compute the available body
budget using source.max_prompt_chars minus the length of the opening
"<brain-rules>\n" and closing "\n</brain-rules>" (and any added newlines),
truncate the joined lines to that budget (operating on the "lines" content),
then assemble the final "block" by concatenating the opening tag, the truncated
body (with rstrip to remove trailing whitespace), and the closing tag so the
<brain-rules> framing is always intact.
---
Outside diff comments:
In `@Gradata/examples/domain-profiles/call_profile.py`:
- Around line 429-443: _avg_features() currently leaves several newly added
CallFeatures fields at defaults; update it to compute averages for every field
on CallFeatures (including commitment_specificity, discovery_minutes,
duration_minutes and any other fields added recently) by summing over fs and
dividing by n, and apply the same rounding semantics used elsewhere (e.g., round
integer-like counts such as question_count, pain_question_count,
longest_monologue, story_count, objection_count, objection_responses,
close_attempts) so that avg_features, win_features and loss_features serialize
complete, correct aggregates; locate and modify the _avg_features function and
ensure it returns a CallFeatures populated with the averaged values.
In `@Gradata/src/gradata/_events.py`:
- Around line 700-704: The except block in _detect_session currently swallows
all exceptions; replace the bare except in _detect_session with catching
Exception as e and log it using logger.warning with a descriptive message and
exc_info=True (e.g., logger.warning("Failed to detect session", exc_info=True))
before returning None so errors are not silently ignored.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 05b30e23-8d52-4ce0-9806-c0f4dbb9d1f4
📒 Files selected for processing (25)
Gradata/bench/pmr_100.pyGradata/examples/basic_usage.pyGradata/examples/domain-profiles/call_profile.pyGradata/examples/domain-profiles/sales_profile.pyGradata/examples/with_claude_code.pyGradata/examples/with_openai.pyGradata/scripts/migrate_legacy_scopes.pyGradata/src/gradata/_config.pyGradata/src/gradata/_context_packet.pyGradata/src/gradata/_core.pyGradata/src/gradata/_events.pyGradata/src/gradata/_migrations/004_event_identity_unique_key.pyGradata/src/gradata/_migrations/device_uuid.pyGradata/src/gradata/_mine_transcripts.pyGradata/src/gradata/brain.pyGradata/src/gradata/cli.pyGradata/src/gradata/cloud/sync.pyGradata/src/gradata/hooks/adapters/hermes.pyGradata/src/gradata/hooks/agent_precontext.pyGradata/src/gradata/hooks/inject_brain_rules.pyGradata/src/gradata/hooks/jit_inject.pyGradata/src/gradata/middleware/_core.pyGradata/tests/test_brain_config_propagation.pyGradata/tests/test_brain_events.pyGradata/tests/test_migration_002_event_identity.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: pytest macos-latest / py3.11
- GitHub Check: pytest ubuntu-latest / py3.12
- GitHub Check: pytest windows-latest / py3.11
- GitHub Check: pytest windows-latest / py3.12
- GitHub Check: pytest macos-latest / py3.12
- GitHub Check: pytest ubuntu-latest / py3.11
🧰 Additional context used
📓 Path-based instructions (2)
Gradata/src/**/*.py
📄 CodeRabbit inference engine (Gradata/AGENTS.md)
Gradata/src/**/*.py: Prefersentence-transformersfor local embeddings,google-genaifor Gemini embeddings,cryptographyfor AES-GCM encrypted system.db,bm25sfor BM25 rule ranking, andmem0aifor external memory adapters — guard all optional dependency imports withtry / except ImportErrorat the call site, never at module level
Maintain strict layering: Layer 0 (Primitives: _types.py, _db.py, _events.py, _paths.py, _file_lock.py; Patterns: contrib/patterns/) must never import from Layer 1 (Enhancements: enhancements/, rules/) or Layer 2 (Public API: brain.py, cli.py, daemon.py, mcp_server.py)
Never use bareexcept: pass— use typed exceptions or at minimumlogger.warning(...)withexc_info=Trueto avoid silent failure in a memory product
Never import from out-of-scope sibling directories../Sprites/or../Hausgem/withingradata/*code — that is a layering bug
Never leak private-sibling paths into public docs/code — no references to../Sprites/,../Hausgem/, email addresses, OneDrive paths, or Sprites-specific examples from insidegradata/*
Use atomic-write helper when writing JSON files to prevent corruption from mid-write crashes
Files:
Gradata/src/gradata/_config.pyGradata/src/gradata/_mine_transcripts.pyGradata/src/gradata/hooks/jit_inject.pyGradata/src/gradata/middleware/_core.pyGradata/src/gradata/hooks/agent_precontext.pyGradata/src/gradata/cli.pyGradata/src/gradata/hooks/inject_brain_rules.pyGradata/src/gradata/cloud/sync.pyGradata/src/gradata/hooks/adapters/hermes.pyGradata/src/gradata/_context_packet.pyGradata/src/gradata/_core.pyGradata/src/gradata/brain.pyGradata/src/gradata/_migrations/004_event_identity_unique_key.pyGradata/src/gradata/_migrations/device_uuid.pyGradata/src/gradata/_events.py
Gradata/tests/**/*.py
📄 CodeRabbit inference engine (Gradata/AGENTS.md)
Gradata/tests/**/*.py: SetBRAIN_DIRenvironment variable viatmp_pathin conftest.py for test isolation — ensure_paths.pymodule cache refreshes when callingBrain.init()directly inside tests
Add unit tests intests/test_*.pyfor every CI push without LLM calls (deterministic); mark integration tests with@pytest.mark.integrationand skip them by default (they hit real LLM APIs)
Files:
Gradata/tests/test_brain_events.pyGradata/tests/test_migration_002_event_identity.pyGradata/tests/test_brain_config_propagation.py
🔇 Additional comments (22)
Gradata/scripts/migrate_legacy_scopes.py (1)
120-121: LGTM!Minor formatting improvement consolidating the warning message to a single line. Functionality is unchanged.
Gradata/src/gradata/hooks/adapters/hermes.py (4)
17-67: Parser scope is appropriate for the use case.The limited YAML subset (nested dicts/lists, simple scalars) matches the expected Hermes hook config structure. The "items" fallback on line 51 handles unexpected list-under-key structures defensively without failing.
One observation: lines 44-46 check
if isinstance(container, dict)before assigningcurrent_listto the container, butcurrent_listis always reassigned regardless. This is safe since orphan lists won't corrupt data, though the comment could clarify this is intentional.
70-75: LGTM!Scalar parsing handles the expected value types (booleans and strings) correctly. Numeric values would be preserved as strings, which is fine since hook configs use string IDs and commands.
78-98: LGTM!The recursive dumping correctly handles nested dicts and lists with proper indentation. Multi-key dict items within lists use the expected YAML continuation style (first key with
-, subsequent keys aligned).
110-139: LGTM!The
installfunction correctly uses the new parser/dumper and maintains atomic writes viaatomic_write_text. Defensive type checks for the hooks structure are preserved, and error handling properly returns a failure result. As per coding guidelines, atomic-write helper is used to prevent corruption from mid-write crashes.Gradata/examples/with_claude_code.py (1)
16-16: Formatting-only change is safe.Line 16 is a non-functional spacing update; behavior and example flow remain unchanged.
Gradata/examples/with_openai.py (1)
10-10: Formatting-only change is safe.Line 10 is purely cosmetic and does not alter runtime behavior.
Gradata/examples/basic_usage.py (1)
2-4: Spacing cleanup looks good.Lines 2–4 are non-functional formatting adjustments only.
Gradata/src/gradata/_context_packet.py (1)
383-386: Session fallback logic is consistent here.The Line 385 coalescing keeps
sessionconcrete for audit/wrap-up query paths and avoidsNonepropagation into int-based filters.Gradata/src/gradata/_config.py (1)
82-85: Good API addition for config loading.
BrainConfig.load(...)is a clean, centralized entrypoint that reduces duplication in consumers.Gradata/src/gradata/middleware/_core.py (1)
125-132: Config-driven prompt cap wiring looks good.Using per-brain
max_recall_tokensto size middleware injection is a solid propagation of runtime config.Gradata/src/gradata/hooks/agent_precontext.py (1)
148-154: Cap propagation and enforcement are consistent.Lines 148–154 and 226–227 correctly apply per-brain prompt sizing with a safe fallback.
Also applies to: 226-227
Gradata/src/gradata/brain.py (1)
239-239: LGTM! Session handling changes are consistent with the PR objectives.The changes correctly:
- Coalesce
Noneto0in thesessionproperty (Line 239)- Forward session value directly to
emit()which now handlesNoneinternally (Line 1516)- Coalesce missing session to
0intrack_rule(Line 1920)This aligns with the stated goal of preserving
Nonesession from callers and resolving the active session at write time.Also applies to: 1516-1516, 1920-1920
Gradata/tests/test_migration_002_event_identity.py (1)
171-194: LGTM! Well-structured test for event identity population.The test correctly validates:
- Two emitted events get distinct
event_idvalues (ULID format)device_idis stable across emits within the same braincontent_hashis identical for events with the same content- Returned event dicts match the persisted DB values
Gradata/tests/test_brain_config_propagation.py (1)
1-55: LGTM! Good integration test for BrainConfig propagation.The test comprehensively validates that
max_recall_tokensfrombrain-config.jsonis respected across all four injection paths (session injection, agent precontext, JIT inject, and middleware). The helper function generates appropriately large lessons to test truncation behavior.One minor observation: Line 54's assertion (
assert os.environ["GRADATA_JIT_ENABLED"] == "1") is redundant since you set it on line 34 withmonkeypatch.setenv. However, this doesn't affect correctness.Gradata/src/gradata/_mine_transcripts.py (1)
222-238: LGTM! Idempotent event IDs for transcript mining.The
_stable_event_idfunction generates a deterministic 32-character hash from event metadata, enabling idempotent replays of mined transcripts. The hardcodedsession=0is appropriate for backfill scenarios where historical session context isn't available.Also applies to: 444-444
Gradata/src/gradata/_migrations/device_uuid.py (1)
30-47: LGTM! Brain-aware device ID generation.The refactored implementation correctly:
- Derives brain identity from manifest metadata (with fallback to path)
- Seeds the device ID from
hostname:brain_identityfor per-brain, per-machine uniqueness- Maintains backward compatibility with existing atomic write patterns
Also applies to: 75-75
Gradata/bench/pmr_100.py (1)
41-61: LGTM! Formatting and style improvements.The benchmark file has been cleaned up with:
- Reorganized imports with UTC alias for cleaner datetime usage
- Multiline function signatures for better readability
- Multiline CLI argument definitions matching broader CLI conventions
- Added
--quickflag for fast feedback (10 sessions vs 100)No functional changes to the benchmark logic.
Also applies to: 296-298, 350-352, 453-469
Gradata/src/gradata/_migrations/004_event_identity_unique_key.py (1)
1-101: LGTM! Migration correctly adds brain_id and unique constraint.The migration:
- Adds
brain_idcolumn and other identity columns- Drops obsolete dedup indexes
- Backfills identity fields for existing rows
- Creates the unique
(brain_id, event_id)indexGradata/src/gradata/_events.py (3)
37-76: LGTM! Well-designed identity and session normalization helpers.The new helper functions are well-structured:
_canonical_content_hash: Creates deterministic SHA256 for content deduplication_legacy_event_id: Provides stable IDs for historical JSONL rows_coerce_session_for_storage: Normalizes session values, returningNonefor invalid/non-positive values (aligning with the PR objective to stop emittingsession=0)
165-209: LGTM! Schema additions for event identity tracking.The schema correctly adds:
- Identity columns:
brain_id,event_id,device_id,content_hash- Lineage columns:
correction_chain_id,origin_agent- Unique index on
(brain_id, event_id)for deduplicationThe use of
contextlib.suppress(sqlite3.OperationalError)for ALTER TABLE statements handles the idempotency of schema migrations gracefully.
354-358: LGTM! Session and identity resolution with defensive fallbacks.The
emit()function now correctly:
- Resolves session at write time (defaulting to 1 if undetectable), then coerces to
Nonefor non-positive values- Ensures
redacted_datais always a dict (wrapping scalars in{"_raw": ...})- Provides hash-based fallbacks for
tenant_idanddevice_idresolution failuresThis aligns with the PR objective of resolving active session at write time while maintaining robustness.
Also applies to: 398-412
| brain_cloud_cfg = _load_cloud_config(brain.dir) | ||
| api_key = _cloud_creds.resolve_credential(fallback=brain_cloud_cfg.token) |
There was a problem hiding this comment.
Honor sync_enabled before resolving shared credentials.
This auto-sync path now uploads whenever a global keyfile or env credential exists, even if this brain was never cloud-enabled. That can leak data from the wrong brain as soon as brain_end_session() runs.
Suggested fix
brain_cloud_cfg = _load_cloud_config(brain.dir)
+ if not brain_cloud_cfg.sync_enabled:
+ return
api_key = _cloud_creds.resolve_credential(fallback=brain_cloud_cfg.token)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| brain_cloud_cfg = _load_cloud_config(brain.dir) | |
| api_key = _cloud_creds.resolve_credential(fallback=brain_cloud_cfg.token) | |
| brain_cloud_cfg = _load_cloud_config(brain.dir) | |
| if not brain_cloud_cfg.sync_enabled: | |
| return | |
| api_key = _cloud_creds.resolve_credential(fallback=brain_cloud_cfg.token) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Gradata/src/gradata/_core.py` around lines 1241 - 1242, Before resolving
shared credentials, check the brain's sync flag and skip shared-credential
resolution when sync is disabled: move or guard the calls to
_load_cloud_config(brain.dir) and _cloud_creds.resolve_credential(...) behind a
conditional that verifies the brain's sync_enabled (or equivalent) property, and
ensure brain_end_session() uses that same check so we never auto-upload or
resolve globals for brains that are not cloud-enabled; update references in the
brain_end_session flow, the _load_cloud_config usage, and any code paths that
call _cloud_creds.resolve_credential to only run when sync_enabled is true.
| sync_mode = getattr(brain_cloud_cfg, "sync_mode", "full") or "full" | ||
| try: | ||
| cfg = _parse_toml_cloud(config_path) | ||
| sync_mode = cfg.get("sync_mode", "metrics_only") | ||
| sync_mode = cfg.get("sync_mode", sync_mode) |
There was a problem hiding this comment.
Do not let legacy TOML override the per-brain sync_mode.
These lines can undo the "full" mode that cmd_sync() just persisted for this brain, so a stale ~/.gradata/config.toml with metrics_only will still skip event backfill and recreate the dashboard gap.
Suggested fix
- sync_mode = getattr(brain_cloud_cfg, "sync_mode", "full") or "full"
- try:
- cfg = _parse_toml_cloud(config_path)
- sync_mode = cfg.get("sync_mode", sync_mode)
- except Exception:
- pass
+ sync_mode = str(getattr(brain_cloud_cfg, "sync_mode", "") or "").strip().lower()
+ if not sync_mode and config_path.is_file():
+ try:
+ cfg = _parse_toml_cloud(config_path)
+ sync_mode = str(cfg.get("sync_mode", "") or "").strip().lower()
+ except Exception:
+ pass
+ sync_mode = sync_mode or "full"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| sync_mode = getattr(brain_cloud_cfg, "sync_mode", "full") or "full" | |
| try: | |
| cfg = _parse_toml_cloud(config_path) | |
| sync_mode = cfg.get("sync_mode", "metrics_only") | |
| sync_mode = cfg.get("sync_mode", sync_mode) | |
| sync_mode = str(getattr(brain_cloud_cfg, "sync_mode", "") or "").strip().lower() | |
| if not sync_mode and config_path.is_file(): | |
| try: | |
| cfg = _parse_toml_cloud(config_path) | |
| sync_mode = str(cfg.get("sync_mode", "") or "").strip().lower() | |
| except Exception: | |
| pass | |
| sync_mode = sync_mode or "full" |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Gradata/src/gradata/_core.py` around lines 1345 - 1348, The global TOML
should not override a per-brain persisted sync_mode: after calling
_parse_toml_cloud(config_path) only apply cfg.get("sync_mode", ...) when the
brain does not already have a sync_mode set; e.g. keep the initial sync_mode
from getattr(brain_cloud_cfg, "sync_mode", "full") and replace the unconditional
sync_mode = cfg.get(...) with a conditional (if not hasattr(brain_cloud_cfg,
"sync_mode") or brain_cloud_cfg.sync_mode in (None, "")) then use cfg.get(...).
This ensures cmd_sync()'s per-brain "full" setting isn't clobbered by legacy
~/.gradata/config.toml.
| def _canonical_content_hash(ev_type: str, source: str | None, data_json: str | None) -> str: | ||
| import hashlib | ||
|
|
||
| try: | ||
| data = json.loads(data_json) if data_json else {} | ||
| except (json.JSONDecodeError, TypeError): | ||
| data = {"_raw": data_json} | ||
| canonical = json.dumps( | ||
| {"type": ev_type, "source": source or "", "data": data}, | ||
| sort_keys=True, | ||
| separators=(",", ":"), | ||
| ensure_ascii=False, | ||
| ) | ||
| return hashlib.sha256(canonical.encode("utf-8")).hexdigest() |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | 💤 Low value
Minor inconsistency: _canonical_content_hash differs from _events.py version.
The migration's _canonical_content_hash is missing default=str in json.dumps() (Line 111-116), whereas the _events.py version includes it. This is acceptable for the migration's backfill use case since it processes data already serialized to JSON in the database. However, for consistency and future-proofing:
🔧 Optional: Add default=str for consistency
canonical = json.dumps(
{"type": ev_type, "source": source or "", "data": data},
sort_keys=True,
separators=(",", ":"),
ensure_ascii=False,
+ default=str,
)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Gradata/src/gradata/_migrations/004_event_identity_unique_key.py` around
lines 104 - 117, The migration function _canonical_content_hash currently calls
json.dumps without default=str; update the json.dumps invocation inside
_canonical_content_hash to include default=str (i.e., json.dumps(...,
default=str, sort_keys=True, ...)) so the behavior matches the _events.py
version and handles non-serializable values consistently during backfill.
| brain_root = Path( | ||
| env_str("GRADATA_BRAIN") or getattr(args, "brain_dir", None) or Path.cwd() | ||
| ).resolve() |
There was a problem hiding this comment.
Resolve the sync target with the standard brain precedence.
This path ignores BRAIN_DIR and lets GRADATA_BRAIN override an explicit --brain-dir, so gradata sync --full can backfill the wrong brain.
Suggested fix
- brain_root = Path(
- env_str("GRADATA_BRAIN") or getattr(args, "brain_dir", None) or Path.cwd()
- ).resolve()
+ brain_root = _resolve_brain_root(args).resolve()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Gradata/src/gradata/cli.py` around lines 886 - 888, The brain_root resolution
currently lets GRADATA_BRAIN override an explicit --brain-dir; change the
precedence so the CLI flag wins first, then the BRAIN_DIR env var, then
GRADATA_BRAIN, then Path.cwd(). Update the expression that sets brain_root in
gradata/cli.py to prefer getattr(args, "brain_dir", None) first, then
env_str("BRAIN_DIR"), then env_str("GRADATA_BRAIN"), and finally Path.cwd(), and
call .resolve() on the resulting Path.
| token=str(data.get("token", "")), | ||
| api_base=_normalize_api_base(str(data.get("api_base", _DEFAULT_API_BASE))), | ||
| contribute_corpus=bool(data.get("contribute_corpus", False)), | ||
| sync_mode=str(data.get("sync_mode", "full") or "full"), |
There was a problem hiding this comment.
Normalize and validate sync_mode on load.
Line 101 accepts arbitrary strings from disk. Please constrain to supported modes (e.g., full, metrics_only) and default invalid values to full to avoid silent mode drift.
🔧 Proposed fix
- sync_mode=str(data.get("sync_mode", "full") or "full"),
+ sync_mode=(
+ (lambda m: m if m in {"full", "metrics_only"} else "full")(
+ str(data.get("sync_mode", "full") or "full").strip().lower()
+ )
+ ),🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Gradata/src/gradata/cloud/sync.py` at line 101, The load code currently sets
sync_mode directly from disk via sync_mode=str(data.get("sync_mode", "full") or
"full"), allowing arbitrary values; change this to normalize (strip and lower)
and validate against an allowed set (e.g., {"full", "metrics_only"}) and fall
back to "full" for any invalid or missing value. Locate the code that reads
data.get("sync_mode") (the sync_mode assignment) and replace it with logic that
normalizes the string, checks membership in the allowed set, and assigns "full"
when the value is invalid or absent so the program cannot silently accept
unsupported modes.
| def _format_scalar(value) -> str: | ||
| if isinstance(value, bool): | ||
| return "true" if value else "false" | ||
| text = str(value) | ||
| if any(ch in text for ch in (":", "#", "{", "}", "[", "]")): | ||
| return repr(text) | ||
| return text |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | 💤 Low value
Minor edge case with repr() escaping.
Using repr() for quoting could produce Python-style escape sequences (e.g., \') that the custom parser's strip("'\"") won't unescape. For the expected hook config values (simple paths and IDs), this is unlikely to occur. If edge cases arise, consider using single-quote wrapping directly instead of repr().
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Gradata/src/gradata/hooks/adapters/hermes.py` around lines 101 - 107, The
current _format_scalar uses repr(text) which can emit Python-style escapes that
your parser won't unescape; change it to wrap the string in single quotes
instead and escape any existing single quotes inside the value by doubling them
(i.e., replace ' with ''), so _format_scalar returns a single-quoted safe string
for values containing characters like :, #, {, }, [, ].
| block = "<brain-rules>\n" + "\n".join(lines) + "\n</brain-rules>" | ||
| if len(block) > source.max_prompt_chars: | ||
| block = block[: source.max_prompt_chars].rstrip() | ||
| return block |
There was a problem hiding this comment.
Preserve valid <brain-rules> framing when truncating.
Line 294 truncates the full block string, which can cut </brain-rules> and return malformed markup. Compute the body budget first, then wrap.
🔧 Proposed fix
- block = "<brain-rules>\n" + "\n".join(lines) + "\n</brain-rules>"
- if len(block) > source.max_prompt_chars:
- block = block[: source.max_prompt_chars].rstrip()
- return block
+ open_tag = "<brain-rules>\n"
+ close_tag = "\n</brain-rules>"
+ max_chars = max(0, int(source.max_prompt_chars))
+ min_wrapper = len(open_tag) + len(close_tag)
+ if max_chars <= min_wrapper:
+ return ""
+
+ body = "\n".join(lines)
+ body_budget = max_chars - min_wrapper
+ if len(body) > body_budget:
+ body = body[:body_budget].rstrip()
+ return f"{open_tag}{body}{close_tag}"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| block = "<brain-rules>\n" + "\n".join(lines) + "\n</brain-rules>" | |
| if len(block) > source.max_prompt_chars: | |
| block = block[: source.max_prompt_chars].rstrip() | |
| return block | |
| open_tag = "<brain-rules>\n" | |
| close_tag = "\n</brain-rules>" | |
| max_chars = max(0, int(source.max_prompt_chars)) | |
| min_wrapper = len(open_tag) + len(close_tag) | |
| if max_chars <= min_wrapper: | |
| return "" | |
| body = "\n".join(lines) | |
| body_budget = max_chars - min_wrapper | |
| if len(body) > body_budget: | |
| body = body[:body_budget].rstrip() | |
| return f"{open_tag}{body}{close_tag}" |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Gradata/src/gradata/middleware/_core.py` around lines 292 - 295, The current
truncation slices the full block string and can chop off the closing
</brain-rules> tag; instead compute the available body budget using
source.max_prompt_chars minus the length of the opening "<brain-rules>\n" and
closing "\n</brain-rules>" (and any added newlines), truncate the joined lines
to that budget (operating on the "lines" content), then assemble the final
"block" by concatenating the opening tag, the truncated body (with rstrip to
remove trailing whitespace), and the closing tag so the <brain-rules> framing is
always intact.
v0.7.2 — Data-flow correctness
Codex deep audit found 12 bugs in the SDK→cloud→dashboard event flow. This PR ships SDK-side fixes for Bugs 1, 2, 7, 8, 9, 10, 11. Companion
gradata-cloudPR fixes the projector + adds a backfill script.The headline finding
Oliver suspected "the data flow is corrupt or not up to par." He was right. A 3-vendor council + dataflow trace + codex audit converged on one root cause:
27,000 events were ingested by the cloud
eventstable, but the dashboardcorrectionstable only grew by 35 rows.The cause was a stack of compounding bugs:
metrics_only— events never even reached/sync./sync, they were emitted withsession=0(sentinel collision).(session, description)— 27K events with identical(session=0, description="(no description)")collapsed into one row per real session.SDK fixes in this PR
P0 — Default sync now includes the corpus (Bug 1)
src/gradata/_core.py:1340—_cloud_sync_sessionwas silently usingmetrics_only. Default is nowfull. Clear WARN log whenmetrics_onlyis set so users know dashboard will be empty. New CLI:gradata sync --fullflips the mode and immediately backfills.P0 — Stop emitting events with session=0 (Bug 2)
src/gradata/brain.py:Brain.emit()andsrc/gradata/_events.py— session passed as None propagates to write-time, where the active session is detected from brain state. Session is only coerced to int at serialization, never used as 0 sentinel.P1 — Populate event identity at emit time (Bug 7)
src/gradata/_events.pynow writesevent_id(ULID),device_id(dev_<sha256>),content_hash(sha256) at emit. Migration 004 drops the legacy unique on(tenant_id, ts, type, source)and adds unique(brain_id, event_id). Two events with identical(ts, type, source)but different data now coexist.P1 — Replace false-green identity test (Bug 8)
tests/test_migration_002_event_identity.pywas assertingevent_id IS NULL, pinning the broken behavior. Replaced with assertions that ULID format is correct, device_id is stable across calls, content_hash is deterministic.P1 — BrainConfig propagation (Bug 9)
src/gradata/_config.pyaddsBrainConfig.load(brain_dir). Used byinject_brain_rules.py,agent_precontext.py,jit_inject.py,middleware/_core.py.max_recall_tokensandrankernow respected across all 4 injection paths. Newtests/test_brain_config_propagation.pycovers all 4 sites.P1 — Hermes adapter undeclared YAML dep (Bug 10)
src/gradata/hooks/adapters/hermes.pyhad a hard import onpyyaml(in extras, not core). Replaced with a small in-tree YAML reader/writer for the Hermes config schema. No moreImportErrorwhen user runsgradata install --agent hermes.P2 — README/CLI consistency (Bug 11)
gradata init --no-interactiveflag now exists (was documented but missing).Test impact
ruff check,ruff format --check,pyright src/all clean.What's NOT in this PR (deliberate)
Gradata/gradata-cloudrepo: fixes the(session, description)upsert key, adds migration 021, adds reproject script.brain.prove()holdout validation — flagged as the next bet-the-company item per council. Separate PR after we land the data-flow correctness baseline.Council verdict that drove this prioritization
3-vendor
council --full(7 lenses, 21 minutes) on cloud pivot: Convergence-as-a-Service is the wedge ($99/mo per brain × 10 customers = $1k MRR by demo day). The pitch requires a dashboard that accurately reflects events. This PR is the precondition.Co-authored-by: Codex noreply@openai.com