feat: eval suite + entity injection in prompt hook by EtanHey · Pull Request #72 · EtanHey/brainlayer

EtanHey · 2026-03-09T11:18:53Z

Summary

Phase 0 (Baselines): 23-case eval suite across 8 search quality domains — entity routing, tag filter, recency, Hebrew FTS, cross-project, decision retrieval, memory recall, and real mined queries. Baseline recorded at 94.7% brain_search quality.
Phase A (Entity routing in hook): brainlayer-prompt-search.py now detects known entity names (person, company, agent types) in the user's prompt and injects [Entity: Name — type] section + linked chunks before FTS results.

Before/After

Metric	Before	After
brain_search quality	94.7% (18/19)	94.7% (unchanged)
hook entity injection	25% (1/4)	100% (4/4)
Combined	82.6%	95.7% (+13.1pp)

What Changed

tests/test_eval_baselines.py — 23 pytest test cases, @pytest.mark.live, test the real production DB. Known gaps marked @pytest.mark.xfail. Run with pytest -m live or python tests/test_eval_baselines.py.

scripts/run_evals.py — CLI runner: python scripts/run_evals.py --diff compares to saved baseline.

hooks/brainlayer-prompt-search.py — Entity detection additions:

detect_entities_in_prompt(): checks bigrams + single words (4+ chars, capitalized) against kg_entities WHERE entity_type IN (person, company, agent)
Possessive stripping: "Simon's" → "Simon" for correct bigram matching
Max 2 entities per prompt, each gets 2 linked chunks from kg_entity_chunks
Entity section prepended before FTS results

tests/eval_baselines.json — Committed baseline. Future runs compare against this.

Test Plan

pytest tests/test_eval_baselines.py -q — 22 pass, 2 xfailed, 1 xpassed
python tests/test_eval_baselines.py — prints before/after scores
Hook with "What are Avi Simon's meeting preferences?" → [Entity: Avi Simon — person] in first line
Hook with "How does authentication work in Python?" → no entity injection (Python is technology type)

🤖 Generated with Claude Code

Phase 0 — Baselines: - tests/test_eval_baselines.py: 23-case eval suite across 8 domains (entity routing, tag filter, recency, Hebrew FTS, cross-project, decision retrieval, memory, mined real queries) - tests/eval_baselines.json: recorded baseline results - scripts/run_evals.py: CLI runner for before/after comparison - tests/conftest.py: register `live` pytest mark Phase A — Entity routing in prompt hook: - hooks/brainlayer-prompt-search.py: detect known entity names (person, company, agent) in user prompt → inject [Entity: Name — type] section + linked chunks before FTS results - Possessive stripping ("Simon's" → "Simon") for bigram matching - Filter: only person/company/agent types (skip technology/concept noise) Before/After scores (run: python tests/test_eval_baselines.py): - brain_search quality: 94.7% (18/19) — unchanged (already good) - hook entity injection: 25% → 100% (3 tests now pass) - combined: 82.6% → 95.7% (+13.1pp) Known gaps (xfail): - Hebrew semantic accuracy (query returns unrelated Hebrew content) - "today" temporal awareness in raw hybrid_search Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

coderabbitai · 2026-03-09T11:19:09Z

Warning

Rate limit exceeded

@EtanHey has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 10 minutes and 56 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c099cde5-c06a-408e-b608-1e866a70b212

📥 Commits

Reviewing files that changed from the base of the PR and between 73c2e4f and 8823418.

📒 Files selected for processing (5)

hooks/brainlayer-prompt-search.py
scripts/run_evals.py
tests/conftest.py
tests/eval_baselines.json
tests/test_eval_baselines.py

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/eval-baselines-phase-a

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

EtanHey · 2026-03-09T11:32:05Z

@cursorbot review this PR please

EtanHey · 2026-03-09T11:48:02Z

@cursor-bugbot review

EtanHey · 2026-03-09T11:54:00Z

@cursor @BugBot review

cursor · 2026-03-09T11:54:06Z

Taking a look!

EtanHey · 2026-03-09T11:54:29Z

@cursor @BugBot review

cursor · 2026-03-09T11:54:34Z

🐛 Bug Review Complete

I've analyzed all 5 changed files and found 8 issues (2 critical, 2 medium, 4 minor). Here's the breakdown:

🔴 CRITICAL — Must Fix Before Merge

Bug #1: Possessive Name Truncation
📍 hooks/brainlayer-prompt-search.py:144-148

Names ending with ' (without 's') are incorrectly truncated:

# Current behavior:
"Simon's" → "Simon"  ✓ correct
"James'"  → "Jame"   ✗ WRONG
"Jones'"  → "Jone"   ✗ WRONG

Root cause: re.search(r"'s?$", w) matches both 's AND ', causing the code to always strip the trailing 's'.

Fix:

# Line 147 — change regex to only match 's, not just '
if re.search(r"'s$", w):  # Remove the ? quantifier
    cleaned = cleaned[:-1]

Bug #2: FTS5 Query Injection via Unescaped Quotes
📍 hooks/brainlayer-prompt-search.py:248

Keywords containing " break FTS5 with "unterminated string" error:

# Example that crashes:
keywords = ['test', 'with"quote']
fts_query = " OR ".join(f'"{kw}"' for kw in keywords)
# Result: "test" OR "with"quote"  ← syntax error

Fix option 1 (safer — strip quotes during extraction):

# Line 102 — modify keyword extraction regex
words = re.findall(r"[a-zA-Z0-9][\w-]*", text.lower())
# Add after line 110:
keywords = [w.replace('"', '') for w in keywords if w not in STOP_WORDS and len(w) > 2 and w not in seen]

Fix option 2 (escape quotes in FTS query):

# Line 248
fts_query = " OR ".join(f'"{kw.replace(chr(34), chr(34)*2)}"' for kw in keywords)

🟡 MEDIUM — Should Fix

Bug #3: Timeout Budget Not Enforced
📍 hooks/brainlayer-prompt-search.py:254, 267

Two independent if elapsed_ms(start) < DEADLINE_MS checks don't prevent cumulative timeout:

# Entity detection: 250ms ✓ (under 450ms)
# FTS search: 300ms ✓ (under 450ms)
# Total: 550ms ✗ (exceeds <500ms target)

Fix: Calculate remaining budget:

# After line 265
if elapsed_ms(start) < DEADLINE_MS:
    remaining = DEADLINE_MS - elapsed_ms(start)
    if remaining < 100:  # Need at least 100ms for FTS
        conn.close()
        if lines:
            print("\n".join(lines))
        sys.exit(0)

Bug #4: Hardcoded Date in Recency Test
📍 tests/test_eval_baselines.py:188-189

Test checks for "2026-03-09" or "2026-03-10" which will be incorrect after March 10:

Fix:

# Replace lines 188-189
from datetime import date, timedelta
today = date.today()
yesterday = today - timedelta(days=1)
if today.isoformat() in date_str or yesterday.isoformat() in date_str:
    recent_found = True

🟢 MINOR — Nice to Fix

Issue #5: Linter Violations
📍 scripts/run_evals.py:59, 72

ruff check scripts/run_evals.py
# F541: f-string without any placeholders (lines 59, 72)

Fix: Run ruff check --fix scripts/run_evals.py

Issue #6: Capitalization Filter Too Restrictive
📍 hooks/brainlayer-prompt-search.py:166

Current filter misses 3-char entities and includes sentence starters:

len(w) >= 4 and w[0].isupper() and not w.isupper()
# Misses: "Avi", "IBM", "AWS" (len < 4)
# Includes: "What", "Tell", "This" (sentence starters)

Recommendation: Lower to 3 chars and check against capitalized stop words list.

Issue #7: Silent Error Swallowing
📍 hooks/brainlayer-prompt-search.py:185, 299

except sqlite3.Error: pass hides all DB errors including missing tables, making production debugging difficult.

Recommendation: Add optional debug logging:

# After imports
DEBUG = os.environ.get("BRAINLAYER_HOOK_DEBUG")

# In except blocks
except sqlite3.Error as e:
    if DEBUG:
        sys.stderr.write(f"Entity detection failed: {e}\n")
    pass

Issue #8: Fragile Negative Assertion
📍 tests/test_eval_baselines.py:411-413

Complex assertion could pass for wrong reasons:

assert _passes(docs, ["auth", "JWT"]) and not _passes(docs, ["hybrid_search", "VectorStore"])

If BrainLayer code mentions auth (e.g., "VectorStore auth methods"), test fails incorrectly. Already marked xfail, but consider testing for positive patterns instead.

✅ Verified Safe

SQL injection protected (parameterized queries)
Baseline calculations accurate (94.7% → 95.7%)
Test collection works (25 tests)
Hook gracefully handles edge cases

📋 Action Items

Priority order:

Fix Bug fix: Phase 3 core fixes — DB paths, date filtering, search metadata #1 (possessive regex) — breaks common names like "James'"
Fix Bug feat(youtube): transcript-api v1.2.4 + Brave cookies #2 (quote escaping) — causes silent search failures
Fix Bug test: Phase 4 QA — comprehensive tests for Phase 3 core fixes #4 (hardcoded dates) — will break tomorrow
Fix Bug feat(mcp): tool annotations + completions support #3 (timeout) — affects UX latency
Run ruff check --fix for style issues

Would you like me to apply these fixes?

greptile-apps Bot reviewed Mar 9, 2026

View reviewed changes

EtanHey and others added 2 commits March 9, 2026 13:24

fix: resolve ruff lint errors (unused import, f-string placeholders)

0d5cc75

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

style: format test_eval_baselines.py with ruff

8823418

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

EtanHey merged commit ef3cdd5 into main Mar 9, 2026
5 checks passed

EtanHey deleted the feat/eval-baselines-phase-a branch March 9, 2026 12:10

EtanHey mentioned this pull request Mar 17, 2026

feat: BrainBar Swift daemon — MCP over Unix socket #84

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: eval suite + entity injection in prompt hook#72

feat: eval suite + entity injection in prompt hook#72
EtanHey merged 3 commits into
mainfrom
feat/eval-baselines-phase-a

EtanHey commented Mar 9, 2026

Uh oh!

greptile-apps Bot left a comment

Uh oh!

coderabbitai Bot commented Mar 9, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

EtanHey commented Mar 9, 2026

Uh oh!

EtanHey commented Mar 9, 2026

Uh oh!

EtanHey commented Mar 9, 2026

Uh oh!

cursor Bot commented Mar 9, 2026

Uh oh!

EtanHey commented Mar 9, 2026

Uh oh!

cursor Bot commented Mar 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

EtanHey commented Mar 9, 2026

Summary

Before/After

What Changed

Test Plan

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Uh oh!

EtanHey commented Mar 9, 2026

Uh oh!

EtanHey commented Mar 9, 2026

Uh oh!

EtanHey commented Mar 9, 2026

Uh oh!

cursor Bot commented Mar 9, 2026

Uh oh!

EtanHey commented Mar 9, 2026

Uh oh!

cursor Bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🐛 Bug Review Complete

🔴 CRITICAL — Must Fix Before Merge

🟡 MEDIUM — Should Fix

🟢 MINOR — Nice to Fix

✅ Verified Safe

📋 Action Items

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Mar 9, 2026 •

edited

Loading

cursor Bot commented Mar 9, 2026 •

edited

Loading