Skip to content

feat: eval suite + entity injection in prompt hook#72

Merged
EtanHey merged 3 commits into
mainfrom
feat/eval-baselines-phase-a
Mar 9, 2026
Merged

feat: eval suite + entity injection in prompt hook#72
EtanHey merged 3 commits into
mainfrom
feat/eval-baselines-phase-a

Conversation

@EtanHey
Copy link
Copy Markdown
Owner

@EtanHey EtanHey commented Mar 9, 2026

Summary

  • Phase 0 (Baselines): 23-case eval suite across 8 search quality domains — entity routing, tag filter, recency, Hebrew FTS, cross-project, decision retrieval, memory recall, and real mined queries. Baseline recorded at 94.7% brain_search quality.
  • Phase A (Entity routing in hook): brainlayer-prompt-search.py now detects known entity names (person, company, agent types) in the user's prompt and injects [Entity: Name — type] section + linked chunks before FTS results.

Before/After

Metric Before After
brain_search quality 94.7% (18/19) 94.7% (unchanged)
hook entity injection 25% (1/4) 100% (4/4)
Combined 82.6% 95.7% (+13.1pp)

What Changed

tests/test_eval_baselines.py — 23 pytest test cases, @pytest.mark.live, test the real production DB. Known gaps marked @pytest.mark.xfail. Run with pytest -m live or python tests/test_eval_baselines.py.

scripts/run_evals.py — CLI runner: python scripts/run_evals.py --diff compares to saved baseline.

hooks/brainlayer-prompt-search.py — Entity detection additions:

  • detect_entities_in_prompt(): checks bigrams + single words (4+ chars, capitalized) against kg_entities WHERE entity_type IN (person, company, agent)
  • Possessive stripping: "Simon's" → "Simon" for correct bigram matching
  • Max 2 entities per prompt, each gets 2 linked chunks from kg_entity_chunks
  • Entity section prepended before FTS results

tests/eval_baselines.json — Committed baseline. Future runs compare against this.

Test Plan

  • pytest tests/test_eval_baselines.py -q — 22 pass, 2 xfailed, 1 xpassed
  • python tests/test_eval_baselines.py — prints before/after scores
  • Hook with "What are Avi Simon's meeting preferences?" → [Entity: Avi Simon — person] in first line
  • Hook with "How does authentication work in Python?" → no entity injection (Python is technology type)

🤖 Generated with Claude Code

Phase 0 — Baselines:
- tests/test_eval_baselines.py: 23-case eval suite across 8 domains
  (entity routing, tag filter, recency, Hebrew FTS, cross-project,
   decision retrieval, memory, mined real queries)
- tests/eval_baselines.json: recorded baseline results
- scripts/run_evals.py: CLI runner for before/after comparison
- tests/conftest.py: register `live` pytest mark

Phase A — Entity routing in prompt hook:
- hooks/brainlayer-prompt-search.py: detect known entity names
  (person, company, agent) in user prompt → inject [Entity: Name — type]
  section + linked chunks before FTS results
- Possessive stripping ("Simon's" → "Simon") for bigram matching
- Filter: only person/company/agent types (skip technology/concept noise)

Before/After scores (run: python tests/test_eval_baselines.py):
- brain_search quality: 94.7% (18/19) — unchanged (already good)
- hook entity injection: 25% → 100% (3 tests now pass)
- combined: 82.6% → 95.7% (+13.1pp)

Known gaps (xfail):
- Hebrew semantic accuracy (query returns unrelated Hebrew content)
- "today" temporal awareness in raw hybrid_search

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 9, 2026

Warning

Rate limit exceeded

@EtanHey has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 10 minutes and 56 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c099cde5-c06a-408e-b608-1e866a70b212

📥 Commits

Reviewing files that changed from the base of the PR and between 73c2e4f and 8823418.

📒 Files selected for processing (5)
  • hooks/brainlayer-prompt-search.py
  • scripts/run_evals.py
  • tests/conftest.py
  • tests/eval_baselines.json
  • tests/test_eval_baselines.py
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/eval-baselines-phase-a

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

EtanHey and others added 2 commits March 9, 2026 13:24
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented Mar 9, 2026

@cursorbot review this PR please

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented Mar 9, 2026

@cursor-bugbot review

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented Mar 9, 2026

@cursor @BugBot review

@cursor
Copy link
Copy Markdown

cursor Bot commented Mar 9, 2026

Taking a look!

Open in Web Open in Cursor 

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented Mar 9, 2026

@cursor @BugBot review

@cursor
Copy link
Copy Markdown

cursor Bot commented Mar 9, 2026

🐛 Bug Review Complete

I've analyzed all 5 changed files and found 8 issues (2 critical, 2 medium, 4 minor). Here's the breakdown:


🔴 CRITICAL — Must Fix Before Merge

Bug #1: Possessive Name Truncation
📍 hooks/brainlayer-prompt-search.py:144-148

Names ending with ' (without 's') are incorrectly truncated:

# Current behavior:
"Simon's""Simon"correct
"James'""Jame"WRONG
"Jones'""Jone"WRONG

Root cause: re.search(r"'s?$", w) matches both 's AND ', causing the code to always strip the trailing 's'.

Fix:

# Line 147 — change regex to only match 's, not just '
if re.search(r"'s$", w):  # Remove the ? quantifier
    cleaned = cleaned[:-1]

Bug #2: FTS5 Query Injection via Unescaped Quotes
📍 hooks/brainlayer-prompt-search.py:248

Keywords containing " break FTS5 with "unterminated string" error:

# Example that crashes:
keywords = ['test', 'with"quote']
fts_query = " OR ".join(f'"{kw}"' for kw in keywords)
# Result: "test" OR "with"quote"  ← syntax error

Fix option 1 (safer — strip quotes during extraction):

# Line 102 — modify keyword extraction regex
words = re.findall(r"[a-zA-Z0-9][\w-]*", text.lower())
# Add after line 110:
keywords = [w.replace('"', '') for w in keywords if w not in STOP_WORDS and len(w) > 2 and w not in seen]

Fix option 2 (escape quotes in FTS query):

# Line 248
fts_query = " OR ".join(f'"{kw.replace(chr(34), chr(34)*2)}"' for kw in keywords)

🟡 MEDIUM — Should Fix

Bug #3: Timeout Budget Not Enforced
📍 hooks/brainlayer-prompt-search.py:254, 267

Two independent if elapsed_ms(start) < DEADLINE_MS checks don't prevent cumulative timeout:

# Entity detection: 250ms ✓ (under 450ms)
# FTS search: 300ms ✓ (under 450ms)
# Total: 550ms ✗ (exceeds <500ms target)

Fix: Calculate remaining budget:

# After line 265
if elapsed_ms(start) < DEADLINE_MS:
    remaining = DEADLINE_MS - elapsed_ms(start)
    if remaining < 100:  # Need at least 100ms for FTS
        conn.close()
        if lines:
            print("\n".join(lines))
        sys.exit(0)

Bug #4: Hardcoded Date in Recency Test
📍 tests/test_eval_baselines.py:188-189

Test checks for "2026-03-09" or "2026-03-10" which will be incorrect after March 10:

Fix:

# Replace lines 188-189
from datetime import date, timedelta
today = date.today()
yesterday = today - timedelta(days=1)
if today.isoformat() in date_str or yesterday.isoformat() in date_str:
    recent_found = True

🟢 MINOR — Nice to Fix

Issue #5: Linter Violations
📍 scripts/run_evals.py:59, 72

ruff check scripts/run_evals.py
# F541: f-string without any placeholders (lines 59, 72)

Fix: Run ruff check --fix scripts/run_evals.py


Issue #6: Capitalization Filter Too Restrictive
📍 hooks/brainlayer-prompt-search.py:166

Current filter misses 3-char entities and includes sentence starters:

len(w) >= 4 and w[0].isupper() and not w.isupper()
# Misses: "Avi", "IBM", "AWS" (len < 4)
# Includes: "What", "Tell", "This" (sentence starters)

Recommendation: Lower to 3 chars and check against capitalized stop words list.


Issue #7: Silent Error Swallowing
📍 hooks/brainlayer-prompt-search.py:185, 299

except sqlite3.Error: pass hides all DB errors including missing tables, making production debugging difficult.

Recommendation: Add optional debug logging:

# After imports
DEBUG = os.environ.get("BRAINLAYER_HOOK_DEBUG")

# In except blocks
except sqlite3.Error as e:
    if DEBUG:
        sys.stderr.write(f"Entity detection failed: {e}\n")
    pass

Issue #8: Fragile Negative Assertion
📍 tests/test_eval_baselines.py:411-413

Complex assertion could pass for wrong reasons:

assert _passes(docs, ["auth", "JWT"]) and not _passes(docs, ["hybrid_search", "VectorStore"])

If BrainLayer code mentions auth (e.g., "VectorStore auth methods"), test fails incorrectly. Already marked xfail, but consider testing for positive patterns instead.


Verified Safe

  • SQL injection protected (parameterized queries)
  • Baseline calculations accurate (94.7% → 95.7%)
  • Test collection works (25 tests)
  • Hook gracefully handles edge cases

📋 Action Items

Priority order:

  1. Fix Bug fix: Phase 3 core fixes — DB paths, date filtering, search metadata #1 (possessive regex) — breaks common names like "James'"
  2. Fix Bug feat(youtube): transcript-api v1.2.4 + Brave cookies #2 (quote escaping) — causes silent search failures
  3. Fix Bug test: Phase 4 QA — comprehensive tests for Phase 3 core fixes #4 (hardcoded dates) — will break tomorrow
  4. Fix Bug feat(mcp): tool annotations + completions support #3 (timeout) — affects UX latency
  5. Run ruff check --fix for style issues

Would you like me to apply these fixes?

Open in Web Open in Cursor 

@EtanHey EtanHey merged commit ef3cdd5 into main Mar 9, 2026
5 checks passed
@EtanHey EtanHey deleted the feat/eval-baselines-phase-a branch March 9, 2026 12:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant