bench: PMR-100 procedural memory retention benchmark (rebased)#155
Conversation
The single benchmark council recommended Gradata ship before launch. 100 scripted sessions, 6 correction classes, recall@1/recall@3 metrics with per-class breakdown. First baseline run (3 sessions, BEHAVIORAL class): 0% rules extracted, 0% recall. This is the work. Track on every PR. Ship at >=70% recall@1 across all classes. Run: python -m bench.pmr_100 [--quick] [-n N]
…atch Fix wrong assumption that apply_brain_rules returns a list of rule objects. It returns a formatted prompt string. Recall scoring now checks whether expected keywords appear in the rendered text. Smoke (10 sessions): still 0% recall — confirms the kernel does not graduate rules from a single correction. Multiple reinforcements needed before lessons file populates. This is by design (FSRS scoring) and is the real launch question: how many reinforcements until rules become callable?
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (4)
📝 Walkthrough
WalkthroughIntroduces the PMR-100 "Procedural Memory Retention" benchmark suite. Adds a benchmarking script that evaluates a Brain system's ability to extract and recall procedural rules through correction injection, distractor turns, and recall scoring. Includes documentation, package setup, and CLI interface for running benchmarks with configurable parameters. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Review rate limit: 1/5 review remaining, refill in 38 minutes and 9 seconds. Comment |
Clean rebase of #148.