Skip to content

bench: PMR-100 procedural memory retention benchmark (rebased)#155

Merged
Gradata merged 2 commits into
mainfrom
rebase/pmr-100-benchmark
May 1, 2026
Merged

bench: PMR-100 procedural memory retention benchmark (rebased)#155
Gradata merged 2 commits into
mainfrom
rebase/pmr-100-benchmark

Conversation

@Gradata

@Gradata Gradata commented May 1, 2026

Copy link
Copy Markdown
Owner

Clean rebase of #148.

oliver added 2 commits May 1, 2026 09:08
The single benchmark council recommended Gradata ship before launch. 100
scripted sessions, 6 correction classes, recall@1/recall@3 metrics with
per-class breakdown.

First baseline run (3 sessions, BEHAVIORAL class): 0% rules extracted, 0%
recall. This is the work. Track on every PR. Ship at >=70% recall@1
across all classes.

Run: python -m bench.pmr_100 [--quick] [-n N]
…atch

Fix wrong assumption that apply_brain_rules returns a list of rule
objects. It returns a formatted prompt string. Recall scoring now
checks whether expected keywords appear in the rendered text.

Smoke (10 sessions): still 0% recall — confirms the kernel does not
graduate rules from a single correction. Multiple reinforcements
needed before lessons file populates. This is by design (FSRS
scoring) and is the real launch question: how many reinforcements
until rules become callable?

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@coderabbitai

coderabbitai Bot commented May 1, 2026

Copy link
Copy Markdown

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7f14fe36-1123-467a-9c53-a97afa363cf7

📥 Commits

Reviewing files that changed from the base of the PR and between 5f5f87f and 7516950.

📒 Files selected for processing (4)
  • Gradata/.gitignore
  • Gradata/bench/README.md
  • Gradata/bench/__init__.py
  • Gradata/bench/pmr_100.py

📝 Walkthrough
  • New benchmark: Adds PMR-100 (Procedural Memory Retention) benchmarking suite with 100 scripted sessions across 6 correction classes
  • Metrics: Implements recall@1 and recall@3 scoring with per-class breakdowns; CLI invocation via python -m bench.pmr_100 [--quick] [-n N]
  • New public API: Exports Scenario, SessionResult, and BenchResult dataclasses, plus run_benchmark() and main() functions
  • Benchmark structure: Tests correction injection, distractor turns, probing, and keyword-based recall validation against expected outputs
  • Baseline data: Initial 3-session run shows 0% rule extraction and 0% recall (expected behavior per FSRS design)
  • Package setup: Enables bench/ as importable Python package
  • Result persistence: Outputs benchmark results to JSON with timestamp, config, summary stats, and per-session data
  • Documentation: Adds comprehensive README with workflow description, CLI usage, and scenario configuration guidance

Walkthrough

Introduces the PMR-100 "Procedural Memory Retention" benchmark suite. Adds a benchmarking script that evaluates a Brain system's ability to extract and recall procedural rules through correction injection, distractor turns, and recall scoring. Includes documentation, package setup, and CLI interface for running benchmarks with configurable parameters.

Changes

Cohort / File(s) Summary
Package Setup
Gradata/.gitignore, Gradata/bench/__init__.py
Adds ignore rule for bench/results/ directory and marks bench/ as a Python package to enable module execution.
Documentation
Gradata/bench/README.md
Documents PMR-100 benchmark workflow, expected outputs, CLI commands, baseline results, and instructions for adding new scenarios.
Benchmark Implementation
Gradata/bench/pmr_100.py
Implements complete benchmarking script with dataclasses for scenario, session, and result definitions. Includes run_one_session() to execute individual benchmark sessions with correction injection and distractor turns, run_benchmark() to aggregate results across multiple sessions, and main() CLI entrypoint with configurable parameters (session count, distractor count, seed, quick mode).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

feature, docs

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch rebase/pmr-100-benchmark

Review rate limit: 1/5 review remaining, refill in 38 minutes and 9 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

@Gradata Gradata merged commit b98d16c into main May 1, 2026
7 of 9 checks passed
@Gradata Gradata deleted the rebase/pmr-100-benchmark branch May 1, 2026 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant