Skip to content

fix: quarantine prompt-injection rule graduation#267

Merged
Gradata merged 1 commit into
mainfrom
gra-2088-graduation-safety
Jun 7, 2026
Merged

fix: quarantine prompt-injection rule graduation#267
Gradata merged 1 commit into
mainfrom
gra-2088-graduation-safety

Conversation

@Gradata

@Gradata Gradata commented Jun 7, 2026

Copy link
Copy Markdown
Owner

Summary

  • Add conservative prompt-injection quarantine at PATTERN -> RULE graduation boundary.
  • Reuse existing hook injection guard plus credential/tool-override regexes.
  • Quarantined candidates remain observable in lessons.md via Pending approval: yes and Kill reason: graduation_quarantine:<reason>; they are not exported to AGENTS.md.
  • Add regression coverage for malicious and benign rule candidates.

Paperclip issue UUID: 139431c6-0f40-445d-a937-34f5a3289982
Paperclip issue: GRA-2088

Verification

python3 -m pytest tests/test_session_close_write_through_gate.py -q
# 7 passed in 0.23s

/home/olive/.local/bin/uvx ruff check src/gradata/enhancements/self_improvement/_graduation.py tests/test_session_close_write_through_gate.py
# All checks passed!

python3 -m pytest tests/test_session_close_write_through_gate.py tests/test_graduation_notification.py tests/test_rule_graduated_events.py -q
# 20 passed in 12.27s

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@coderabbitai

coderabbitai Bot commented Jun 7, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Summary

  • Security Enhancement: Introduces a conservative quarantine gate for prompt-injection candidates during PATTERN → RULE graduation boundary
  • Detection Mechanism: Reuses existing hook injection guard and applies credential/tool-override regexes to identify risky candidates
  • Quarantine Behavior: Marks candidates as pending_approval = True with kill_reason prefixed with graduation_quarantine:<reason>, preventing export to AGENTS.md while remaining observable in lessons.md
  • Implementation: Adds _graduation_quarantine_reason() helper function to _graduation.py with compiled regex patterns for injection detection
  • Test Coverage: New regression tests verify both malicious prompt-injection candidates are blocked and benign candidates still graduate normally
  • Import Refactoring: Adjusts imports to remove specific threshold constants, relying on graduation_thresholds() fields
  • No Breaking Changes: No alterations to exported or public function signatures

Walkthrough

This PR adds a safety gate to the graduation flow that quarantines lesson candidates suspected of prompt-injection attacks during PATTERN → RULE promotion. A new quarantine detector checks lesson metadata; malicious candidates are marked pending and blocked from promotion, while benign patterns graduate normally.

Changes

Graduation Quarantine Gating

Layer / File(s) Summary
Injection detection patterns and helper
Gradata/src/gradata/enhancements/self_improvement/_graduation.py
re import is added and _graduation_quarantine_reason() helper with compiled _GRADUATION_INJECTION_PATTERNS detects suspicious prompt-injection text in lesson metadata using gradata.hooks._injection_guard with regex fallback.
Graduation quarantine integration
Gradata/src/gradata/enhancements/self_improvement/_graduation.py
During PATTERN → RULE promotion eligibility, _graduation_quarantine_reason() is called; if a reason is returned, the lesson is marked pending_approval, assigned a quarantine-prefixed kill_reason, logged as a warning, and skipped from promotion without hook installation.
Quarantine and normal graduation tests
Gradata/tests/test_session_close_write_through_gate.py
Two new tests verify that malicious PATTERN candidates are quarantined (pending with kill_reason, no RULE/AGENTS.md entry) and that benign candidates still graduate normally (promoted to RULE and exported as AGENTS.md bullets).

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly Related PRs

  • Gradata/gradata#223: Both PRs use gradata.hooks._injection_guard for prompt-injection detection—feat(hooks): add prompt-injection sanitization layer to jit_inject (GRA-1295) #223 wires the guard into hooks/jit_inject, while this PR applies it to graduation promotion to quarantine suspicious candidates.
  • Gradata/gradata#183: Both PRs modify self-improvement graduation logic around PATTERN → RULE; #183 centralizes graduation thresholds via graduation_thresholds(), and this PR uses the same threshold wiring while adding the quarantine check.
  • Gradata/gradata#249: Both PRs modify the session_close._run_graduation promotion flow for PATTERN → RULE—this PR adds a quarantine/pending gate while #249 adds persistence and AGENTS.md auto-export for the same stage.

Suggested Labels

security, bug

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: quarantine prompt-injection rule graduation' accurately summarizes the main change: adding a quarantine mechanism for prompt-injection candidates during PATTERN→RULE graduation.
Description check ✅ Passed The description is directly related to the changeset, clearly explaining the quarantine feature, implementation approach, observability in lessons.md, and exclusion from AGENTS.md exports.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch gra-2088-graduation-safety

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 OpenGrep (1.22.0)

OpenGrep fatal error (exit code 2):
┌──────────────┐
│ Opengrep CLI │
└──────────────┘

�[32m✔�[39m �[1mOpengrep OSS�[0m
�[32m✔�[39m Basic security coverage for first-party code vulnerabilities.

�[1m Loading rules from local config...�[0m
[00.16][ERROR]: Error: exception Glob.Lexer.Syntax_error("malformed glob pattern: missing ']'")
Raised at Glob__Lexer.syntax_error in file "libs/glob/Lexer.mll", line 8, characters 2-26
Called from Glob__Lexer.__ocaml_lex_token_rec in file "libs/glob/Lexer.mll", line 29, characters 26-53
Cal


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added bug Something isn't working security labels Jun 7, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@Gradata/src/gradata/enhancements/self_improvement/_graduation.py`:
- Around line 70-78: The current broad except hides errors from
gradata.hooks._injection_guard functions (sanitize, is_suspicious) and silently
falls back to the limited _GRADUATION_INJECTION_PATTERNS; change the handler so
ImportError is caught and handled as before, but any other Exception is logged
with logger.warning(..., exc_info=True) and then fail closed (return a
quarantine reason, e.g., the original exception message or a generic
"prompt_injection_detection_error") or re-run a local normalized fallback using
the same normalization as sanitize before checking
_GRADUATION_INJECTION_PATTERNS; update the try/except around the
sanitize/is_suspicious calls and reference sanitize, is_suspicious,
_GRADUATION_INJECTION_PATTERNS, and logger.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6228c646-09d0-4230-87a2-d7225dffe5e9

📥 Commits

Reviewing files that changed from the base of the PR and between f05b53d and f795d4d.

📒 Files selected for processing (2)
  • Gradata/src/gradata/enhancements/self_improvement/_graduation.py
  • Gradata/tests/test_session_close_write_through_gate.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: pytest (py3.11)
  • GitHub Check: pytest windows-latest / py3.12
  • GitHub Check: pytest (py3.12)
  • GitHub Check: pytest windows-latest / py3.11
  • GitHub Check: pytest macos-latest / py3.11
  • GitHub Check: pytest ubuntu-latest / py3.12
  • GitHub Check: pytest macos-latest / py3.12
  • GitHub Check: pytest ubuntu-latest / py3.11
🧰 Additional context used
📓 Path-based instructions (2)
Gradata/tests/**/*.py

📄 CodeRabbit inference engine (Gradata/AGENTS.md)

Gradata/tests/**/*.py: Set BRAIN_DIR environment variable via tmp_path in conftest.py for test isolation — ensure _paths.py module cache refreshes when calling Brain.init() directly inside tests
Add unit tests in tests/test_*.py for every CI push without LLM calls (deterministic); mark integration tests with @pytest.mark.integration and skip them by default (they hit real LLM APIs)

Files:

  • Gradata/tests/test_session_close_write_through_gate.py
Gradata/src/**/*.py

📄 CodeRabbit inference engine (Gradata/AGENTS.md)

Gradata/src/**/*.py: Prefer sentence-transformers for local embeddings, google-genai for Gemini embeddings, cryptography for AES-GCM encrypted system.db, bm25s for BM25 rule ranking, and mem0ai for external memory adapters — guard all optional dependency imports with try / except ImportError at the call site, never at module level
Maintain strict layering: Layer 0 (Primitives: _types.py, _db.py, _events.py, _paths.py, _file_lock.py; Patterns: contrib/patterns/) must never import from Layer 1 (Enhancements: enhancements/, rules/) or Layer 2 (Public API: brain.py, cli.py, daemon.py, mcp_server.py)
Never use bare except: pass — use typed exceptions or at minimum logger.warning(...) with exc_info=True to avoid silent failure in a memory product
Never import from out-of-scope sibling directories ../Sprites/ or ../Hausgem/ within gradata/* code — that is a layering bug
Never leak private-sibling paths into public docs/code — no references to ../Sprites/, ../Hausgem/, email addresses, OneDrive paths, or Sprites-specific examples from inside gradata/*
Use atomic-write helper when writing JSON files to prevent corruption from mid-write crashes

Files:

  • Gradata/src/gradata/enhancements/self_improvement/_graduation.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: Gradata
Repo: Gradata/gradata PR: 0
File: :0-0
Timestamp: 2026-04-17T17:18:07.439Z
Learning: In PR `#102` (gradata/gradata), Round 2 addressed: cli.py env-first brain resolution (GRADATA_BRAIN > --brain-dir > cwd), _tenant.py corrupt .tenant_id overwrite, _env_int default clamping to minimum, and _events.py tenant-scoped fallback SELECT for dedup. All ruff and 99 tests green after these fixes.
📚 Learning: 2026-04-17T17:18:07.439Z
Learnt from: Gradata
Repo: Gradata/gradata PR: 0
File: :0-0
Timestamp: 2026-04-17T17:18:07.439Z
Learning: In PR `#102` (gradata/gradata), Round 2 addressed: cli.py env-first brain resolution (GRADATA_BRAIN > --brain-dir > cwd), _tenant.py corrupt .tenant_id overwrite, _env_int default clamping to minimum, and _events.py tenant-scoped fallback SELECT for dedup. All ruff and 99 tests green after these fixes.

Applied to files:

  • Gradata/tests/test_session_close_write_through_gate.py
📚 Learning: 2026-05-01T15:50:32.772Z
Learnt from: CR
Repo: Gradata/gradata PR: 0
File: Gradata/AGENTS.md:0-0
Timestamp: 2026-05-01T15:50:32.772Z
Learning: Applies to Gradata/tests/**/*.py : Add unit tests in `tests/test_*.py` for every CI push without LLM calls (deterministic); mark integration tests with `pytest.mark.integration` and skip them by default (they hit real LLM APIs)

Applied to files:

  • Gradata/tests/test_session_close_write_through_gate.py
📚 Learning: 2026-05-01T15:50:32.772Z
Learnt from: CR
Repo: Gradata/gradata PR: 0
File: Gradata/AGENTS.md:0-0
Timestamp: 2026-05-01T15:50:32.772Z
Learning: Ensure the 4 deterministic guarantees have tests: (1) Correction in → rule extracted out, (2) Rule retrieved/applied in subsequent session, (3) Contradicting evidence lowers confidence, (4) Stale rules decay below threshold

Applied to files:

  • Gradata/tests/test_session_close_write_through_gate.py
🔇 Additional comments (2)
Gradata/src/gradata/enhancements/self_improvement/_graduation.py (1)

13-56: LGTM!

Also applies to: 399-409

Gradata/tests/test_session_close_write_through_gate.py (1)

72-115: LGTM!

Comment on lines +70 to +78
try:
from gradata.hooks._injection_guard import is_suspicious, sanitize

normalized = sanitize(text)
suspicious, reason = is_suspicious(normalized)
if suspicious:
return reason or "prompt_injection_pattern"
except Exception:
normalized = text

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t silently downgrade the quarantine detector on _injection_guard failures.

Line 77 catches every detector error and falls back to _GRADUATION_INJECTION_PATTERNS, but that fallback is much narrower than gradata.hooks._injection_guard.is_suspicious(). If sanitize()/is_suspicious() ever raises, phrases like ignore previous instructions stop being quarantined and can graduate into durable RULE/AGENTS output again. Narrow this to ImportError, and on unexpected failures either log with exc_info=True and fail closed, or re-run an equivalent normalized fallback locally.

As per coding guidelines, use typed exceptions or at minimum logger.warning(..., exc_info=True) to avoid silent failure in a memory product; based on learnings from Gradata/src/gradata/hooks/_injection_guard.py:111-162, sanitize() + is_suspicious() provide broader normalized detection than the local regex fallback.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Gradata/src/gradata/enhancements/self_improvement/_graduation.py` around
lines 70 - 78, The current broad except hides errors from
gradata.hooks._injection_guard functions (sanitize, is_suspicious) and silently
falls back to the limited _GRADUATION_INJECTION_PATTERNS; change the handler so
ImportError is caught and handled as before, but any other Exception is logged
with logger.warning(..., exc_info=True) and then fail closed (return a
quarantine reason, e.g., the original exception message or a generic
"prompt_injection_detection_error") or re-run a local normalized fallback using
the same normalization as sanitize before checking
_GRADUATION_INJECTION_PATTERNS; update the try/except around the
sanitize/is_suspicious calls and reference sanitize, is_suspicious,
_GRADUATION_INJECTION_PATTERNS, and logger.

Source: Coding guidelines

@Gradata Gradata merged commit 69ad787 into main Jun 7, 2026
9 checks passed
@Gradata Gradata deleted the gra-2088-graduation-safety branch June 7, 2026 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working security

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant