Adversarial Enterprise Guard for Intrinsic Security
An open-source research project to create a dataset and training pipeline that improves open LLMs' resistance to prompt-based attacks while minimizing over-refusal.
Current LLM defenses leave a critical gap:
| Defense Layer | Protection | Issue |
|---|---|---|
| Base model | ~0% | Will do anything |
| Instruct/RLHF | ~60% | Basic safety training |
| Flagship (Claude/GPT) | ~75% | Must stay usable for everyone |
| Third-party guardrails | ~95% | 20%+ false positive rate |
Enterprises need 85-90% protection without the false positive explosion.
TRYLOCK provides a three-layer defense stack:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TRYLOCK v2 DEFENSE STACK β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Layer 1: KNOWLEDGE (LoRA + DPO) β
β βββ Teaches model what attacks look like through preference β
β learning on multi-turn trajectories β
β β
β Layer 2: INSTINCT (Representation Engineering) β
β βββ Dampens "attack compliance" direction with tunable Ξ± β
β coefficient (0.0 = research, 1.0 = balanced, 2.5 = lockdown) β
β β
β Layer 3: OVERSIGHT (Security Sidecar) β
β βββ Parallel 8B classifier scores conversation state β
β (SAFE | WARN | ATTACK) invisible to attacker β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The TRYLOCK defense system is fully trained and available on HuggingFace:
- DPO Adapter: scthornton/trylock-mistral-7b-dpo
- RepE Steering Vectors: scthornton/trylock-repe-vectors
- Sidecar Classifier: scthornton/trylock-sidecar-classifier
- Public Sample: scthornton/trylock-demo-dataset (48 diverse examples)
- Full Training Set: Private (2,939 preference pairs - available upon request for academic research)
See paper/TRYLOCK_Canonical.md for the complete research paper documenting methodology, experiments, and results.
Performance: TRYLOCK achieves 82.8% reduction in attack success rate (from 100% to 17.2%) while maintaining low over-refusal (12.6%) on benign queries.
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load DPO-trained model
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
base = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
model = PeftModel.from_pretrained(base, "scthornton/trylock-mistral-7b-dpo")
# See scripts/eval_cpu_only.py for full three-layer implementationgit clone https://github.com/scthornton/trylock.git
cd trylock
pip install -e ".[dev]"# Generate 100 attack trajectories (mock mode - no API)
python scripts/generate_data.py generate --count 100 --mock
# Generate with real API calls
export ANTHROPIC_API_KEY="your-key"
python scripts/generate_data.py generate --count 100
# Generate benign hard negatives
python scripts/generate_data.py benign --count 50
# Validate generated data
python scripts/generate_data.py validate data/tier1_open/attacks/generated.jsonlfrom generation import TRYLOCKPipeline, PipelineConfig
# Configure pipeline
config = PipelineConfig(
red_bot_model="claude-sonnet-4-20250514",
judge_model="claude-sonnet-4-20250514",
mutations_per_trajectory=3,
)
# Generate trajectories
pipeline = TRYLOCKPipeline(config)
trajectories = pipeline.run(count=100)
# Save output
pipeline.save(trajectories, "data/output.jsonl")Each trajectory contains:
{
"id": "trylock-mt-00421",
"version": "2.0",
"attack_metadata": {
"family": "multi_turn_manipulation",
"type": "crescendo",
"wrappers": ["adversarial_poetry"],
"manipulation_tactics": ["context_anchoring"],
"difficulty": "hard"
},
"conversation": [
{
"turn": 1,
"role": "user",
"content": "I'm researching security...",
"turn_labels": {
"attack_stage": "setup",
"attack_intent_score": 0,
"expected_action": "continue"
}
}
],
"pivot_turn_index": 3,
"responses": {
"chosen": {"content": "...", "response_type": "soft_decline"},
"rejected_unsafe": {"content": "...", "failure_mode": "..."},
"rejected_overblock": {"content": "...", "failure_mode": "..."}
}
}TRYLOCK covers five attack families:
| Family | Description | Priority |
|---|---|---|
| Multi-turn Manipulation | Crescendo, context anchoring, boundary softening | HIGH |
| Indirect Injection | RAG poisoning, tool output injection | HIGH |
| Obfuscation Wrappers | Poetry, roleplay, encoding, translation | MEDIUM |
| Direct Injection | Classic jailbreaks, system prompt extraction | MEDIUM |
| Tool/Agent Abuse | Instruction hierarchy attacks, hidden goals | EMERGING |
See taxonomy/v2.0/attack_families.yaml for the full taxonomy.
trylock/
βββ taxonomy/v2.0/ # Attack classification system
β βββ attack_families.yaml
β βββ manipulation_tactics.yaml
β βββ attack_stages.yaml
β βββ response_types.yaml
β
βββ data/
β βββ schema/ # JSON schema + validator
β βββ tier1_open/ # Public dataset (Apache 2.0)
β βββ tier2_gated/ # Research agreement required
β βββ tier3_private/ # Internal only
β
βββ generation/ # Data generation pipeline
β βββ red_bot.py # Attack generator
β βββ victim_bot.py # Target model simulator
β βββ judge_bot.py # Labeler + response generator
β βββ mutation_engine.py # Create attack variants
β βββ activation_capture.py # RepE training data
β βββ pipeline.py # Orchestration
β
βββ training/ # Training pipeline (coming soon)
β βββ sft_warmup.py
β βββ dpo_preference.py
β βββ repe_training.py
β βββ sidecar_classifier.py
β
βββ eval/ # Evaluation framework (coming soon)
β βββ harness.py
β βββ metrics.py
β βββ benchmarks/
β
βββ scripts/ # CLI tools
βββ generate_data.py
| Metric | Baseline | Target |
|---|---|---|
| Single-turn ASR | ~25% | β€10% |
| Multi-turn ASR | ~35% | β€15% |
| Indirect/RAG ASR | ~40% | β€20% |
| Novel wrapper ASR | ~60% | β€30% |
| Over-refusal rate | - | β€+2-4% |
| Capability preservation | 100% | β₯95% |
- SecAlign: arXiv:2410.05451
- MTJ-Bench: arXiv:2508.06755
- PoisonedRAG: USENIX Security 2025
- Adversarial Poetry: arXiv:2511.15304
- LLMail-Inject: arXiv:2506.09956
We welcome contributions! Areas of interest:
- New attack patterns: Especially novel multi-turn and indirect injection
- Benign hard negatives: Cases that look like attacks but aren't
- Evaluation benchmarks: Integration with existing security benchmarks
- Training improvements: Better DPO/RepE configurations
Please see CONTRIBUTING.md for guidelines.
Apache 2.0 with a Responsible Use Addendum. See LICENSE.
The dataset is intended for defensive security research only. Do not use this data to:
- Train models intended to generate attacks
- Bypass security measures on systems you don't own
- Cause harm to individuals or organizations
@software{trylock2025,
title = {TRYLOCK: Adversarial Enterprise Guard for Intrinsic Security},
author = {Thornton, Scott},
year = {2025},
url = {https://github.com/scthornton/trylock}
}- Project Lead: Scott Thornton
- Organization: perfecXion.ai
- GitHub: @scthornton
- Dataset: huggingface.co/datasets/scthornton/trylock