Project TRYLOCK v2.0

Adversarial Enterprise Guard for Intrinsic Security

An open-source research project to create a dataset and training pipeline that improves open LLMs' resistance to prompt-based attacks while minimizing over-refusal.

The Problem

Current LLM defenses leave a critical gap:

Defense Layer	Protection	Issue
Base model	~0%	Will do anything
Instruct/RLHF	~60%	Basic safety training
Flagship (Claude/GPT)	~75%	Must stay usable for everyone
Third-party guardrails	~95%	20%+ false positive rate

Enterprises need 85-90% protection without the false positive explosion.

The Solution

TRYLOCK provides a three-layer defense stack:

┌─────────────────────────────────────────────────────────────────────┐
│                    TRYLOCK v2 DEFENSE STACK                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Layer 1: KNOWLEDGE (LoRA + DPO)                                    │
│  └── Teaches model what attacks look like through preference        │
│      learning on multi-turn trajectories                            │
│                                                                      │
│  Layer 2: INSTINCT (Representation Engineering)                     │
│  └── Dampens "attack compliance" direction with tunable α           │
│      coefficient (0.0 = research, 1.0 = balanced, 2.5 = lockdown)  │
│                                                                      │
│  Layer 3: OVERSIGHT (Security Sidecar)                              │
│  └── Parallel 8B classifier scores conversation state               │
│      (SAFE | WARN | ATTACK) invisible to attacker                   │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

🎯 Trained Models & Research

Published Models

The TRYLOCK defense system is fully trained and available on HuggingFace:

DPO Adapter: scthornton/trylock-mistral-7b-dpo
RepE Steering Vectors: scthornton/trylock-repe-vectors
Sidecar Classifier: scthornton/trylock-sidecar-classifier

Dataset

Public Sample: scthornton/trylock-demo-dataset (48 diverse examples)
Full Training Set: Private (2,939 preference pairs - available upon request for academic research)

Research Paper

See paper/TRYLOCK_Canonical.md for the complete research paper documenting methodology, experiments, and results.

Performance: TRYLOCK achieves 82.8% reduction in attack success rate (from 100% to 17.2%) while maintaining low over-refusal (12.6%) on benign queries.

Quick Start

Use Pre-Trained Models

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load DPO-trained model
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
base = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
model = PeftModel.from_pretrained(base, "scthornton/trylock-mistral-7b-dpo")

# See scripts/eval_cpu_only.py for full three-layer implementation

Installation (For Development)

git clone https://github.com/scthornton/trylock.git
cd trylock
pip install -e ".[dev]"

Generate Sample Data

# Generate 100 attack trajectories (mock mode - no API)
python scripts/generate_data.py generate --count 100 --mock

# Generate with real API calls
export ANTHROPIC_API_KEY="your-key"
python scripts/generate_data.py generate --count 100

# Generate benign hard negatives
python scripts/generate_data.py benign --count 50

# Validate generated data
python scripts/generate_data.py validate data/tier1_open/attacks/generated.jsonl

Python API

from generation import TRYLOCKPipeline, PipelineConfig

# Configure pipeline
config = PipelineConfig(
    red_bot_model="claude-sonnet-4-20250514",
    judge_model="claude-sonnet-4-20250514",
    mutations_per_trajectory=3,
)

# Generate trajectories
pipeline = TRYLOCKPipeline(config)
trajectories = pipeline.run(count=100)

# Save output
pipeline.save(trajectories, "data/output.jsonl")

Dataset Structure

Each trajectory contains:

{
  "id": "trylock-mt-00421",
  "version": "2.0",
  "attack_metadata": {
    "family": "multi_turn_manipulation",
    "type": "crescendo",
    "wrappers": ["adversarial_poetry"],
    "manipulation_tactics": ["context_anchoring"],
    "difficulty": "hard"
  },
  "conversation": [
    {
      "turn": 1,
      "role": "user",
      "content": "I'm researching security...",
      "turn_labels": {
        "attack_stage": "setup",
        "attack_intent_score": 0,
        "expected_action": "continue"
      }
    }
  ],
  "pivot_turn_index": 3,
  "responses": {
    "chosen": {"content": "...", "response_type": "soft_decline"},
    "rejected_unsafe": {"content": "...", "failure_mode": "..."},
    "rejected_overblock": {"content": "...", "failure_mode": "..."}
  }
}

Attack Taxonomy

TRYLOCK covers five attack families:

Family	Description	Priority
Multi-turn Manipulation	Crescendo, context anchoring, boundary softening	HIGH
Indirect Injection	RAG poisoning, tool output injection	HIGH
Obfuscation Wrappers	Poetry, roleplay, encoding, translation	MEDIUM
Direct Injection	Classic jailbreaks, system prompt extraction	MEDIUM
Tool/Agent Abuse	Instruction hierarchy attacks, hidden goals	EMERGING

See taxonomy/v2.0/attack_families.yaml for the full taxonomy.

Project Structure

trylock/
├── taxonomy/v2.0/          # Attack classification system
│   ├── attack_families.yaml
│   ├── manipulation_tactics.yaml
│   ├── attack_stages.yaml
│   └── response_types.yaml
│
├── data/
│   ├── schema/             # JSON schema + validator
│   ├── tier1_open/         # Public dataset (Apache 2.0)
│   ├── tier2_gated/        # Research agreement required
│   └── tier3_private/      # Internal only
│
├── generation/             # Data generation pipeline
│   ├── red_bot.py          # Attack generator
│   ├── victim_bot.py       # Target model simulator
│   ├── judge_bot.py        # Labeler + response generator
│   ├── mutation_engine.py  # Create attack variants
│   ├── activation_capture.py  # RepE training data
│   └── pipeline.py         # Orchestration
│
├── training/               # Training pipeline (coming soon)
│   ├── sft_warmup.py
│   ├── dpo_preference.py
│   ├── repe_training.py
│   └── sidecar_classifier.py
│
├── eval/                   # Evaluation framework (coming soon)
│   ├── harness.py
│   ├── metrics.py
│   └── benchmarks/
│
└── scripts/                # CLI tools
    └── generate_data.py

Target Metrics

Metric	Baseline	Target
Single-turn ASR	~25%	≤10%
Multi-turn ASR	~35%	≤15%
Indirect/RAG ASR	~40%	≤20%
Novel wrapper ASR	~60%	≤30%
Over-refusal rate	-	≤+2-4%
Capability preservation	100%	≥95%

Academic References

SecAlign: arXiv:2410.05451
MTJ-Bench: arXiv:2508.06755
PoisonedRAG: USENIX Security 2025
Adversarial Poetry: arXiv:2511.15304
LLMail-Inject: arXiv:2506.09956

Contributing

We welcome contributions! Areas of interest:

New attack patterns: Especially novel multi-turn and indirect injection
Benign hard negatives: Cases that look like attacks but aren't
Evaluation benchmarks: Integration with existing security benchmarks
Training improvements: Better DPO/RepE configurations

Please see CONTRIBUTING.md for guidelines.

License

Apache 2.0 with a Responsible Use Addendum. See LICENSE.

The dataset is intended for defensive security research only. Do not use this data to:

Train models intended to generate attacks
Bypass security measures on systems you don't own
Cause harm to individuals or organizations

Citation

@software{trylock2025,
  title = {TRYLOCK: Adversarial Enterprise Guard for Intrinsic Security},
  author = {Thornton, Scott},
  year = {2025},
  url = {https://github.com/scthornton/trylock}
}

Contact

Project Lead: Scott Thornton
Organization: perfecXion.ai
GitHub: @scthornton
Dataset: huggingface.co/datasets/scthornton/trylock

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
benchmarks		benchmarks
data		data
demo		demo
deployment		deployment
eval		eval
generation		generation
huggingface		huggingface
paper		paper
scripts		scripts
taxonomy/v2.0		taxonomy/v2.0
tests		tests
training		training
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project TRYLOCK v2.0

The Problem

The Solution

🎯 Trained Models & Research

Published Models

Dataset

Research Paper

Quick Start

Use Pre-Trained Models

Installation (For Development)

Generate Sample Data

Python API

Dataset Structure

Attack Taxonomy

Project Structure

Target Metrics

Academic References

Contributing

License

Citation

Contact

trylock

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

scthornton/trylock

Folders and files

Latest commit

History

Repository files navigation

Project TRYLOCK v2.0

The Problem

The Solution

🎯 Trained Models & Research

Published Models

Dataset

Research Paper

Quick Start

Use Pre-Trained Models

Installation (For Development)

Generate Sample Data

Python API

Dataset Structure

Attack Taxonomy

Project Structure

Target Metrics

Academic References

Contributing

License

Citation

Contact

trylock

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages