python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install -r requirements.txtPython ≥ 3.11 recommended. The pinned requirements.txt records the exact package versions used to produce all results.
For gated models (Llama-3.1-8B, Mistral-7B-Instruct-v0.3) accept the licence on Hugging Face, then:
export HF_TOKEN=<your_hf_token>Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct require no token.
cd experiments
python run_ie_proof.py --allRuns the pre-registered L0-clean tagged-vault minimal pair plus the phrasing-neutral control on all four pre-registered open-weight models and prints the statistical report matching Table 1 and Table 3 of the paper. Walltime: roughly 4–6 hours on a single GPU or Apple M-series Mac.
Partial runs:
python run_ie_proof.py --model llama-3.1-8b # single model
python run_ie_proof.py --report-only # re-aggregate already-written outputsEvery JSON in results/ was produced by the final experiment runs. To regenerate statistics without re-running models:
cd experiments && python run_ie_proof.py --report-only| Model | HF identifier | Token? |
|---|---|---|
| Qwen2.5-1.5B-Instruct | Qwen/Qwen2.5-1.5B-Instruct |
No |
| Qwen2.5-7B-Instruct | Qwen/Qwen2.5-7B-Instruct |
No |
| Llama-3.1-8B-Instruct | meta-llama/Meta-Llama-3.1-8B-Instruct |
Yes |
| Mistral-7B-Instruct-v0.3 | mistralai/Mistral-7B-Instruct-v0.3 |
Yes |
All open-weight runs use torch.bfloat16. Hardware requirements per model are roughly 6 GB (Qwen-1.5B), 18 GB (Qwen-7B / Mistral-7B), and 20 GB (Llama-8B). CPU-only is possible but slow.
experiments/
run_ie_proof.py One-command entry point (§4.1)
run_l0_clean.py Pre-registered L0 tagged-vault minimal pair (§4.1)
run_l0_neutral.py Phrasing-neutral control — system-prompt axis (§4.1)
run_l0_sysmerge.py System-prompt → user-turn merge ablation (§4.1)
run_l0_policy.py Symmetric data-minimisation defence script
(not reported in paper; supplied for follow-up)
run_bootstrap_ci.py Bootstrap CIs + Holm correction (§4.1)
run_twosided_check.py Two-sided robustness check (App. B)
run_orthogonal_v2.py 2×2×2 orthogonal ablation (§4.3)
run_matched_context.py Matched-context identity isolation (App. H)
run_dissociation.py IIV five-test battery (§4.4 / App. C)
run_mechanistic.py Extended mechanistic analysis (§4.4 / App. C)
run_iiv_framing.py IIV framing-gradient bridge (App. D)
run_persona_vectors.py Persona-vectors baseline (App. F)
run_behavioral.py Cross-model A2A + framing-level / implicit / persona (App. A)
run_a2a_llama.py A2A protocol replication (App. A)
run_risk_normalized.py Risk-normalised frontier effect sizes (§4.5 / App. E)
validate_detector.py Leakage-detector validation (App. J)
scenarios_l0_clean.py 51 tagged-vault test scenarios + 20 calibration scenarios
scenarios_l0_neutral.py Scenarios for phrasing-neutral control
scenarios_neurips.py 40 legacy scenarios (A2A, framing gradient, persona)
scenarios_orthogonal.py 100 held-out scenarios for 2×2×2 ablation
config.py Model registry + shared utilities
results/
l0_clean/ PRIMARY — pre-registered tagged-vault minimal pair
PREREGISTRATION.json Pre-registration record (SHA-256 in paper)
calibration/ Calibration run (Llama-3.1-8B, 20 scenarios)
llama-3.1-8b/ Per-model raw draws (l0_clean_raw.json)
mistral-7b/
qwen2.5-1.5b/
qwen2.5-7b/
l0_clean_report.json Aggregated per-model statistics
l0_neutral/ Phrasing-neutral control (§4.1 / App. G)
<model>/l0_neutral_raw.json
l0_sysmerge/ System-prompt → user-turn merge ablation (§4.1)
<model>/l0_sysmerge_raw.json
frontier_iwpe/ Frontier-model replication (§4.5 / App. E)
benchmark_raw.json Raw per-trial API responses
frontier_replication_stats.json
stats/
l0_bootstrap_ci.json Bootstrap CIs + Holm correction
<model>/ Per-model legacy outputs for appendix analyses
behavioral_raw.json Cross-model A2A (App. A)
stats.json
behavioral_framing_levels_raw.json Framing gradient (App. B)
stats_framing_levels.json
behavioral_implicit_raw.json Implicit MCP framing (App. B)
stats_implicit.json
behavioral_orthogonal_v2_raw.json 2×2×2 ablation (§4.3 / App. I)
stats_orthogonal_v2.json
behavioral_matched_context_raw.json Matched-context (App. H)
stats_matched_context.json
behavioral_persona_control_raw.json Persona specificity (App. F)
stats_persona_control.json
dissociation_safety.json IIV five-test battery (§4.4 / App. C)
iiv_framing_gradient.json IIV framing gradient (App. D)
persona_vectors.json Persona vectors (App. F)
results.json Mechanistic analysis raw
a2a_behavioral_raw.json A2A replication (App. A)
a2a_stats.json
detector_validation.json Detector validation (App. J)
detector_validation_summary.txt
requirements.txt
LICENSE
| Script | Paper | Output |
|---|---|---|
run_l0_clean.py --calibrate --model llama-3.1-8b |
§4.1 calibration | results/l0_clean/PREREGISTRATION.json |
run_l0_clean.py --run --model <tag> |
§4.1 Table 1 | results/l0_clean/<tag>/l0_clean_raw.json |
run_l0_neutral.py --model <tag> |
§4.1 / App. G | results/l0_neutral/<tag>/l0_neutral_raw.json |
run_l0_sysmerge.py --model <tag> |
§4.1 sysmerge ablation | results/l0_sysmerge/<tag>/l0_sysmerge_raw.json |
run_bootstrap_ci.py |
§4.1 bootstrap CIs + Holm | results/stats/l0_bootstrap_ci.json |
run_twosided_check.py |
App. B two-sided check | results/stats/l0_twosided.json |
run_ie_proof.py --all |
Single entry point | all of the above |
| Script | Paper | Output |
|---|---|---|
run_orthogonal_v2.py |
§4.3 2×2×2 ablation | results/<tag>/behavioral_orthogonal_v2_raw.json, stats_orthogonal_v2.json |
run_matched_context.py |
App. H matched-context | results/<tag>/behavioral_matched_context_raw.json, stats_matched_context.json |
| Script | Paper | Output |
|---|---|---|
run_dissociation.py --model <tag> |
§4.4 / App. C IIV T1–T5 | results/<tag>/dissociation_safety.json |
run_mechanistic.py --model <tag> |
§4.4 / App. C extended | results/<tag>/results.json |
run_iiv_framing.py --model <tag> |
App. D IIV framing gradient | results/<tag>/iiv_framing_gradient.json |
| Script | Paper | Output |
|---|---|---|
run_risk_normalized.py |
§4.5 / App. E | results/stats/risk_normalized.json (reads frontier_iwpe/benchmark_raw.json) |
| Script | Paper | Output |
|---|---|---|
run_behavioral.py |
App. A cross-model A2A | results/<tag>/behavioral_raw.json, stats.json |
run_a2a_llama.py |
App. A A2A replication | results/<tag>/a2a_behavioral_raw.json, a2a_stats.json |
run_behavioral.py --mode framing_levels |
App. B framing gradient | results/<tag>/stats_framing_levels.json |
run_behavioral.py --mode implicit |
App. B implicit MCP | results/<tag>/stats_implicit.json |
run_behavioral.py --mode persona_control |
App. F persona specificity | results/<tag>/stats_persona_control.json |
run_persona_vectors.py |
App. F persona vectors | results/<tag>/persona_vectors.json |
validate_detector.py |
App. J detector validation | results/detector_validation.json |
Pre-registered tagged-vault minimal pair. 51 test scenarios, headroom window H ∈ (0.10, 0.90). All runs at T=0, 5 seeds × 5 paraphrase pairs per scenario per model.
| Model | n | Ā−H̄ | Cohen's d | 95% CI | Wilcoxon p |
|---|---|---|---|---|---|
| Llama-3.1-8B | 46 | +0.110 | +0.76 | [+0.46, +1.15] | 2.0×10⁻⁵ |
| Mistral-7B | 42 | +0.208 | +1.32 | [+0.99, +1.85] | 5.0×10⁻⁸ |
| Qwen2.5-1.5B | 50 | +0.226 | +1.41 | [+1.07, +1.89] | 3.4×10⁻⁹ |
| Qwen2.5-7B | 40 | +0.010 | +0.05 | [−0.28, +0.37] | 0.42 |
| Stouffer (4 models) | z = 7.70 | p = 6.7×10⁻¹⁵ |
Identical user request on both sides; system prompt identity only. Llama-3.1-8B: d = 0.93, p = 7.2×10⁻⁷. Pooled Stouffer z = 3.81, p = 6.8×10⁻⁵.
Llama-3.1-8B: d = 1.41 (n=44). Mistral-7B: d = 1.32 (n=42). Stouffer z = 7.56, p = 2.0×10⁻¹⁴. The dedicated system-prompt channel is not required.
Interlocutor main effect survives Holm correction on Qwen2.5-1.5B (F=8.40, p_corr=0.023) and Llama-3.1-8B (F=12.05, p_corr=0.003). Interlocutor × Extraction interaction null on all three models (p_corr=1.0): identity is additive with extraction directives.
| Model | Human | Agent | Δ (pp) | d | Wilcoxon p |
|---|---|---|---|---|---|
| Claude-3.5-Sonnet | 84.7% | 93.5% | +8.78 | 0.365 | 7×10⁻⁷ |
| GPT-4o | 87.2% | 93.0% | +5.86 | 0.243 | 7×10⁻⁴ |
| Llama-3.3-70B | 68.7% | 75.5% | +6.76 | 0.202 | 5×10⁻³ |
| Mistral-Large | 92.0% | 94.5% | +2.50 | 0.133 | 4×10⁻² |
Risk-normalised (Δ_norm): Claude +0.66, GPT-4o +0.57, Llama-3.3-70B +0.43, Mistral-Large +0.64. Agent framing closes 43–66% of residual safety headroom.
IIV linear separability accuracy: 1.000 on all four ≤8B models. Refusal orthogonality |cos| ∈ [0.105, 0.202]. Each model passes 4/5 IIV dissociation tests.
Field-level F₁ = 0.993 on 87 adversarial cases. Cohen's κ = 0.991 on 792 real model outputs.
results/l0_clean/PREREGISTRATION.json — committed before any L0-clean test run. Records: calibration model (Llama-3.1-8B), calibration window (0.10, 0.90), 51 test scenarios, seeds {42, 142, 242, 342, 442}, 5 paraphrase pairs, and the statistical plan. SHA-256 of this file is reported in the paper.
This repository is anonymised for double-blind review. It contains no author names, affiliations, contact information, or identifying metadata. Synthetic PII in scenario vaults is fictitious; no real individuals are identifiable.
MIT — see LICENSE.