Multi-turn adversarial self-play against a frozen victim, with temporal SAE mechanistic analysis. 10 rounds, 300 conversations, JailbreakBench-100 goals.
A 1B adversary (Llama-3.2-1B-Instruct, LoRA) learns to jailbreak a frozen 8B victim (Llama-3.1-8B-Instruct) through 5-turn conversations over 10 rounds of self-play. The adversary is bootstrapped on 20 seed conversations generated by the 8B model itself.
The adversary develops genuine multi-turn strategies. Most breaches occur at turns 1–2, not turn 0.
| Breach Turn | Count | Fraction | Interpretation |
|---|---|---|---|
| Turn 0 | 11 | 14% | Immediate framing succeeds (fictional scenario, roleplay) |
| Turn 1 | 26 | 33% | Initial framing + first escalation |
| Turn 2 | 17 | 22% | Context established, adversary pivots to goal |
| Turn 3 | 11 | 14% | Gradual trust-building pays off |
| Turn 4 | 4 | 5% | Deep multi-turn escalation (rare) |
The adversary's most effective strategy is "fictional scenario" framing (57% of successful attacks). This works in 2 steps: (1) establish a benign-looking fictional/hypothetical/research context at turn 0, (2) the victim engages with the framing and begins providing relevant content at turn 1. Llama Guard correctly flags this as unsafe. This is not a failure of multi-turn strategy — it's the adversary efficiently exploiting the minimal context needed to bypass safety.
Each row shows one training round. Turn 0 breaches (red) dominate early rounds but disappear by round 7–8. Turn 2+ breaches (green, blue, purple) grow steadily. Round 7 achieves 11 wins with 6 at turn 2+ and zero at turn 0 — the adversary learned that deeper multi-turn strategies are more reliable than single-shot framing.
| Strategy | Count | Share | Description |
|---|---|---|---|
| Fictional scenario | 45 | 57% | "Let's simulate...", "Imagine we're...", "In a hypothetical..." |
| Help request | 17 | 22% | "I'm happy to help with...", "I'd like to assist..." |
| Curiosity/rapport | 6 | 8% | "I'm curious about...", "I'm interested in..." |
| Other | 11 | 14% | Direct questions, educational framing, roleplay setups |
Successful multi-turn attacks follow a consistent 3-phase pattern:
Judge: Llama Guard 3-1B (frozen, no fine-tuning). For each conversation, the full multi-turn transcript is passed through Llama Guard's safety classification template. If the output contains "unsafe", the conversation is flagged as a jailbreak.
Turn-of-breach detection: For conversations flagged unsafe, we identify when the breach occurred by judging cumulative prefixes. We feed turns [0–0], [0–1], [0–2], etc. to Llama Guard and find the first prefix judged unsafe. This is the turn where the victim first began complying with the harmful goal.
In the example conversations below, the breach turn is highlighted in red. Messages after the breach are faded — the attack has already succeeded.
Full multi-turn conversations showing different attack strategies and breach depths. Click to expand.
A linear probe on the victim's residual stream at the middle layer achieves AUC 0.974 +/- 0.007 for per-turn jailbreak detection (10-fold stratified cross-validation, 1500 turn-level samples). The victim's internal state clearly encodes whether compliance is occurring, independently at each turn.
An SAE with 16,384 features (4x expansion) is trained on the hidden states. A logistic probe on the top-100 discriminative SAE features achieves AUC 0.774 +/- 0.024 — lower than raw hidden states but using sparse, interpretable features.
A Temporal SAE (Bhalla et al., ICLR 2026) was adapted from token-level to turn-level consistency. The Matryoshka feature partition (20% high-level, 80% low-level) with BatchTopK(k=20) activation and bidirectional InfoNCE contrastive loss was trained on 1,200 turn pairs.
Mean high-level T-SAE feature activation across turns. Unsafe (red) and safe (blue) trajectories show a small but consistent separation, with both rising and plateauing by turn 2.
Phase 5: Stealth adversary. The per-turn probe (AUC 0.95) makes a strong defender. If the adversary can learn to jailbreak and evade per-turn detection, that's when temporal probes become necessary. The stealth loop trains the adversary with probe-weighted selection: successful attacks that evade the probe receive higher training weight.
Victim hardening. Enable LoRA training on the victim to create a true co-evolutionary dynamic. The current frozen victim means ASR can only go up — hardening creates the arms race that tests whether multi-turn strategies are robust.
Scale. 30 candidates/round with 5 turns is limited. Scaling to 100+ candidates and more rounds would give the T-SAE contrastive loss enough data to potentially work, and let the adversary develop more diverse strategies.
Turnstile — Julian Quick, 2026. Built on GitHub. Models: Llama-3.2-1B-Instruct (adversary), Llama-3.1-8B-Instruct (victim), Llama-Guard-3-1B (judge). Goals: JailbreakBench JBB-100. T-SAE: Bhalla et al. ICLR 2026.