Turnstile

Multi-turn adversarial self-play against a frozen victim, with temporal SAE mechanistic analysis. 10 rounds, 300 conversations, JailbreakBench-100 goals.

37%

Peak ASR (round 3)

1.7

Mean breach turn

0.95

Per-turn probe AUC

79/300

Total jailbreaks

ASR Across Training Rounds

A 1B adversary (Llama-3.2-1B-Instruct, LoRA) learns to jailbreak a frozen 8B victim (Llama-3.1-8B-Instruct) through 5-turn conversations over 10 rounds of self-play. The adversary is bootstrapped on 20 seed conversations generated by the 8B model itself.

ASR climbs from 13% to 37% by round 3, then oscillates in the 20–37% range. The oscillation is expected without victim hardening — the adversary overfits to patterns that work against the frozen victim, but random JBB goal sampling shifts the distribution each round.

When Do Jailbreaks Happen?

The adversary develops genuine multi-turn strategies. Most breaches occur at turns 1–2, not turn 0.

Breach Turn	Count	Fraction	Interpretation
Turn 0	11	14%	Immediate framing succeeds (fictional scenario, roleplay)
Turn 1	26	33%	Initial framing + first escalation
Turn 2	17	22%	Context established, adversary pivots to goal
Turn 3	11	14%	Gradual trust-building pays off
Turn 4	4	5%	Deep multi-turn escalation (rare)

Why turn 1 dominates

The adversary's most effective strategy is "fictional scenario" framing (57% of successful attacks). This works in 2 steps: (1) establish a benign-looking fictional/hypothetical/research context at turn 0, (2) the victim engages with the framing and begins providing relevant content at turn 1. Llama Guard correctly flags this as unsafe. This is not a failure of multi-turn strategy — it's the adversary efficiently exploiting the minimal context needed to bypass safety.

Multi-turn depth improves over rounds

Each row shows one training round. Turn 0 breaches (red) dominate early rounds but disappear by round 7–8. Turn 2+ breaches (green, blue, purple) grow steadily. Round 7 achieves 11 wins with 6 at turn 2+ and zero at turn 0 — the adversary learned that deeper multi-turn strategies are more reliable than single-shot framing.

Category Vulnerability

Expert advice is by far the most vulnerable category (61% ASR) — the victim readily engages with medical, legal, and financial scenarios when framed as hypothetical. Sexual/Adult content is nearly impervious (2% ASR) — the victim's safety training for this category is much harder to bypass. This mirrors findings from single-turn jailbreak studies.

Adversary Strategy Analysis

Opening strategies (across all 79 successful attacks)

Strategy	Count	Share	Description
Fictional scenario	45	57%	"Let's simulate...", "Imagine we're...", "In a hypothetical..."
Help request	17	22%	"I'm happy to help with...", "I'd like to assist..."
Curiosity/rapport	6	8%	"I'm curious about...", "I'm interested in..."
Other	11	14%	Direct questions, educational framing, roleplay setups

Multi-turn escalation patterns

Successful multi-turn attacks follow a consistent 3-phase pattern:

Frame (turn 0): Establish a benign context — fictional company, research project, educational exercise, hypothetical patient
Engage (turn 1): The victim responds within the frame, providing initial relevant information. The adversary validates and deepens the discussion.
Pivot (turn 2+): The adversary steers the conversation toward the actual goal. By this point, the victim is invested in the fictional scenario and provides increasingly specific harmful content.

How Jailbreaks Are Determined

Judge: Llama Guard 3-1B (frozen, no fine-tuning). For each conversation, the full multi-turn transcript is passed through Llama Guard's safety classification template. If the output contains "unsafe", the conversation is flagged as a jailbreak.

Turn-of-breach detection: For conversations flagged unsafe, we identify when the breach occurred by judging cumulative prefixes. We feed turns [0–0], [0–1], [0–2], etc. to Llama Guard and find the first prefix judged unsafe. This is the turn where the victim first began complying with the harmful goal.

In the example conversations below, the breach turn is highlighted in red. Messages after the breach are faded — the attack has already succeeded.

Example Conversations

Full multi-turn conversations showing different attack strategies and breach depths. Click to expand.

Mechanistic Detection

Per-turn probe (Phase 3)

A linear probe on the victim's residual stream at the middle layer achieves AUC 0.974 +/- 0.007 for per-turn jailbreak detection (10-fold stratified cross-validation, 1500 turn-level samples). The victim's internal state clearly encodes whether compliance is occurring, independently at each turn.

An SAE with 16,384 features (4x expansion) is trained on the hidden states. A logistic probe on the top-100 discriminative SAE features achieves AUC 0.774 +/- 0.024 — lower than raw hidden states but using sparse, interpretable features.

Temporal SAE (Phase 4)

A Temporal SAE (Bhalla et al., ICLR 2026) was adapted from token-level to turn-level consistency. The Matryoshka feature partition (20% high-level, 80% low-level) with BatchTopK(k=20) activation and bidirectional InfoNCE contrastive loss was trained on 1,200 turn pairs.

The T-SAE did not achieve meaningful temporal disentanglement. High-level features had higher smoothness Delta_s (42.97) than low-level features (0.04) — the opposite of what Bhalla et al. predict. Temporal probe AUC was 0.55, near chance. Three likely causes:

Turn-level ≠ token-level: adjacent tokens are highly correlated; adjacent turns have large semantic jumps. The temporal consistency assumption doesn't transfer.
Data scale: 1,200 turn pairs vs the large corpora in the original paper. InfoNCE needs many negatives.
Baseline too strong: per-turn AUC is already 0.95. There is no "context accumulation" signal to capture — each turn independently reveals compliance.

Mean high-level T-SAE feature activation across turns. Unsafe (red) and safe (blue) trajectories show a small but consistent separation, with both rising and plateauing by turn 2.

What's Next

Phase 5: Stealth adversary. The per-turn probe (AUC 0.95) makes a strong defender. If the adversary can learn to jailbreak and evade per-turn detection, that's when temporal probes become necessary. The stealth loop trains the adversary with probe-weighted selection: successful attacks that evade the probe receive higher training weight.

Victim hardening. Enable LoRA training on the victim to create a true co-evolutionary dynamic. The current frozen victim means ASR can only go up — hardening creates the arms race that tests whether multi-turn strategies are robust.

Scale. 30 candidates/round with 5 turns is limited. Scaling to 100+ candidates and more rounds would give the T-SAE contrastive loss enough data to potentially work, and let the adversary develop more diverse strategies.

Turnstile — Julian Quick, 2026. Built on GitHub. Models: Llama-3.2-1B-Instruct (adversary), Llama-3.1-8B-Instruct (victim), Llama-Guard-3-1B (judge). Goals: JailbreakBench JBB-100. T-SAE: Bhalla et al. ICLR 2026.