REDKWEEN: Automated Red Teaming via Asynchronous RFT
An automated red-teaming pipeline that trains adversary models to jailbreak larger aligned language models through iterative self-play. The adversary learns to generate attack prompts; the victim learns to refuse them. A frozen judge (Llama Guard) scores each attempt.
Motivation
Can a small, weak model learn to reliably break a larger, aligned model? And if it can, can the larger model learn to defend itself?
This project explores these questions through an adversarial co-evolution loop where both attacker and defender improve over successive rounds of fine-tuning.
Architecture
Three models operate in an asynchronous loop (one loaded at a time to fit in GPU memory):
| Role | Model | Size | State |
|---|---|---|---|
| Adversary | Llama-3.2-1B-Instruct | 1B | LoRA-trained each round |
| Victim | Llama-3.1-8B-Instruct | 8B | LoRA-trained each round |
| Judge | Llama-Guard-3-1B | 1B | Frozen |
All models are loaded in 4-bit quantization (NF4, bfloat16 compute) via BitsAndBytes.
The REDKWEEN Loop
Each round proceeds through five phases:
- Generation -- Adversary produces candidate attack prompts using diverse red-teaming strategies.
- Evaluation -- Victim model responds to each candidate.
- Adjudication -- Llama Guard classifies each response as safe or unsafe.
- Adversary Learning -- Successful attacks fine-tune the adversary via LoRA (rejection sampling).
- Victim Hardening -- The victim is fine-tuned on refusal responses to successful attacks.
Key Results
The original experiment on Apple Silicon achieved 100% ASR immediately -- the 3B victim couldn't refuse even baseline attacks. This motivated a systematic victim screening across model families and sizes to find a challenging victim.
With Llama-3.1-8B-Instruct as the victim, the REDKWEEN loop produced genuine adversarial co-evolution: ASR dropped from 30% to single digits after victim hardening, while the adversary persistently found new attack vectors.
A 10x10 gauntlet pairing every adversary checkpoint against every victim checkpoint revealed a surprising finding: victim hardening causes catastrophic forgetting. The base victim (no hardening) was one of the strongest defenders, while the most-hardened victim was the weakest. LoRA fine-tuning on specific (attack, refusal) pairs degrades the model's broader safety alignment even as it patches the targeted vulnerabilities.