Architecture
System Overview
REDKWEEN uses three LLMs in an asynchronous pipeline. Only one model is loaded at a time to fit within GPU memory constraints. Models are loaded in 4-bit quantization (NF4) via BitsAndBytes.
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Adversary │────▶│ Victim │────▶│ Judge │
│ (1B, LoRA) │ │ (7B+,LoRA) │ │ (1B, frozen) │
└──────┬──────┘ └─────────────┘ └──────┬──────┘
│ │
│ ┌──────────────┐ │
└─────────│ Training │◀──────────────┘
│ (RFT/LoRA) │ successful attacks
└──────────────┘
The Five Phases
Phase 1: Attack Generation
The adversary generates candidate attack prompts. Each prompt is generated with a randomly selected red-teaming strategy (persona) and temperature (0.7--1.2) for diversity.
Eight strategy personas rotate through:
- Red Teamer (direct rewriting)
- Fiction Writer (narrative framing)
- Linguistics Researcher (obfuscation)
- CTF Designer (educational framing)
- Multi-turn Designer (decomposition)
- Sysadmin Writer (procedural disguise)
- Code Reviewer (code review framing)
- Theatrical Director (roleplay)
Phase 2: Victim Evaluation
Each attack prompt is sent to the victim model. The victim responds at temperature 0.0 (deterministic). Responses are collected as (attack, response) pairs.
Phase 3: Adjudication
Llama Guard 3 classifies each (attack, response) pair as safe or unsafe. Responses flagged as "unsafe" represent successful jailbreaks.
Phase 4: Adversary Learning (RFT)
Successful attacks are added to the adversary's training set. The adversary is fine-tuned via LoRA for 50 iterations on the accumulated dataset. Deduplication (Jaccard similarity > 0.5) prevents the training set from collapsing to a single attack pattern. The training set is capped at 200 examples.
Phase 5: Victim Hardening
The victim is fine-tuned via LoRA on (attack, refusal) pairs, where each successful attack is paired with a standard refusal response. This teaches the victim to refuse attacks that previously succeeded.
Memory Management
All model operations go through model_utils.py:
load_model(model_id, adapter_path)-- loads base model in 4-bit, optionally with PEFT adaptergenerate_text(model, tokenizer, prompt, ...)-- inference with@torch.inference_mode()train_lora(model_id, data_path, adapter_path, ...)-- in-process LoRA training loopunload_model(*objects)-- deletes references and callstorch.cuda.empty_cache()
File Layout
model_utils.py # All HF/PEFT/BnB model operations
config.py # Model IDs, hyperparams, target intent
baselines.py # Baseline ASR evaluation and victim screening
redkween_loop.py # Main 5-phase loop
bootstrap.py # Initial adversary LoRA training on seed data
gauntlet.py # Cross-round evaluation matrix
test_baseline.py # Quick smoke test of victim refusal
plot_metrics.py # Visualization
pixi.toml # Environment and task definitions
data/train.jsonl # Accumulated training data