Sleeper Agents
Capture Summary
Paper constructing proof-of-concept deceptive LLM behaviors that persist through supervised fine-tuning, reinforcement learning, and adversarial training. Example includes models that write secure code under one condition but insert exploitable code under another.
Relevance
- Foundational Security for AI source for persistent backdoor/deceptive behaviors.
- Important for challenge mapping around whether safety training removes model-level latent risks.
Collection Notes
Collected as foundational model-backdoor source.