Sourcessourceseed2026-07-04ai-securitysecurity-for-aibackdoorsdeceptive-modelssafety-trainingmodel-security

Sleeper Agents

Capture Summary

Paper constructing proof-of-concept deceptive LLM behaviors that persist through supervised fine-tuning, reinforcement learning, and adversarial training. Example includes models that write secure code under one condition but insert exploitable code under another.

Relevance

Foundational Security for AI source for persistent backdoor/deceptive behaviors.
Important for challenge mapping around whether safety training removes model-level latent risks.

Collection Notes

Collected as foundational model-backdoor source.