AI Security Research Portal
Sourcessourceseed2026-07-04ai-securitysecurity-for-aibackdoorsdeceptive-modelssafety-trainingmodel-security

Sleeper Agents

Capture Summary

Paper constructing proof-of-concept deceptive LLM behaviors that persist through supervised fine-tuning, reinforcement learning, and adversarial training. Example includes models that write secure code under one condition but insert exploitable code under another.

Relevance

Collection Notes

Collected as foundational model-backdoor source.