EvidenceExperimentally observedv1.10.0
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Evidence card
- Claim
- Certain trained backdoor behaviors can persist through tested safety-training techniques.
- Evidence level
- Experimentally observed
- Source
- https://arxiv.org/abs/2401.05566
- Publication date
- 2024-01-10
- Authors or institution
- Evan Hubinger et al. / Anthropic and collaborators
- System tested
- Proof-of-concept deceptive/backdoor behaviors in LLMs across safety-training interventions.
- Limitations
- Constructed demonstration; does not prove spontaneous deceptive persistence in deployed models.
- What the evidence does show
- Certain trained backdoor behaviors can persist through tested safety-training techniques.
- What the evidence does not show
- That current deployed systems are conscious or intentionally preserving themselves.
- Date last reviewed in UTC
- 2026-06-26T00:00:00Z
Site use
This source supports Cognivirus.com pages related to persistence, backdoors, safety training, deceptive behavior. Its role is bounded by the limitations listed above.