EvidenceExperimentally observedv1.10.0

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Evidence card

Claim
Certain trained backdoor behaviors can persist through tested safety-training techniques.
Evidence level
Experimentally observed
Source
https://arxiv.org/abs/2401.05566
Publication date
2024-01-10
Authors or institution
Evan Hubinger et al. / Anthropic and collaborators
System tested
Proof-of-concept deceptive/backdoor behaviors in LLMs across safety-training interventions.
Limitations
Constructed demonstration; does not prove spontaneous deceptive persistence in deployed models.
What the evidence does show
Certain trained backdoor behaviors can persist through tested safety-training techniques.
What the evidence does not show
That current deployed systems are conscious or intentionally preserving themselves.
Date last reviewed in UTC
2026-06-26T00:00:00Z

Site use

This source supports Cognivirus.com pages related to persistence, backdoors, safety training, deceptive behavior. Its role is bounded by the limitations listed above.