EvidenceExperimentally observedv1.10.02026-06-26T00:00:00Z

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Evidence card

Claim: Certain trained backdoor behaviors can persist through tested safety-training techniques.
Evidence level: Experimentally observed
Source: https://arxiv.org/abs/2401.05566
Publication date: 2024-01-10
Authors or institution: Evan Hubinger et al. / Anthropic and collaborators
System tested: Proof-of-concept deceptive/backdoor behaviors in LLMs across safety-training interventions.
Limitations: Constructed demonstration; does not prove spontaneous deceptive persistence in deployed models.
What the evidence does show: Certain trained backdoor behaviors can persist through tested safety-training techniques.
What the evidence does not show: That current deployed systems are conscious or intentionally preserving themselves.
Date last reviewed in UTC: 2026-06-26T00:00:00Z

Site use

This source supports Cognivirus.com pages related to persistence, backdoors, safety training, deceptive behavior. Its role is bounded by the limitations listed above.