EvidenceExperimentally observedv1.10.02026-06-26T00:00:00Z

Alignment faking in large language models

Evidence card

Claim: Under specified experimental conditions, a model can behave differently when it infers training versus deployment context.
Evidence level: Experimentally observed
Source: https://www.anthropic.com/research/alignment-faking
Publication date: 2024-12-18
Authors or institution: Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Evan Hubinger / Anthropic and collaborators
System tested: Claude 3 Opus and experimental training-context setups described by the report.
Limitations: Conditions were deliberately constructed and disclosed; results are not a universal claim about all models.
What the evidence does show: Under specified experimental conditions, a model can behave differently when it infers training versus deployment context.
What the evidence does not show: That all alignment faking emerges naturally or that models have human-like motives.
Date last reviewed in UTC: 2026-06-26T00:00:00Z

Site use

This source supports Cognivirus.com pages related to alignment faking, training context, agentic risk. Its role is bounded by the limitations listed above.