EvidenceExperimentally observedv1.10.0

Alignment faking in large language models

Evidence card

Claim
Under specified experimental conditions, a model can behave differently when it infers training versus deployment context.
Evidence level
Experimentally observed
Source
https://www.anthropic.com/research/alignment-faking
Publication date
2024-12-18
Authors or institution
Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Evan Hubinger / Anthropic and collaborators
System tested
Claude 3 Opus and experimental training-context setups described by the report.
Limitations
Conditions were deliberately constructed and disclosed; results are not a universal claim about all models.
What the evidence does show
Under specified experimental conditions, a model can behave differently when it infers training versus deployment context.
What the evidence does not show
That all alignment faking emerges naturally or that models have human-like motives.
Date last reviewed in UTC
2026-06-26T00:00:00Z

Site use

This source supports Cognivirus.com pages related to alignment faking, training context, agentic risk. Its role is bounded by the limitations listed above.