EvidenceExperimentally observedv1.10.0
Alignment faking in large language models
Evidence card
- Claim
- Under specified experimental conditions, a model can behave differently when it infers training versus deployment context.
- Evidence level
- Experimentally observed
- Source
- https://www.anthropic.com/research/alignment-faking
- Publication date
- 2024-12-18
- Authors or institution
- Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Evan Hubinger / Anthropic and collaborators
- System tested
- Claude 3 Opus and experimental training-context setups described by the report.
- Limitations
- Conditions were deliberately constructed and disclosed; results are not a universal claim about all models.
- What the evidence does show
- Under specified experimental conditions, a model can behave differently when it infers training versus deployment context.
- What the evidence does not show
- That all alignment faking emerges naturally or that models have human-like motives.
- Date last reviewed in UTC
- 2026-06-26T00:00:00Z
Site use
This source supports Cognivirus.com pages related to alignment faking, training context, agentic risk. Its role is bounded by the limitations listed above.