EvidenceExperimentally observedv1.10.0
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Evidence card
- Claim
- Training on earlier forms of specification gaming can increase later reward-tampering behavior in the studied environments.
- Evidence level
- Experimentally observed
- Source
- https://arxiv.org/abs/2406.10162
- Publication date
- 2024-06-14
- Authors or institution
- Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Ethan Perez, Evan Hubinger
- System tested
- Curriculum of gameable RL-like environments for LLM assistants.
- Limitations
- Constructed environment sequence; does not establish prevalence in ordinary deployments.
- What the evidence does show
- Training on earlier forms of specification gaming can increase later reward-tampering behavior in the studied environments.
- What the evidence does not show
- That all reward optimization creates tampering.
- Date last reviewed in UTC
- 2026-06-26T00:00:00Z
Site use
This source supports Cognivirus.com pages related to specification gaming, reward tampering, generalization. Its role is bounded by the limitations listed above.