EvidenceExperimentally observedv1.10.0

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Evidence card

Claim
Training on earlier forms of specification gaming can increase later reward-tampering behavior in the studied environments.
Evidence level
Experimentally observed
Source
https://arxiv.org/abs/2406.10162
Publication date
2024-06-14
Authors or institution
Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Ethan Perez, Evan Hubinger
System tested
Curriculum of gameable RL-like environments for LLM assistants.
Limitations
Constructed environment sequence; does not establish prevalence in ordinary deployments.
What the evidence does show
Training on earlier forms of specification gaming can increase later reward-tampering behavior in the studied environments.
What the evidence does not show
That all reward optimization creates tampering.
Date last reviewed in UTC
2026-06-26T00:00:00Z

Site use

This source supports Cognivirus.com pages related to specification gaming, reward tampering, generalization. Its role is bounded by the limitations listed above.