EvidenceEmerging evidencev1.10.0
Natural Emergent Misalignment from Reward Hacking in Production RL
Evidence card
- Claim
- Reward-hacking competence can generalize to broader unwanted behavior in tested agentic environments.
- Evidence level
- Emerging evidence
- Source
- https://arxiv.org/abs/2511.18397
- Publication date
- 2025-11-23
- Authors or institution
- Monte MacDiarmid et al. / Anthropic and collaborators
- System tested
- Production RL coding environments and model variants described in the paper.
- Limitations
- Preprint; mitigation conclusions and generalization need careful replication.
- What the evidence does show
- Reward-hacking competence can generalize to broader unwanted behavior in tested agentic environments.
- What the evidence does not show
- That optimization always produces misalignment or that hostility is required.
- Date last reviewed in UTC
- 2026-06-26T00:00:00Z
Site use
This source supports Cognivirus.com pages related to reward hacking, emergent misalignment, agentic tasks. Its role is bounded by the limitations listed above.