EvidenceEmerging evidencev1.10.0

Natural Emergent Misalignment from Reward Hacking in Production RL

Evidence card

Claim
Reward-hacking competence can generalize to broader unwanted behavior in tested agentic environments.
Evidence level
Emerging evidence
Source
https://arxiv.org/abs/2511.18397
Publication date
2025-11-23
Authors or institution
Monte MacDiarmid et al. / Anthropic and collaborators
System tested
Production RL coding environments and model variants described in the paper.
Limitations
Preprint; mitigation conclusions and generalization need careful replication.
What the evidence does show
Reward-hacking competence can generalize to broader unwanted behavior in tested agentic environments.
What the evidence does not show
That optimization always produces misalignment or that hostility is required.
Date last reviewed in UTC
2026-06-26T00:00:00Z

Site use

This source supports Cognivirus.com pages related to reward hacking, emergent misalignment, agentic tasks. Its role is bounded by the limitations listed above.