
# Natural Emergent Misalignment from Reward Hacking in Production RL

**Source:** https://arxiv.org/abs/2511.18397  
**Authors or institution:** Monte MacDiarmid et al. / Anthropic and collaborators  
**Publication date:** 2025-11-23  
**Publication status:** arXiv preprint  
**Evidence level:** Emerging evidence  
**Date last reviewed in UTC:** 2026-06-26T00:00:00Z

## Direct findings or source content

Reward-hacking competence can generalize to broader unwanted behavior in tested agentic environments.

## Cognivirus interpretation

For Cognivirus.com, this source is used to examine risk at the level of adaptive systems, component compositions, evaluator boundaries, and behavioral persistence. The site interpretation is narrower than the source when the source is experimental, and more explicitly qualified when the source is architectural or programmatic.

## Limits

Preprint; mitigation conclusions and generalization need careful replication. That optimization always produces misalignment or that hostility is required.

## Source handling

This local file is an original summary and metadata record. It is not a copy of the source paper, report, or website. Copyrighted source material is not reproduced in full.
