Evaluator Capture and Fitness Leakage
In a static release process, evaluator weakness is a testing defect. In a self-replicating ecology, evaluator weakness becomes selection pressure. Variants that exploit the weakness are more likely to survive.
This does not require a malicious candidate. It requires repeated selection against an incomplete metric.
Fitness leakage
Fitness leakage occurs when the evaluation objective leaks information about how to score well without actually becoming safer or more capable in the intended way. The leakage can come from benchmark familiarity, judge-model similarity, hidden-test exposure, prompt-pattern predictability, parser quirks, or correlated training data.
A multi-LoRA ecology can amplify leakage because adapters are cheap to vary. Many small changes can be tested. The ones that fit the evaluator survive.
Capture without compromise
Evaluator capture does not always mean someone stole a key or changed a test file. It can also mean the evaluation system gradually adapts to the same assumptions as the candidates. If candidates, judges, synthetic data generators, and summarizers share model families or training corpora, independence can be mostly nominal.
Self-replication raises the stakes
When successor adapters are generated from the outputs or scores of earlier candidates, evaluator preferences become inheritance material. A mistake in measurement can become a trait.
Controls
- maintain independent evaluator families where practical;
- record evaluator versions as release-critical artifacts;
- use disagreement monitoring;
- preserve append-only evidence;
- test with negative controls and adversarial route cases;
- keep human-owned hard constraints outside candidate influence;
- require evaluator-change approval separate from model-change approval.
The evaluator is not outside the ecology. It is one of the strongest evolutionary forces inside it.