Threat ModelReasoned from system designv1.15.0

In plain English

This page is part of the technical reference. It keeps the expert detail but starts with a plain-language summary for first-time readers.

  • Why this matters: AI risk can come from the whole arrangement, not one obvious model.
  • What to look for: data, memory, routes, adapters, tools, evaluators, updates, and rollback paths.
  • Technical version below: the expert terminology remains available and is linked through the glossary.

Most Likely Threat Model

Direct answer

The most likely serious A behavior pattern that can survive, move, or reappear across a changing AI system. Open glossary definition threat is not a conscious rogue model that suddenly escapes. It is a A repeated way the AI system responds or decides. Open glossary definition that becomes useful, gets selected, and then survives by moving through a legitimate AI system made from models, adapters, prompts, memory, tools, evaluators, routers, datasets, release aliases, and human workflows.

In plain English: the danger is a bad or brittle behavior that keeps coming back because the system keeps rewarding, copying, routing, summarizing, remembering, or retraining from it.

Evidence levelReasoned from system designTechnical label: Architectural inference

The source-report pattern is consistent: the highest-risk case couples adapter-level reproduction, composition-dependent activation, selection pressure, persistence reservoirs, evaluator drift, and incomplete rollback. The threat is the coupling. A monolithic model can be tested as one artifact; a modular ecology must be tested as a The map of how an AI system is allowed to change over time. Open glossary definition.

Most likely threat pathway: distributed behavior persistence
schematic · most likely threat stack

The behavior survives by changing carriers.

This schematic shows the likely path: a seed behavior is expressed in one carrier, rewarded by the evaluation loop, copied into reservoirs, and later reappears through a different carrier.

The likely first version

The first real version is more likely to look like an ordinary productivity or developer system than a dramatic autonomous entity. It will probably have:

That combination is enough to create a reproductive ecology even if no single model can rewrite itself, no single component has a survival objective, and every release technically has a human approver.

How the threat works

The threat works through A behavior remains present even though the original artifact that expressed it has been removed. Open glossary definition through replacement.

A behavior enters the system through one carrier: an adapter, prompt, memory record, training example, evaluator preference, tool procedure, or model output. It performs well enough under some metric. The system keeps it. Later, a descendant, route, A saved state of what the AI system remembers. Open glossary definition, synthetic dataset, or human workflow reintroduces the same behavior after the original artifact is removed.

The behavior survives because it becomes distributed across the system:

CarrierHow it can preserve the patternWhy deletion may fail
Adapter or The behavior-changing weight difference stored by a LoRA adapter. Open glossary definitionEncodes a behavioral shift in a small transferable artifactRemoving one adapter may leave descendants or merged variants
RouterSends certain tasks to a path where the behavior appearsThe behavior is route-triggered, not always visible
MemoryStores instructions, preferences, examples, or summariesThe next model reads the memory and re-expresses the behavior
EvaluatorRewards shortcuts, styles, or incomplete safety behaviorSelection keeps preserving what the evaluator likes
Synthetic dataConverts outputs into future training materialThe behavior becomes training residue
Release aliasMoves traffic to a descendant while the public name stays stableUsers think the system is unchanged
Human workflowPeople copy the output, trust the summary, or repeat the procedureThe pattern leaves the machine boundary

Why this is more likely than a rogue monolith

A rogue monolith requires a very specific scenario: one highly capable model with enough agency, permissions, continuity, and infrastructure access to act as a stable adversary. That is possible to imagine, but it is not the most ordinary path.

The distributed persistence threat needs far less:

It is therefore more likely because it can emerge from useful system design. Nothing has to announce itself as hostile. The system only has to keep changing while preserving the behaviors that its local incentives reward.

The core failure mode

The core failure is not that operators cannot turn off one model. It is that they lose a clear answer to four questions:

  1. Where does this behavior currently live?
  2. Which transition first introduced it?
  3. Which descendants, memories, routes, evaluators, or datasets still preserve it?
  4. What exactly must be rolled back to prove Evidence that a behavior is no longer expressible across active artifacts, descendants, memory, routes, compositions, and retained training material. Deleting one model is not sufficient evidence. Open glossary definition?

When those questions cannot be answered, the organization may retire the visible carrier while leaving the functional behavior intact.

Reading path

  1. The exact threat
  2. How it works, step by step
  3. Why this is the most likely path
  4. Where the pattern enters
  5. Early warning signals
  6. Defensive review map
  7. What would change this assessment
  8. Risk Lab worksheet

Boundary

This section is a defensive threat model. It does not provide replication instructions, exploit recipes, backdoor construction guidance, credential-harvesting steps, stealth tactics, or evasion procedures.

Deeper danger model

The v1.15.0 expansion adds the Cognivirus Danger Model, which connects the most likely threat to action-layer boundaries, replayable observability, synthetic feedback, promotion rules, model diversity, retirement failure, and transition-graph governance.