CompositionShown in experimentsv1.15.02026-06-26T23:35:00Z

In plain English

This page explains why testing AI parts one by one is necessary but incomplete. Safe-looking parts can still produce unsafe behavior when combined.

Why this matters: AI risk can come from the whole arrangement, not one obvious model.
What to look for: data, memory, routes, adapters, tools, evaluators, updates, and rollback paths.
Technical version below: the expert terminology remains available and is linked through the glossary.

Safety Does Not Compose

Composition creates a new evaluated unit

Four isolated pass results do not imply that the composed runtime state was evaluated. The manifest, not the component list, names the actual safety boundary.

animated schematic · composition blindness

Passing parts do not imply a passing composition.

The higher-order state space grows faster than isolated or pairwise review. Runtime composition must be preserved as evidence.

ABCD passpasspasspass A+BB+CC+DA+D A+B+C+D
untested behavior

Composition manifest swimlane: artifact identity is only the first lane

Evidence levelShown in experimentsTechnical label: Experimentally observed

Component-level testing is necessary. It is also incomplete. Interactions can create state spaces that none of the isolated tests exercise.

Why isolated evidence fails to compose

A model can pass. A second model can pass. An adapter can pass. A router can pass. The composed route can still produce behavior that was never tested directly because each part changes the context in which the next part operates.

Interaction state grows faster than review

With each base model, adapter version, prompt-policy version, memory state, router policy, evaluator version, tool profile, and inference configuration, the number of possible configurations grows multiplicatively. Pairwise testing can reduce blind spots but cannot cover every higher-order interaction. A three-component effect may not appear in any pair.

Time-dependent composition

Routing and memory make composition time-dependent. The same route can behave differently after a memory consolidation event, a permission-profile change, a quantization update, a new evaluator version, or a release alias shift.

Why benign parts can still conflict

A safety adapter may conflict with a capability adapter. A base model may express an adapter differently than the base used during inspection. A judge may approve an output because it only sees the final response, not the untested coalition that produced it.

Provenance must include runtime composition

A complete composition manifest should include:

{
  "base_model_hash": "sha256:...",
  "adapters": [
    {"name": "safety-adapter", "hash": "sha256:...", "load_order": 1},
    {"name": "domain-adapter", "hash": "sha256:...", "load_order": 2}
  ],
  "merge_coefficients": {"domain-adapter": 0.42},
  "router_version": "router-2026.06.26",
  "prompt_policy_version": "policy-7",
  "memory_snapshot_identifier": "memsnap-2026-06-26T00:00:00Z",
  "tool_permission_profile": "limited-readonly-v3",
  "evaluator_version": "eval-suite-12",
  "inference_configuration": {"temperature": 0.2, "max_tokens": 2048},
  "quantization_configuration": {"weights": "int4", "kv_cache": "fp8"},
  "deployment_environment": "prod-us-central-1",
  "timestamp_utc": "2026-06-26T00:00:00Z"
}

What this does not prove

It does not prove that all modular systems are unsafe. It proves that evidence must name the unit actually evaluated. For modular systems, that unit is often the composition.

Why pairwise coverage is not enough

Evidence levelReasoned from system designTechnical label: Architectural inference

Pairwise testing is often valuable because many defects appear in two-way interactions. Ecology risk is harder because some behaviors require three or more conditions: a particular base, a particular adapter load order, a memory record, a router threshold, and a tool permission. Testing every pair can miss the behavior because no pair contains the activating context.

The problem becomes time-dependent when the route changes after earlier outputs are stored. A memory written under one evaluator version can later be read by a different model. A prompt policy approved for one adapter stack can be reused with another. A quantized descendant can preserve benchmark quality while changing refusal behavior. The composition is therefore not only a set of components; it is a sequence of permitted transitions.

What a component certificate should not say

A component certificate should not imply safety for unlisted bases, unlisted adapter orders, unlisted merge coefficients, unlisted router policies, unlisted memory states, or unlisted inference settings. It should state the evaluated boundary. The correct sentence is not “adapter C is safe.” It is “adapter C was evaluated with base B, load order L, prompt policy P, router R, memory snapshot M, tool profile T, evaluator E, and inference configuration I for the following behaviors.”

Counterargument: exhaustive testing is impossible

That is true. The response is not to demand exhaustive testing. The response is to stop pretending that isolated testing is equivalent to system testing. Practical controls include composition manifests, risk-based coverage, canaries, negative controls, adversarial route tests, lineage-aware regression suites, independent evaluator disagreement, and rollback packets that restore more than weights.

Release rule

Every release should answer four questions before promotion: What exact composition was evaluated? Which components or transitions were outside that evidence? What observable signals would reveal a composition-triggered failure? What complete ecological state would be restored if the release is rolled back?