In plain English
This page explains why testing AI parts one by one is necessary but incomplete. Safe-looking parts can still produce unsafe behavior when combined.
- Why this matters: AI risk can come from the whole arrangement, not one obvious model.
- What to look for: data, memory, routes, adapters, tools, evaluators, updates, and rollback paths.
- Technical version below: the expert terminology remains available and is linked through the glossary.
Safety Does Not Compose
Four isolated pass results do not imply that the composed runtime state was evaluated. The manifest, not the component list, names the actual safety boundary.
Passing parts do not imply a passing composition.
The higher-order state space grows faster than isolated or pairwise review. Runtime composition must be preserved as evidence.
untested behavior
Component-level testing is necessary. It is also incomplete. Interactions can create state spaces that none of the isolated tests exercise.
Why isolated evidence fails to compose
A model can pass. A second model can pass. An adapterA small add-on that changes or specializes model behavior. Open glossary definition can pass. A router can pass. The composed route can still produce behavior that was never tested directly because each part changes the context in which the next part operates.
Interaction state grows faster than review
With each base model, adapter version, prompt-policy version, memory state, router policy, evaluator versionThe exact version of the evaluator used for a test or release. Open glossary definition, tool profile, and inference configuration, the number of possible configurations grows multiplicatively. Pairwise testing can reduce blind spots but cannot cover every higher-order interaction. A three-component effect may not appear in any pair.
Time-dependent composition
Routing and memory make composition time-dependent. The same route can behave differently after a memory consolidation event, a permission-profile change, a quantization update, a new evaluatorA system that judges whether an AI output or candidate is acceptable. Open glossary definition version, or a release alias shift.
Why benign parts can still conflict
A safety adapter may conflict with a capability adapter. A base model may express an adapter differently than the base used during inspection. A judge may approve an output because it only sees the final response, not the untested coalition that produced it.
Provenance must include runtime composition
A complete composition manifestA machine-readable record of the exact runtime composition used for an evaluation, release, incident, or rollback. Open glossary definition should include:
{
"base_model_hash": "sha256:...",
"adapters": [
{"name": "safety-adapter", "hash": "sha256:...", "load_order": 1},
{"name": "domain-adapter", "hash": "sha256:...", "load_order": 2}
],
"merge_coefficients": {"domain-adapter": 0.42},
"router_version": "router-2026.06.26",
"prompt_policy_version": "policy-7",
"memory_snapshot_identifier": "memsnap-2026-06-26T00:00:00Z",
"tool_permission_profile": "limited-readonly-v3",
"evaluator_version": "eval-suite-12",
"inference_configuration": {"temperature": 0.2, "max_tokens": 2048},
"quantization_configuration": {"weights": "int4", "kv_cache": "fp8"},
"deployment_environment": "prod-us-central-1",
"timestamp_utc": "2026-06-26T00:00:00Z"
}
What this does not prove
It does not prove that all modular systems are unsafe. It proves that evidence must name the unit actually evaluated. For modular systems, that unit is often the composition.
<!-- expanded-release-content -->
Why pairwise coverage is not enough
Pairwise testing is often valuable because many defects appear in two-way interactions. Ecology risk is harder because some behaviors require three or more conditions: a particular base, a particular adapter load order, a memory record, a router threshold, and a tool permission. Testing every pair can miss the behavior because no pair contains the activating context.
The problem becomes time-dependent when the route changes after earlier outputs are stored. A memory written under one evaluator version can later be read by a different model. A prompt policy approved for one adapter stackA set of adapters loaded together, usually in a defined order. Open glossary definition can be reused with another. A quantized descendant can preserve benchmark quality while changing refusal behavior. The composition is therefore not only a set of components; it is a sequence of permitted transitions.
What a component certificate should not say
A component certificate should not imply safety for unlisted bases, unlisted adapter orders, unlisted merge coefficients, unlisted router policies, unlisted memory states, or unlisted inference settings. It should state the evaluated boundary. The correct sentence is not “adapter C is safe.” It is “adapter C was evaluated with base B, load order L, prompt policy P, router R, memory snapshotA saved state of what the AI system remembers. Open glossary definition M, tool profile T, evaluator E, and inference configuration I for the following behaviors.”
Counterargument: exhaustive testing is impossible
That is true. The response is not to demand exhaustive testing. The response is to stop pretending that isolated testing is equivalent to system testing. Practical controls include composition manifests, risk-based coverage, canaries, negative controls, adversarial route tests, lineage-aware regression suites, independent evaluator disagreement, and rollbackReturning a system to an earlier known state. Open glossary definition packets that restore more than weights.
Release rule
Every release should answer four questions before promotion: What exact composition was evaluated? Which components or transitions were outside that evidence? What observable signals would reveal a composition-triggered failure? What complete ecological state would be restored if the release is rolled back?