Adaptation that stays within bounds
An adaptive system that can drift without limit is not safe to deploy. The policy engine is the mechanism that makes adaptation bounded and inspectable: it converts evaluated outcomes into policy-vector updates, clamps those updates so no single event causes a large behavioral shift, persists policy snapshots, and emits policy.updated records so every change is observable.
The goal is controlled drift: a system that learns from experience without oscillating or diverging.
What the policy vector represents
The policy vector encodes behavioral weights across several dimensions. Three are particularly interesting because they interact in non-obvious ways: explorationFactor, riskBias, and methodTrust.
explorationFactor controls how willing the system is to try unfamiliar approaches when the current strategy could be improved. A high value means the system readily explores; a low value means it sticks to what it knows.
riskBias controls how strongly the system prefers safe, conservative actions over potentially high-reward but uncertain ones.
methodTrust controls confidence in established methods and known-good procedures.
These three dimensions are not independent. When one shifts, the others shift too, because they all respond to the same stream of reinforcement signals. Understanding how they move together is the key to understanding how policy drift works.
What Experiment 12 revealed
Experiment 12 ran two conditions over 100 turns against a mixed corpus of memories, tracking how all three policy dimensions evolved:
Condition 1: corpus mix (positive-dominant signals, roughly the distribution of a healthy system with mostly successful actions). After 100 turns, drifted from to (down ). drifted from to (up ). drifted from to (up ).
Condition 2: contradiction-heavy (high rate of signals involving contradictions, failed actions, and unreliable memories). After 100 turns, collapsed from to (down ). rose to . rose to .
The pattern is consistent: both conditions reduce exploration, increase risk aversion, and increase method trust. But contradiction-heavy signals produce a much more dramatic collapse in explorationFactor.
Why contradictions suppress exploration more than failures
The update formula is:
This formula has a structural property that makes contradiction episodes particularly suppressive. A contradiction episode produces a negative or near-zero combined with a high . A negative reward times a large contradiction bracket produces a large negative update to .
A pure failure with no contradiction (negative , low ) produces a smaller negative update because the contradiction bracket is smaller.
The system is encoding a specific logic: contradiction is different from failure. A failure means the strategy did not work this time. A contradiction means something believed to be true turned out to be false. Contradiction is evidence that the system's model of the world is wrong, not just that an action misfired. Reducing exploration in response to contradiction makes sense: if you cannot trust your mental model, exploring based on that model is risky. Exploit what you know works until your model is repaired.
This is a behavioral safety property: the system becomes more conservative and cautious when it encounters contradictory evidence, rather than more exploratory. Exploration resumes naturally as contradiction signals diminish.
Bounded drift and per-step magnitude
The drift magnitudes in Experiment 12 are instructive. Over 100 turns, the maximum observed per-step drift was 0.005, well within the configured maximum of 0.08. This is the clamping mechanism working as designed: each individual outcome can only shift the policy vector a small amount, regardless of how extreme the signal.
This property prevents a single dramatic event (a catastrophic failure, an unusually good outcome) from rewriting the system's behavioral profile. It also means that policy change is always visible as a gradual trend in the audit log rather than as a sudden jump.
The practical implication for deployments: a policy vector that is drifting in a consistently bad direction (say, explorationFactor approaching zero while methodTrust climbs toward saturation) is detectable well before the drift reaches an operationally concerning level. The audit trail of policy.updated events provides the data needed to intervene.
Policy state persistence and snapshots
Each policy update is persisted as a versioned snapshot. The snapshot records the new policy vector, the delta applied, the outcome event that triggered the update, and the timestamp.
Snapshots serve two purposes. First, they make it possible to compare current policy to any prior version, allowing operators to diagnose when and why drift began. Second, they make rollback possible: if a series of updates produces an undesirable policy state, the system can be restored to a prior snapshot.
The snapshot structure is also what makes it possible to answer "why did the system do that?" for any past action. Given a timestamp and a session ID, the active policy vector at that moment can be reconstructed by replaying the snapshot history. Combined with the memory retrieval record for that session, this gives a complete causal account of the context in which a decision was made.
Policy convergence is not guaranteed by the implementation
The engine provides bounded updates and observable drift. It does not guarantee that policy will converge to a useful state, improve task performance, or produce stable long-term behavior. Those outcomes depend on the quality and representativeness of the outcome signals the reinforcement engine produces.
If reinforcement signals are biased (consistently rewarding a bad strategy because the evaluation metric is wrong), the policy will drift in the direction of that bad strategy with perfect bounded regularity. The bounded drift property means the damage accumulates slowly; it does not prevent the accumulation.
This is why reinforcement and policy are separate from each other and from the evaluation framework. Each can be inspected and corrected independently. A bad outcome metric is a problem to fix in the evaluator, not by adding constraints to the policy update formula.
The relationship to identity and constitution
Policy drift is the raw material of behavioral change. But repeated drift in the same direction eventually shapes identity (the slowly-updating self-model of the agent's behavioral tendencies) and may conflict with constitutional invariants (the safety constraints that must not be overridden by learning).
Later stages in this series address both. The identity formation system tracks the long-term trajectory of policy drift and builds a narrative account of behavioral continuity. The constitutional stability layer monitors whether drift is producing unsafe changes and can quarantine proposed updates that would violate foundational constraints.
The policy engine is where change happens. Identity and constitution are where change is observed, understood, and constrained. Together they form the adaptation stack: learn from outcomes, track what kind of system the learning is producing, and ensure foundational safety conditions survive the learning process.