Steve HutchinsonBig Pines
·6 min read·Stage 5·Cognitive Substrate

Policy Engine

The policy engine provides bounded behavioral drift, converting evaluated outcomes into clamped policy-vector updates and emitting inspectable adaptation records.

Adaptation that stays within bounds

Policy update cycle: evaluated outcomes produce a clamped reward delta that updates the versioned policy store, biasing future actions.
Policy update cycle: evaluated outcomes produce a clamped reward delta that updates the versioned policy store, biasing future actions.

An adaptive system that can drift without limit is not safe to deploy. The policy engine is the mechanism that makes adaptation bounded and inspectable: it converts evaluated outcomes into policy-vector updates, clamps those updates so no single event causes a large behavioral shift, persists policy snapshots, and emits policy.updated records so every change is observable.

The goal is controlled drift: a system that learns from experience without oscillating or diverging.

What the policy vector represents

The policy vector encodes behavioral weights across several dimensions. Three are particularly interesting because they interact in non-obvious ways: explorationFactor, riskBias, and methodTrust.

explorationFactor controls how willing the system is to try unfamiliar approaches when the current strategy could be improved. A high value means the system readily explores; a low value means it sticks to what it knows.

riskBias controls how strongly the system prefers safe, conservative actions over potentially high-reward but uncertain ones.

methodTrust controls confidence in established methods and known-good procedures.

These three dimensions are not independent. When one shifts, the others shift too, because they all respond to the same stream of reinforcement signals. Understanding how they move together is the key to understanding how policy drift works.

What Experiment 12 revealed

Experiment 12 ran two conditions over 100 turns against a mixed corpus of memories, tracking how all three policy dimensions evolved:

Condition 1: corpus mix (positive-dominant signals, roughly the distribution of a healthy system with mostly successful actions). After 100 turns, explorationFactorexplorationFactor drifted from 0.5000.500 to 0.4280.428 (down 0.0720.072). riskBiasriskBias drifted from 0.5000.500 to 0.6080.608 (up 0.1080.108). methodTrustmethodTrust drifted from 0.5000.500 to 0.7300.730 (up 0.2300.230).

Condition 2: contradiction-heavy (high rate of signals involving contradictions, failed actions, and unreliable memories). After 100 turns, explorationFactorexplorationFactor collapsed from 0.5000.500 to 0.0860.086 (down 0.4110.411). riskBiasriskBias rose to 0.7460.746. methodTrustmethodTrust rose to 0.8400.840.

The pattern is consistent: both conditions reduce exploration, increase risk aversion, and increase method trust. But contradiction-heavy signals produce a much more dramatic collapse in explorationFactor.

Why contradictions suppress exploration more than failures

The explorationFactorexplorationFactor update formula is:

ΔexplorationFactor=rewardDelta×(0.5confidence+contradictionRisk)×0.08\Delta explorationFactor = rewardDelta \times (0.5 - confidence + contradictionRisk) \times 0.08

This formula has a structural property that makes contradiction episodes particularly suppressive. A contradiction episode produces a negative or near-zero rewardDeltarewardDelta combined with a high contradictionRiskcontradictionRisk. A negative reward times a large contradiction bracket produces a large negative update to explorationFactorexplorationFactor.

A pure failure with no contradiction (negative rewardDeltarewardDelta, low contradictionRiskcontradictionRisk) produces a smaller negative update because the contradiction bracket is smaller.

The system is encoding a specific logic: contradiction is different from failure. A failure means the strategy did not work this time. A contradiction means something believed to be true turned out to be false. Contradiction is evidence that the system's model of the world is wrong, not just that an action misfired. Reducing exploration in response to contradiction makes sense: if you cannot trust your mental model, exploring based on that model is risky. Exploit what you know works until your model is repaired.

This is a behavioral safety property: the system becomes more conservative and cautious when it encounters contradictory evidence, rather than more exploratory. Exploration resumes naturally as contradiction signals diminish.

Bounded drift and per-step magnitude

The drift magnitudes in Experiment 12 are instructive. Over 100 turns, the maximum observed per-step drift was 0.005, well within the configured maximum of 0.08. This is the clamping mechanism working as designed: each individual outcome can only shift the policy vector a small amount, regardless of how extreme the signal.

This property prevents a single dramatic event (a catastrophic failure, an unusually good outcome) from rewriting the system's behavioral profile. It also means that policy change is always visible as a gradual trend in the audit log rather than as a sudden jump.

The practical implication for deployments: a policy vector that is drifting in a consistently bad direction (say, explorationFactor approaching zero while methodTrust climbs toward saturation) is detectable well before the drift reaches an operationally concerning level. The audit trail of policy.updated events provides the data needed to intervene.

Policy state persistence and snapshots

Each policy update is persisted as a versioned snapshot. The snapshot records the new policy vector, the delta applied, the outcome event that triggered the update, and the timestamp.

Snapshots serve two purposes. First, they make it possible to compare current policy to any prior version, allowing operators to diagnose when and why drift began. Second, they make rollback possible: if a series of updates produces an undesirable policy state, the system can be restored to a prior snapshot.

The snapshot structure is also what makes it possible to answer "why did the system do that?" for any past action. Given a timestamp and a session ID, the active policy vector at that moment can be reconstructed by replaying the snapshot history. Combined with the memory retrieval record for that session, this gives a complete causal account of the context in which a decision was made.

Policy convergence is not guaranteed by the implementation

The engine provides bounded updates and observable drift. It does not guarantee that policy will converge to a useful state, improve task performance, or produce stable long-term behavior. Those outcomes depend on the quality and representativeness of the outcome signals the reinforcement engine produces.

If reinforcement signals are biased (consistently rewarding a bad strategy because the evaluation metric is wrong), the policy will drift in the direction of that bad strategy with perfect bounded regularity. The bounded drift property means the damage accumulates slowly; it does not prevent the accumulation.

This is why reinforcement and policy are separate from each other and from the evaluation framework. Each can be inspected and corrected independently. A bad outcome metric is a problem to fix in the evaluator, not by adding constraints to the policy update formula.

The relationship to identity and constitution

Policy drift is the raw material of behavioral change. But repeated drift in the same direction eventually shapes identity (the slowly-updating self-model of the agent's behavioral tendencies) and may conflict with constitutional invariants (the safety constraints that must not be overridden by learning).

Later stages in this series address both. The identity formation system tracks the long-term trajectory of policy drift and builds a narrative account of behavioral continuity. The constitutional stability layer monitors whether drift is producing unsafe changes and can quarantine proposed updates that would violate foundational constraints.

The policy engine is where change happens. Identity and constitution are where change is observed, understood, and constrained. Together they form the adaptation stack: learn from outcomes, track what kind of system the learning is producing, and ensure foundational safety conditions survive the learning process.

Related Articles

This site collects anonymous usage data to understand how people read and navigate the blog. Accepting enables persistent reader preferences across visits.