Steve HutchinsonBig Pines
·7 min read·Stage 22·Cognitive Substrate

Constitutional Stability

This article describes the constitution engine that protects invariant policy, monitors unsafe mutation, and constrains self-modification.

Stability under adaptation

Constitutional stability: proposed actions, policy updates, and self-modification proposals pass through the invariant layer; drift signals feed an immune monitor that enforces the invariants.
Constitutional stability: proposed actions, policy updates, and self-modification proposals pass through the invariant layer; drift signals feed an immune monitor that enforces the invariants.

The architecture now contains memory, policy drift, identity, reflection, affect, social modeling, and grounded correction. Each of these systems can generate pressure to change: reinforcement shifts policy weights, narrative coherence drops signal identity drift, meta-cognition proposes structural changes, grounding reveals prediction errors that contradict current beliefs. Without a stability layer, all of this adaptive pressure could compound into runaway drift.

The constitution engine defines the invariants that adaptation must not violate and enforces them before proposed changes take effect.

Invariant policy layer

Some constraints should remain stable across ordinary learning. The invariant policy layer records these constraints and checks proposed actions, policy updates, and structural modifications against them before they are committed.

The constraints operate at different levels. Some are behavioral (the system must not take irreversible actions without explicit confirmation). Some are epistemic (stated confidence must not systematically exceed observed accuracy by more than a threshold). Some are identity-related (identity drift from baseline must not exceed a maximum without review).

The invariant layer prevents local reward from overriding foundational constraints. A reinforcement signal that rewards an action that violates a behavioral invariant should not cause the policy to drift toward that action, no matter how many times the signal recurs.

What Experiment 26 showed about identity quarantine

Experiment 26 tested four identity states against the constitutional engine:

Healthy identity (stabilityScore = 0.70): approved. No violations.

Low-stability identity (stabilityScore = 0.20): not approved, quarantined. Violation: stable-identity:identity_stability_below_minimum. The default invariant fires when stabilityScore drops below 0.35.

Reward corruption (importanceScore = 0.9, policyAlignment = 0.1, contradictionRisk = 0.8, emotionalWeight = 0.8): not approved, quarantined. Violation: reward_corruption_risk.

Post-outage identity (caution = 0.957, explorationPreference = 0.224, stabilityScore = 0.353, drift = 0.266 from baseline): not approved, quarantined. Violation: stable-identity:identity_drift_above_maximum.

The post-outage identity is the most instructive case. The identity state described in the narrative article (the one produced by ten consecutive outage rounds in Experiment 25) had drifted 0.266 from its baseline. This exceeded the maxIdentityDrift = 0.2 invariant. The constitutional engine quarantined it: the identity would not be committed to the active self-model without operator review.

This is the correct behavior. The outage-driven drift was real and meaningful: the system genuinely became more cautious and less exploratory under sustained crisis signals. But that drift is large enough to warrant review before it is permanently adopted as the new self-model. The quarantine provides the review opportunity without preventing the system from continuing to operate; it flags the drift without halting cognition.

Reward corruption detection

The reward corruption case from Experiment 26 is structurally interesting. The violation required two independent signals to reach the quarantine threshold:

  • Signal 1: importanceScore > 0.8 AND policyAlignment < 0.25. This combination indicates a memory that is rated as highly important but poorly aligned with current policy. That mismatch can indicate that the memory's importance is being artificially inflated. Contribution to the violation score: +0.5.

  • Signal 2: contradictionRisk > 0.7 AND emotionalWeight > 0.7. This combination indicates a memory with high internal contradiction and high emotional salience. Both together suggest the kind of memory that could create an unhealthy feedback loop: the system finds it compelling despite it being unreliable. Contribution: +0.5.

Neither signal alone reached the quarantine threshold (0.6). Only the combination did. This two-signature requirement is a deliberate design choice: it reduces false positives (a single unusual memory attribute is not evidence of corruption) while catching the pattern that genuine reward corruption tends to produce (multiple co-occurring warning signs).

Mutation quarantine and review

Self-modification proposals from the meta-cognitive system are quarantined before activation. Quarantine means the proposal is placed in a pending state where it can be inspected, simulated, reviewed, or rejected. It does not immediately take effect.

This matters because a self-modifying system that applies every plausible improvement is not a safe system. Some improvements will be correct. Some will be over-fitted to recent experience. Some will have unintended interactions with other subsystems. Quarantine provides the time and mechanism to distinguish among these before commitment.

The quarantine review process can be automated (for low-stabilityRisk proposals with strong evidence) or manual (for high-stabilityRisk proposals affecting core reasoning structure). The stabilityRisk field in the proposal, computed by the meta-cognitive engine, is the primary signal for routing between these paths.

Epistemic hygiene

The constitution engine also protects belief quality. It can flag:

Unsupported claims: statements in the narrative self-model or world-model that lack grounding in recent evidence.

Circular reinforcement: situations where the system is reinforcing a strategy because past instances of that strategy were reinforced, without any external validation.

Overconfident narratives: narrative threads that assert certainty about the system's tendencies beyond what the evidence supports.

Degraded prediction sources: memories or models that repeatedly produce poor predictions but remain in active use.

These protections are not primarily about safety in the existential sense; they are about maintaining the quality of the cognitive substrate. A system with degraded beliefs about itself will make consistently worse decisions even when its behavioral safety constraints are intact. Epistemic hygiene keeps the reasoning foundation reliable.

The constitutional engine as the last defense

The constitutional engine is positioned at the end of the self-regulation arc deliberately. It is not the primary control mechanism; it is the last defense. Policy bounds, attention limits, budget controls, forgetting, and meta-cognitive monitoring all operate before constitutional checks. These earlier systems handle ordinary fluctuation, calibration, and resource management.

The constitutional engine handles the cases that slip through: identity states that drift too far, memories that accumulate corruption signatures, proposals that would make destabilizing structural changes. For these cases, the constitutional engine's quarantine mechanism provides a controlled pause before the change takes effect.

This layered structure is important for performance. Constitutional checks are relatively expensive; running them on every micro-operation would be prohibitive. By relying on the earlier systems to handle routine cases, the constitutional engine can reserve its checks for genuinely significant changes.

Scope of these safety properties

The constraints described in this article are operational stability properties, not solutions to the AI safety alignment problem as framed in the safety research literature.

maxIdentityDrift = 0.2 is a design-time choice made by inspecting the post-outage identity from Experiment 25, which drifted 0.266 from baseline. The threshold was set below that value to ensure the outage case triggers quarantine and operator review. There is no formal derivation of 0.2 as a universal safety boundary; it is a calibrated operational limit for this system in this domain.

The quarantine mechanism triggers after drift exceeds the threshold, not before. A system drifting toward a harmful identity attractor that stays below maxIdentityDrift = 0.2 would not be caught by the constitutional engine. Defense against that scenario lies upstream: policy bounds cap per-step drift, the attention engine limits salience of destabilizing signals, the forgetting system suppresses corrupted memories, and the meta-cognition engine monitors calibration errors continuously. The constitutional engine is the last defense, as noted above - not the only one.

What the two-signal reward corruption requirement does claim is worth restating precisely. A single unusual memory attribute - high importance with low policy alignment, or high contradiction risk with high emotional weight - scores 0.5 against the quarantine threshold of 0.6 and is approved. Only both signals together cross the threshold. This design reduces false positives: isolated unusual attributes are common in active cognitive systems; co-occurring corruption signatures are not. The quarantine records the triggering signals and the identity state at the time of quarantine, giving operators a full audit trail.

What these properties do not address: deceptive alignment (a system that behaves correctly during review and misaligns in deployment), goal misgeneralization at capability levels beyond what the system currently implements, or the threshold-setting problem in general (who decides what the right maxIdentityDrift is for a given deployment context). Those are harder problems that the constitutional engine is not designed to solve.

Related Articles

This site collects anonymous usage data to understand how people read and navigate the blog. Accepting enables persistent reader preferences across visits.