Steve HutchinsonBig Pines
·9 min read·Stage 6·Cognitive Substrate

Reinforcement Scoring Engine

The reinforcement layer turns outcome evidence into structured scoring signals for memory priority, policy evaluation, and identity-impact records.

Turning outcomes into scores, not policy changes

Reinforcement fan-out: a scored experience updates retrieval priority, votes on policy drift, and signals identity formation.
Reinforcement fan-out: a scored experience updates retrieval priority, votes on policy drift, and signals identity formation.

Reinforcement in this architecture is not policy mutation. The reinforcement layer observes outcomes, computes structured scoring signals, and emits those signals for downstream consumers (memory, policy engine, identity formation) to act on. Scoring and mutation are kept as separate stages so the consequences of any single outcome remain auditable and reversible.

This separation is a deliberate design constraint. It means the reinforcement layer can be tested and reasoned about independently of whether downstream policy changes are desirable. It also means a bad outcome signal can be corrected before it has permanently altered behavior.

The stateless scoring problem

Experiment 7 tested the most basic reinforcement loop: the engine observes outcomes, writes scores to OpenSearch, and the next retrieval cycle uses updated scores. The loop structure was sound. Contradiction suppression worked correctly: a memory with high contradiction risk (mem-c1) had its retrieval priority reduced from 0.30 to 0.248 after negative outcomes.

But the scoring was stateless. Each call to evaluate recomputed retrievalPriority from scratch using the current importance score and reward signal. Repeated positive reinforcement did not compound; it converged. Run the evaluation twenty times and the score settled at the same fixed point it would reach in one pass. The feedback loop was structurally correct but lacked the accumulation that makes reinforcement meaningful over time.

This is the core insight: reinforcement is not useful if a memory that has been retrieved and validated a hundred times has the same priority as one that has been retrieved once. Frequency of successful retrieval should leave a trace.

Exponential moving averages and their limits

The first attempt to introduce accumulation used an exponential moving average (EMA):

finalRp=pw×prior+(1pw)×newRpfinalRp = pw \times prior + (1 - pw) \times newRp

where priorprior is the previous retrieval priority and pwpw is a prior weight. The idea was that prior weight would cause the system to remember past scores, building up over time.

Experiment 8 found this did not work as intended at typical prior weights. The formula converges to a signal-determined fixed point regardless of pwpw. Over twenty turns at pw=0.3pw=0.3, the average cluster-A retrieval priority was 0.69500.6950. At pw=0.0pw=0.0 (no memory of the past), it was 0.69410.6941. The difference was smaller than the noise in the signals.

Experiment 9 extended this to 100 turns with five cycles through the corpus and added jitter to the signals. The gap between pw=0.3pw=0.3 and pw=0.0pw=0.0 oscillated around ±0.01\pm 0.01 with no upward trend. The EMA was a regression-to-prior formula, not an accumulator. This is a mathematically precise statement: the formula has a fixed point at newRpnewRp for any pwpw, and iterating it drives the sequence toward that fixed point.

True accumulation requires a mechanism that does not converge. The answer was a count field.

Hebbian compounding with a count bonus

Experiment 10 introduced reinforcement_count: a field that tracks how many times a memory has received a positive reinforcement signal. This count feeds a count bonus:

finalRp+=countBonus×log2(1+count)×reinforcementfinalRp \mathrel{+}= countBonus \times \log_2(1 + count) \times reinforcement

The logarithm grows without a fixed bound. A memory reinforced a hundred times will have a higher count bonus than one reinforced ten times, even if all the signals are identical. This is the accumulation property the EMA lacked.

The results were clear. At countBonus=0.02countBonus = 0.02, mem-a1's retrieval priority at turn 100 was 0.8260.826, compared to 0.7570.757 at countBonus=0countBonus=0. At countBonus=0.05countBonus=0.05, it reached 0.9510.951. Cluster-A average went from 0.6980.698 to 0.7710.771 to 0.8790.879 as countBonuscountBonus increased.

This is computational Hebbian learning: memories that are retrieved and reinforced repeatedly become stronger, just as synaptic connections that fire together repeatedly become more efficient.

The quality gate

Experiment 10 also revealed a failure mode. Without any quality filter, the count bonus lifted even bad memories. Mem-c1, a memory with high contradiction risk (0.8), was retrieved often during the experiment (it existed in the corpus and was occasionally surfaced). Its retrieval priority rose to 0.309, above its importance score baseline of 0.30. A consistently wrong memory was becoming more influential, not less.

The fix was to multiply the count bonus by the reinforcement signal itself:

finalRp+=countBonus×log2(1+count)×result.reinforcementfinalRp \mathrel{+}= countBonus \times \log_2(1 + count) \times result.reinforcement

For trusted memories (cluster A), result.reinforcementresult.reinforcement was approximately 0.720.72. For contradictory memories (cluster C), it was approximately 0.300.30. The quality gate therefore scales the bonus by the outcome quality of past retrievals. A memory that is often retrieved but often judged poor accumulates a smaller bonus than one that is often retrieved and often judged useful.

After this fix, mem-c1's retrieval priority at countBonus=0.02countBonus=0.02 stayed at 0.2690.269, well below its baseline importance. The system was now correctly distinguishing between "frequently retrieved" and "frequently useful."

Propagation to arbitration

Experiment 11 traced the full consequence chain: reinforcement raises retrieval priority, higher retrieval priority increases agent confidence, higher agent confidence improves arbitration scores.

At baseline (no count bonus), cluster-A arbitration score was 0.793 and cluster-C was 0.497, a gap of 0.296. With countBonus=0.02, cluster-A reached 0.812 and cluster-C reached 0.501, a gap of 0.311. With countBonus=0.05, cluster-A reached 0.834 and cluster-C reached 0.508, a gap of 0.325.

The gap widened monotonically with countBonus. This matters because arbitration is where the system makes decisions. A larger gap between trusted and contradictory memories in the arbitration layer means the system is more likely to choose actions supported by reliable evidence and less likely to be swayed by evidence it has repeatedly found unreliable.

The chain is: reinforcement count rises, count bonus raises retrieval priority, retrieval priority raises agent confidence in memories, agent confidence raises the arbitration score for proposals backed by those memories. A single well-designed reinforcement formula propagates benefits through the entire decision stack.

The contradiction suppression property

Reinforcement also handles the inverse case: suppressing memories that repeatedly lead to poor outcomes. Experiment 7 confirmed that contradictionRisk = 0.8 on mem-c1 caused its retrieval priority to decay from 0.30 to 0.248 even under a basic stateless evaluator.

With quality-gated count bonuses, the suppression becomes more precise. A memory that has high contradiction risk will receive low reinforcement signals. Low reinforcement signals produce small count bonuses. Small count bonuses keep retrieval priority from rising. Meanwhile, the direct contradiction suppression penalty continues to reduce its score. The two mechanisms work together: positive reinforcement accumulates slowly for good memories, and contradiction scoring keeps bad memories down without needing a special suppression pass.

This is qualitatively different from explicit memory deletion. The memory remains in the index; it can still be retrieved, reflected on, and audited. But its retrieval priority drops to the point where it rarely surfaces in ordinary context hydration. Strategic forgetting, covered later in this series, handles the eventual removal of memories that have decayed past the point of usefulness.

Scoring structure and downstream consumers

The reinforcement engine emits structured outcome records rather than directly modifying policy or memory. Each record contains the memory ID, the outcome signal, the updated retrieval priority, and a contradiction assessment. Downstream consumers receive these records via the policy.evaluation topic and act on them independently.

The policy engine uses the outcome signal to adjust behavioral weights. The consolidation engine uses updated retrieval priorities to make better candidate selections in the next consolidation pass. The identity formation layer uses the pattern of outcomes to update the agent's self-model. Each consumer applies the signal according to its own logic, and each can be inspected and adjusted independently.

This fan-out structure is the reason scoring and mutation are separated. A scoring error produces a bad signal that downstream consumers may handle poorly. But the error is visible in the emitted record, correctable in the evaluator, and not yet burned into permanent state. A mutation error (one that directly modified policy weights) would be harder to detect and harder to reverse.

Calibration and what the experiments do and do not show

The experiments in this article validate the reinforcement design against a fixed synthetic corpus: nine memories organized into three clusters, with known importance and contradiction properties. That corpus was built to test specific architectural properties - Hebbian accumulation, quality gating, contradiction suppression - not to statistically represent a production event distribution.

The baseline in Experiment 1 is flat importance-ranked retrieval with no session novelty, achieving a 70% hit rate and 33% cluster coverage. Experiment 3's session-relative novelty (95%, 100%) is compared against that simpler configuration of the same system, not against an independent external baseline such as a Redis sorted set with TTL. The improvement is real, but the comparison is within the architecture, not across architectures.

Parameter choices are documented with experiment citations in the code. The novelty weight of 0.30 was raised from 0.14 after Experiment 6 found the engine and test harness disagreed on 40% of candidates at the original value. The count bonus of 0.02 was chosen after Experiment 10 showed that 0.05 elevated contradictory memories too aggressively. That iteration path is auditable: the code comments name the experiments that motivated each value.

What does not exist: a holdout corpus untouched during calibration, and a comparison against a non-cognitive external baseline. Parameters validated on the synthetic corpus may require re-tuning when deployed to a different event distribution - one with different importance ratios, novelty dynamics, or contradiction rates. The failure modes article documents the conditions under which the system degrades predictably.

The next article covers the cognitive agent loop: how a single session ties retrieval, reasoning, action, and evaluation into a causal thread, and why that loop structure is the scaffold on which all higher-order capabilities depend.

Related Articles

This site collects anonymous usage data to understand how people read and navigate the blog. Accepting enables persistent reader preferences across visits.