Steve HutchinsonBig Pines
·6 min read·Stage 38·Cognitive Substrate

Design Philosophy: Scope, Vocabulary, and What This Project Claims

A retrospective on the vocabulary, methodology, and honest scope of the Cognitive Substrate series - what the architecture claims, what it demonstrates, and where those two things diverge.

Thirty-seven articles, forty-four experiments, one hosted demonstration. This is a reasonable point to be explicit about what this project actually claims - and what it does not.

On the vocabulary

The series borrows terms from cognitive science: episodic memory, semantic memory, salience, Hebbian compounding, affect, curiosity, identity, narrative. These words come with connotations. They suggest minds, not machines. That choice is intentional and worth defending.

The terms are functional analogies. They are not claims about consciousness, subjective experience, or biological implementation. "Episodic memory" here means a time-ordered, immutable log of experience events with context, action, and evaluation - not a claim that the system experiences episodes. "Salience" here means a composite priority signal derived from six measurable dimensions - not a claim about what the system notices or feels.

The vocabulary earns its use when it generates non-trivial predictions about system behavior that can be experimentally checked. Three examples:

The episodic framing predicts that retrieval without session-relative novelty will exhibit attentional capture - the most salient memories will crowd out all others regardless of relevance. Experiment 1 confirms this: flat importance ranking achieves 70% hit rate but only 33% cluster coverage, with an entire cluster of memories permanently invisible. Experiment 3's session-relative novelty formula raises coverage to 100%. That improvement is a prediction of the episodic framing, verifiable by inspection.

The attentional framing predicts that affect state should quantitatively modulate which signals receive cognitive resources. Experiment 21 measures this directly: the coupleAttention value for a risky, urgent candidate is 0.030 in a settled affect state and 0.334 in a stressed one - an 11-fold amplification. A simple priority-queue framing would not predict this relationship; the attentional framing demands it.

The Hebbian framing predicts that memories reinforced repeatedly should compound faster than arithmetic accumulation. Experiment 8 tests exponential moving averages and shows they converge to a signal-determined fixed point regardless of prior weight - no accumulation. Experiment 10 introduces a logarithmic count bonus and measures genuine compounding: cluster-A retrieval priority at turn 100 is 0.826 with the count bonus versus 0.757 without. The Hebbian framing correctly identified why the EMA failed and what would fix it.

These are predictions, not analogies. The cognitive vocabulary is wrong in checkable ways if the architectural bets it encodes are wrong. That checkability is why the vocabulary is used.

On demonstrated scope versus designed scope

The system is designed to generalize. The ExperienceEvent schema is domain-independent: any agent that can express its inputs, internal state, actions, outcomes, and evaluations in structured form can use the substrate. The operational primitive taxonomy is vendor-agnostic: BACKPRESSURE_ACCUMULATION and QUEUE_GROWTH describe behavioral dynamics, not Kafka or PostgreSQL. The embedding and retrieval infrastructure makes no assumption about what the documents contain.

The only working end-to-end demonstration is SRE observability on Aiven-managed infrastructure. The real-telemetry experiment (Stage 37) takes live Kafka and OpenSearch metrics, converts them to ExperienceEvent records, embeds them with Vertex, indexes them in OpenSearch, and retrieves them via the same cognitive loop used for all reasoning. That pipeline is real and running.

Broader applicability - the claim that the same architecture would work for, say, a code review agent or a customer support agent - is an architectural hypothesis. The schema would accommodate those domains. The pattern library would need new primitives. The retrieval and reinforcement systems would work the same way. But "would work" is a different claim from "has been demonstrated to work." A reader should treat SRE observability as the demonstrated case and treat anything else as an untested extension.

On the experiment methodology

The series reports results from forty-four numbered experiments. These experiments are reproducible: each has a standalone script, a fixed corpus, and explicit hypotheses that pass or fail. The parameter choices throughout the series are documented with experiment citations in the code, so the calibration path is auditable.

What the experiments are not: a statistically representative evaluation across deployment distributions. The small-scale experiments use nine synthetic memories organized into three clusters with known properties. The operational signal experiments use a synthetic corpus of operational signals generated to represent degraded, outage, and recovery windows. These corpora were designed to test specific architectural properties, not to sample from the space of real-world deployments.

This means two things. First, the experiments demonstrate that the system behaves as designed on the design corpus. That is evidence, but it is evidence about the corpus used. Second, parameters calibrated on the synthetic corpus - the novelty weight of 0.30, the count bonus of 0.02, the re-consolidation interval of 5 epochs - may require tuning when deployed to a different event distribution with different importance ratios, novelty dynamics, or contradiction rates. The failure modes article documents the conditions under which the system degrades predictably.

There is no holdout corpus that was untouched during calibration, and no comparison against an external non-cognitive baseline such as a keyword-indexed log store. The internal baselines - Experiment 1 (flat importance ranking) compared to Experiment 3 (session novelty), or Experiment 7 (stateless reinforcement) compared to Experiment 10 (Hebbian compounding) - are comparisons between configurations of the same system, not between the cognitive system and a simpler alternative.

What this project is not

It is not a claim to implement general cognition. The architecture implements specific functional behaviors - associative retrieval, reinforcement-weighted priority, salience-gated attention, identity tracking - that cognitive science vocabulary describes well. It does not implement, and does not claim to implement, the full range of cognitive capacities.

It is not a solution to the AI safety alignment problem. The constitutional engine provides operational stability constraints: identity drift is bounded, reward corruption requires two independent signals, self-modification proposals are quarantined for review. These are useful properties. They are not solutions to deceptive alignment, goal misgeneralization, or the broader problem of specifying what a system should optimize for. The constitutional stability article is explicit about this boundary.

It is not a demonstrated solution at arbitrary scale. The hosted deployment runs on a single ML node for embeddings, a single OpenSearch cluster, and a single Kafka broker. The architecture is designed to scale horizontally via Kafka consumer groups and OpenSearch replicas. Whether it does so gracefully under production load at scale has not been tested.

What this project is

A reference implementation of a persistent, learnable memory substrate for AI agents, built on standard infrastructure - Kafka, OpenSearch, ClickHouse, PostgreSQL - and instrumented for cognitive observability via OpenTelemetry.

The substrate is accompanied by forty-four reproducible experiments that test specific architectural properties, document failure modes as they were discovered, and record the parameter choices that resulted. One end-to-end demonstration shows live infrastructure telemetry flowing through the full pipeline into retrievable memory.

The cognitive vocabulary is the design language. The experiments are the evidence. The scope is SRE observability, demonstrated; broader domains are a hypothesis. The failure modes are known and documented. That is what this project claims.

Related Articles

This site collects anonymous usage data to understand how people read and navigate the blog. Accepting enables persistent reader preferences across visits.