Steve HutchinsonBig Pines
·3 min read·Kafka and the Registry

Why I Chose Kafka for the Cognitive Substrate

The Cognitive Substrate needs to ingest a constant stream of events from many different sources - telemetry, tickets, Slack, logs. Here is why Kafka is the right foundation for that, and what I quickly learned it was not enough on its own.

When building the memory layer for the Cognitive Substrate, I needed something that could reliably handle a constant stream of events from many different sources - telemetry, logs, tickets, Slack messages, and more - without any of those sources needing to coordinate directly with each other.

I chose Kafka for three main reasons.

Why Kafka

High-throughput event streaming - Kafka is built for this. It handles thousands of events per second comfortably without special tuning, and scales well beyond that when needed.

Producer-consumer decoupling - the telemetry system publishes events without needing to know which workers will process them or when. The ingestion worker processes events without needing to know about the telemetry system. They communicate only through the Kafka topic. This decoupling makes the architecture easy to extend: adding a new event source means writing a producer, not changing any existing consumers.

Durability and replay - once data is written to Kafka, it is retained for a configurable period and can be replayed from any offset. This is valuable for the Cognitive Substrate specifically: when I add a new processing worker, I can replay historical events to populate its state rather than starting from scratch.

Kafka as the Central Nervous System

Kafka acts as the central event bus for the entire Cognitive Substrate. Every experience - whether it originated as a telemetry spike, a support ticket, or a Slack thread - first lands in a Kafka topic. Workers consume from those topics at their own pace, process the events, and emit outputs to other topics.

This design gives resilience (workers can crash and restart without losing events), scalability (add workers to increase throughput), and auditability (the full event history is available for replay and debugging).

Here is the CognitiveProducer.publish method. Every message is automatically mirrored to the immutable audit topic - a pattern that would be impossible to enforce consistently with ad-hoc producers:

// packages/kafka-bus/src/producer.ts

export class CognitiveProducer {
  async publish<T>(topic: TopicName, payload: T, options: PublishOptions = {}): Promise<void> {
    const headers = options.traceContext ? injectTraceContext(options.traceContext) : {}

    const messages: Array<{ topic: string; messages: Message[] }> = [
      {
        topic,
        messages: [
          {
            key: options.key ? Buffer.from(options.key) : null,
            value: Buffer.from(JSON.stringify(payload)),
            headers,
          },
        ],
      },
    ]

    // Every message is automatically mirrored to the immutable audit topic,
    // except for audit messages themselves (which would cause infinite recursion).
    if (this.enableAuditMirror && topic !== Topics.AUDIT_EVENTS) {
      const auditPayload = { originalTopic: topic, payload, timestamp: new Date().toISOString() }
      messages.push({
        topic: Topics.AUDIT_EVENTS,
        messages: [
          {
            key: options.key ? Buffer.from(options.key) : null,
            value: Buffer.from(JSON.stringify(auditPayload)),
            headers,
          },
        ],
      })
    }

    await this.producer.sendBatch({ topicMessages: messages, compression: CompressionTypes.GZIP })
  }
}

The audit mirror is opt-in per producer instance but enabled by default. Any message published to any Cognitive Substrate topic creates a corresponding audit record with the original topic name, payload, and timestamp - providing an immutable event history without any per-call boilerplate.

The Problem I Did Not See Coming

Using Kafka this way works well until you have more than one producer writing events. Then you discover the problem: without a formal contract about what the events look like, one innocent change to a message structure can silently break every downstream consumer.

That is the problem the Schema Registry solves. The next article covers how I hit it and what it cost.

Related Articles

This site collects anonymous usage data to understand how people read and navigate the blog. Accepting enables persistent reader preferences across visits.