Bringing embeddings closer to the index
Calling an external embedding service on every retrieval adds a network round-trip on the hot path. When the retrieval system needs a query embedding to run a vector search, it makes a call to the embedding provider, waits for the response, and only then submits the search to OpenSearch. Under high retrieval load, this round-trip accumulates.
OpenSearch ML inference nodes eliminate that round-trip by running embedding and reranking models inside the cluster, co-located with the memory index they serve. A query submitted to an OpenSearch ML pipeline is embedded and searched within the same request, without any external network dependency.
What ML nodes provide
ML Commons, the OpenSearch plugin that enables in-cluster inference, exposes two operations that matter for the memory substrate.
Embedding at ingest time allows experience events to be embedded as they are written to the index, rather than requiring the ingestion worker to call an external embedding service before writing. The ingest pipeline intercepts each new document, passes the text field through the registered embedding model, and writes the resulting vector to the knn_vector field automatically. This keeps the ingestion hot path simple: the worker writes the text and metadata, and the cluster handles the vector.
Reranking at query time allows a retrieval result set to be re-scored using a cross-encoder model that considers the query and each candidate together, rather than scoring each candidate independently. Cross-encoder reranking is more expensive than bi-encoder similarity search but more accurate, because the query and document are considered jointly rather than as independent vectors. Running it inside the cluster avoids shipping the full candidate set to an external reranker.
Model profiles and environment portability
Not every deployment has ML nodes available. A development environment may run a single-node OpenSearch cluster without ML Commons. A test environment may use stub embeddings to keep tests deterministic. A production environment may use ML nodes for embedding and an external provider for reranking, or vice versa.
The retrieval layer handles this through environment-derived model profiles. A profile specifies which embedding path to use (ML node pipeline, external provider, or stub) and which reranking path to use (ML node, external, or identity). The application logic that performs retrieval does not know which path is active; it calls the retrieval interface and the profile determines what happens underneath.
This design means the retrieval layer can switch between embedding providers without code changes. An experiment that compares ML node embedding quality against a hosted embedding service can be run by changing the active profile rather than modifying retrieval logic.
The kNN field constraint
One operational constraint applies regardless of which embedding path is active: in a kNN index, every document in the index shard must have the knn_vector field populated. A shard that contains documents with the field alongside documents without it produces errors at query time.
This constraint has a practical implication for mixed deployments. If some experience events were ingested with a stub embedder (producing no vector) and later events are ingested with a real embedder, the index becomes mixed and vector queries fail. The correct approach is to use a dedicated index for embedded documents and a separate index for non-embedded documents, or to ensure all documents in a vector-enabled index go through the same embedding path.
The consolidation article describes how the ingestion worker handles the empty-embedding case: when no embedding is available, the knn_vector field is omitted from the write payload rather than written as an empty array. This prevents the mapper rejection error, but it does not prevent the mixed-shard issue if some documents have the field and others do not. A clean index transition, rather than gradual backfill, is the reliable path when switching embedding providers on an existing index.
Reranking as a quality gate
The optional reranking step changes the nature of retrieval in a subtle but important way. Standard kNN retrieval returns the most geometrically similar documents, where similarity is measured by cosine distance between the query vector and each document vector. This is an approximation of relevance, not a direct measure of it.
A cross-encoder reranker computes a relevance score for each query-document pair directly, considering the full text of both together. The result is a reordered candidate set where the top entries are more likely to be genuinely relevant, even if they were not the geometrically nearest in vector space.
The tradeoff is cost. Reranking every retrieval result for every query is expensive. The profile system allows reranking to be applied selectively: only for high-stakes retrievals in slow mode, or only when retrieval depth is above a threshold. This connects the ML inference configuration to the cognitive economics system: deeper retrieval with reranking is one of the capabilities that slow mode enables, and the budget gate determines when that investment is appropriate.