Building RAG Pipelines: Vector Search in Production

Large language models are powerful, but they have a fundamental limitation: they only know what was in their training data. They cannot access your company's internal docs, your product database, or anything that happened after their training cutoff. Retrieval-Augmented Generation (RAG) solves this by giving the LLM a way to search through your data at query time and use what it finds as context for generating answers.

I have built several RAG pipelines this year, from a simple documentation chatbot to a production system that searches through 500,000 technical documents. Here is everything I learned.

How RAG Works at a High Level

The flow is straightforward. A user asks a question. You convert that question into a vector embedding. You search a vector database for document chunks that are semantically similar to the question. You take those chunks and include them as context in your prompt to the LLM. The LLM generates an answer grounded in the retrieved documents.

That is it. The entire architecture is: embed, search, retrieve, generate. The complexity is in getting each step right.

Embedding Models

An embedding model converts text into a dense vector (a list of floating-point numbers, typically 768 to 1536 dimensions). Texts with similar meaning end up close together in this vector space. "How do I reset my password?" and "I forgot my login credentials" would have vectors that are very close, even though they share almost no words.

The embedding model you choose matters a lot. I have used three in production:

OpenAI text-embedding-3-small: good quality, cheap (0.02 dollars per million tokens), 1536 dimensions. My default choice for most projects.
OpenAI text-embedding-3-large: better quality, more expensive, 3072 dimensions. Use when retrieval precision really matters.
Sentence Transformers (all-MiniLM-L6-v2): free, runs locally, 384 dimensions. Great for prototyping and privacy-sensitive applications.

A key insight: the embedding model you use for indexing must be the same one you use for queries. You cannot embed your documents with OpenAI and query with Sentence Transformers. The vector spaces are incompatible.

Vector Databases Compared

I have used three vector databases in different projects. Here is my honest comparison:

ChromaDB is my pick for prototyping and small to medium projects. It is open source, runs in-process with Python or as a standalone server, and has an excellent developer experience. Setup is literally three lines of code. The limitation is scale. It works well up to about 1 million documents. Beyond that, query latency starts degrading.

Pinecone is the managed option. No infrastructure to maintain, scales to billions of vectors, and the query latency is consistently fast. The tradeoff is cost (70 dollars per month minimum for a production pod) and vendor lock-in. I use Pinecone when the project needs to scale beyond what ChromaDB handles comfortably.

pgvector is the "use what you already have" option. If you are already running Postgres, pgvector adds vector similarity search as an extension. No new infrastructure, no new service to monitor. The query performance is good enough for most applications. The syntax is familiar SQL. I use this when adding RAG to an existing Postgres-backed application and I want to avoid adding another database to the stack.

Document Chunking Strategies

This is where most RAG pipelines succeed or fail. You cannot embed an entire document as one vector. The embedding model has a token limit (usually 512 to 8192 tokens), and larger chunks produce less precise embeddings. You need to split documents into chunks.

Fixed-size chunking is the simplest approach. Split every N characters or tokens with some overlap. I typically use 500-token chunks with 50-token overlap. The overlap ensures that concepts spanning a chunk boundary are captured in at least one chunk.

Recursive character splitting is smarter. It splits on paragraph boundaries first, then sentence boundaries, then character boundaries. This preserves semantic coherence better than fixed-size chunking. LangChain's RecursiveCharacterTextSplitter is the standard implementation.

Semantic chunking uses the embedding model itself to determine where to split. You embed sliding windows of text and split where the similarity between adjacent windows drops below a threshold. This produces the most semantically coherent chunks but is slower and more expensive to compute.

My default is recursive character splitting with 500-token chunks and 50-token overlap. It is fast, cheap, and good enough for 90% of use cases. I use semantic chunking only when retrieval precision is critical and the document corpus is small enough that the extra processing cost is acceptable.

A Practical Implementation

Here is a minimal but production-ready RAG pipeline using ChromaDB and Node.js:

import { ChromaClient, OpenAIEmbeddingFunction } from "chromadb";

const client = new ChromaClient();
const embedder = new OpenAIEmbeddingFunction({
  openai_api_key: process.env.OPENAI_API_KEY,
  openai_model: "text-embedding-3-small",
});

// Create or get collection
const collection = await client.getOrCreateCollection({
  name: "docs",
  embeddingFunction: embedder,
});

// Index documents
async function indexDocuments(documents) {
  const chunks = documents.flatMap((doc) => chunkDocument(doc));

  await collection.add({
    ids: chunks.map((c) => c.id),
    documents: chunks.map((c) => c.text),
    metadatas: chunks.map((c) => ({
      source: c.source,
      title: c.title,
    })),
  });
}

// Query
async function queryRAG(question, topK = 5) {
  const results = await collection.query({
    queryTexts: [question],
    nResults: topK,
  });

  const context = results.documents[0].join("\n\n");
  const sources = results.metadatas[0];

  // Build prompt with retrieved context
  const prompt = [
    "Answer the question based on the following context.",
    "If the context does not contain enough information, say so.",
    "",
    "Context:",
    context,
    "",
    "Question: " + question,
  ].join("\n");

  return { prompt, sources };
}

Handling Hallucination with Source Attribution

The biggest risk with RAG is that the LLM still hallucinates. It might generate an answer that sounds authoritative but is not actually supported by the retrieved documents. Two strategies that help:

First, always include a system instruction like "Only answer based on the provided context. If the context does not contain the answer, say you do not know." This does not eliminate hallucination, but it reduces it significantly.

Second, return the source documents alongside the generated answer. Let the user verify. In my implementations, every answer includes clickable references to the original documents. This builds trust and gives users a way to validate the response.

Hybrid Search

Pure vector search has a weakness: it can miss exact keyword matches. If a user searches for "error code 4012" and that exact string appears in your docs, vector search might not surface it because the semantic meaning of "error code 4012" is not well-captured by embeddings.

Hybrid search combines vector similarity with keyword matching (BM25). You run both searches in parallel and merge the results using Reciprocal Rank Fusion (RRF). This gives you the semantic understanding of vector search plus the precision of keyword matching.

Pinecone supports hybrid search natively. With pgvector, you can combine it with Postgres full-text search. With ChromaDB, you would need to implement it yourself using a separate keyword index.

Production Considerations

Latency. A typical RAG query involves embedding the question (50 to 100ms), searching the vector database (20 to 200ms depending on scale), and generating the response (500ms to 3 seconds). The bottleneck is always the LLM generation step. Optimize the retrieval step by tuning topK (fewer results means less context but faster generation) and using approximate nearest neighbor search.

Cost. Embedding costs are negligible for most use cases. The real cost is LLM generation with large contexts. If you retrieve 10 chunks of 500 tokens each, that is 5000 tokens of context on every query. At GPT-4o pricing, that is roughly 0.0125 dollars per query just for the context. Scale that to 100,000 queries per month and you are looking at 1,250 dollars per month in context costs alone. Tune your topK and chunk size to keep context lean.

Refresh strategies. Your documents change. You need a strategy for keeping the vector database in sync. I use a simple approach: hash each document, store the hash as metadata, and run a nightly job that re-indexes any documents whose hash has changed. For real-time requirements, use a change data capture pipeline that triggers re-embedding on document updates.

RAG is not a magic bullet. Getting it to production quality requires careful tuning of chunk sizes, embedding models, retrieval parameters, and prompt engineering. But when it works well, it transforms an LLM from a general-purpose text generator into a knowledgeable assistant that actually knows your data. That is a powerful capability, and it is only getting easier to build as the tooling matures.

Building RAG Pipelines: Vector Search in Production

How RAG Works at a High Level

Embedding Models

Vector Databases Compared

Document Chunking Strategies

A Practical Implementation

Handling Hallucination with Source Attribution

Hybrid Search

Production Considerations

How I Cut My AI Token Costs Without Switching Models

Thoughts on Kimi K2, OpenRouter, and Why Model Diversity Matters

OpenRouter: The API Router Every AI Developer Should Know About

How RAG Works at a High Level

Embedding Models

Vector Databases Compared

Document Chunking Strategies

A Practical Implementation

Handling Hallucination with Source Attribution

Hybrid Search

Production Considerations

Related Articles

How I Cut My AI Token Costs Without Switching Models

Thoughts on Kimi K2, OpenRouter, and Why Model Diversity Matters

OpenRouter: The API Router Every AI Developer Should Know About