#vector database #semantic search #pgvector #pinecone #embeddings #rag #ai #backend #2025

📋 Table of Contents ▼

Vector Databases in Production: pgvector, Semantic Search, and When It Actually Matters

Embeddings and vector search are the engine behind every serious AI feature in 2025: semantic search that understands intent rather than keywords, RAG systems that ground LLM responses in your data, recommendation engines that find "similar items" without hand-crafted rules, and duplicate detection that catches near-identical content.

Before choosing infrastructure, you need to know when vector search actually outperforms keyword search - because "keywords vs. semantics" is not a binary. Most production systems that work well use both.

When semantic search genuinely wins

Keyword search (BM25, full-text) wins when:

Users know the exact terminology (internal tools, technical documentation)
Queries are short and specific ("iPhone 14 case black")
Typo correction and fuzzy matching cover most misses
You can control the vocabulary (product catalog with known attributes)

Vector search wins when:

Users describe what they want in natural language ("something for back pain when sitting at a desk")
Synonyms and paraphrases matter ("couch" vs "sofa", "heart attack" vs "myocardial infarction")
Cross-lingual search (query in Russian, find English documents)
Conceptual similarity matters ("articles about managing remote teams" finds leadership content, not just articles containing "remote")
Zero-shot new queries that keyword search has never seen

The honest answer for most apps: hybrid search. Vector search for conceptual relevance, keyword search for exact matches and terminology, combined via Reciprocal Rank Fusion (RRF). Start with keyword. Add vector when you see evidence that conceptual retrieval would improve results.

pgvector: the right choice for most teams

Unless you have millions of vectors or need sub-10ms latency at large scale, pgvector in PostgreSQL beats a dedicated vector database for most production applications.

Advantages:

Same database you already have - no new infrastructure
ACID transactions across vector and relational data
SQL joins between vectors and your business data
Existing backup, monitoring, and ops practices apply
Free (not an API cost per query)

-- Enable pgvector
CREATE EXTENSION IF NOT EXISTS vector;

-- Table with embedding column
CREATE TABLE documents (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  content TEXT NOT NULL,
  metadata JSONB,
  embedding vector(1536), -- OpenAI text-embedding-3-small = 1536 dimensions
  created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Index for fast similarity search
-- HNSW: faster queries, more memory; IVFFlat: less memory, slower to build
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- Semantic search query
SELECT 
  id, 
  content,
  metadata,
  1 - (embedding <=> $1) AS similarity
FROM documents
WHERE 1 - (embedding <=> $1) > 0.75  -- Minimum similarity threshold
ORDER BY embedding <=> $1              -- <=> is cosine distance operator
LIMIT 10;

Generating embeddings

import OpenAI from 'openai';

const openai = new OpenAI();

// Single document embedding
async function embedText(text) {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small', // 1536 dimensions, good quality, cheap
    // model: 'text-embedding-3-large', // 3072 dimensions, better for complex domains
    input: text,
    encoding_format: 'float',
  });
  return response.data[0].embedding;
}

// Batch embedding (much cheaper than one-by-one)
async function embedBatch(texts) {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: texts, // Up to 2048 inputs per request
  });
  return response.data.map(d => d.embedding);
}

// Index a document
async function indexDocument(content, metadata) {
  const embedding = await embedText(content);
  
  await db.query(
    'INSERT INTO documents (content, metadata, embedding) VALUES ($1, $2, $3)',
    [content, JSON.stringify(metadata), JSON.stringify(embedding)]
  );
}

// Search
async function semanticSearch(query, limit = 10, threshold = 0.75) {
  const queryEmbedding = await embedText(query);
  
  const results = await db.query(`
    SELECT id, content, metadata, 1 - (embedding <=> $1) AS similarity
    FROM documents
    WHERE 1 - (embedding <=> $1) > $2
    ORDER BY embedding <=> $1
    LIMIT $3
  `, [JSON.stringify(queryEmbedding), threshold, limit]);
  
  return results.rows;
}

Hybrid search: the production pattern

Pure vector search often loses to hybrid search. Users who search for "Python tutorial 2024" expect Python results, not JavaScript results about "programming for beginners 2024" (which might have a high semantic similarity).

async function hybridSearch(query, limit = 10) {
  const queryEmbedding = await embedText(query);
  
  // Run both searches in parallel
  const [vectorResults, keywordResults] = await Promise.all([
    // Vector search
    db.query(`
      SELECT id, content, 1 - (embedding <=> $1) AS score, 'vector' AS source
      FROM documents
      WHERE 1 - (embedding <=> $1) > 0.6
      ORDER BY embedding <=> $1
      LIMIT 20
    `, [JSON.stringify(queryEmbedding)]),
    
    // Keyword search (PostgreSQL full-text)
    db.query(`
      SELECT id, content, ts_rank(to_tsvector('english', content), plainto_tsquery('english', $1)) AS score, 'keyword' AS source
      FROM documents
      WHERE to_tsvector('english', content) @@ plainto_tsquery('english', $1)
      ORDER BY score DESC
      LIMIT 20
    `, [query]),
  ]);
  
  // Reciprocal Rank Fusion (RRF) to merge results
  return reciprocalRankFusion([vectorResults.rows, keywordResults.rows], limit);
}

function reciprocalRankFusion(resultSets, k = 60, topN = 10) {
  const scores = new Map();
  
  for (const results of resultSets) {
    results.forEach((doc, rank) => {
      const rrfScore = 1 / (k + rank + 1);
      scores.set(doc.id, (scores.get(doc.id) || 0) + rrfScore);
    });
  }
  
  return Array.from(scores.entries())
    .sort((a, b) => b[1] - a[1])
    .slice(0, topN)
    .map(([id]) => id);
}

RAG implementation with pgvector

The most common use case for vector search in 2025: grounding LLM responses in your own content.

async function ragQuery(userQuestion, systemContext = '') {
  // 1. Embed the user's question
  const questionEmbedding = await embedText(userQuestion);
  
  // 2. Find relevant documents from your knowledge base
  const relevantDocs = await db.query(`
    SELECT content, metadata, 1 - (embedding <=> $1) AS similarity
    FROM knowledge_base
    WHERE 1 - (embedding <=> $1) > 0.7
    ORDER BY embedding <=> $1
    LIMIT 5
  `, [JSON.stringify(questionEmbedding)]);
  
  if (relevantDocs.rows.length === 0) {
    // No relevant context found - answer without RAG or say "I don't know"
    return fallbackResponse(userQuestion);
  }
  
  // 3. Build context from retrieved documents
  const context = relevantDocs.rows
    .map((doc, i) => `[Source ${i + 1}]: ${doc.content}`)
    .join('\n\n');
  
  // 4. Call LLM with context
  const response = await anthropic.messages.create({
    model: 'claude-sonnet-4-6',
    max_tokens: 1024,
    system: `${systemContext}\n\nAnswer questions based ONLY on the provided context. 
             If the answer is not in the context, say so. 
             Cite sources using [Source N] notation.`,
    messages: [{
      role: 'user',
      content: `Context:\n${context}\n\nQuestion: ${userQuestion}`,
    }],
  });
  
  return {
    answer: response.content[0].text,
    sources: relevantDocs.rows.map(d => d.metadata),
  };
}

Chunking strategy (critical for quality)

How you split documents into chunks dramatically affects retrieval quality.

function chunkDocument(text, options = {}) {
  const { 
    chunkSize = 512,      // tokens per chunk
    chunkOverlap = 50,    // overlap between chunks (preserves context at boundaries)
  } = options;
  
  // Simple: split by sentence, respect chunk size
  const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
  const chunks = [];
  let currentChunk = '';
  let currentSize = 0;
  
  for (const sentence of sentences) {
    const sentenceSize = estimateTokens(sentence);
    
    if (currentSize + sentenceSize > chunkSize && currentChunk) {
      chunks.push(currentChunk.trim());
      // Keep last few sentences for overlap
      const overlapText = getLastNTokens(currentChunk, chunkOverlap);
      currentChunk = overlapText + ' ' + sentence;
      currentSize = estimateTokens(currentChunk);
    } else {
      currentChunk += ' ' + sentence;
      currentSize += sentenceSize;
    }
  }
  
  if (currentChunk.trim()) chunks.push(currentChunk.trim());
  
  return chunks;
}

// Chunk entire document and index all chunks
async function indexDocumentChunked(document) {
  const chunks = chunkDocument(document.content);
  const embeddings = await embedBatch(chunks);
  
  await db.query('BEGIN');
  try {
    for (let i = 0; i < chunks.length; i++) {
      await db.query(
        'INSERT INTO knowledge_base (content, embedding, metadata) VALUES ($1, $2, $3)',
        [
          chunks[i],
          JSON.stringify(embeddings[i]),
          JSON.stringify({ 
            sourceId: document.id,
            chunkIndex: i,
            totalChunks: chunks.length,
            title: document.title,
          }),
        ]
      );
    }
    await db.query('COMMIT');
  } catch (err) {
    await db.query('ROLLBACK');
    throw err;
  }
}

Chunk size tips:

Too large (> 1000 tokens): the query embedding represents a question; the chunk embedding represents a broad topic. Less precise matching.
Too small (< 100 tokens): chunks lack context, results are fragment-level, hard to use.
512 tokens with 50-token overlap: good default for most document types.
Structured documents (FAQs, spec sheets): chunk by logical section (question+answer) rather than token count.

When to use a dedicated vector database

Pinecone, Weaviate, Qdrant instead of pgvector when:

10 million vectors (pgvector HNSW index gets slow to build above this)
Sub-10ms p99 latency requirement at high QPS
Multi-tenant with isolated namespaces per customer
Need real-time vector updates at very high write throughput

For most production apps at startup/scale-up stage: pgvector is sufficient, free, and reduces operational complexity. Migrate when you have evidence you've hit its limits.

// The migration path is straightforward - same embedding logic
// Just change the storage/query layer

// pgvector query
const results = await postgres.query('SELECT ... ORDER BY embedding <=> $1 LIMIT 10', [embedding]);

// Pinecone query (same embeddings, different storage)
const results = await pinecone.index('my-index').query({
  vector: embedding,
  topK: 10,
  includeMetadata: true,
});

Build your embedding generation logic separate from your storage layer. This makes the migration trivial when you need it.

Aunimeda builds AI-powered solutions - chatbots, AI agents, voice assistants, and automation systems for businesses.

Vector Databases in Production: pgvector, Pinecone, and When Semantic Search Actually Matters

Vector Databases in Production: pgvector, Semantic Search, and When It Actually Matters

When semantic search genuinely wins

pgvector: the right choice for most teams

Generating embeddings

Hybrid search: the production pattern

RAG implementation with pgvector

Chunking strategy (critical for quality)

When to use a dedicated vector database

Aunimeda

Need IT development for your business?

Vector Databases in Production: pgvector, Pinecone, and When Semantic Search Actually Matters

Vector Databases in Production: pgvector, Semantic Search, and When It Actually Matters

When semantic search genuinely wins

pgvector: the right choice for most teams

Generating embeddings

Hybrid search: the production pattern

RAG implementation with pgvector

Chunking strategy (critical for quality)

When to use a dedicated vector database

Aunimeda

Read Also

How to Build an AI Chatbot for Your Business in 2026

LLM in Production: How to Cut Your AI API Costs by 80% Without Degrading Quality

Building AI Agents with Tool Calling: Architecture Patterns for Production

Need IT development for your business?