AboutBlogContact
AI & Machine LearningDecember 8, 2025 8 min read 284Updated: May 18, 2026

Vector Databases in Production: pgvector, Pinecone, and When Semantic Search Actually Matters

AunimedaAunimeda
📋 Table of Contents

Vector Databases in Production: pgvector, Semantic Search, and When It Actually Matters

Embeddings and vector search are the engine behind every serious AI feature in 2025: semantic search that understands intent rather than keywords, RAG systems that ground LLM responses in your data, recommendation engines that find "similar items" without hand-crafted rules, and duplicate detection that catches near-identical content.

Before choosing infrastructure, you need to know when vector search actually outperforms keyword search - because "keywords vs. semantics" is not a binary. Most production systems that work well use both.


When semantic search genuinely wins

Keyword search (BM25, full-text) wins when:

  • Users know the exact terminology (internal tools, technical documentation)
  • Queries are short and specific ("iPhone 14 case black")
  • Typo correction and fuzzy matching cover most misses
  • You can control the vocabulary (product catalog with known attributes)

Vector search wins when:

  • Users describe what they want in natural language ("something for back pain when sitting at a desk")
  • Synonyms and paraphrases matter ("couch" vs "sofa", "heart attack" vs "myocardial infarction")
  • Cross-lingual search (query in Russian, find English documents)
  • Conceptual similarity matters ("articles about managing remote teams" finds leadership content, not just articles containing "remote")
  • Zero-shot new queries that keyword search has never seen

The honest answer for most apps: hybrid search. Vector search for conceptual relevance, keyword search for exact matches and terminology, combined via Reciprocal Rank Fusion (RRF). Start with keyword. Add vector when you see evidence that conceptual retrieval would improve results.


pgvector: the right choice for most teams

Unless you have millions of vectors or need sub-10ms latency at large scale, pgvector in PostgreSQL beats a dedicated vector database for most production applications.

Advantages:

  • Same database you already have - no new infrastructure
  • ACID transactions across vector and relational data
  • SQL joins between vectors and your business data
  • Existing backup, monitoring, and ops practices apply
  • Free (not an API cost per query)
-- Enable pgvector
CREATE EXTENSION IF NOT EXISTS vector;

-- Table with embedding column
CREATE TABLE documents (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  content TEXT NOT NULL,
  metadata JSONB,
  embedding vector(1536), -- OpenAI text-embedding-3-small = 1536 dimensions
  created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Index for fast similarity search
-- HNSW: faster queries, more memory; IVFFlat: less memory, slower to build
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- Semantic search query
SELECT 
  id, 
  content,
  metadata,
  1 - (embedding <=> $1) AS similarity
FROM documents
WHERE 1 - (embedding <=> $1) > 0.75  -- Minimum similarity threshold
ORDER BY embedding <=> $1              -- <=> is cosine distance operator
LIMIT 10;

Generating embeddings

import OpenAI from 'openai';

const openai = new OpenAI();

// Single document embedding
async function embedText(text) {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small', // 1536 dimensions, good quality, cheap
    // model: 'text-embedding-3-large', // 3072 dimensions, better for complex domains
    input: text,
    encoding_format: 'float',
  });
  return response.data[0].embedding;
}

// Batch embedding (much cheaper than one-by-one)
async function embedBatch(texts) {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: texts, // Up to 2048 inputs per request
  });
  return response.data.map(d => d.embedding);
}

// Index a document
async function indexDocument(content, metadata) {
  const embedding = await embedText(content);
  
  await db.query(
    'INSERT INTO documents (content, metadata, embedding) VALUES ($1, $2, $3)',
    [content, JSON.stringify(metadata), JSON.stringify(embedding)]
  );
}

// Search
async function semanticSearch(query, limit = 10, threshold = 0.75) {
  const queryEmbedding = await embedText(query);
  
  const results = await db.query(`
    SELECT id, content, metadata, 1 - (embedding <=> $1) AS similarity
    FROM documents
    WHERE 1 - (embedding <=> $1) > $2
    ORDER BY embedding <=> $1
    LIMIT $3
  `, [JSON.stringify(queryEmbedding), threshold, limit]);
  
  return results.rows;
}

Hybrid search: the production pattern

Pure vector search often loses to hybrid search. Users who search for "Python tutorial 2024" expect Python results, not JavaScript results about "programming for beginners 2024" (which might have a high semantic similarity).

async function hybridSearch(query, limit = 10) {
  const queryEmbedding = await embedText(query);
  
  // Run both searches in parallel
  const [vectorResults, keywordResults] = await Promise.all([
    // Vector search
    db.query(`
      SELECT id, content, 1 - (embedding <=> $1) AS score, 'vector' AS source
      FROM documents
      WHERE 1 - (embedding <=> $1) > 0.6
      ORDER BY embedding <=> $1
      LIMIT 20
    `, [JSON.stringify(queryEmbedding)]),
    
    // Keyword search (PostgreSQL full-text)
    db.query(`
      SELECT id, content, ts_rank(to_tsvector('english', content), plainto_tsquery('english', $1)) AS score, 'keyword' AS source
      FROM documents
      WHERE to_tsvector('english', content) @@ plainto_tsquery('english', $1)
      ORDER BY score DESC
      LIMIT 20
    `, [query]),
  ]);
  
  // Reciprocal Rank Fusion (RRF) to merge results
  return reciprocalRankFusion([vectorResults.rows, keywordResults.rows], limit);
}

function reciprocalRankFusion(resultSets, k = 60, topN = 10) {
  const scores = new Map();
  
  for (const results of resultSets) {
    results.forEach((doc, rank) => {
      const rrfScore = 1 / (k + rank + 1);
      scores.set(doc.id, (scores.get(doc.id) || 0) + rrfScore);
    });
  }
  
  return Array.from(scores.entries())
    .sort((a, b) => b[1] - a[1])
    .slice(0, topN)
    .map(([id]) => id);
}

RAG implementation with pgvector

The most common use case for vector search in 2025: grounding LLM responses in your own content.

async function ragQuery(userQuestion, systemContext = '') {
  // 1. Embed the user's question
  const questionEmbedding = await embedText(userQuestion);
  
  // 2. Find relevant documents from your knowledge base
  const relevantDocs = await db.query(`
    SELECT content, metadata, 1 - (embedding <=> $1) AS similarity
    FROM knowledge_base
    WHERE 1 - (embedding <=> $1) > 0.7
    ORDER BY embedding <=> $1
    LIMIT 5
  `, [JSON.stringify(questionEmbedding)]);
  
  if (relevantDocs.rows.length === 0) {
    // No relevant context found - answer without RAG or say "I don't know"
    return fallbackResponse(userQuestion);
  }
  
  // 3. Build context from retrieved documents
  const context = relevantDocs.rows
    .map((doc, i) => `[Source ${i + 1}]: ${doc.content}`)
    .join('\n\n');
  
  // 4. Call LLM with context
  const response = await anthropic.messages.create({
    model: 'claude-sonnet-4-6',
    max_tokens: 1024,
    system: `${systemContext}\n\nAnswer questions based ONLY on the provided context. 
             If the answer is not in the context, say so. 
             Cite sources using [Source N] notation.`,
    messages: [{
      role: 'user',
      content: `Context:\n${context}\n\nQuestion: ${userQuestion}`,
    }],
  });
  
  return {
    answer: response.content[0].text,
    sources: relevantDocs.rows.map(d => d.metadata),
  };
}

Chunking strategy (critical for quality)

How you split documents into chunks dramatically affects retrieval quality.

function chunkDocument(text, options = {}) {
  const { 
    chunkSize = 512,      // tokens per chunk
    chunkOverlap = 50,    // overlap between chunks (preserves context at boundaries)
  } = options;
  
  // Simple: split by sentence, respect chunk size
  const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
  const chunks = [];
  let currentChunk = '';
  let currentSize = 0;
  
  for (const sentence of sentences) {
    const sentenceSize = estimateTokens(sentence);
    
    if (currentSize + sentenceSize > chunkSize && currentChunk) {
      chunks.push(currentChunk.trim());
      // Keep last few sentences for overlap
      const overlapText = getLastNTokens(currentChunk, chunkOverlap);
      currentChunk = overlapText + ' ' + sentence;
      currentSize = estimateTokens(currentChunk);
    } else {
      currentChunk += ' ' + sentence;
      currentSize += sentenceSize;
    }
  }
  
  if (currentChunk.trim()) chunks.push(currentChunk.trim());
  
  return chunks;
}

// Chunk entire document and index all chunks
async function indexDocumentChunked(document) {
  const chunks = chunkDocument(document.content);
  const embeddings = await embedBatch(chunks);
  
  await db.query('BEGIN');
  try {
    for (let i = 0; i < chunks.length; i++) {
      await db.query(
        'INSERT INTO knowledge_base (content, embedding, metadata) VALUES ($1, $2, $3)',
        [
          chunks[i],
          JSON.stringify(embeddings[i]),
          JSON.stringify({ 
            sourceId: document.id,
            chunkIndex: i,
            totalChunks: chunks.length,
            title: document.title,
          }),
        ]
      );
    }
    await db.query('COMMIT');
  } catch (err) {
    await db.query('ROLLBACK');
    throw err;
  }
}

Chunk size tips:

  • Too large (> 1000 tokens): the query embedding represents a question; the chunk embedding represents a broad topic. Less precise matching.
  • Too small (< 100 tokens): chunks lack context, results are fragment-level, hard to use.
  • 512 tokens with 50-token overlap: good default for most document types.
  • Structured documents (FAQs, spec sheets): chunk by logical section (question+answer) rather than token count.

When to use a dedicated vector database

Pinecone, Weaviate, Qdrant instead of pgvector when:

  • 10 million vectors (pgvector HNSW index gets slow to build above this)

  • Sub-10ms p99 latency requirement at high QPS
  • Multi-tenant with isolated namespaces per customer
  • Need real-time vector updates at very high write throughput

For most production apps at startup/scale-up stage: pgvector is sufficient, free, and reduces operational complexity. Migrate when you have evidence you've hit its limits.

// The migration path is straightforward - same embedding logic
// Just change the storage/query layer

// pgvector query
const results = await postgres.query('SELECT ... ORDER BY embedding <=> $1 LIMIT 10', [embedding]);

// Pinecone query (same embeddings, different storage)
const results = await pinecone.index('my-index').query({
  vector: embedding,
  topK: 10,
  includeMetadata: true,
});

Build your embedding generation logic separate from your storage layer. This makes the migration trivial when you need it.


Aunimeda builds AI-powered solutions - chatbots, AI agents, voice assistants, and automation systems for businesses.

Contact us to discuss AI integration for your business. See also: AI Solutions, AI Agents, Chatbot Development

Read Also

How to Build an AI Chatbot for Your Business in 2026aunimeda
AI & Machine Learning

How to Build an AI Chatbot for Your Business in 2026

AI chatbots in 2026 are not the rule-based bots of 2020. They understand context, handle complex questions, and integrate with your actual business data. Here's how to build one that works.

LLM in Production: How to Cut Your AI API Costs by 80% Without Degrading Qualityaunimeda
AI & Machine Learning

LLM in Production: How to Cut Your AI API Costs by 80% Without Degrading Quality

AI API costs can spiral fast. A feature that costs $200/month in testing can hit $8,000/month at scale. Here are the concrete strategies we use in production - prompt caching, model routing, semantic caching, output compression, and smart batching - with real cost numbers.

Building AI Agents with Tool Calling: Architecture Patterns for Productionaunimeda
AI & Machine Learning

Building AI Agents with Tool Calling: Architecture Patterns for Production

Tool calling (function calling) is what separates a chatbot from an agent. An agent can look up information, take actions, and chain multiple steps to complete a task. Here's how to architect AI agents that work reliably in production - not just in demos.

Need IT development for your business?

We build websites, mobile apps and AI solutions. Free consultation.

AI Solutions

Get Consultation All articles