Vector Databases in Production: pgvector, Semantic Search, and When It Actually Matters
Embeddings and vector search are the engine behind every serious AI feature in 2025: semantic search that understands intent rather than keywords, RAG systems that ground LLM responses in your data, recommendation engines that find "similar items" without hand-crafted rules, and duplicate detection that catches near-identical content.
Before choosing infrastructure, you need to know when vector search actually outperforms keyword search - because "keywords vs. semantics" is not a binary. Most production systems that work well use both.
When semantic search genuinely wins
Keyword search (BM25, full-text) wins when:
- Users know the exact terminology (internal tools, technical documentation)
- Queries are short and specific ("iPhone 14 case black")
- Typo correction and fuzzy matching cover most misses
- You can control the vocabulary (product catalog with known attributes)
Vector search wins when:
- Users describe what they want in natural language ("something for back pain when sitting at a desk")
- Synonyms and paraphrases matter ("couch" vs "sofa", "heart attack" vs "myocardial infarction")
- Cross-lingual search (query in Russian, find English documents)
- Conceptual similarity matters ("articles about managing remote teams" finds leadership content, not just articles containing "remote")
- Zero-shot new queries that keyword search has never seen
The honest answer for most apps: hybrid search. Vector search for conceptual relevance, keyword search for exact matches and terminology, combined via Reciprocal Rank Fusion (RRF). Start with keyword. Add vector when you see evidence that conceptual retrieval would improve results.
pgvector: the right choice for most teams
Unless you have millions of vectors or need sub-10ms latency at large scale, pgvector in PostgreSQL beats a dedicated vector database for most production applications.
Advantages:
- Same database you already have - no new infrastructure
- ACID transactions across vector and relational data
- SQL joins between vectors and your business data
- Existing backup, monitoring, and ops practices apply
- Free (not an API cost per query)
-- Enable pgvector
CREATE EXTENSION IF NOT EXISTS vector;
-- Table with embedding column
CREATE TABLE documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
content TEXT NOT NULL,
metadata JSONB,
embedding vector(1536), -- OpenAI text-embedding-3-small = 1536 dimensions
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Index for fast similarity search
-- HNSW: faster queries, more memory; IVFFlat: less memory, slower to build
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- Semantic search query
SELECT
id,
content,
metadata,
1 - (embedding <=> $1) AS similarity
FROM documents
WHERE 1 - (embedding <=> $1) > 0.75 -- Minimum similarity threshold
ORDER BY embedding <=> $1 -- <=> is cosine distance operator
LIMIT 10;
Generating embeddings
import OpenAI from 'openai';
const openai = new OpenAI();
// Single document embedding
async function embedText(text) {
const response = await openai.embeddings.create({
model: 'text-embedding-3-small', // 1536 dimensions, good quality, cheap
// model: 'text-embedding-3-large', // 3072 dimensions, better for complex domains
input: text,
encoding_format: 'float',
});
return response.data[0].embedding;
}
// Batch embedding (much cheaper than one-by-one)
async function embedBatch(texts) {
const response = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: texts, // Up to 2048 inputs per request
});
return response.data.map(d => d.embedding);
}
// Index a document
async function indexDocument(content, metadata) {
const embedding = await embedText(content);
await db.query(
'INSERT INTO documents (content, metadata, embedding) VALUES ($1, $2, $3)',
[content, JSON.stringify(metadata), JSON.stringify(embedding)]
);
}
// Search
async function semanticSearch(query, limit = 10, threshold = 0.75) {
const queryEmbedding = await embedText(query);
const results = await db.query(`
SELECT id, content, metadata, 1 - (embedding <=> $1) AS similarity
FROM documents
WHERE 1 - (embedding <=> $1) > $2
ORDER BY embedding <=> $1
LIMIT $3
`, [JSON.stringify(queryEmbedding), threshold, limit]);
return results.rows;
}
Hybrid search: the production pattern
Pure vector search often loses to hybrid search. Users who search for "Python tutorial 2024" expect Python results, not JavaScript results about "programming for beginners 2024" (which might have a high semantic similarity).
async function hybridSearch(query, limit = 10) {
const queryEmbedding = await embedText(query);
// Run both searches in parallel
const [vectorResults, keywordResults] = await Promise.all([
// Vector search
db.query(`
SELECT id, content, 1 - (embedding <=> $1) AS score, 'vector' AS source
FROM documents
WHERE 1 - (embedding <=> $1) > 0.6
ORDER BY embedding <=> $1
LIMIT 20
`, [JSON.stringify(queryEmbedding)]),
// Keyword search (PostgreSQL full-text)
db.query(`
SELECT id, content, ts_rank(to_tsvector('english', content), plainto_tsquery('english', $1)) AS score, 'keyword' AS source
FROM documents
WHERE to_tsvector('english', content) @@ plainto_tsquery('english', $1)
ORDER BY score DESC
LIMIT 20
`, [query]),
]);
// Reciprocal Rank Fusion (RRF) to merge results
return reciprocalRankFusion([vectorResults.rows, keywordResults.rows], limit);
}
function reciprocalRankFusion(resultSets, k = 60, topN = 10) {
const scores = new Map();
for (const results of resultSets) {
results.forEach((doc, rank) => {
const rrfScore = 1 / (k + rank + 1);
scores.set(doc.id, (scores.get(doc.id) || 0) + rrfScore);
});
}
return Array.from(scores.entries())
.sort((a, b) => b[1] - a[1])
.slice(0, topN)
.map(([id]) => id);
}
RAG implementation with pgvector
The most common use case for vector search in 2025: grounding LLM responses in your own content.
async function ragQuery(userQuestion, systemContext = '') {
// 1. Embed the user's question
const questionEmbedding = await embedText(userQuestion);
// 2. Find relevant documents from your knowledge base
const relevantDocs = await db.query(`
SELECT content, metadata, 1 - (embedding <=> $1) AS similarity
FROM knowledge_base
WHERE 1 - (embedding <=> $1) > 0.7
ORDER BY embedding <=> $1
LIMIT 5
`, [JSON.stringify(questionEmbedding)]);
if (relevantDocs.rows.length === 0) {
// No relevant context found - answer without RAG or say "I don't know"
return fallbackResponse(userQuestion);
}
// 3. Build context from retrieved documents
const context = relevantDocs.rows
.map((doc, i) => `[Source ${i + 1}]: ${doc.content}`)
.join('\n\n');
// 4. Call LLM with context
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-6',
max_tokens: 1024,
system: `${systemContext}\n\nAnswer questions based ONLY on the provided context.
If the answer is not in the context, say so.
Cite sources using [Source N] notation.`,
messages: [{
role: 'user',
content: `Context:\n${context}\n\nQuestion: ${userQuestion}`,
}],
});
return {
answer: response.content[0].text,
sources: relevantDocs.rows.map(d => d.metadata),
};
}
Chunking strategy (critical for quality)
How you split documents into chunks dramatically affects retrieval quality.
function chunkDocument(text, options = {}) {
const {
chunkSize = 512, // tokens per chunk
chunkOverlap = 50, // overlap between chunks (preserves context at boundaries)
} = options;
// Simple: split by sentence, respect chunk size
const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
const chunks = [];
let currentChunk = '';
let currentSize = 0;
for (const sentence of sentences) {
const sentenceSize = estimateTokens(sentence);
if (currentSize + sentenceSize > chunkSize && currentChunk) {
chunks.push(currentChunk.trim());
// Keep last few sentences for overlap
const overlapText = getLastNTokens(currentChunk, chunkOverlap);
currentChunk = overlapText + ' ' + sentence;
currentSize = estimateTokens(currentChunk);
} else {
currentChunk += ' ' + sentence;
currentSize += sentenceSize;
}
}
if (currentChunk.trim()) chunks.push(currentChunk.trim());
return chunks;
}
// Chunk entire document and index all chunks
async function indexDocumentChunked(document) {
const chunks = chunkDocument(document.content);
const embeddings = await embedBatch(chunks);
await db.query('BEGIN');
try {
for (let i = 0; i < chunks.length; i++) {
await db.query(
'INSERT INTO knowledge_base (content, embedding, metadata) VALUES ($1, $2, $3)',
[
chunks[i],
JSON.stringify(embeddings[i]),
JSON.stringify({
sourceId: document.id,
chunkIndex: i,
totalChunks: chunks.length,
title: document.title,
}),
]
);
}
await db.query('COMMIT');
} catch (err) {
await db.query('ROLLBACK');
throw err;
}
}
Chunk size tips:
- Too large (> 1000 tokens): the query embedding represents a question; the chunk embedding represents a broad topic. Less precise matching.
- Too small (< 100 tokens): chunks lack context, results are fragment-level, hard to use.
- 512 tokens with 50-token overlap: good default for most document types.
- Structured documents (FAQs, spec sheets): chunk by logical section (question+answer) rather than token count.
When to use a dedicated vector database
Pinecone, Weaviate, Qdrant instead of pgvector when:
10 million vectors (pgvector HNSW index gets slow to build above this)
- Sub-10ms p99 latency requirement at high QPS
- Multi-tenant with isolated namespaces per customer
- Need real-time vector updates at very high write throughput
For most production apps at startup/scale-up stage: pgvector is sufficient, free, and reduces operational complexity. Migrate when you have evidence you've hit its limits.
// The migration path is straightforward - same embedding logic
// Just change the storage/query layer
// pgvector query
const results = await postgres.query('SELECT ... ORDER BY embedding <=> $1 LIMIT 10', [embedding]);
// Pinecone query (same embeddings, different storage)
const results = await pinecone.index('my-index').query({
vector: embedding,
topK: 10,
includeMetadata: true,
});
Build your embedding generation logic separate from your storage layer. This makes the migration trivial when you need it.
Aunimeda builds AI-powered solutions - chatbots, AI agents, voice assistants, and automation systems for businesses.
Contact us to discuss AI integration for your business. See also: AI Solutions, AI Agents, Chatbot Development