RAG Architecture: How We Made GPT-4 Answer Only From

#rag#gpt-4#llm#pinecone#vector database#openai#ai#production#2023

📋 Table of Contents ▼

A legal services company wanted an internal assistant that could answer questions about their case procedures, client intake forms, and regulatory guidelines. The problem with asking GPT-4 directly: it answers confidently from its training data, which doesn't include their proprietary documents, and it occasionally invents plausible-sounding but incorrect legal procedures.

Retrieval-Augmented Generation (RAG) solved both problems: the model answers only from retrieved documents, and we can point to which documents informed each answer.

The Hallucination Problem and Why RAG Fixes It

A language model's knowledge is frozen at its training cutoff. It has no access to documents you wrote last month. Worse, when asked about things it doesn't know, it often generates fluent, confident text that sounds correct but isn't - "hallucination."

RAG addresses this by separating the retrieval step from the generation step:

User query
  → Retrieve: find the 3-5 most relevant documents from your knowledge base
  → Augment: inject those documents into the prompt as context
  → Generate: LLM answers based on the provided context, not its training data

If the answer isn't in the retrieved documents, a well-prompted model says "I don't have information on that in the provided documents" rather than inventing an answer.

The Architecture

Documents (PDF, Word, HTML)
  → Text extraction
  → Chunking (500-token chunks with overlap)
  → Embedding (OpenAI text-embedding-ada-002)
  → Pinecone vector database (indexed by embedding)

Query pipeline:
  User query
    → Embed query (same model)
    → Pinecone similarity search (top 5 chunks)
    → Build prompt: system + context chunks + user question
    → GPT-4 generates answer
    → Return answer + source citations

Step 1: Document Ingestion

// ingestion/ingest.ts
import { OpenAI } from 'openai';
import { Pinecone } from '@pinecone-database/pinecone';
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
import { PDFLoader } from 'langchain/document_loaders/fs/pdf';
import fs from 'fs/promises';
import path from 'path';

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });

const INDEX_NAME = 'company-knowledge-base';
const EMBEDDING_MODEL = 'text-embedding-ada-002';
const EMBEDDING_DIMENSIONS = 1536;

// Chunk size matters: too small = lost context; too large = noise in retrieval
const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 500,         // tokens (~375 words)
  chunkOverlap: 100,      // Overlap prevents cutting mid-sentence context
  separators: ['\n\n', '\n', '. ', ' ', ''],
});

async function ingestDocument(filePath: string) {
  const fileName = path.basename(filePath);
  console.log(`Ingesting: ${fileName}`);

  // Load and extract text
  let text: string;
  if (filePath.endsWith('.pdf')) {
    const loader = new PDFLoader(filePath);
    const docs = await loader.load();
    text = docs.map(d => d.pageContent).join('\n\n');
  } else {
    text = await fs.readFile(filePath, 'utf-8');
  }

  // Split into chunks
  const chunks = await splitter.splitText(text);
  console.log(`  ${chunks.length} chunks from ${fileName}`);

  // Embed all chunks in batches (API limit: 2048 inputs per request)
  const index = pinecone.index(INDEX_NAME);
  const BATCH_SIZE = 100;

  for (let i = 0; i < chunks.length; i += BATCH_SIZE) {
    const batch = chunks.slice(i, i + BATCH_SIZE);

    // Get embeddings for the batch
    const embeddingResponse = await openai.embeddings.create({
      model: EMBEDDING_MODEL,
      input: batch,
    });

    // Upsert vectors to Pinecone
    const vectors = batch.map((chunk, j) => ({
      id: `${fileName}-chunk-${i + j}`,
      values: embeddingResponse.data[j].embedding,
      metadata: {
        text: chunk,
        source: fileName,
        chunkIndex: i + j,
      },
    }));

    await index.upsert(vectors);
    console.log(`  Upserted batch ${i / BATCH_SIZE + 1}`);
  }
}

// Ingest all documents in a directory
async function ingestDirectory(dir: string) {
  const files = await fs.readdir(dir);
  for (const file of files) {
    if (file.match(/\.(pdf|txt|md)$/)) {
      await ingestDocument(path.join(dir, file));
    }
  }
}

ingestDirectory('./documents').catch(console.error);

Step 2: The Query Pipeline

// rag/query.ts
import { OpenAI } from 'openai';
import { Pinecone } from '@pinecone-database/pinecone';

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });

interface RagResult {
  answer: string;
  sources: Array<{
    source: string;
    chunkIndex: number;
    text: string;
    score: number;
  }>;
}

async function queryKnowledgeBase(userQuestion: string): Promise<RagResult> {
  // 1. Embed the user's question
  const queryEmbedding = await openai.embeddings.create({
    model: 'text-embedding-ada-002',
    input: userQuestion,
  });
  const queryVector = queryEmbedding.data[0].embedding;

  // 2. Retrieve top-5 most similar chunks from Pinecone
  const index = pinecone.index('company-knowledge-base');
  const searchResults = await index.query({
    vector: queryVector,
    topK: 5,
    includeMetadata: true,
  });

  // Filter out low-relevance results (cosine similarity threshold)
  const relevantChunks = searchResults.matches.filter(m => (m.score ?? 0) > 0.75);

  if (relevantChunks.length === 0) {
    return {
      answer: "I couldn't find relevant information in the knowledge base to answer this question.",
      sources: [],
    };
  }

  // 3. Build the context string from retrieved chunks
  const context = relevantChunks
    .map((chunk, i) => `[Source ${i + 1}: ${chunk.metadata?.source}]\n${chunk.metadata?.text}`)
    .join('\n\n---\n\n');

  // 4. Generate answer with GPT-4
  const completion = await openai.chat.completions.create({
    model: 'gpt-4',
    messages: [
      {
        role: 'system',
        content: `You are an assistant for a legal services company. Answer questions based ONLY on the provided context documents. 
        
If the context doesn't contain enough information to answer the question, say so explicitly - do not make assumptions or use outside knowledge.

Always cite which source document your answer comes from, e.g., "According to [document name]..."

Context:
${context}`,
      },
      {
        role: 'user',
        content: userQuestion,
      },
    ],
    temperature: 0.1,  // Low temperature = more deterministic, less creative
    max_tokens: 1000,
  });

  return {
    answer: completion.choices[0].message.content ?? '',
    sources: relevantChunks.map(chunk => ({
      source: chunk.metadata?.source as string,
      chunkIndex: chunk.metadata?.chunkIndex as number,
      text: chunk.metadata?.text as string,
      score: chunk.score ?? 0,
    })),
  };
}

Step 3: The API Endpoint

// app/api/ask/route.ts (Next.js App Router)
import { NextRequest, NextResponse } from 'next/server';
import { queryKnowledgeBase } from '@/rag/query';
import { rateLimit } from '@/lib/rateLimit';

export async function POST(request: NextRequest) {
  // Rate limiting: 20 queries/minute per user
  const identifier = request.headers.get('x-forwarded-for') ?? 'anonymous';
  const { success } = await rateLimit(identifier, { limit: 20, window: 60 });
  if (!success) {
    return NextResponse.json({ error: 'Rate limit exceeded' }, { status: 429 });
  }

  const { question } = await request.json();
  
  if (!question || typeof question !== 'string' || question.length > 500) {
    return NextResponse.json({ error: 'Invalid question' }, { status: 400 });
  }

  try {
    const result = await queryKnowledgeBase(question);
    return NextResponse.json(result);
  } catch (error) {
    console.error('RAG query failed:', error);
    return NextResponse.json({ error: 'Query failed' }, { status: 500 });
  }
}

What Made It Production-Reliable

Chunking strategy is critical

Fixed-size chunking (chunkSize: 500) worked but lost semantic boundaries - a chunk might start mid-procedure. Better for legal documents: chunk by section headers.

// Better chunking for structured documents
const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 800,
  chunkOverlap: 150,
  separators: [
    '\n## ',     // H2 sections
    '\n### ',    // H3 sections  
    '\n\n',      // Paragraphs
    '\n',
    '. ',
  ],
});

Hybrid search (semantic + keyword)

Pure semantic search misses exact matches. "Article 15, paragraph 3" is a keyword match, not a semantic concept. Pinecone's sparse-dense hybrid combines BM25 keyword scoring with vector similarity:

// Hybrid query (requires Pinecone pod-based index with sparse-dense support)
const results = await index.query({
  vector: denseVector,
  sparseVector: bm25Vector,  // Pre-computed BM25 sparse representation
  topK: 5,
  includeMetadata: true,
});

For most use cases, pure semantic search is sufficient. Hybrid is worth the complexity when users ask about specific identifiers (article numbers, case IDs, form names).

Re-ranking retrieved chunks

Pinecone returns chunks ordered by vector similarity, but vector similarity isn't perfectly correlated with "best answer to this question." A re-ranker model reorders the retrieved chunks before building the context prompt:

import { CohereRerank } from 'cohere-ai';

const cohere = new CohereRerank({ token: process.env.COHERE_API_KEY });

// Re-rank top 10 results, then use top 5
const reranked = await cohere.rerank({
  query: userQuestion,
  documents: top10Chunks.map(c => c.metadata?.text as string),
  topN: 5,
});

Cohere's re-ranker added ~200ms per query but improved answer quality measurably - fewer "the document mentions X but doesn't directly answer Y" situations.

Evaluating RAG Quality

This was harder than building the pipeline. We used three metrics:

1. Faithfulness - does the answer only use information from retrieved context?
Test: manually check 50 queries. Flag any answer that contains information not in the retrieved chunks.

2. Answer relevance - does the retrieved context actually contain the answer?
Test: for 50 known Q&A pairs, check if the correct chunk is in the top-5 retrieved results.

3. Chunk coverage - after ingestion, can the system answer questions about every document?
Test: generate 5 test questions per document. Measure retrieval hit rate.

Our initial retrieval hit rate was 71% (71 of 100 test queries retrieved the right chunk). After tuning chunk size and adding overlap, it reached 89%.

Cost at Scale

For this client (200 active users, ~150 queries/day):

Component	Cost/month
OpenAI text-embedding-ada-002	~$2 (150 queries × 500 tokens = 75K tokens)
GPT-4 (answer generation)	~$45 (150 queries × ~2K tokens = 300K tokens at $0.03/1K input + $0.06/1K output)
Pinecone Starter	$70/month (1M vectors, pod-based)
Total	~$117/month

Pinecone is the largest cost at scale. Alternatives: Qdrant (open source, self-hostable on a €5/month VPS), pgvector (PostgreSQL extension - no additional infrastructure if you already run PostgreSQL).

In 2024, GPT-4o has cut generation costs ~10× vs GPT-4 (June 2023 pricing). The economics of production RAG are significantly better now.

What RAG Can't Fix

Stale embeddings. When a document is updated, its old vectors remain in the index. You need a deletion + re-embedding pipeline triggered on document updates.

Conflicting information across documents. If two documents contradict each other, the LLM will pick one (or hedge). Your ingestion pipeline should track document dates and prefer recent content.

Questions that require synthesis across many documents. "Summarize all changes to procedure X across the last 5 years" requires retrieving and synthesizing many chunks. RAG with a fixed context window handles this poorly. Approaches: hierarchical summarization, or query decomposition (break into sub-queries, combine results).

The pattern holds for most production use cases: company knowledge bases, documentation assistants, customer support bots, research tools. RAG is the practical alternative to fine-tuning for knowledge-grounded question answering - faster to implement, cheaper to update, and easier to audit.

Aunimeda builds AI-powered solutions - chatbots, AI agents, voice assistants, and automation systems for businesses.

RAG Architecture: How We Made GPT-4 Answer Only From Our Client's Knowledge Base

The Hallucination Problem and Why RAG Fixes It

The Architecture

Step 1: Document Ingestion

Step 2: The Query Pipeline

Step 3: The API Endpoint

What Made It Production-Reliable

Chunking strategy is critical

Hybrid search (semantic + keyword)

Re-ranking retrieved chunks

Evaluating RAG Quality

Cost at Scale

What RAG Can't Fix

Aunimeda

Need IT development for your business?

RAG Architecture: How We Made GPT-4 Answer Only From Our Client's Knowledge Base

The Hallucination Problem and Why RAG Fixes It

The Architecture

Step 1: Document Ingestion

Step 2: The Query Pipeline

Step 3: The API Endpoint

What Made It Production-Reliable

Chunking strategy is critical

Hybrid search (semantic + keyword)

Re-ranking retrieved chunks

Evaluating RAG Quality

Cost at Scale

What RAG Can't Fix

Aunimeda

Read Also

How to Build an AI Chatbot for Your Business in 2026

How to Build an AI Chatbot with Claude or GPT-4o in 2026

What Is an AI Agent and Does Your Business Need One?

Need IT development for your business?