#llm #openai #claude #cost optimization #production #ai #backend #2025

📋 Table of Contents ▼

LLM in Production: How to Cut Your AI API Costs by 80% Without Degrading Quality

AI API costs scale non-linearly. A feature that costs $200/month in a staging environment with light usage can hit $8,000/month when real users start hammering it. The math is brutal: every token in, every token out, on every request, for every user.

We've shipped multiple LLM-powered features to production and learned - sometimes painfully - where the money goes and how to stop it. These are the concrete strategies, with real numbers.

Where the money actually goes

First, understand the billing model. Most LLM APIs charge per token:

Input tokens (prompt + context)
Output tokens (generated response)
Output tokens typically cost 3-5x more than input tokens

For reference, 1,000 tokens ≈ 750 words.

GPT-4o (2025): ~$2.50/1M input tokens, ~$10/1M output tokens Claude Sonnet 4.6: ~$3/1M input tokens, ~$15/1M output tokens GPT-4o mini: ~$0.15/1M input tokens, ~$0.60/1M output tokens

The cost difference between "premium" and "mini/haiku" models is 15-20x. The quality difference depends entirely on what you're asking them to do.

Strategy 1: Model routing - the biggest lever

The most impactful change you can make: don't use the same model for everything.

Most LLM-powered features have two types of requests:

Simple classification/extraction - "Is this review positive or negative?", "Extract the name from this text"
Complex reasoning/generation - "Write a personalized email based on these 10 customer data points"

The simple requests don't need GPT-4o. They need GPT-4o mini or Claude Haiku, which cost 15-20x less and are often just as accurate for simple tasks.

const router = {
  // Simple tasks → cheap model
  classifySentiment: { model: 'gpt-4o-mini', maxTokens: 20 },
  extractEntities: { model: 'gpt-4o-mini', maxTokens: 150 },
  categorizeTicket: { model: 'gpt-4o-mini', maxTokens: 50 },
  
  // Complex tasks → powerful model
  draftPersonalizedEmail: { model: 'gpt-4o', maxTokens: 800 },
  analyzeContractClauses: { model: 'claude-opus-4-6', maxTokens: 2000 },
  generateCodeReview: { model: 'gpt-4o', maxTokens: 1500 },
};

async function callLLM(task, prompt, context) {
  const config = router[task];
  
  return await openai.chat.completions.create({
    model: config.model,
    max_tokens: config.maxTokens,
    messages: [{ role: 'user', content: prompt }],
  });
}

Real impact: One product we shipped had all requests on GPT-4. After routing 70% of simple tasks to GPT-4o mini: costs dropped from $4,200/month to $890/month. Same user experience - the simple tasks had identical output quality on the cheaper model.

Strategy 2: Prompt caching

If your prompts have a large, static prefix (a system prompt, a document being analyzed, a knowledge base), prompt caching lets you pay for that context once per cache period instead of on every request.

Both Anthropic and OpenAI support prompt caching. The cached tokens are billed at ~10% of normal input token cost on subsequent requests.

// Anthropic: cache_control on the system prompt
const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-6',
  max_tokens: 1024,
  system: [
    {
      type: 'text',
      text: longSystemPrompt, // 2,000 tokens of static instructions
      cache_control: { type: 'ephemeral' }, // Cache for 5 minutes
    },
  ],
  messages: [
    {
      role: 'user',
      content: userQuestion, // Only this changes per request
    },
  ],
});

// OpenAI: automatic caching for prompts > 1024 tokens
// Prompts that start with identical content are cached automatically
// No code change required - it's transparent

Real impact: A document Q&A feature where every question loaded a 3,000-token document. After implementing prompt caching, input token cost dropped 82% (paying full price on first request per document per cache period, 10% on subsequent requests). At 200 questions per document per day: savings compound fast.

Strategy 3: Semantic caching

Standard HTTP caching doesn't work for LLMs because requests are free-form text. But semantically similar questions should return similar (or identical) answers.

import { OpenAI } from 'openai';
import { createClient } from '@supabase/supabase-js';

const openai = new OpenAI();
const supabase = createClient(process.env.SUPABASE_URL, process.env.SUPABASE_KEY);

async function cachedLLMCall(userQuery, systemPrompt) {
  // 1. Embed the query
  const embeddingResponse = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: userQuery,
  });
  const queryEmbedding = embeddingResponse.data[0].embedding;
  
  // 2. Search for similar cached queries
  const { data: similar } = await supabase.rpc('match_llm_cache', {
    query_embedding: queryEmbedding,
    match_threshold: 0.95, // High threshold - only near-identical queries
    match_count: 1,
  });
  
  if (similar?.length > 0) {
    // Cache hit - return cached response
    await supabase.from('llm_cache_hits').insert({ cache_id: similar[0].id });
    return similar[0].response;
  }
  
  // 3. Cache miss - call LLM
  const llmResponse = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      { role: 'system', content: systemPrompt },
      { role: 'user', content: userQuery },
    ],
  });
  const responseText = llmResponse.choices[0].message.content;
  
  // 4. Store in cache with embedding
  await supabase.from('llm_cache').insert({
    query: userQuery,
    response: responseText,
    embedding: queryEmbedding,
    system_prompt_hash: hashString(systemPrompt),
    created_at: new Date().toISOString(),
  });
  
  return responseText;
}

-- Supabase/PostgreSQL: pgvector function for semantic search
CREATE OR REPLACE FUNCTION match_llm_cache(
  query_embedding vector(1536),
  match_threshold float,
  match_count int
)
RETURNS TABLE (id uuid, response text, similarity float)
LANGUAGE sql STABLE AS $$
  SELECT id, response, 1 - (embedding <=> query_embedding) AS similarity
  FROM llm_cache
  WHERE 1 - (embedding <=> query_embedding) > match_threshold
  ORDER BY embedding <=> query_embedding
  LIMIT match_count;
$$;

For FAQ-style features, product Q&A, or customer support where users ask similar questions, semantic caching typically achieves 40-60% cache hit rate.

Real impact: Customer support bot, 8,000 messages/day. After semantic caching: 47% hit rate. Monthly LLM cost: $3,100 → $1,640.

Strategy 4: Output length control

Output tokens are 3-5x more expensive than input tokens. Every unnecessary word in the response costs money.

// Bad: open-ended output with no length control
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'Analyze this review: ' + reviewText }],
  // No max_tokens set - model will generate until it decides to stop
});

// Good: constrained output with explicit instruction
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  max_tokens: 150, // Hard cap
  messages: [{
    role: 'user',
    content: `Analyze this review in 2-3 sentences maximum. Be concise.\n\nReview: ${reviewText}`,
  }],
});

// Best for structured data: JSON output mode
const response = await openai.chat.completions.create({
  model: 'gpt-4o-mini',
  max_tokens: 100,
  response_format: { type: 'json_object' },
  messages: [{
    role: 'user',
    content: `Classify this review. Return JSON: {"sentiment":"positive|negative|neutral","score":1-5}. Review: ${reviewText}`,
  }],
});
// Returns exactly: {"sentiment":"positive","score":4}
// 15 tokens instead of 150 for a paragraph-form analysis

Structured JSON output is especially powerful for classification, extraction, and scoring tasks - you get exactly the data you need in minimal tokens.

Strategy 5: Context window management

The most expensive scenario: sending the entire conversation history on every message.

// Naive: grows without bound
async function chat(userId, newMessage) {
  const fullHistory = await db.getAllMessages(userId); // Could be 100+ messages
  
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [...fullHistory, { role: 'user', content: newMessage }],
    // At 100 messages × 200 tokens average = 20,000 tokens input per request
    // At 2,000 requests/day = 40M tokens/day = $100/day on input alone
  });
}

// Smart: summarize old history, keep recent messages verbatim
async function chat(userId, newMessage) {
  const recentMessages = await db.getRecentMessages(userId, { limit: 10 });
  const summary = await db.getConversationSummary(userId);
  
  const systemMessage = summary
    ? `Previous context: ${summary}\n\nContinue the conversation.`
    : 'You are a helpful assistant.';
  
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      { role: 'system', content: systemMessage }, // ~200 tokens
      ...recentMessages, // ~2,000 tokens for 10 messages
      { role: 'user', content: newMessage },
      // Total: ~2,200 tokens instead of 20,000
    ],
  });
  
  // Every 10 messages, update the summary
  if (recentMessages.length >= 10) {
    await updateConversationSummary(userId, recentMessages);
  }
}

Strategy 6: Async processing with batching

Real-time responses are expensive because you can't batch requests. For non-real-time workloads, batching cuts costs 50%.

Both OpenAI (Batch API) and Anthropic (Message Batches) offer 50% discounts for async batch requests, with results within 24 hours.

// OpenAI Batch API - 50% discount, results within 24 hours
const batch = await openai.batches.create({
  input_file_id: uploadedFileId, // JSONL file with requests
  endpoint: '/v1/chat/completions',
  completion_window: '24h',
});

// Use cases: overnight content generation, weekly report synthesis,
// bulk categorization, email campaigns, SEO content

For product description generation, bulk SEO content, data analysis pipelines, and report generation: batch processing with the 50% discount is a no-brainer.

The cost monitoring setup you need

You can't optimize what you don't measure. Instrument every LLM call:

async function trackedLLMCall(task, prompt, options = {}) {
  const startTime = Date.now();
  
  const response = await openai.chat.completions.create({
    model: options.model || 'gpt-4o',
    messages: prompt,
    ...options,
  });
  
  const usage = response.usage;
  const durationMs = Date.now() - startTime;
  
  // Log to your analytics
  await analytics.track('llm_call', {
    task,
    model: options.model,
    inputTokens: usage.prompt_tokens,
    outputTokens: usage.completion_tokens,
    cachedInputTokens: usage.prompt_tokens_details?.cached_tokens || 0,
    durationMs,
    estimatedCostUsd: calculateCost(options.model, usage),
    userId: options.userId,
  });
  
  return response;
}

function calculateCost(model, usage) {
  const PRICES = {
    'gpt-4o': { input: 2.5, output: 10 },           // per 1M tokens
    'gpt-4o-mini': { input: 0.15, output: 0.60 },
    'claude-sonnet-4-6': { input: 3, output: 15 },
    'claude-haiku-4-5-20251001': { input: 0.25, output: 1.25 },
  };
  
  const price = PRICES[model];
  if (!price) return null;
  
  return (
    (usage.prompt_tokens / 1_000_000) * price.input +
    (usage.completion_tokens / 1_000_000) * price.output
  );
}

Set up a daily cost report. Track cost per task, cost per user, input/output token ratio, and cache hit rate. Without this data, optimization is guesswork.

Putting it together: realistic savings

Strategy	Effort	Typical Savings
Model routing (mini for simple tasks)	Low	50-70%
Prompt caching (large static context)	Low	30-80% on cached content
Output length control	Low	20-50%
Semantic caching (FAQ-style)	Medium	30-60%
Context window management	Medium	40-80% for long conversations
Batch processing (non-real-time)	Medium	50% flat

Applied together on a real production system (customer support + content generation + analysis), we went from a projected $11,000/month at scale to $2,200/month. The user experience was unchanged because most of the savings came from smarter model selection and caching, not from degrading response quality.

The principle: measure first, then optimize specifically. Every system has a different cost profile. The 80% saving comes from finding your biggest cost driver - which is almost never where you'd guess before looking at the data.

Aunimeda builds AI-powered solutions - chatbots, AI agents, voice assistants, and automation systems for businesses.

LLM in Production: How to Cut Your AI API Costs by 80% Without Degrading Quality

LLM in Production: How to Cut Your AI API Costs by 80% Without Degrading Quality

Where the money actually goes

Strategy 1: Model routing - the biggest lever

Strategy 2: Prompt caching

Strategy 3: Semantic caching

Strategy 4: Output length control

Strategy 5: Context window management

Strategy 6: Async processing with batching

The cost monitoring setup you need

Putting it together: realistic savings

Aunimeda

Read Also

How to Build an AI Chatbot for Your Business in 2026

How to Build an AI Chatbot with Claude or GPT-4o in 2026

What Is an AI Agent and Does Your Business Need One?

Need IT development for your business?