LLM in Production: How to Cut Your AI API Costs by 80% Without Degrading Quality
AI API costs scale non-linearly. A feature that costs $200/month in a staging environment with light usage can hit $8,000/month when real users start hammering it. The math is brutal: every token in, every token out, on every request, for every user.
We've shipped multiple LLM-powered features to production and learned - sometimes painfully - where the money goes and how to stop it. These are the concrete strategies, with real numbers.
Where the money actually goes
First, understand the billing model. Most LLM APIs charge per token:
- Input tokens (prompt + context)
- Output tokens (generated response)
- Output tokens typically cost 3-5x more than input tokens
For reference, 1,000 tokens ≈ 750 words.
GPT-4o (2025): ~$2.50/1M input tokens, ~$10/1M output tokens Claude Sonnet 4.6: ~$3/1M input tokens, ~$15/1M output tokens GPT-4o mini: ~$0.15/1M input tokens, ~$0.60/1M output tokens
The cost difference between "premium" and "mini/haiku" models is 15-20x. The quality difference depends entirely on what you're asking them to do.
Strategy 1: Model routing - the biggest lever
The most impactful change you can make: don't use the same model for everything.
Most LLM-powered features have two types of requests:
- Simple classification/extraction - "Is this review positive or negative?", "Extract the name from this text"
- Complex reasoning/generation - "Write a personalized email based on these 10 customer data points"
The simple requests don't need GPT-4o. They need GPT-4o mini or Claude Haiku, which cost 15-20x less and are often just as accurate for simple tasks.
const router = {
// Simple tasks → cheap model
classifySentiment: { model: 'gpt-4o-mini', maxTokens: 20 },
extractEntities: { model: 'gpt-4o-mini', maxTokens: 150 },
categorizeTicket: { model: 'gpt-4o-mini', maxTokens: 50 },
// Complex tasks → powerful model
draftPersonalizedEmail: { model: 'gpt-4o', maxTokens: 800 },
analyzeContractClauses: { model: 'claude-opus-4-6', maxTokens: 2000 },
generateCodeReview: { model: 'gpt-4o', maxTokens: 1500 },
};
async function callLLM(task, prompt, context) {
const config = router[task];
return await openai.chat.completions.create({
model: config.model,
max_tokens: config.maxTokens,
messages: [{ role: 'user', content: prompt }],
});
}
Real impact: One product we shipped had all requests on GPT-4. After routing 70% of simple tasks to GPT-4o mini: costs dropped from $4,200/month to $890/month. Same user experience - the simple tasks had identical output quality on the cheaper model.
Strategy 2: Prompt caching
If your prompts have a large, static prefix (a system prompt, a document being analyzed, a knowledge base), prompt caching lets you pay for that context once per cache period instead of on every request.
Both Anthropic and OpenAI support prompt caching. The cached tokens are billed at ~10% of normal input token cost on subsequent requests.
// Anthropic: cache_control on the system prompt
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-6',
max_tokens: 1024,
system: [
{
type: 'text',
text: longSystemPrompt, // 2,000 tokens of static instructions
cache_control: { type: 'ephemeral' }, // Cache for 5 minutes
},
],
messages: [
{
role: 'user',
content: userQuestion, // Only this changes per request
},
],
});
// OpenAI: automatic caching for prompts > 1024 tokens
// Prompts that start with identical content are cached automatically
// No code change required - it's transparent
Real impact: A document Q&A feature where every question loaded a 3,000-token document. After implementing prompt caching, input token cost dropped 82% (paying full price on first request per document per cache period, 10% on subsequent requests). At 200 questions per document per day: savings compound fast.
Strategy 3: Semantic caching
Standard HTTP caching doesn't work for LLMs because requests are free-form text. But semantically similar questions should return similar (or identical) answers.
import { OpenAI } from 'openai';
import { createClient } from '@supabase/supabase-js';
const openai = new OpenAI();
const supabase = createClient(process.env.SUPABASE_URL, process.env.SUPABASE_KEY);
async function cachedLLMCall(userQuery, systemPrompt) {
// 1. Embed the query
const embeddingResponse = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: userQuery,
});
const queryEmbedding = embeddingResponse.data[0].embedding;
// 2. Search for similar cached queries
const { data: similar } = await supabase.rpc('match_llm_cache', {
query_embedding: queryEmbedding,
match_threshold: 0.95, // High threshold - only near-identical queries
match_count: 1,
});
if (similar?.length > 0) {
// Cache hit - return cached response
await supabase.from('llm_cache_hits').insert({ cache_id: similar[0].id });
return similar[0].response;
}
// 3. Cache miss - call LLM
const llmResponse = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'system', content: systemPrompt },
{ role: 'user', content: userQuery },
],
});
const responseText = llmResponse.choices[0].message.content;
// 4. Store in cache with embedding
await supabase.from('llm_cache').insert({
query: userQuery,
response: responseText,
embedding: queryEmbedding,
system_prompt_hash: hashString(systemPrompt),
created_at: new Date().toISOString(),
});
return responseText;
}
-- Supabase/PostgreSQL: pgvector function for semantic search
CREATE OR REPLACE FUNCTION match_llm_cache(
query_embedding vector(1536),
match_threshold float,
match_count int
)
RETURNS TABLE (id uuid, response text, similarity float)
LANGUAGE sql STABLE AS $$
SELECT id, response, 1 - (embedding <=> query_embedding) AS similarity
FROM llm_cache
WHERE 1 - (embedding <=> query_embedding) > match_threshold
ORDER BY embedding <=> query_embedding
LIMIT match_count;
$$;
For FAQ-style features, product Q&A, or customer support where users ask similar questions, semantic caching typically achieves 40-60% cache hit rate.
Real impact: Customer support bot, 8,000 messages/day. After semantic caching: 47% hit rate. Monthly LLM cost: $3,100 → $1,640.
Strategy 4: Output length control
Output tokens are 3-5x more expensive than input tokens. Every unnecessary word in the response costs money.
// Bad: open-ended output with no length control
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: 'Analyze this review: ' + reviewText }],
// No max_tokens set - model will generate until it decides to stop
});
// Good: constrained output with explicit instruction
const response = await openai.chat.completions.create({
model: 'gpt-4o',
max_tokens: 150, // Hard cap
messages: [{
role: 'user',
content: `Analyze this review in 2-3 sentences maximum. Be concise.\n\nReview: ${reviewText}`,
}],
});
// Best for structured data: JSON output mode
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
max_tokens: 100,
response_format: { type: 'json_object' },
messages: [{
role: 'user',
content: `Classify this review. Return JSON: {"sentiment":"positive|negative|neutral","score":1-5}. Review: ${reviewText}`,
}],
});
// Returns exactly: {"sentiment":"positive","score":4}
// 15 tokens instead of 150 for a paragraph-form analysis
Structured JSON output is especially powerful for classification, extraction, and scoring tasks - you get exactly the data you need in minimal tokens.
Strategy 5: Context window management
The most expensive scenario: sending the entire conversation history on every message.
// Naive: grows without bound
async function chat(userId, newMessage) {
const fullHistory = await db.getAllMessages(userId); // Could be 100+ messages
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [...fullHistory, { role: 'user', content: newMessage }],
// At 100 messages × 200 tokens average = 20,000 tokens input per request
// At 2,000 requests/day = 40M tokens/day = $100/day on input alone
});
}
// Smart: summarize old history, keep recent messages verbatim
async function chat(userId, newMessage) {
const recentMessages = await db.getRecentMessages(userId, { limit: 10 });
const summary = await db.getConversationSummary(userId);
const systemMessage = summary
? `Previous context: ${summary}\n\nContinue the conversation.`
: 'You are a helpful assistant.';
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'system', content: systemMessage }, // ~200 tokens
...recentMessages, // ~2,000 tokens for 10 messages
{ role: 'user', content: newMessage },
// Total: ~2,200 tokens instead of 20,000
],
});
// Every 10 messages, update the summary
if (recentMessages.length >= 10) {
await updateConversationSummary(userId, recentMessages);
}
}
Strategy 6: Async processing with batching
Real-time responses are expensive because you can't batch requests. For non-real-time workloads, batching cuts costs 50%.
Both OpenAI (Batch API) and Anthropic (Message Batches) offer 50% discounts for async batch requests, with results within 24 hours.
// OpenAI Batch API - 50% discount, results within 24 hours
const batch = await openai.batches.create({
input_file_id: uploadedFileId, // JSONL file with requests
endpoint: '/v1/chat/completions',
completion_window: '24h',
});
// Use cases: overnight content generation, weekly report synthesis,
// bulk categorization, email campaigns, SEO content
For product description generation, bulk SEO content, data analysis pipelines, and report generation: batch processing with the 50% discount is a no-brainer.
The cost monitoring setup you need
You can't optimize what you don't measure. Instrument every LLM call:
async function trackedLLMCall(task, prompt, options = {}) {
const startTime = Date.now();
const response = await openai.chat.completions.create({
model: options.model || 'gpt-4o',
messages: prompt,
...options,
});
const usage = response.usage;
const durationMs = Date.now() - startTime;
// Log to your analytics
await analytics.track('llm_call', {
task,
model: options.model,
inputTokens: usage.prompt_tokens,
outputTokens: usage.completion_tokens,
cachedInputTokens: usage.prompt_tokens_details?.cached_tokens || 0,
durationMs,
estimatedCostUsd: calculateCost(options.model, usage),
userId: options.userId,
});
return response;
}
function calculateCost(model, usage) {
const PRICES = {
'gpt-4o': { input: 2.5, output: 10 }, // per 1M tokens
'gpt-4o-mini': { input: 0.15, output: 0.60 },
'claude-sonnet-4-6': { input: 3, output: 15 },
'claude-haiku-4-5-20251001': { input: 0.25, output: 1.25 },
};
const price = PRICES[model];
if (!price) return null;
return (
(usage.prompt_tokens / 1_000_000) * price.input +
(usage.completion_tokens / 1_000_000) * price.output
);
}
Set up a daily cost report. Track cost per task, cost per user, input/output token ratio, and cache hit rate. Without this data, optimization is guesswork.
Putting it together: realistic savings
| Strategy | Effort | Typical Savings |
|---|---|---|
| Model routing (mini for simple tasks) | Low | 50-70% |
| Prompt caching (large static context) | Low | 30-80% on cached content |
| Output length control | Low | 20-50% |
| Semantic caching (FAQ-style) | Medium | 30-60% |
| Context window management | Medium | 40-80% for long conversations |
| Batch processing (non-real-time) | Medium | 50% flat |
Applied together on a real production system (customer support + content generation + analysis), we went from a projected $11,000/month at scale to $2,200/month. The user experience was unchanged because most of the savings came from smarter model selection and caching, not from degrading response quality.
The principle: measure first, then optimize specifically. Every system has a different cost profile. The 80% saving comes from finding your biggest cost driver - which is almost never where you'd guess before looking at the data.
Aunimeda builds AI-powered solutions - chatbots, AI agents, voice assistants, and automation systems for businesses.
Contact us to discuss AI integration for your business. See also: AI Solutions, AI Agents, Chatbot Development