The first version of our AI feature was a standard API route: send request, wait for complete response, return JSON. For a short classification task, fine. For a 500-word AI-generated summary, the user stared at a spinner for 8-12 seconds before anything appeared.
Streaming solves the perceived latency problem: instead of waiting for the full response, tokens stream to the client as they're generated. The first word appears in under 500ms. The experience feels instant even when generation takes 15 seconds total.
This post covers the full implementation: Edge Runtime for minimal cold starts, streaming responses, provider failover, and the gotchas that took us time to figure out.
Why Edge Runtime
Next.js routes can run in two environments:
- Node.js runtime (default): Full Node.js APIs, 150MB memory limit on Vercel, cold start ~500ms
- Edge Runtime: V8 isolates, runs close to the user (CDN edge), cold start ~5ms, limited to Web APIs (no
fs, nonet)
For AI streaming, the difference matters:
- Node.js runtime: cold start delays the first token by 500ms+ after a period of inactivity
- Edge Runtime: first token appears in ~80ms from a warm edge node
The trade-off: Edge Runtime can't use Node.js-specific packages. Anthropic's SDK and OpenAI's SDK both support Edge Runtime via the Web Streams API - this wasn't always the case (pre-2024 versions required Node.js streams).
// app/api/generate/route.ts
export const runtime = 'edge'; // This one line switches to Edge Runtime
Streaming With the Anthropic SDK
// app/api/generate/route.ts
import Anthropic from '@anthropic-ai/sdk';
import { NextRequest } from 'next/server';
export const runtime = 'edge';
const anthropic = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
});
export async function POST(request: NextRequest) {
const { prompt, systemPrompt } = await request.json();
if (!prompt || typeof prompt !== 'string') {
return new Response('Invalid request', { status: 400 });
}
// Create a streaming response
const stream = await anthropic.messages.stream({
model: 'claude-opus-4-6',
max_tokens: 1024,
system: systemPrompt ?? 'You are a helpful assistant.',
messages: [{ role: 'user', content: prompt }],
});
// Convert Anthropic's stream to a ReadableStream of SSE events
const encoder = new TextEncoder();
const readableStream = new ReadableStream({
async start(controller) {
try {
for await (const chunk of stream) {
if (
chunk.type === 'content_block_delta' &&
chunk.delta.type === 'text_delta'
) {
const text = chunk.delta.text;
// Server-Sent Events format
const data = `data: ${JSON.stringify({ text })}\n\n`;
controller.enqueue(encoder.encode(data));
}
}
// Signal end of stream
controller.enqueue(encoder.encode('data: [DONE]\n\n'));
controller.close();
} catch (error) {
controller.error(error);
}
},
});
return new Response(readableStream, {
headers: {
'Content-Type': 'text/event-stream',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
},
});
}
The Client-Side Stream Consumer
// components/AiTextGenerator.tsx
'use client';
import { useState } from 'react';
export function AiTextGenerator() {
const [output, setOutput] = useState('');
const [isStreaming, setIsStreaming] = useState(false);
async function generate(prompt: string) {
setOutput('');
setIsStreaming(true);
try {
const response = await fetch('/api/generate', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ prompt }),
});
if (!response.ok) throw new Error(`HTTP ${response.status}`);
if (!response.body) throw new Error('No response body');
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value, { stream: true });
const lines = chunk.split('\n');
for (const line of lines) {
if (!line.startsWith('data: ')) continue;
const data = line.slice(6); // Remove 'data: ' prefix
if (data === '[DONE]') break;
try {
const { text } = JSON.parse(data);
setOutput(prev => prev + text);
} catch {
// Ignore malformed chunks
}
}
}
} catch (error) {
console.error('Streaming error:', error);
setOutput(prev => prev + '\n[Error: generation failed]');
} finally {
setIsStreaming(false);
}
}
return (
<div>
<button onClick={() => generate('Explain React Server Components')} disabled={isStreaming}>
{isStreaming ? 'Generating...' : 'Generate'}
</button>
<div className="output">{output}</div>
</div>
);
}
Provider Failover: Claude Primary, OpenAI Fallback
Anthropic's API occasionally has elevated latency or rate limit issues. For a production feature, we wanted automatic failover to OpenAI:
// lib/ai/generate.ts
import Anthropic from '@anthropic-ai/sdk';
import OpenAI from 'openai';
const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
interface StreamOptions {
prompt: string;
system?: string;
maxTokens?: number;
}
type Provider = 'anthropic' | 'openai';
async function streamFromAnthropic(options: StreamOptions): Promise<ReadableStream> {
const stream = await anthropic.messages.stream({
model: 'claude-opus-4-6',
max_tokens: options.maxTokens ?? 1024,
system: options.system ?? 'You are a helpful assistant.',
messages: [{ role: 'user', content: options.prompt }],
});
return anthropicStreamToReadable(stream);
}
async function streamFromOpenAI(options: StreamOptions): Promise<ReadableStream> {
const stream = await openai.chat.completions.create({
model: 'gpt-4o',
max_tokens: options.maxTokens ?? 1024,
stream: true,
messages: [
{ role: 'system', content: options.system ?? 'You are a helpful assistant.' },
{ role: 'user', content: options.prompt },
],
});
return openaiStreamToReadable(stream);
}
export async function generateWithFailover(
options: StreamOptions
): Promise<{ stream: ReadableStream; provider: Provider }> {
// Try Anthropic first with 5-second timeout on connection
try {
const timeoutPromise = new Promise<never>((_, reject) =>
setTimeout(() => reject(new Error('Anthropic timeout')), 5000)
);
const stream = await Promise.race([
streamFromAnthropic(options),
timeoutPromise,
]);
return { stream, provider: 'anthropic' };
} catch (error) {
console.warn('Anthropic failed, falling back to OpenAI:', error);
// Fallback to OpenAI
const stream = await streamFromOpenAI(options);
return { stream, provider: 'openai' };
}
}
The 5-second timeout is on connection establishment, not the full generation. This way we catch "Anthropic API is down" quickly without prematurely timing out legitimate slow generations.
Preventing Prompt Injection
An AI endpoint that accepts user input and passes it to a model is an injection target. Users can try to override the system prompt or extract system instructions.
// lib/ai/sanitize.ts
// Hard character limits - most legitimate prompts fit in 2000 chars
const MAX_PROMPT_LENGTH = 2000;
// Patterns that indicate prompt injection attempts
const INJECTION_PATTERNS = [
/ignore (all |previous |above )?instructions/i,
/you are now/i,
/new (system |)prompt:/i,
/\[system\]/i,
/\[assistant\]/i,
/forget (everything|all) (above|previous)/i,
];
export function sanitizePrompt(input: string): { safe: boolean; sanitized: string } {
const trimmed = input.trim().slice(0, MAX_PROMPT_LENGTH);
for (const pattern of INJECTION_PATTERNS) {
if (pattern.test(trimmed)) {
return { safe: false, sanitized: trimmed };
}
}
return { safe: true, sanitized: trimmed };
}
Beyond input sanitization, the defense-in-depth approach:
- Separate system/user turn - never interpolate user input into the system prompt string
- Output validation - for structured outputs, parse and validate before using
- Audit logging - log all inputs (redacted) + model/provider for anomaly detection
Rate Limiting at the Edge
Edge Runtime can't use Redis directly (no TCP connections). We use Upstash Redis, which provides a REST API compatible with Edge:
// lib/rateLimit.ts
import { Ratelimit } from '@upstash/ratelimit';
import { Redis } from '@upstash/redis';
const ratelimit = new Ratelimit({
redis: Redis.fromEnv(), // UPSTASH_REDIS_REST_URL + UPSTASH_REDIS_REST_TOKEN
limiter: Ratelimit.slidingWindow(10, '1 m'), // 10 requests per minute per IP
analytics: true,
});
export async function checkRateLimit(identifier: string) {
const { success, limit, remaining, reset } = await ratelimit.limit(identifier);
return { success, limit, remaining, reset };
}
// In the route:
const ip = request.headers.get('x-forwarded-for')?.split(',')[0] ?? 'anonymous';
const { success } = await checkRateLimit(`generate:${ip}`);
if (!success) {
return new Response('Rate limit exceeded', { status: 429 });
}
Tracking Costs Per User
AI tokens cost money. In a multi-tenant product, you want to attribute costs per user/organization:
// After the stream completes, Anthropic's SDK provides usage stats
const stream = anthropic.messages.stream({ ... });
// Get the final message (available after stream completes)
const finalMessage = await stream.getFinalMessage();
const usage = {
inputTokens: finalMessage.usage.input_tokens,
outputTokens: finalMessage.usage.output_tokens,
// Claude Sonnet 4.6: $3 per 1M input tokens, $15 per 1M output tokens (as of mid-2024)
costUsd: (finalMessage.usage.input_tokens * 3 + finalMessage.usage.output_tokens * 15) / 1_000_000,
};
// Log asynchronously - don't block the stream
logUsageAsync({ userId, organizationId, ...usage });
Latency Benchmarks (Vercel Edge, eu-central-1)
| Metric | Node.js Runtime | Edge Runtime |
|---|---|---|
| Cold start (first request after idle) | 480ms | 8ms |
| Time to first token (warm) | 310ms | 95ms |
| Time to first token (cold) | 790ms | 103ms |
| P99 time to first token | 1,200ms | 280ms |
The cold start difference is the key win. In Node.js runtime, a user who hits the endpoint first after a 10-minute idle period waits nearly 800ms before seeing anything. Edge Runtime is consistent at ~100ms regardless of warm/cold state.
The Nuances We Learned
AbortController for cancelled requests. If the user navigates away mid-generation, the stream should stop. Without this, you pay for tokens that go nowhere:
// Client-side
const controller = new AbortController();
const response = await fetch('/api/generate', {
signal: controller.signal,
...
});
// When user navigates away:
useEffect(() => {
return () => controller.abort(); // Cleanup on unmount
}, []);
The Edge Runtime correctly propagates the abort signal to the Anthropic SDK's internal fetch - the stream stops and billing stops.
Streaming and error handling. If Anthropic returns an error mid-stream (rate limit hit after the stream started), the error arrives as a special event, not an HTTP error code (since we already sent 200 OK when the stream began). Handle this:
for await (const chunk of stream) {
if (chunk.type === 'error') {
controller.enqueue(encoder.encode(`data: ${JSON.stringify({ error: chunk.error.message })}\n\n`));
controller.close();
return;
}
// ... normal handling
}
Don't serialize the full prompt in logs. User prompts may contain sensitive data. Log prompt length, first 50 characters, and a hash - enough for debugging without storing user content.
The Edge + Streaming combination is now our default for any AI feature that generates more than a short response. The 100ms time-to-first-token vs 800ms cold-start difference is the kind of improvement users notice immediately.
Aunimeda builds AI-powered solutions - chatbots, AI agents, voice assistants, and automation systems for businesses.
Contact us to discuss AI integration for your business. See also: AI Solutions, AI Agents, Chatbot Development