AboutBlogContact
AI & Machine LearningJune 20, 2024 8 min read 154Updated: June 22, 2026

Serverless AI: Streaming Claude and OpenAI Responses in Next.js 15 via Edge Runtime

AunimedaAunimeda
📋 Table of Contents

The first version of our AI feature was a standard API route: send request, wait for complete response, return JSON. For a short classification task, fine. For a 500-word AI-generated summary, the user stared at a spinner for 8-12 seconds before anything appeared.

Streaming solves the perceived latency problem: instead of waiting for the full response, tokens stream to the client as they're generated. The first word appears in under 500ms. The experience feels instant even when generation takes 15 seconds total.

This post covers the full implementation: Edge Runtime for minimal cold starts, streaming responses, provider failover, and the gotchas that took us time to figure out.


Why Edge Runtime

Next.js routes can run in two environments:

  • Node.js runtime (default): Full Node.js APIs, 150MB memory limit on Vercel, cold start ~500ms
  • Edge Runtime: V8 isolates, runs close to the user (CDN edge), cold start ~5ms, limited to Web APIs (no fs, no net)

For AI streaming, the difference matters:

  • Node.js runtime: cold start delays the first token by 500ms+ after a period of inactivity
  • Edge Runtime: first token appears in ~80ms from a warm edge node

The trade-off: Edge Runtime can't use Node.js-specific packages. Anthropic's SDK and OpenAI's SDK both support Edge Runtime via the Web Streams API - this wasn't always the case (pre-2024 versions required Node.js streams).

// app/api/generate/route.ts
export const runtime = 'edge';  // This one line switches to Edge Runtime

Streaming With the Anthropic SDK

// app/api/generate/route.ts
import Anthropic from '@anthropic-ai/sdk';
import { NextRequest } from 'next/server';

export const runtime = 'edge';

const anthropic = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
});

export async function POST(request: NextRequest) {
  const { prompt, systemPrompt } = await request.json();

  if (!prompt || typeof prompt !== 'string') {
    return new Response('Invalid request', { status: 400 });
  }

  // Create a streaming response
  const stream = await anthropic.messages.stream({
    model: 'claude-opus-4-6',
    max_tokens: 1024,
    system: systemPrompt ?? 'You are a helpful assistant.',
    messages: [{ role: 'user', content: prompt }],
  });

  // Convert Anthropic's stream to a ReadableStream of SSE events
  const encoder = new TextEncoder();
  
  const readableStream = new ReadableStream({
    async start(controller) {
      try {
        for await (const chunk of stream) {
          if (
            chunk.type === 'content_block_delta' &&
            chunk.delta.type === 'text_delta'
          ) {
            const text = chunk.delta.text;
            // Server-Sent Events format
            const data = `data: ${JSON.stringify({ text })}\n\n`;
            controller.enqueue(encoder.encode(data));
          }
        }
        // Signal end of stream
        controller.enqueue(encoder.encode('data: [DONE]\n\n'));
        controller.close();
      } catch (error) {
        controller.error(error);
      }
    },
  });

  return new Response(readableStream, {
    headers: {
      'Content-Type': 'text/event-stream',
      'Cache-Control': 'no-cache',
      'Connection': 'keep-alive',
    },
  });
}

The Client-Side Stream Consumer

// components/AiTextGenerator.tsx
'use client';

import { useState } from 'react';

export function AiTextGenerator() {
  const [output, setOutput] = useState('');
  const [isStreaming, setIsStreaming] = useState(false);

  async function generate(prompt: string) {
    setOutput('');
    setIsStreaming(true);

    try {
      const response = await fetch('/api/generate', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ prompt }),
      });

      if (!response.ok) throw new Error(`HTTP ${response.status}`);
      if (!response.body) throw new Error('No response body');

      const reader = response.body.getReader();
      const decoder = new TextDecoder();

      while (true) {
        const { done, value } = await reader.read();
        if (done) break;

        const chunk = decoder.decode(value, { stream: true });
        const lines = chunk.split('\n');

        for (const line of lines) {
          if (!line.startsWith('data: ')) continue;
          
          const data = line.slice(6);  // Remove 'data: ' prefix
          if (data === '[DONE]') break;

          try {
            const { text } = JSON.parse(data);
            setOutput(prev => prev + text);
          } catch {
            // Ignore malformed chunks
          }
        }
      }
    } catch (error) {
      console.error('Streaming error:', error);
      setOutput(prev => prev + '\n[Error: generation failed]');
    } finally {
      setIsStreaming(false);
    }
  }

  return (
    <div>
      <button onClick={() => generate('Explain React Server Components')} disabled={isStreaming}>
        {isStreaming ? 'Generating...' : 'Generate'}
      </button>
      <div className="output">{output}</div>
    </div>
  );
}

Provider Failover: Claude Primary, OpenAI Fallback

Anthropic's API occasionally has elevated latency or rate limit issues. For a production feature, we wanted automatic failover to OpenAI:

// lib/ai/generate.ts
import Anthropic from '@anthropic-ai/sdk';
import OpenAI from 'openai';

const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

interface StreamOptions {
  prompt: string;
  system?: string;
  maxTokens?: number;
}

type Provider = 'anthropic' | 'openai';

async function streamFromAnthropic(options: StreamOptions): Promise<ReadableStream> {
  const stream = await anthropic.messages.stream({
    model: 'claude-opus-4-6',
    max_tokens: options.maxTokens ?? 1024,
    system: options.system ?? 'You are a helpful assistant.',
    messages: [{ role: 'user', content: options.prompt }],
  });

  return anthropicStreamToReadable(stream);
}

async function streamFromOpenAI(options: StreamOptions): Promise<ReadableStream> {
  const stream = await openai.chat.completions.create({
    model: 'gpt-4o',
    max_tokens: options.maxTokens ?? 1024,
    stream: true,
    messages: [
      { role: 'system', content: options.system ?? 'You are a helpful assistant.' },
      { role: 'user', content: options.prompt },
    ],
  });

  return openaiStreamToReadable(stream);
}

export async function generateWithFailover(
  options: StreamOptions
): Promise<{ stream: ReadableStream; provider: Provider }> {
  // Try Anthropic first with 5-second timeout on connection
  try {
    const timeoutPromise = new Promise<never>((_, reject) =>
      setTimeout(() => reject(new Error('Anthropic timeout')), 5000)
    );
    
    const stream = await Promise.race([
      streamFromAnthropic(options),
      timeoutPromise,
    ]);

    return { stream, provider: 'anthropic' };
  } catch (error) {
    console.warn('Anthropic failed, falling back to OpenAI:', error);

    // Fallback to OpenAI
    const stream = await streamFromOpenAI(options);
    return { stream, provider: 'openai' };
  }
}

The 5-second timeout is on connection establishment, not the full generation. This way we catch "Anthropic API is down" quickly without prematurely timing out legitimate slow generations.


Preventing Prompt Injection

An AI endpoint that accepts user input and passes it to a model is an injection target. Users can try to override the system prompt or extract system instructions.

// lib/ai/sanitize.ts

// Hard character limits - most legitimate prompts fit in 2000 chars
const MAX_PROMPT_LENGTH = 2000;

// Patterns that indicate prompt injection attempts
const INJECTION_PATTERNS = [
  /ignore (all |previous |above )?instructions/i,
  /you are now/i,
  /new (system |)prompt:/i,
  /\[system\]/i,
  /\[assistant\]/i,
  /forget (everything|all) (above|previous)/i,
];

export function sanitizePrompt(input: string): { safe: boolean; sanitized: string } {
  const trimmed = input.trim().slice(0, MAX_PROMPT_LENGTH);

  for (const pattern of INJECTION_PATTERNS) {
    if (pattern.test(trimmed)) {
      return { safe: false, sanitized: trimmed };
    }
  }

  return { safe: true, sanitized: trimmed };
}

Beyond input sanitization, the defense-in-depth approach:

  1. Separate system/user turn - never interpolate user input into the system prompt string
  2. Output validation - for structured outputs, parse and validate before using
  3. Audit logging - log all inputs (redacted) + model/provider for anomaly detection

Rate Limiting at the Edge

Edge Runtime can't use Redis directly (no TCP connections). We use Upstash Redis, which provides a REST API compatible with Edge:

// lib/rateLimit.ts
import { Ratelimit } from '@upstash/ratelimit';
import { Redis } from '@upstash/redis';

const ratelimit = new Ratelimit({
  redis: Redis.fromEnv(),  // UPSTASH_REDIS_REST_URL + UPSTASH_REDIS_REST_TOKEN
  limiter: Ratelimit.slidingWindow(10, '1 m'),  // 10 requests per minute per IP
  analytics: true,
});

export async function checkRateLimit(identifier: string) {
  const { success, limit, remaining, reset } = await ratelimit.limit(identifier);
  return { success, limit, remaining, reset };
}

// In the route:
const ip = request.headers.get('x-forwarded-for')?.split(',')[0] ?? 'anonymous';
const { success } = await checkRateLimit(`generate:${ip}`);
if (!success) {
  return new Response('Rate limit exceeded', { status: 429 });
}

Tracking Costs Per User

AI tokens cost money. In a multi-tenant product, you want to attribute costs per user/organization:

// After the stream completes, Anthropic's SDK provides usage stats
const stream = anthropic.messages.stream({ ... });

// Get the final message (available after stream completes)
const finalMessage = await stream.getFinalMessage();

const usage = {
  inputTokens: finalMessage.usage.input_tokens,
  outputTokens: finalMessage.usage.output_tokens,
  // Claude Sonnet 4.6: $3 per 1M input tokens, $15 per 1M output tokens (as of mid-2024)
  costUsd: (finalMessage.usage.input_tokens * 3 + finalMessage.usage.output_tokens * 15) / 1_000_000,
};

// Log asynchronously - don't block the stream
logUsageAsync({ userId, organizationId, ...usage });

Latency Benchmarks (Vercel Edge, eu-central-1)

Metric Node.js Runtime Edge Runtime
Cold start (first request after idle) 480ms 8ms
Time to first token (warm) 310ms 95ms
Time to first token (cold) 790ms 103ms
P99 time to first token 1,200ms 280ms

The cold start difference is the key win. In Node.js runtime, a user who hits the endpoint first after a 10-minute idle period waits nearly 800ms before seeing anything. Edge Runtime is consistent at ~100ms regardless of warm/cold state.


The Nuances We Learned

AbortController for cancelled requests. If the user navigates away mid-generation, the stream should stop. Without this, you pay for tokens that go nowhere:

// Client-side
const controller = new AbortController();

const response = await fetch('/api/generate', {
  signal: controller.signal,
  ...
});

// When user navigates away:
useEffect(() => {
  return () => controller.abort();  // Cleanup on unmount
}, []);

The Edge Runtime correctly propagates the abort signal to the Anthropic SDK's internal fetch - the stream stops and billing stops.

Streaming and error handling. If Anthropic returns an error mid-stream (rate limit hit after the stream started), the error arrives as a special event, not an HTTP error code (since we already sent 200 OK when the stream began). Handle this:

for await (const chunk of stream) {
  if (chunk.type === 'error') {
    controller.enqueue(encoder.encode(`data: ${JSON.stringify({ error: chunk.error.message })}\n\n`));
    controller.close();
    return;
  }
  // ... normal handling
}

Don't serialize the full prompt in logs. User prompts may contain sensitive data. Log prompt length, first 50 characters, and a hash - enough for debugging without storing user content.

The Edge + Streaming combination is now our default for any AI feature that generates more than a short response. The 100ms time-to-first-token vs 800ms cold-start difference is the kind of improvement users notice immediately.


Aunimeda builds AI-powered solutions - chatbots, AI agents, voice assistants, and automation systems for businesses.

Contact us to discuss AI integration for your business. See also: AI Solutions, AI Agents, Chatbot Development

Read Also

How to Build an AI Chatbot for Your Business in 2026aunimeda
AI & Machine Learning

How to Build an AI Chatbot for Your Business in 2026

AI chatbots in 2026 are not the rule-based bots of 2020. They understand context, handle complex questions, and integrate with your actual business data. Here's how to build one that works.

Vector Databases in Production: pgvector, Pinecone, and When Semantic Search Actually Mattersaunimeda
AI & Machine Learning

Vector Databases in Production: pgvector, Pinecone, and When Semantic Search Actually Matters

Vector databases power semantic search, RAG systems, recommendation engines, and duplicate detection. But most teams reach for them before they need them. Here's when embeddings genuinely outperform keyword search, how to implement them in production, and why pgvector is the right choice for most applications.

LLM in Production: How to Cut Your AI API Costs by 80% Without Degrading Qualityaunimeda
AI & Machine Learning

LLM in Production: How to Cut Your AI API Costs by 80% Without Degrading Quality

AI API costs can spiral fast. A feature that costs $200/month in testing can hit $8,000/month at scale. Here are the concrete strategies we use in production - prompt caching, model routing, semantic caching, output compression, and smart batching - with real cost numbers.

Need IT development for your business?

We build websites, mobile apps and AI solutions. Free consultation.

AI Solutions

Get Consultation All articles