Serverless AI: Streaming Claude and OpenAI Responses in

#claude api#openai#next.js#edge runtime#streaming#serverless#ai#2024

📋 Table of Contents ▼

The first version of our AI feature was a standard API route: send request, wait for complete response, return JSON. For a short classification task, fine. For a 500-word AI-generated summary, the user stared at a spinner for 8-12 seconds before anything appeared.

Streaming solves the perceived latency problem: instead of waiting for the full response, tokens stream to the client as they're generated. The first word appears in under 500ms. The experience feels instant even when generation takes 15 seconds total.

This post covers the full implementation: Edge Runtime for minimal cold starts, streaming responses, provider failover, and the gotchas that took us time to figure out.

Why Edge Runtime

Next.js routes can run in two environments:

Node.js runtime (default): Full Node.js APIs, 150MB memory limit on Vercel, cold start ~500ms
Edge Runtime: V8 isolates, runs close to the user (CDN edge), cold start ~5ms, limited to Web APIs (no fs, no net)

For AI streaming, the difference matters:

Node.js runtime: cold start delays the first token by 500ms+ after a period of inactivity
Edge Runtime: first token appears in ~80ms from a warm edge node

The trade-off: Edge Runtime can't use Node.js-specific packages. Anthropic's SDK and OpenAI's SDK both support Edge Runtime via the Web Streams API - this wasn't always the case (pre-2024 versions required Node.js streams).

// app/api/generate/route.ts
export const runtime = 'edge';  // This one line switches to Edge Runtime

Streaming With the Anthropic SDK

// app/api/generate/route.ts
import Anthropic from '@anthropic-ai/sdk';
import { NextRequest } from 'next/server';

export const runtime = 'edge';

const anthropic = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
});

export async function POST(request: NextRequest) {
  const { prompt, systemPrompt } = await request.json();

  if (!prompt || typeof prompt !== 'string') {
    return new Response('Invalid request', { status: 400 });
  }

  // Create a streaming response
  const stream = await anthropic.messages.stream({
    model: 'claude-opus-4-6',
    max_tokens: 1024,
    system: systemPrompt ?? 'You are a helpful assistant.',
    messages: [{ role: 'user', content: prompt }],
  });

  // Convert Anthropic's stream to a ReadableStream of SSE events
  const encoder = new TextEncoder();
  
  const readableStream = new ReadableStream({
    async start(controller) {
      try {
        for await (const chunk of stream) {
          if (
            chunk.type === 'content_block_delta' &&
            chunk.delta.type === 'text_delta'
          ) {
            const text = chunk.delta.text;
            // Server-Sent Events format
            const data = `data: ${JSON.stringify({ text })}\n\n`;
            controller.enqueue(encoder.encode(data));
          }
        }
        // Signal end of stream
        controller.enqueue(encoder.encode('data: [DONE]\n\n'));
        controller.close();
      } catch (error) {
        controller.error(error);
      }
    },
  });

  return new Response(readableStream, {
    headers: {
      'Content-Type': 'text/event-stream',
      'Cache-Control': 'no-cache',
      'Connection': 'keep-alive',
    },
  });
}

The Client-Side Stream Consumer

// components/AiTextGenerator.tsx
'use client';

import { useState } from 'react';

export function AiTextGenerator() {
  const [output, setOutput] = useState('');
  const [isStreaming, setIsStreaming] = useState(false);

  async function generate(prompt: string) {
    setOutput('');
    setIsStreaming(true);

    try {
      const response = await fetch('/api/generate', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ prompt }),
      });

      if (!response.ok) throw new Error(`HTTP ${response.status}`);
      if (!response.body) throw new Error('No response body');

      const reader = response.body.getReader();
      const decoder = new TextDecoder();

      while (true) {
        const { done, value } = await reader.read();
        if (done) break;

        const chunk = decoder.decode(value, { stream: true });
        const lines = chunk.split('\n');

        for (const line of lines) {
          if (!line.startsWith('data: ')) continue;
          
          const data = line.slice(6);  // Remove 'data: ' prefix
          if (data === '[DONE]') break;

          try {
            const { text } = JSON.parse(data);
            setOutput(prev => prev + text);
          } catch {
            // Ignore malformed chunks
          }
        }
      }
    } catch (error) {
      console.error('Streaming error:', error);
      setOutput(prev => prev + '\n[Error: generation failed]');
    } finally {
      setIsStreaming(false);
    }
  }

  return (
    <div>
      <button onClick={() => generate('Explain React Server Components')} disabled={isStreaming}>
        {isStreaming ? 'Generating...' : 'Generate'}
      </button>
      <div className="output">{output}</div>
    </div>
  );
}

Provider Failover: Claude Primary, OpenAI Fallback

Anthropic's API occasionally has elevated latency or rate limit issues. For a production feature, we wanted automatic failover to OpenAI:

// lib/ai/generate.ts
import Anthropic from '@anthropic-ai/sdk';
import OpenAI from 'openai';

const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

interface StreamOptions {
  prompt: string;
  system?: string;
  maxTokens?: number;
}

type Provider = 'anthropic' | 'openai';

async function streamFromAnthropic(options: StreamOptions): Promise<ReadableStream> {
  const stream = await anthropic.messages.stream({
    model: 'claude-opus-4-6',
    max_tokens: options.maxTokens ?? 1024,
    system: options.system ?? 'You are a helpful assistant.',
    messages: [{ role: 'user', content: options.prompt }],
  });

  return anthropicStreamToReadable(stream);
}

async function streamFromOpenAI(options: StreamOptions): Promise<ReadableStream> {
  const stream = await openai.chat.completions.create({
    model: 'gpt-4o',
    max_tokens: options.maxTokens ?? 1024,
    stream: true,
    messages: [
      { role: 'system', content: options.system ?? 'You are a helpful assistant.' },
      { role: 'user', content: options.prompt },
    ],
  });

  return openaiStreamToReadable(stream);
}

export async function generateWithFailover(
  options: StreamOptions
): Promise<{ stream: ReadableStream; provider: Provider }> {
  // Try Anthropic first with 5-second timeout on connection
  try {
    const timeoutPromise = new Promise<never>((_, reject) =>
      setTimeout(() => reject(new Error('Anthropic timeout')), 5000)
    );
    
    const stream = await Promise.race([
      streamFromAnthropic(options),
      timeoutPromise,
    ]);

    return { stream, provider: 'anthropic' };
  } catch (error) {
    console.warn('Anthropic failed, falling back to OpenAI:', error);

    // Fallback to OpenAI
    const stream = await streamFromOpenAI(options);
    return { stream, provider: 'openai' };
  }
}

The 5-second timeout is on connection establishment, not the full generation. This way we catch "Anthropic API is down" quickly without prematurely timing out legitimate slow generations.

Preventing Prompt Injection

An AI endpoint that accepts user input and passes it to a model is an injection target. Users can try to override the system prompt or extract system instructions.

// lib/ai/sanitize.ts

// Hard character limits - most legitimate prompts fit in 2000 chars
const MAX_PROMPT_LENGTH = 2000;

// Patterns that indicate prompt injection attempts
const INJECTION_PATTERNS = [
  /ignore (all |previous |above )?instructions/i,
  /you are now/i,
  /new (system |)prompt:/i,
  /\[system\]/i,
  /\[assistant\]/i,
  /forget (everything|all) (above|previous)/i,
];

export function sanitizePrompt(input: string): { safe: boolean; sanitized: string } {
  const trimmed = input.trim().slice(0, MAX_PROMPT_LENGTH);

  for (const pattern of INJECTION_PATTERNS) {
    if (pattern.test(trimmed)) {
      return { safe: false, sanitized: trimmed };
    }
  }

  return { safe: true, sanitized: trimmed };
}

Beyond input sanitization, the defense-in-depth approach:

Separate system/user turn - never interpolate user input into the system prompt string
Output validation - for structured outputs, parse and validate before using
Audit logging - log all inputs (redacted) + model/provider for anomaly detection

Rate Limiting at the Edge

Edge Runtime can't use Redis directly (no TCP connections). We use Upstash Redis, which provides a REST API compatible with Edge:

// lib/rateLimit.ts
import { Ratelimit } from '@upstash/ratelimit';
import { Redis } from '@upstash/redis';

const ratelimit = new Ratelimit({
  redis: Redis.fromEnv(),  // UPSTASH_REDIS_REST_URL + UPSTASH_REDIS_REST_TOKEN
  limiter: Ratelimit.slidingWindow(10, '1 m'),  // 10 requests per minute per IP
  analytics: true,
});

export async function checkRateLimit(identifier: string) {
  const { success, limit, remaining, reset } = await ratelimit.limit(identifier);
  return { success, limit, remaining, reset };
}

// In the route:
const ip = request.headers.get('x-forwarded-for')?.split(',')[0] ?? 'anonymous';
const { success } = await checkRateLimit(`generate:${ip}`);
if (!success) {
  return new Response('Rate limit exceeded', { status: 429 });
}

Tracking Costs Per User

AI tokens cost money. In a multi-tenant product, you want to attribute costs per user/organization:

// After the stream completes, Anthropic's SDK provides usage stats
const stream = anthropic.messages.stream({ ... });

// Get the final message (available after stream completes)
const finalMessage = await stream.getFinalMessage();

const usage = {
  inputTokens: finalMessage.usage.input_tokens,
  outputTokens: finalMessage.usage.output_tokens,
  // Claude Sonnet 4.6: $3 per 1M input tokens, $15 per 1M output tokens (as of mid-2024)
  costUsd: (finalMessage.usage.input_tokens * 3 + finalMessage.usage.output_tokens * 15) / 1_000_000,
};

// Log asynchronously - don't block the stream
logUsageAsync({ userId, organizationId, ...usage });

Latency Benchmarks (Vercel Edge, eu-central-1)

Metric	Node.js Runtime	Edge Runtime
Cold start (first request after idle)	480ms	8ms
Time to first token (warm)	310ms	95ms
Time to first token (cold)	790ms	103ms
P99 time to first token	1,200ms	280ms

The cold start difference is the key win. In Node.js runtime, a user who hits the endpoint first after a 10-minute idle period waits nearly 800ms before seeing anything. Edge Runtime is consistent at ~100ms regardless of warm/cold state.

The Nuances We Learned

AbortController for cancelled requests. If the user navigates away mid-generation, the stream should stop. Without this, you pay for tokens that go nowhere:

// Client-side
const controller = new AbortController();

const response = await fetch('/api/generate', {
  signal: controller.signal,
  ...
});

// When user navigates away:
useEffect(() => {
  return () => controller.abort();  // Cleanup on unmount
}, []);

The Edge Runtime correctly propagates the abort signal to the Anthropic SDK's internal fetch - the stream stops and billing stops.

Streaming and error handling. If Anthropic returns an error mid-stream (rate limit hit after the stream started), the error arrives as a special event, not an HTTP error code (since we already sent 200 OK when the stream began). Handle this:

for await (const chunk of stream) {
  if (chunk.type === 'error') {
    controller.enqueue(encoder.encode(`data: ${JSON.stringify({ error: chunk.error.message })}\n\n`));
    controller.close();
    return;
  }
  // ... normal handling
}

Don't serialize the full prompt in logs. User prompts may contain sensitive data. Log prompt length, first 50 characters, and a hash - enough for debugging without storing user content.

The Edge + Streaming combination is now our default for any AI feature that generates more than a short response. The 100ms time-to-first-token vs 800ms cold-start difference is the kind of improvement users notice immediately.

Aunimeda builds AI-powered solutions - chatbots, AI agents, voice assistants, and automation systems for businesses.

Serverless AI: Streaming Claude and OpenAI Responses in Next.js 15 via Edge Runtime

Why Edge Runtime

Streaming With the Anthropic SDK

The Client-Side Stream Consumer

Provider Failover: Claude Primary, OpenAI Fallback

Preventing Prompt Injection

Rate Limiting at the Edge

Tracking Costs Per User

Latency Benchmarks (Vercel Edge, eu-central-1)

The Nuances We Learned

Aunimeda

Need IT development for your business?

Serverless AI: Streaming Claude and OpenAI Responses in Next.js 15 via Edge Runtime

Why Edge Runtime

Streaming With the Anthropic SDK

The Client-Side Stream Consumer

Provider Failover: Claude Primary, OpenAI Fallback

Preventing Prompt Injection

Rate Limiting at the Edge

Tracking Costs Per User

Latency Benchmarks (Vercel Edge, eu-central-1)

The Nuances We Learned

Aunimeda

Read Also

How to Build an AI Chatbot for Your Business in 2026

Vector Databases in Production: pgvector, Pinecone, and When Semantic Search Actually Matters

LLM in Production: How to Cut Your AI API Costs by 80% Without Degrading Quality

Need IT development for your business?