Skip to main content
Back to blog

Real-Time AI Responses with FastAPI SSE: How I Cut Perceived Latency by 90%

A production guide to Server-Sent Events with FastAPI, OpenAI streaming, and asyncio backpressure control. From first token in 200ms to a fully streamed response with zero dropped connections.

8 min read
by Andrii Peretiatko
FastAPISSEOpenAIPythonasyncio

The Challenge

The first version of an AI assistant platform I built for a client returned the full answer after the model finished generating — sometimes 8-12 seconds for complex queries. The UX was a blank screen for 10 seconds, then a wall of text. Users thought the app was broken.

The numbers that forced the change:

  • Mean time to first byte (TTFB): 9,400ms
  • User-perceived "stuck" rate: 34% of sessions showed rage-clicks
  • Session abandonment before response: 21%

The goal: First token visible in under 300ms, full response streamed progressively.


The Investigation

Before implementing SSE, I mapped every millisecond of the existing request lifecycle:

Latency Breakdown (non-streaming endpoint)

StageTime% of Total
Agent selection (vector search)4ms0.04%
Context assembly12ms0.1%
OpenAI API: waiting for full response9,100ms96.8%
JSON serialization180ms1.9%
HTTP response120ms1.2%

96.8% of latency was waiting for OpenAI to finish generating the entire response before we sent a single byte.

The model was already streaming internally — we were just discarding that and waiting. Classic mistake.


The Solution: Async SSE Pipeline

1. FastAPI SSE Endpoint with EventSourceResponse

FastAPI doesn't ship SSE support in the standard library. I used sse-starlette with a custom async generator:

Before: blocking endpoint

After: streaming SSE endpoint

Impact:

  • TTFB: 9,400ms → 180ms (98% reduction)
  • Users see first token while model is still generating

2. Nginx SSE Proxy Configuration

SSE breaks silently if Nginx buffers the response. The X-Accel-Buffering: no header works, but I also added explicit Nginx config to be safe:

Without proxy_buffering off, Nginx buffers the entire SSE stream and delivers it as a single response — defeating the whole point.

Impact:

  • Eliminated 3 instances of "stream delivered as one chunk" bug in staging
  • Confirmed with curl -N (no-buffer mode)

3. asyncio Backpressure with Queue maxsize

Without backpressure, a fast producer (OpenAI) can flood a slow consumer (client with bad connection). The asyncio.Queue(maxsize=512) cap ensures the producer pauses when the consumer falls behind:

With maxsize=512 tokens ≈ ~2,000 characters buffer — enough to absorb network jitter without memory blowout.

Impact:

  • Memory per connection: capped at ~8KB regardless of model output length
  • Zero OOM events in 3 months of production traffic

4. Client-Side SSE with React 19

SSE connections drop. Mobile clients switch networks. Here's production-ready handling:


5. Rate Limiting per User (Redis Sliding Window)

Streaming is expensive. I added per-user rate limiting with a Redis sliding window before the stream starts:

Impact:

  • P99 rate limiter overhead: 0.8ms (pipeline = 4 commands in 1 round trip)

Final Results

Before vs After Comparison

MetricBeforeAfterImprovement
Time to First Byte (TTFB)9,400ms180ms98% faster
Time to First Token visible9,400ms210ms97.8% faster
User rage-click rate34%4%88% reduction
Session abandonment21%3%86% reduction
Memory per SSE connectionUnbounded~8KBStable
Nginx dropped connections12/day0/dayEliminated

Cost Savings

Streaming + rate limiting prevented over-generation:

  • Before: Average 2,100 tokens per response (model over-generated waiting for cutoff)
  • After: Average 890 tokens (users stop reading mid-stream, connection closes)
  • Cost reduction: ~58% per request

Key Takeaways

1. TTFB Is the Metric That Matters for AI UX

Total response time is irrelevant if the first token appears in 200ms. Users tolerate slow completion; they don't tolerate a blank screen.

2. Nginx Buffering Will Silently Break Your SSE

Always add proxy_buffering off explicitly. X-Accel-Buffering: no header helps but isn't sufficient in all proxy configurations.

3. asyncio.Queue maxsize Is Mandatory for Backpressure

Without a cap, a slow client accumulates unbounded tokens in memory. 512 is a sensible default for conversational AI.

4. POST + Token Pattern for Authenticated SSE

EventSource is GET-only. Use POST to create a session, return a short-lived token, then SSE GET. Never put JWTs in SSE URLs — they end up in server logs.

5. Sliding Window Rate Limiting Costs <1ms with Redis Pipeline

4 Redis commands in a single pipeline = one round trip. At sub-millisecond overhead, there's no excuse to skip rate limiting on expensive AI endpoints.


Tools & Technologies Used

  • API Framework: FastAPI 0.189 + sse-starlette 2.1
  • Async: Python 3.12 asyncio, uvloop
  • OpenAI Client: openai 1.40 with stream=True
  • Cache / Rate Limit: Redis 7 sorted sets
  • Proxy: Nginx 1.26 with proxy_buffering off
  • Client: React 19, native EventSource API
  • Monitoring: OpenTelemetry traces per SSE connection

What's Next?

  1. WebSocket fallback: for clients that don't support SSE (some corporate proxies)
  2. Token budget enforcement: hard-stop streams at N tokens to cap cost per request
  3. Speculative decoding: pre-generate likely continuations to reduce inter-token latency
  4. Multi-agent streaming: fan-out to 3 agents in parallel, stream the first to respond

Conclusion

Switching from blocking OpenAI responses to SSE streaming reduced perceived latency from 9.4 seconds to 210ms — a 97.8% improvement — without changing a single line of model configuration.

The technical work was straightforward: async generator + asyncio.Queue + correct Nginx config. The production system is 80 lines of Python and 10 lines of Nginx config.

If your AI endpoints feel slow, check TTFB first. The model isn't slow — you're probably just waiting for it to finish before sending byte one.