Real-Time AI Responses with FastAPI SSE: How I Cut Perceived Latency by 90%
A production guide to Server-Sent Events with FastAPI, OpenAI streaming, and asyncio backpressure control. From first token in 200ms to a fully streamed response with zero dropped connections.
The Challenge
The first version of an AI assistant platform I built for a client returned the full answer after the model finished generating — sometimes 8-12 seconds for complex queries. The UX was a blank screen for 10 seconds, then a wall of text. Users thought the app was broken.
The numbers that forced the change:
- Mean time to first byte (TTFB): 9,400ms
- User-perceived "stuck" rate: 34% of sessions showed rage-clicks
- Session abandonment before response: 21%
The goal: First token visible in under 300ms, full response streamed progressively.
The Investigation
Before implementing SSE, I mapped every millisecond of the existing request lifecycle:
Latency Breakdown (non-streaming endpoint)
| Stage | Time | % of Total |
|---|---|---|
| Agent selection (vector search) | 4ms | 0.04% |
| Context assembly | 12ms | 0.1% |
| OpenAI API: waiting for full response | 9,100ms | 96.8% |
| JSON serialization | 180ms | 1.9% |
| HTTP response | 120ms | 1.2% |
96.8% of latency was waiting for OpenAI to finish generating the entire response before we sent a single byte.
The model was already streaming internally — we were just discarding that and waiting. Classic mistake.
The Solution: Async SSE Pipeline
1. FastAPI SSE Endpoint with EventSourceResponse
FastAPI doesn't ship SSE support in the standard library. I used sse-starlette with a custom async generator:
Before: blocking endpoint
After: streaming SSE endpoint
Impact:
- TTFB: 9,400ms → 180ms (98% reduction)
- Users see first token while model is still generating
2. Nginx SSE Proxy Configuration
SSE breaks silently if Nginx buffers the response. The X-Accel-Buffering: no header works, but I also added explicit Nginx config to be safe:
Without proxy_buffering off, Nginx buffers the entire SSE stream and delivers it as a single response — defeating the whole point.
Impact:
- Eliminated 3 instances of "stream delivered as one chunk" bug in staging
- Confirmed with
curl -N(no-buffer mode)
3. asyncio Backpressure with Queue maxsize
Without backpressure, a fast producer (OpenAI) can flood a slow consumer (client with bad connection). The asyncio.Queue(maxsize=512) cap ensures the producer pauses when the consumer falls behind:
With maxsize=512 tokens ≈ ~2,000 characters buffer — enough to absorb network jitter without memory blowout.
Impact:
- Memory per connection: capped at ~8KB regardless of model output length
- Zero OOM events in 3 months of production traffic
4. Client-Side SSE with React 19
SSE connections drop. Mobile clients switch networks. Here's production-ready handling:
5. Rate Limiting per User (Redis Sliding Window)
Streaming is expensive. I added per-user rate limiting with a Redis sliding window before the stream starts:
Impact:
- P99 rate limiter overhead: 0.8ms (pipeline = 4 commands in 1 round trip)
Final Results
Before vs After Comparison
| Metric | Before | After | Improvement |
|---|---|---|---|
| Time to First Byte (TTFB) | 9,400ms | 180ms | 98% faster |
| Time to First Token visible | 9,400ms | 210ms | 97.8% faster |
| User rage-click rate | 34% | 4% | 88% reduction |
| Session abandonment | 21% | 3% | 86% reduction |
| Memory per SSE connection | Unbounded | ~8KB | Stable |
| Nginx dropped connections | 12/day | 0/day | Eliminated |
Cost Savings
Streaming + rate limiting prevented over-generation:
- Before: Average 2,100 tokens per response (model over-generated waiting for cutoff)
- After: Average 890 tokens (users stop reading mid-stream, connection closes)
- Cost reduction: ~58% per request
Key Takeaways
1. TTFB Is the Metric That Matters for AI UX
Total response time is irrelevant if the first token appears in 200ms. Users tolerate slow completion; they don't tolerate a blank screen.
2. Nginx Buffering Will Silently Break Your SSE
Always add proxy_buffering off explicitly. X-Accel-Buffering: no header helps but isn't sufficient in all proxy configurations.
3. asyncio.Queue maxsize Is Mandatory for Backpressure
Without a cap, a slow client accumulates unbounded tokens in memory. 512 is a sensible default for conversational AI.
4. POST + Token Pattern for Authenticated SSE
EventSource is GET-only. Use POST to create a session, return a short-lived token, then SSE GET. Never put JWTs in SSE URLs — they end up in server logs.
5. Sliding Window Rate Limiting Costs <1ms with Redis Pipeline
4 Redis commands in a single pipeline = one round trip. At sub-millisecond overhead, there's no excuse to skip rate limiting on expensive AI endpoints.
Tools & Technologies Used
- API Framework: FastAPI 0.189 + sse-starlette 2.1
- Async: Python 3.12 asyncio, uvloop
- OpenAI Client: openai 1.40 with
stream=True - Cache / Rate Limit: Redis 7 sorted sets
- Proxy: Nginx 1.26 with
proxy_buffering off - Client: React 19, native EventSource API
- Monitoring: OpenTelemetry traces per SSE connection
What's Next?
- WebSocket fallback: for clients that don't support SSE (some corporate proxies)
- Token budget enforcement: hard-stop streams at N tokens to cap cost per request
- Speculative decoding: pre-generate likely continuations to reduce inter-token latency
- Multi-agent streaming: fan-out to 3 agents in parallel, stream the first to respond
Conclusion
Switching from blocking OpenAI responses to SSE streaming reduced perceived latency from 9.4 seconds to 210ms — a 97.8% improvement — without changing a single line of model configuration.
The technical work was straightforward: async generator + asyncio.Queue + correct Nginx config. The production system is 80 lines of Python and 10 lines of Nginx config.
If your AI endpoints feel slow, check TTFB first. The model isn't slow — you're probably just waiting for it to finish before sending byte one.