Real-Time AI Responses with FastAPI SSE: How I Cut Perceived Latency by 90%

The Challenge

The first version of an AI assistant platform I built for a client returned the full answer after the model finished generating — sometimes 8-12 seconds for complex queries. The UX was a blank screen for 10 seconds, then a wall of text. Users thought the app was broken.

The numbers that forced the change:

Mean time to first byte (TTFB): 9,400ms
User-perceived "stuck" rate: 34% of sessions showed rage-clicks
Session abandonment before response: 21%

The goal: First token visible in under 300ms, full response streamed progressively.

The Investigation

Before implementing SSE, I mapped every millisecond of the existing request lifecycle:

Latency Breakdown (non-streaming endpoint)

Stage	Time	% of Total
Agent selection (vector search)	4ms	0.04%
Context assembly	12ms	0.1%
OpenAI API: waiting for full response	9,100ms	96.8%
JSON serialization	180ms	1.9%
HTTP response	120ms	1.2%

96.8% of latency was waiting for OpenAI to finish generating the entire response before we sent a single byte.

The model was already streaming internally — we were just discarding that and waiting. Classic mistake.

The Solution: Async SSE Pipeline

1. FastAPI SSE Endpoint with EventSourceResponse

FastAPI doesn't ship SSE support in the standard library. I used sse-starlette with a custom async generator:

Before: blocking endpoint

After: streaming SSE endpoint

Impact:

TTFB: 9,400ms → 180ms (98% reduction)
Users see first token while model is still generating

2. Nginx SSE Proxy Configuration

SSE breaks silently if Nginx buffers the response. The X-Accel-Buffering: no header works, but I also added explicit Nginx config to be safe:

Without proxy_buffering off, Nginx buffers the entire SSE stream and delivers it as a single response — defeating the whole point.

Impact:

Eliminated 3 instances of "stream delivered as one chunk" bug in staging
Confirmed with curl -N (no-buffer mode)

3. asyncio Backpressure with Queue maxsize

Without backpressure, a fast producer (OpenAI) can flood a slow consumer (client with bad connection). The asyncio.Queue(maxsize=512) cap ensures the producer pauses when the consumer falls behind:

With maxsize=512 tokens ≈ ~2,000 characters buffer — enough to absorb network jitter without memory blowout.

Impact:

Memory per connection: capped at ~8KB regardless of model output length
Zero OOM events in 3 months of production traffic

4. Client-Side SSE with React 19

SSE connections drop. Mobile clients switch networks. Here's production-ready handling:

5. Rate Limiting per User (Redis Sliding Window)

Streaming is expensive. I added per-user rate limiting with a Redis sliding window before the stream starts:

Impact:

P99 rate limiter overhead: 0.8ms (pipeline = 4 commands in 1 round trip)

Final Results

Before vs After Comparison

Metric	Before	After	Improvement
Time to First Byte (TTFB)	9,400ms	180ms	98% faster
Time to First Token visible	9,400ms	210ms	97.8% faster
User rage-click rate	34%	4%	88% reduction
Session abandonment	21%	3%	86% reduction
Memory per SSE connection	Unbounded	~8KB	Stable
Nginx dropped connections	12/day	0/day	Eliminated

Cost Savings

Streaming + rate limiting prevented over-generation:

Before: Average 2,100 tokens per response (model over-generated waiting for cutoff)
After: Average 890 tokens (users stop reading mid-stream, connection closes)
Cost reduction: ~58% per request

Key Takeaways

1. TTFB Is the Metric That Matters for AI UX

Total response time is irrelevant if the first token appears in 200ms. Users tolerate slow completion; they don't tolerate a blank screen.

2. Nginx Buffering Will Silently Break Your SSE

Always add proxy_buffering off explicitly. X-Accel-Buffering: no header helps but isn't sufficient in all proxy configurations.

3. asyncio.Queue maxsize Is Mandatory for Backpressure

Without a cap, a slow client accumulates unbounded tokens in memory. 512 is a sensible default for conversational AI.

4. POST + Token Pattern for Authenticated SSE

EventSource is GET-only. Use POST to create a session, return a short-lived token, then SSE GET. Never put JWTs in SSE URLs — they end up in server logs.

5. Sliding Window Rate Limiting Costs <1ms with Redis Pipeline

4 Redis commands in a single pipeline = one round trip. At sub-millisecond overhead, there's no excuse to skip rate limiting on expensive AI endpoints.

Tools & Technologies Used

API Framework: FastAPI 0.189 + sse-starlette 2.1
Async: Python 3.12 asyncio, uvloop
OpenAI Client: openai 1.40 with stream=True
Cache / Rate Limit: Redis 7 sorted sets
Proxy: Nginx 1.26 with proxy_buffering off
Client: React 19, native EventSource API
Monitoring: OpenTelemetry traces per SSE connection

What's Next?

WebSocket fallback: for clients that don't support SSE (some corporate proxies)
Token budget enforcement: hard-stop streams at N tokens to cap cost per request
Speculative decoding: pre-generate likely continuations to reduce inter-token latency
Multi-agent streaming: fan-out to 3 agents in parallel, stream the first to respond

Conclusion

Switching from blocking OpenAI responses to SSE streaming reduced perceived latency from 9.4 seconds to 210ms — a 97.8% improvement — without changing a single line of model configuration.

The technical work was straightforward: async generator + asyncio.Queue + correct Nginx config. The production system is 80 lines of Python and 10 lines of Nginx config.

If your AI endpoints feel slow, check TTFB first. The model isn't slow — you're probably just waiting for it to finish before sending byte one.