The Voice AI Stack — STT, TTS, and Streaming
How voice AI actually works end-to-end — the components, the latency, and the trade-offs
Voice AI is three problems joined together: hearing what the user said, thinking about it, and speaking back. Each step has different tools, latency profiles, and cost structures. Understanding the stack prevents you from optimising the wrong bottleneck.
The three-stage pipeline
- STT — Speech to Text — Converts audio to a text transcript. Whisper (OpenAI) leads quality. Deepgram leads speed. Choose based on whether accuracy or latency matters more.
- LLM reasoning — Claude or GPT-4 processes the transcript and generates a text response. 300-800ms to first token.
- TTS — Text to Speech — Converts the text response to audio. ElevenLabs leads quality. OpenAI TTS leads speed+cost.
A voice assistant is a phone interpreter in real time
Imagine hiring an interpreter who listens to your question in English, thinks about the answer, and speaks it back in any language. The three stages are: hearing (STT), thinking (LLM), and speaking (TTS). Each stage has its own delay. The total voice latency is the sum of all three — typically 1.5-3 seconds end-to-end.
Latency targets by use case
- Conversational assistant — Total latency target: under 2 seconds. Streaming TTS essential — start playing audio before generation is complete.
- Voice form or command interface — Latency less critical — users expect a processing delay after they speak.
- Real-time voice-to-voice — Sub-500ms total required. Only achievable with specialised models (Gemini Live, OpenAI Realtime API). Not covered in this course — requires WebRTC.
The cost model
- Whisper — $0.006/minute of audio. A 30-second interaction costs $0.003 for STT.
- ElevenLabs — $0.30 per 1,000 characters generated (Starter). A 100-word response ≈ 500 characters = $0.15.
- Claude Sonnet — ~$0.008 per voice interaction at typical lengths.
- Total cost per voice interaction — Roughly $0.16-0.25. Budget accordingly — voice features have 10-20x the cost of text features.
Try this
Record a 30-second voice note explaining what you want your voice AI to do. Listen back. Count how many distinct sentences you speak. Multiply by 60 characters per sentence. That is your approximate TTS character count per interaction — run the cost calculation.