Voice (STT / TTS)
Talk to Sophon and listen back — speech-to-text input, multi-provider text-to-speech output, and voice session management.
Sophon speaks and listens. On mobile, you can tap the microphone and have the agent respond in synthesized speech. On desktop (and the Electron app), voice mode turns the agent into a hands-free assistant. The STT side is handled by the client platform; the TTS side is pluggable across four providers so you can pick the voice you like.
How it's wired
Mic → STT (client-side) → transcribed text → Gateway → Agent
│
▼
Response text → TTS (server-side)
│
▼
Audio stream → SpeakerSTT happens on the client to keep latency low and avoid streaming raw audio to the Gateway. TTS happens server-side because voice selection, caching, and provider credentials all live there.
Speech-to-text
Sophon uses platform-native STT on the client:
| Platform | STT backend |
|---|---|
| Mobile (iOS/Android) | Expo Speech Recognition (native iOS / Android SDKs under the hood) |
| Desktop (Electron) | Chromium SpeechRecognition API, fallback to Whisper via the Voice skill |
| Dashboard (web) | Chromium SpeechRecognition — permission-gated per browser |
| CLI | Not supported (CLI is a text interface) |
Transcription is streamed — you see your words appear in the chat input as you speak, and the agent starts processing as soon as you pause.
For languages not well-supported by the platform STT, the agent can fall back to the voice.transcribe skill which uses a configured Whisper-compatible provider.
Text-to-speech
Four providers supported out of the box. Configure in Settings → Voice.
| Provider | Quality | Latency | Cost | Notes |
|---|---|---|---|---|
| ElevenLabs | Best | Medium | Highest | Huge voice library, emotion control |
| OpenAI TTS | Very good | Low | Medium | 6 built-in voices |
| Google Cloud TTS | Good | Low | Low | WaveNet + Neural2 voices |
| Azure Speech | Good | Low | Low | Enterprise-friendly SSO |
Each provider has a different voice catalog. You pick:
- Provider — ElevenLabs / OpenAI / Google / Azure
- Voice — dropdown pre-filtered by provider (Rachel / Adam / Sarah / …)
- Language — auto-detected from response, overridable
- Speed — 0.5× to 2.0×
Streaming TTS
For long responses, the agent streams TTS sentence by sentence:
- Gateway receives the agent's response token-by-token from the LLM.
- A sentence buffer accumulates text until it hits a terminator (
.,?,!, newline). - Each completed sentence is handed off to the TTS provider.
- The provider returns an audio chunk (MP3 or OGG).
- The chunk is pushed to the client over SignalR.
- The client plays chunks in order — starts playing the first sentence while later sentences are still being rendered.
Net effect: ~1 second between "agent starts responding" and "you hear the first word."
Voice mode (mobile)
On the mobile app, the Voice tab is a full-screen voice experience:
- Animated orb indicates listening / thinking / speaking
- Long-press to talk, release to send (push-to-talk), or tap to start hands-free mode
- Real-time transcription renders as you speak
- Responses play through device audio with visual waveform feedback
- Tap orb to interrupt / cancel
Quiet hours and push-notification rules don't apply — voice mode is user-initiated and opt-in per session.
Voice on desktop
The Electron desktop app supports a global push-to-talk keyboard shortcut (default: Ctrl+Shift+Space). Hold the shortcut, speak, release — the agent processes your audio in whatever app you're focused in. The response plays through your default audio device.
Configure the shortcut in Settings → Voice → Push-to-talk shortcut.
Caching
TTS renders are cached by (provider, voice, language, text-hash). Repeated requests — e.g., a greeting the agent uses often — don't re-render. Cache lives at ~/.sophon/cache/tts/ with a 30-day TTL and 1 GB cap.
Agent tool
Agents have a voice.* tool family (optional — not enabled by default):
voice.speak(text, voice?)— generate speech and play it (in sessions that support playback)voice.transcribe(audioUrl)— force transcription of an audio filevoice.list_voices(provider?)— enumerate available voices
Enable these in Agents → Tools if you want the agent to proactively speak beyond its chat responses — e.g., "announce this over voice when the build finishes."
Configuration
~/.sophon/config/voice.json:
{
"tts": {
"provider": "ElevenLabs",
"voice": "Rachel",
"speed": 1.0,
"cachingEnabled": true
},
"stt": {
"preferPlatformNative": true,
"whisperFallback": true
}
}Provider credentials are stored in the credential vault. Add keys in Settings → Connections:
- ElevenLabs API key
- OpenAI API key (reused from your LLM config)
- Google Cloud service account JSON
- Azure Speech region + key
CLI
sophon voice list-voices --provider elevenlabs
sophon voice speak "Hello, world." --provider openai --voice nova
sophon voice config --provider google --voice en-US-Neural2-FLimits and gotchas
- Mobile STT requires microphone permission on iOS/Android.
- Expo Go does not support custom STT — you need a development or production build.
- Browser STT requires HTTPS (except
localhost) per browser security policy. - Some TTS providers charge per character. Streaming TTS counts all rendered characters, even if you cancel mid-sentence.
- ElevenLabs has rate limits on voice cloning; stock voices are fine at scale.
- If multiple TTS providers are configured, only one is active at a time — picking a new provider in settings is global.
Where to go next
- Mobile App — voice tab specifics
- Connections — configuring provider credentials
- Skills — other audio tools (transcription, format conversion)