Voice (STT / TTS)NEW
Talk to Sophon and listen back — real-time server-side speech-to-text, multi-provider text-to-speech output, and voice session management.
Sophon speaks and listens. Tap the microphone on mobile or in the Dashboard and your words appear on screen as you say them; the agent answers in synthesized speech. Both directions run through the server: mic audio streams to the Gateway in real time, the Gateway bridges to a streaming STT provider, and transcripts flow back live. Listening behaves identically on every client, and provider credentials never leave the host.
How it's wired
Mic → audio stream (real time) → Gateway → streaming STT provider
│
├── interim transcripts → client (live, as you speak)
└── final transcript → Agent
│
▼
Response text → TTS (server-side)
│
▼
Audio stream → SpeakerThe client's only job is to capture mic audio and play audio back. Transcription, provider selection, caching, and credentials all live server-side — which is also why the Dashboard now has listening without depending on browser speech APIs.
Listening (speech-to-text)
Interim transcripts render live as you speak, then finalize in one of two utterance modes:
- Auto (hands-free) — the provider's voice-activity detection ends the utterance on a natural pause; the final transcript is sent to the agent automatically.
- Manual (push-to-talk) — you control the boundary: hold to talk, and the utterance finalizes when you send.
Six STT providers are supported:
| Provider | Notes |
|---|---|
| Deepgram | Streaming-first; lowest interim-transcript latency |
| OpenAI | Reuses your OpenAI API key |
| Azure Speech | Same region + key as Azure TTS |
| Google Cloud Speech | Same service-account JSON as Google TTS |
| ElevenLabs | One key covers both STT and TTS |
| Sophon Managed Speech | Self-hosted: one endpoint + access token serves both STT and TTS, with a built-in health check |
Per client:
| Client | Listening |
|---|---|
| Dashboard | New — Voice page Listening tab with provider status and live transcripts |
| Mobile | Same unified server path (replaces the previous device-native STT) |
| Desktop (Electron) | Global push-to-talk shortcut, same server path |
| CLI | Text-only — the CLI is a text interface |
Configuring listening
Host-wide defaults live in stt.json under ~/.sophon/config/ and are managed from Settings → Voice → Listening or the REST API — no hand-editing needed. Provider API keys in stt.json are encrypted at rest.
| Host setting | Range / default | What it does |
|---|---|---|
| Provider list | — | Configured STT providers with credentials |
| Default provider | — | Handles utterances unless a user overrides |
| Default language | e.g. en-US | Transcription language hint |
| Interim results | on | Show the transcript live as you speak |
| Endpointing | 10–10,000 ms | Pause length that ends an utterance in Auto mode |
| Max utterance | 1–300 s, default 60 | Hard cap on a single utterance |
Each user can override the provider, interim results, endpointing, and max utterance in Voice Center → Listening (Dashboard and Mobile). Leave a field blank to inherit the host default.
REST surface:
| Endpoint | Purpose |
|---|---|
GET / POST / DELETE /api/stt/providers | List, add, remove STT providers |
POST /api/stt/providers/{id}/test | Health-check a provider |
GET /api/stt/providers/{id}/languages | Enumerate supported languages |
GET / PUT /api/stt/settings | Read / update host-wide defaults |
GET / PUT /api/voice/preferences | Read / update per-user voice preferences |
Fallbacks and budgets
If no STT provider is configured, the mic button shows a graceful status message and you can type instead. No audio is uploaded and no error is thrown.
- Utterance caps — each utterance is bounded by the max-utterance duration and a byte-size cap, so a stuck mic can't exhaust resources. Speech that hits the cap is still transcribed and answered.
- Barge-in — tap the mic again (Mobile) or click Stop (Dashboard) to cancel mid-utterance. The in-flight stream is cancelled cleanly; an empty transcript is never sent to the agent.
Text-to-speech
Six providers supported out of the box. Configure in Settings → Voice.
| Provider | Quality | Latency | Cost | Notes |
|---|---|---|---|---|
| ElevenLabs | Best | Medium | Highest | Huge voice library, emotion control |
| OpenAI TTS | Very good | Low | Medium | 6 built-in voices |
| Google Cloud TTS | Good | Low | Low | WaveNet + Neural2 voices |
| Azure Speech | Good | Low | Low | Enterprise-friendly SSO |
| Deepgram | Good | Lowest | Low | Aura voices; streaming-first, pairs with its STT side |
| Sophon Managed Speech | Good | Low | Self-hosted | Same endpoint + token as its STT side |
Each provider has a different voice catalog. You pick:
- Provider — ElevenLabs / OpenAI / Google / Azure / Deepgram / Managed
- Voice — dropdown pre-filtered by provider (Rachel / Adam / Sarah / …)
- Language — auto-detected from response, overridable
- Speed — 0.5× to 2.0×
Streaming TTS
For long responses, the agent streams TTS sentence by sentence:
- Gateway receives the agent's response token-by-token from the LLM.
- A sentence buffer accumulates text until it hits a terminator (
.,?,!, newline). - Each completed sentence is handed off to the TTS provider.
- The provider returns an audio chunk (MP3 or OGG).
- The chunk is pushed to the client over SignalR.
- The client plays chunks in order — starts playing the first sentence while later sentences are still being rendered.
Net effect: ~1 second between "agent starts responding" and "you hear the first word."
Voice mode (mobile)
On the mobile app, the Voice tab is a full-screen voice experience:
- Animated orb indicates listening / thinking / speaking
- Long-press to talk, release to send (Manual / push-to-talk), or tap once for hands-free (Auto)
- Interim transcripts render live in the conversation as you speak — the same server transcription path as the Dashboard
- Responses play through device audio with visual waveform feedback
- Tap orb to interrupt / cancel (barge-in)
Voice Center — reachable from the Voice tab and More → Settings — holds all listening and speech settings, including your per-user overrides. Quiet hours and push-notification rules don't apply — voice mode is user-initiated and opt-in per session.
Voice on desktop
The Electron desktop app supports a global push-to-talk keyboard shortcut (default: Ctrl+Shift+Space). Hold the shortcut, speak, release — the agent processes your audio in whatever app you're focused in. The response plays through your default audio device.
Configure the shortcut in Settings → Voice → Push-to-talk shortcut.
Caching
Two layers, both server-side:
- Disk — TTS renders are cached by (provider, voice, language, text-hash). Repeated requests — e.g., a greeting the agent uses often — don't re-render. Cache lives at
~/.sophon/cache/tts/with a 30-day TTL and 1 GB cap. - Memory — a byte-bounded LRU cache (default 32 MB) holds recently synthesized audio, so repeated lines (greetings, status updates) skip provider round-trips and re-billing entirely. It's cleared on restart; no audio is written to disk by this layer.
Agent tool
Agents have a voice.* tool family (optional — not enabled by default):
voice.speak(text, voice?)— generate speech and play it (in sessions that support playback)voice.transcribe(audioUrl)— force transcription of an audio filevoice.list_voices(provider?)— enumerate available voices
Enable these in Agents → Tools if you want the agent to proactively speak beyond its chat responses — e.g., "announce this over voice when the build finishes."
Configuration
~/.sophon/config/voice.json covers speech output:
{
"tts": {
"provider": "ElevenLabs",
"voice": "Rachel",
"speed": 1.0,
"cachingEnabled": true
}
}Listening lives separately in ~/.sophon/config/stt.json — see Configuring listening. Provider credentials are stored in the credential vault. Add keys in Settings → Connections:
- Deepgram API key
- ElevenLabs API key
- OpenAI API key (reused from your LLM config)
- Google Cloud service account JSON
- Azure Speech region + key
- Sophon Managed Speech endpoint + access token
CLI
sophon voice list-voices --provider elevenlabs
sophon voice speak "Hello, world." --provider openai --voice nova
sophon voice config --provider google --voice en-US-Neural2-FLimits and gotchas
- Microphone permission is required on iOS/Android and in browsers. Browser mic capture requires HTTPS (except
localhost). - Adding, removing, or testing STT providers and changing host-wide listening defaults requires the Admin role; per-user overrides are available to everyone.
- Some TTS providers charge per character. Streaming TTS counts all rendered characters, even if you cancel mid-sentence.
- ElevenLabs has rate limits on voice cloning; stock voices are fine at scale.
- If multiple TTS providers are configured, only one is active at a time — picking a new provider in settings is global.
Where to go next
- Mobile App — voice tab and Voice Center specifics
- Dashboard — Voice page and Listening tab
- Connections — configuring provider credentials
- Skills — other audio tools (transcription, format conversion)