Voice (STT / TTS)

Talk to Sophon and listen back — real-time server-side speech-to-text, multi-provider text-to-speech output, and voice session management.

Sophon speaks and listens. Tap the microphone on mobile or in the Dashboard and your words appear on screen as you say them; the agent answers in synthesized speech. Both directions run through the server: mic audio streams to the Gateway in real time, the Gateway bridges to a streaming STT provider, and transcripts flow back live. Listening behaves identically on every client, and provider credentials never leave the host.

How it's wired

Mic → audio stream (real time) → Gateway → streaming STT provider
                                    │
                                    ├── interim transcripts → client (live, as you speak)
                                    └── final transcript → Agent
                                                             │
                                                             ▼
                                                  Response text → TTS (server-side)
                                                                    │
                                                                    ▼
                                                             Audio stream → Speaker

The client's only job is to capture mic audio and play audio back. Transcription, provider selection, caching, and credentials all live server-side — which is also why the Dashboard now has listening without depending on browser speech APIs.

Listening (speech-to-text)

Interim transcripts render live as you speak, then finalize in one of two utterance modes:

Auto (hands-free) — the provider's voice-activity detection ends the utterance on a natural pause; the final transcript is sent to the agent automatically.
Manual (push-to-talk) — you control the boundary: hold to talk, and the utterance finalizes when you send.

Six STT providers are supported:

Provider	Notes
Deepgram	Streaming-first; lowest interim-transcript latency
OpenAI	Reuses your OpenAI API key
Azure Speech	Same region + key as Azure TTS
Google Cloud Speech	Same service-account JSON as Google TTS
ElevenLabs	One key covers both STT and TTS
Sophon Managed Speech	Self-hosted: one endpoint + access token serves both STT and TTS, with a built-in health check

Per client:

Client	Listening
Dashboard	New — Voice page Listening tab with provider status and live transcripts
Mobile	Same unified server path (replaces the previous device-native STT)
Desktop (Electron)	Global push-to-talk shortcut, same server path
CLI	Text-only — the CLI is a text interface

Configuring listening

Host-wide defaults live in stt.json under ~/.sophon/config/ and are managed from Settings → Voice → Listening or the REST API — no hand-editing needed. Provider API keys in stt.json are encrypted at rest.

Host setting	Range / default	What it does
Provider list	—	Configured STT providers with credentials
Default provider	—	Handles utterances unless a user overrides
Default language	e.g. `en-US`	Transcription language hint
Interim results	on	Show the transcript live as you speak
Endpointing	10–10,000 ms	Pause length that ends an utterance in Auto mode
Max utterance	1–300 s, default 60	Hard cap on a single utterance

Each user can override the provider, interim results, endpointing, and max utterance in Voice Center → Listening (Dashboard and Mobile). Leave a field blank to inherit the host default.

REST surface:

Endpoint	Purpose
`GET` / `POST` / `DELETE` `/api/stt/providers`	List, add, remove STT providers
`POST /api/stt/providers/{id}/test`	Health-check a provider
`GET /api/stt/providers/{id}/languages`	Enumerate supported languages
`GET` / `PUT` `/api/stt/settings`	Read / update host-wide defaults
`GET` / `PUT` `/api/voice/preferences`	Read / update per-user voice preferences

Fallbacks and budgets

If no STT provider is configured, the mic button shows a graceful status message and you can type instead. No audio is uploaded and no error is thrown.

Utterance caps — each utterance is bounded by the max-utterance duration and a byte-size cap, so a stuck mic can't exhaust resources. Speech that hits the cap is still transcribed and answered.
Barge-in — tap the mic again (Mobile) or click Stop (Dashboard) to cancel mid-utterance. The in-flight stream is cancelled cleanly; an empty transcript is never sent to the agent.

Text-to-speech

Six providers supported out of the box. Configure in Settings → Voice.

Provider	Quality	Latency	Cost	Notes
ElevenLabs	Best	Medium	Highest	Huge voice library, emotion control
OpenAI TTS	Very good	Low	Medium	6 built-in voices
Google Cloud TTS	Good	Low	Low	WaveNet + Neural2 voices
Azure Speech	Good	Low	Low	Enterprise-friendly SSO
Deepgram	Good	Lowest	Low	Aura voices; streaming-first, pairs with its STT side
Sophon Managed Speech	Good	Low	Self-hosted	Same endpoint + token as its STT side

Each provider has a different voice catalog. You pick:

Provider — ElevenLabs / OpenAI / Google / Azure / Deepgram / Managed
Voice — dropdown pre-filtered by provider (Rachel / Adam / Sarah / …)
Language — auto-detected from response, overridable
Speed — 0.5× to 2.0×

Streaming TTS

For long responses, the agent streams TTS sentence by sentence:

Gateway receives the agent's response token-by-token from the LLM.
A sentence buffer accumulates text until it hits a terminator (., ?, !, newline).
Each completed sentence is handed off to the TTS provider.
The provider returns an audio chunk (MP3 or OGG).
The chunk is pushed to the client over SignalR.
The client plays chunks in order — starts playing the first sentence while later sentences are still being rendered.

Net effect: ~1 second between "agent starts responding" and "you hear the first word."

Voice mode (mobile)

On the mobile app, the Voice tab is a full-screen voice experience:

Animated orb indicates listening / thinking / speaking
Long-press to talk, release to send (Manual / push-to-talk), or tap once for hands-free (Auto)
Interim transcripts render live in the conversation as you speak — the same server transcription path as the Dashboard
Responses play through device audio with visual waveform feedback
Tap orb to interrupt / cancel (barge-in)

Voice Center — reachable from the Voice tab and More → Settings — holds all listening and speech settings, including your per-user overrides. Quiet hours and push-notification rules don't apply — voice mode is user-initiated and opt-in per session.

Voice on desktop

The Electron desktop app supports a global push-to-talk keyboard shortcut (default: Ctrl+Shift+Space). Hold the shortcut, speak, release — the agent processes your audio in whatever app you're focused in. The response plays through your default audio device.

Configure the shortcut in Settings → Voice → Push-to-talk shortcut.

Caching

Two layers, both server-side:

Disk — TTS renders are cached by (provider, voice, language, text-hash). Repeated requests — e.g., a greeting the agent uses often — don't re-render. Cache lives at ~/.sophon/cache/tts/ with a 30-day TTL and 1 GB cap.
Memory — a byte-bounded LRU cache (default 32 MB) holds recently synthesized audio, so repeated lines (greetings, status updates) skip provider round-trips and re-billing entirely. It's cleared on restart; no audio is written to disk by this layer.

Agent tool

Agents have a voice.* tool family (optional — not enabled by default):

voice.speak(text, voice?) — generate speech and play it (in sessions that support playback)
voice.transcribe(audioUrl) — force transcription of an audio file
voice.list_voices(provider?) — enumerate available voices

Enable these in Agents → Tools if you want the agent to proactively speak beyond its chat responses — e.g., "announce this over voice when the build finishes."

Configuration

~/.sophon/config/voice.json covers speech output:

{
  "tts": {
    "provider": "ElevenLabs",
    "voice": "Rachel",
    "speed": 1.0,
    "cachingEnabled": true
  }
}

Listening lives separately in ~/.sophon/config/stt.json — see Configuring listening. Provider credentials are stored in the credential vault. Add keys in Settings → Connections:

Deepgram API key
ElevenLabs API key
OpenAI API key (reused from your LLM config)
Google Cloud service account JSON
Azure Speech region + key
Sophon Managed Speech endpoint + access token

CLI

sophon voice list-voices --provider elevenlabs
sophon voice speak "Hello, world." --provider openai --voice nova
sophon voice config --provider google --voice en-US-Neural2-F

Limits and gotchas

Microphone permission is required on iOS/Android and in browsers. Browser mic capture requires HTTPS (except localhost).
Adding, removing, or testing STT providers and changing host-wide listening defaults requires the Admin role; per-user overrides are available to everyone.
Some TTS providers charge per character. Streaming TTS counts all rendered characters, even if you cancel mid-sentence.
ElevenLabs has rate limits on voice cloning; stock voices are fine at scale.
If multiple TTS providers are configured, only one is active at a time — picking a new provider in settings is global.

Where to go next

Mobile App — voice tab and Voice Center specifics
Dashboard — Voice page and Listening tab
Connections — configuring provider credentials
Skills — other audio tools (transcription, format conversion)

Voice (STT / TTS)NEW