Sophon Docs
Features

Voice (STT / TTS)NEW

Talk to Sophon and listen back — real-time server-side speech-to-text, multi-provider text-to-speech output, and voice session management.

Sophon speaks and listens. Tap the microphone on mobile or in the Dashboard and your words appear on screen as you say them; the agent answers in synthesized speech. Both directions run through the server: mic audio streams to the Gateway in real time, the Gateway bridges to a streaming STT provider, and transcripts flow back live. Listening behaves identically on every client, and provider credentials never leave the host.

How it's wired

Mic → audio stream (real time) → Gateway → streaming STT provider

                                    ├── interim transcripts → client (live, as you speak)
                                    └── final transcript → Agent


                                                  Response text → TTS (server-side)


                                                             Audio stream → Speaker

The client's only job is to capture mic audio and play audio back. Transcription, provider selection, caching, and credentials all live server-side — which is also why the Dashboard now has listening without depending on browser speech APIs.

Listening (speech-to-text)

Interim transcripts render live as you speak, then finalize in one of two utterance modes:

  • Auto (hands-free) — the provider's voice-activity detection ends the utterance on a natural pause; the final transcript is sent to the agent automatically.
  • Manual (push-to-talk) — you control the boundary: hold to talk, and the utterance finalizes when you send.

Six STT providers are supported:

ProviderNotes
DeepgramStreaming-first; lowest interim-transcript latency
OpenAIReuses your OpenAI API key
Azure SpeechSame region + key as Azure TTS
Google Cloud SpeechSame service-account JSON as Google TTS
ElevenLabsOne key covers both STT and TTS
Sophon Managed SpeechSelf-hosted: one endpoint + access token serves both STT and TTS, with a built-in health check

Per client:

ClientListening
DashboardNew — Voice page Listening tab with provider status and live transcripts
MobileSame unified server path (replaces the previous device-native STT)
Desktop (Electron)Global push-to-talk shortcut, same server path
CLIText-only — the CLI is a text interface

Configuring listening

Host-wide defaults live in stt.json under ~/.sophon/config/ and are managed from Settings → Voice → Listening or the REST API — no hand-editing needed. Provider API keys in stt.json are encrypted at rest.

Host settingRange / defaultWhat it does
Provider listConfigured STT providers with credentials
Default providerHandles utterances unless a user overrides
Default languagee.g. en-USTranscription language hint
Interim resultsonShow the transcript live as you speak
Endpointing10–10,000 msPause length that ends an utterance in Auto mode
Max utterance1–300 s, default 60Hard cap on a single utterance

Each user can override the provider, interim results, endpointing, and max utterance in Voice Center → Listening (Dashboard and Mobile). Leave a field blank to inherit the host default.

REST surface:

EndpointPurpose
GET / POST / DELETE /api/stt/providersList, add, remove STT providers
POST /api/stt/providers/{id}/testHealth-check a provider
GET /api/stt/providers/{id}/languagesEnumerate supported languages
GET / PUT /api/stt/settingsRead / update host-wide defaults
GET / PUT /api/voice/preferencesRead / update per-user voice preferences

Fallbacks and budgets

If no STT provider is configured, the mic button shows a graceful status message and you can type instead. No audio is uploaded and no error is thrown.

  • Utterance caps — each utterance is bounded by the max-utterance duration and a byte-size cap, so a stuck mic can't exhaust resources. Speech that hits the cap is still transcribed and answered.
  • Barge-in — tap the mic again (Mobile) or click Stop (Dashboard) to cancel mid-utterance. The in-flight stream is cancelled cleanly; an empty transcript is never sent to the agent.

Text-to-speech

Six providers supported out of the box. Configure in Settings → Voice.

ProviderQualityLatencyCostNotes
ElevenLabsBestMediumHighestHuge voice library, emotion control
OpenAI TTSVery goodLowMedium6 built-in voices
Google Cloud TTSGoodLowLowWaveNet + Neural2 voices
Azure SpeechGoodLowLowEnterprise-friendly SSO
DeepgramGoodLowestLowAura voices; streaming-first, pairs with its STT side
Sophon Managed SpeechGoodLowSelf-hostedSame endpoint + token as its STT side

Each provider has a different voice catalog. You pick:

  • Provider — ElevenLabs / OpenAI / Google / Azure / Deepgram / Managed
  • Voice — dropdown pre-filtered by provider (Rachel / Adam / Sarah / …)
  • Language — auto-detected from response, overridable
  • Speed — 0.5× to 2.0×

Streaming TTS

For long responses, the agent streams TTS sentence by sentence:

  1. Gateway receives the agent's response token-by-token from the LLM.
  2. A sentence buffer accumulates text until it hits a terminator (., ?, !, newline).
  3. Each completed sentence is handed off to the TTS provider.
  4. The provider returns an audio chunk (MP3 or OGG).
  5. The chunk is pushed to the client over SignalR.
  6. The client plays chunks in order — starts playing the first sentence while later sentences are still being rendered.

Net effect: ~1 second between "agent starts responding" and "you hear the first word."

Voice mode (mobile)

On the mobile app, the Voice tab is a full-screen voice experience:

  • Animated orb indicates listening / thinking / speaking
  • Long-press to talk, release to send (Manual / push-to-talk), or tap once for hands-free (Auto)
  • Interim transcripts render live in the conversation as you speak — the same server transcription path as the Dashboard
  • Responses play through device audio with visual waveform feedback
  • Tap orb to interrupt / cancel (barge-in)

Voice Center — reachable from the Voice tab and More → Settings — holds all listening and speech settings, including your per-user overrides. Quiet hours and push-notification rules don't apply — voice mode is user-initiated and opt-in per session.

Voice on desktop

The Electron desktop app supports a global push-to-talk keyboard shortcut (default: Ctrl+Shift+Space). Hold the shortcut, speak, release — the agent processes your audio in whatever app you're focused in. The response plays through your default audio device.

Configure the shortcut in Settings → Voice → Push-to-talk shortcut.

Caching

Two layers, both server-side:

  • Disk — TTS renders are cached by (provider, voice, language, text-hash). Repeated requests — e.g., a greeting the agent uses often — don't re-render. Cache lives at ~/.sophon/cache/tts/ with a 30-day TTL and 1 GB cap.
  • Memory — a byte-bounded LRU cache (default 32 MB) holds recently synthesized audio, so repeated lines (greetings, status updates) skip provider round-trips and re-billing entirely. It's cleared on restart; no audio is written to disk by this layer.

Agent tool

Agents have a voice.* tool family (optional — not enabled by default):

  • voice.speak(text, voice?) — generate speech and play it (in sessions that support playback)
  • voice.transcribe(audioUrl) — force transcription of an audio file
  • voice.list_voices(provider?) — enumerate available voices

Enable these in Agents → Tools if you want the agent to proactively speak beyond its chat responses — e.g., "announce this over voice when the build finishes."

Configuration

~/.sophon/config/voice.json covers speech output:

{
  "tts": {
    "provider": "ElevenLabs",
    "voice": "Rachel",
    "speed": 1.0,
    "cachingEnabled": true
  }
}

Listening lives separately in ~/.sophon/config/stt.json — see Configuring listening. Provider credentials are stored in the credential vault. Add keys in Settings → Connections:

  • Deepgram API key
  • ElevenLabs API key
  • OpenAI API key (reused from your LLM config)
  • Google Cloud service account JSON
  • Azure Speech region + key
  • Sophon Managed Speech endpoint + access token

CLI

sophon voice list-voices --provider elevenlabs
sophon voice speak "Hello, world." --provider openai --voice nova
sophon voice config --provider google --voice en-US-Neural2-F

Limits and gotchas

  • Microphone permission is required on iOS/Android and in browsers. Browser mic capture requires HTTPS (except localhost).
  • Adding, removing, or testing STT providers and changing host-wide listening defaults requires the Admin role; per-user overrides are available to everyone.
  • Some TTS providers charge per character. Streaming TTS counts all rendered characters, even if you cancel mid-sentence.
  • ElevenLabs has rate limits on voice cloning; stock voices are fine at scale.
  • If multiple TTS providers are configured, only one is active at a time — picking a new provider in settings is global.

Where to go next

  • Mobile App — voice tab and Voice Center specifics
  • Dashboard — Voice page and Listening tab
  • Connections — configuring provider credentials
  • Skills — other audio tools (transcription, format conversion)