Sophon Docs
Features

Voice (STT / TTS)

Talk to Sophon and listen back — speech-to-text input, multi-provider text-to-speech output, and voice session management.

Sophon speaks and listens. On mobile, you can tap the microphone and have the agent respond in synthesized speech. On desktop (and the Electron app), voice mode turns the agent into a hands-free assistant. The STT side is handled by the client platform; the TTS side is pluggable across four providers so you can pick the voice you like.

How it's wired

Mic → STT (client-side) → transcribed text → Gateway → Agent


                                        Response text → TTS (server-side)


                                                   Audio stream → Speaker

STT happens on the client to keep latency low and avoid streaming raw audio to the Gateway. TTS happens server-side because voice selection, caching, and provider credentials all live there.

Speech-to-text

Sophon uses platform-native STT on the client:

PlatformSTT backend
Mobile (iOS/Android)Expo Speech Recognition (native iOS / Android SDKs under the hood)
Desktop (Electron)Chromium SpeechRecognition API, fallback to Whisper via the Voice skill
Dashboard (web)Chromium SpeechRecognition — permission-gated per browser
CLINot supported (CLI is a text interface)

Transcription is streamed — you see your words appear in the chat input as you speak, and the agent starts processing as soon as you pause.

For languages not well-supported by the platform STT, the agent can fall back to the voice.transcribe skill which uses a configured Whisper-compatible provider.

Text-to-speech

Four providers supported out of the box. Configure in Settings → Voice.

ProviderQualityLatencyCostNotes
ElevenLabsBestMediumHighestHuge voice library, emotion control
OpenAI TTSVery goodLowMedium6 built-in voices
Google Cloud TTSGoodLowLowWaveNet + Neural2 voices
Azure SpeechGoodLowLowEnterprise-friendly SSO

Each provider has a different voice catalog. You pick:

  • Provider — ElevenLabs / OpenAI / Google / Azure
  • Voice — dropdown pre-filtered by provider (Rachel / Adam / Sarah / …)
  • Language — auto-detected from response, overridable
  • Speed — 0.5× to 2.0×

Streaming TTS

For long responses, the agent streams TTS sentence by sentence:

  1. Gateway receives the agent's response token-by-token from the LLM.
  2. A sentence buffer accumulates text until it hits a terminator (., ?, !, newline).
  3. Each completed sentence is handed off to the TTS provider.
  4. The provider returns an audio chunk (MP3 or OGG).
  5. The chunk is pushed to the client over SignalR.
  6. The client plays chunks in order — starts playing the first sentence while later sentences are still being rendered.

Net effect: ~1 second between "agent starts responding" and "you hear the first word."

Voice mode (mobile)

On the mobile app, the Voice tab is a full-screen voice experience:

  • Animated orb indicates listening / thinking / speaking
  • Long-press to talk, release to send (push-to-talk), or tap to start hands-free mode
  • Real-time transcription renders as you speak
  • Responses play through device audio with visual waveform feedback
  • Tap orb to interrupt / cancel

Quiet hours and push-notification rules don't apply — voice mode is user-initiated and opt-in per session.

Voice on desktop

The Electron desktop app supports a global push-to-talk keyboard shortcut (default: Ctrl+Shift+Space). Hold the shortcut, speak, release — the agent processes your audio in whatever app you're focused in. The response plays through your default audio device.

Configure the shortcut in Settings → Voice → Push-to-talk shortcut.

Caching

TTS renders are cached by (provider, voice, language, text-hash). Repeated requests — e.g., a greeting the agent uses often — don't re-render. Cache lives at ~/.sophon/cache/tts/ with a 30-day TTL and 1 GB cap.

Agent tool

Agents have a voice.* tool family (optional — not enabled by default):

  • voice.speak(text, voice?) — generate speech and play it (in sessions that support playback)
  • voice.transcribe(audioUrl) — force transcription of an audio file
  • voice.list_voices(provider?) — enumerate available voices

Enable these in Agents → Tools if you want the agent to proactively speak beyond its chat responses — e.g., "announce this over voice when the build finishes."

Configuration

~/.sophon/config/voice.json:

{
  "tts": {
    "provider": "ElevenLabs",
    "voice": "Rachel",
    "speed": 1.0,
    "cachingEnabled": true
  },
  "stt": {
    "preferPlatformNative": true,
    "whisperFallback": true
  }
}

Provider credentials are stored in the credential vault. Add keys in Settings → Connections:

  • ElevenLabs API key
  • OpenAI API key (reused from your LLM config)
  • Google Cloud service account JSON
  • Azure Speech region + key

CLI

sophon voice list-voices --provider elevenlabs
sophon voice speak "Hello, world." --provider openai --voice nova
sophon voice config --provider google --voice en-US-Neural2-F

Limits and gotchas

  • Mobile STT requires microphone permission on iOS/Android.
  • Expo Go does not support custom STT — you need a development or production build.
  • Browser STT requires HTTPS (except localhost) per browser security policy.
  • Some TTS providers charge per character. Streaming TTS counts all rendered characters, even if you cancel mid-sentence.
  • ElevenLabs has rate limits on voice cloning; stock voices are fine at scale.
  • If multiple TTS providers are configured, only one is active at a time — picking a new provider in settings is global.

Where to go next

  • Mobile App — voice tab specifics
  • Connections — configuring provider credentials
  • Skills — other audio tools (transcription, format conversion)