Sophon Docs
Features

Media UnderstandingNEW

Inbound audio, images, and video are automatically converted to text so any model can reason over them.

When someone sends a voice note, a photo, or a video through a connected channel, Sophon pre-digests the media into text before the agent reasons about the message. A voice note becomes a transcript, a photo becomes a caption when the active model can't see images, and a video becomes a transcript plus key-frame descriptions. The result: any model — vision-capable or not — can work with anything a user sends.

What gets digested

MediaWhenWhat the agent receives
AudioAlwaysA transcript of the recording, folded into the message
ImagesOnly when the active model lacks visionA caption describing the image
VideoAlways (asynchronous)Audio-track transcript + captions for sampled key frames, delivered as a follow-up

Audio

Voice notes and audio files are always transcribed. The transcript is appended to the message text, so the agent reads what was said exactly as if it had been typed.

Images

Sophon checks the active model first. Vision-capable models receive the image natively — no re-processing, no extra provider call. Only when the model is text-only does Sophon caption the image through a vision provider and substitute the description. The caption prompt is configurable via ImageDigestPrompt.

Video

Video is too heavy to digest inside a single chat turn, so it runs asynchronously: Sophon transcribes the audio track and samples key frames using scene-change detection, captions each frame, and delivers the combined digest as a follow-up once it completes. By default six frames are sampled, and each video processing step gets up to 15 minutes of processing time before the digest falls back to a metadata placeholder. Video digestion requires ffmpeg — set FfmpegPath explicitly or have it available on PATH.

Channel coverage

Pre-digestion happens at the platform level, so every channel that delivers media attachments inherits it automatically — there is nothing to enable per channel:

Configuration

All options live under the MediaUnderstanding configuration section. The defaults below apply out of the box — media understanding is on without any setup.

KeyDefaultDescription
EnabledtrueMaster switch for all media pre-digestion.
DigestImagestrueCaption images when the active model lacks vision.
DigestAudiotrueTranscribe audio attachments.
DigestVideotrueDigest video attachments asynchronously.
MaxImageBytes20000000 (20 MB)Images above this size are skipped with a note.
MaxAudioBytes25000000 (25 MB)Audio above this size is skipped with a note.
MaxVideoBytes200000000 (200 MB)Video above this size is skipped with a note.
VideoFrameCount6Number of key frames sampled and captioned per video.
VideoMaxDurationSeconds900Maximum time each video processing step may run before the digest falls back to a metadata placeholder.
SceneChangeThreshold0.4Sensitivity of the scene-change detection used to pick key frames.
SyncDigestTimeoutSeconds45How long a synchronous digest (audio, image) may run before falling back to a placeholder.
FfmpegPathunsetExplicit path to the ffmpeg binary. When unset, Sophon looks for ffmpeg on PATH.
ImageDigestPromptbuilt-inPrompt used when captioning an image.
VideoFrameDigestPromptbuilt-inPrompt used when captioning video key frames.

Note that channels enforce their own per-attachment caps — often lower than these byte limits — before media ever reaches the digest stage; see Attachment Limits.

Graceful degradation

Digestion never blocks message delivery. When media can't be converted, the agent receives a metadata placeholder — file name, type, and size — instead of the digested text:

  • ffmpeg unavailable — video (and audio formats that need conversion) fall back to the placeholder.
  • Transcription timeout — if a synchronous digest exceeds SyncDigestTimeoutSeconds, the placeholder is used so the conversation keeps moving.
  • Oversized media — files above the per-type byte limits are skipped, and the note says so.

Because the agent always learns that media arrived, it can tell the user what happened ("I received a 30 MB video, which is over the attachment size limit") rather than silently ignoring the attachment.

Digested text is user content. Transcripts and captions inherit the same prompt-injection wrapping as any other inbound message — instructions hidden in a voice note or an image are treated as untrusted data, never as commands. See Prompt Injection Defense.

Where to go next

  • Documents — file attachments uploaded through channels are also indexed for search and Q&A
  • Routing, Failover & Budgets — how Sophon picks the active model, including vision-capable ones
  • Channels Overview — connect the channels that feed media into your agent