Media Understanding

Inbound audio, images, and video are automatically converted to text so any model can reason over them.

When someone sends a voice note, a photo, or a video through a connected channel, Sophon pre-digests the media into text before the agent reasons about the message. A voice note becomes a transcript, a photo becomes a caption when the active model can't see images, and a video becomes a transcript plus key-frame descriptions. The result: any model — vision-capable or not — can work with anything a user sends.

What gets digested

Media	When	What the agent receives
Audio	Always	A transcript of the recording, folded into the message
Images	Only when the active model lacks vision	A caption describing the image
Video	Always (asynchronous)	Audio-track transcript + captions for sampled key frames, delivered as a follow-up

Audio

Voice notes and audio files are always transcribed. The transcript is appended to the message text, so the agent reads what was said exactly as if it had been typed.

Images

Sophon checks the active model first. Vision-capable models receive the image natively — no re-processing, no extra provider call. Only when the model is text-only does Sophon caption the image through a vision provider and substitute the description. The caption prompt is configurable via ImageDigestPrompt.

Video

Video is too heavy to digest inside a single chat turn, so it runs asynchronously: Sophon transcribes the audio track and samples key frames using scene-change detection, captions each frame, and delivers the combined digest as a follow-up once it completes. By default six frames are sampled, and each video processing step gets up to 15 minutes of processing time before the digest falls back to a metadata placeholder. Video digestion requires ffmpeg — set FfmpegPath explicitly or have it available on PATH.

Channel coverage

Pre-digestion happens at the platform level, so every channel that delivers media attachments inherits it automatically — there is nothing to enable per channel:

Configuration

All options live under the MediaUnderstanding configuration section. The defaults below apply out of the box — media understanding is on without any setup.

Key	Default	Description
`Enabled`	`true`	Master switch for all media pre-digestion.
`DigestImages`	`true`	Caption images when the active model lacks vision.
`DigestAudio`	`true`	Transcribe audio attachments.
`DigestVideo`	`true`	Digest video attachments asynchronously.
`MaxImageBytes`	`20000000` (20 MB)	Images above this size are skipped with a note.
`MaxAudioBytes`	`25000000` (25 MB)	Audio above this size is skipped with a note.
`MaxVideoBytes`	`200000000` (200 MB)	Video above this size is skipped with a note.
`VideoFrameCount`	`6`	Number of key frames sampled and captioned per video.
`VideoMaxDurationSeconds`	`900`	Maximum time each video processing step may run before the digest falls back to a metadata placeholder.
`SceneChangeThreshold`	`0.4`	Sensitivity of the scene-change detection used to pick key frames.
`SyncDigestTimeoutSeconds`	`45`	How long a synchronous digest (audio, image) may run before falling back to a placeholder.
`FfmpegPath`	unset	Explicit path to the `ffmpeg` binary. When unset, Sophon looks for `ffmpeg` on `PATH`.
`ImageDigestPrompt`	built-in	Prompt used when captioning an image.
`VideoFrameDigestPrompt`	built-in	Prompt used when captioning video key frames.

Note that channels enforce their own per-attachment caps — often lower than these byte limits — before media ever reaches the digest stage; see Attachment Limits.

Graceful degradation

Digestion never blocks message delivery. When media can't be converted, the agent receives a metadata placeholder — file name, type, and size — instead of the digested text:

ffmpeg unavailable — video (and audio formats that need conversion) fall back to the placeholder.
Transcription timeout — if a synchronous digest exceeds SyncDigestTimeoutSeconds, the placeholder is used so the conversation keeps moving.
Oversized media — files above the per-type byte limits are skipped, and the note says so.

Because the agent always learns that media arrived, it can tell the user what happened ("I received a 30 MB video, which is over the attachment size limit") rather than silently ignoring the attachment.

Digested text is user content. Transcripts and captions inherit the same prompt-injection wrapping as any other inbound message — instructions hidden in a voice note or an image are treated as untrusted data, never as commands. See Prompt Injection Defense.

Where to go next

Documents — file attachments uploaded through channels are also indexed for search and Q&A
Routing, Failover & Budgets — how Sophon picks the active model, including vision-capable ones
Channels Overview — connect the channels that feed media into your agent

Media UnderstandingNEW