Media Generation

Sophon ships three built-in generation tools that turn text prompts into media: image.generate, video.generate, and music.generate. Each one fans out to a choice of external AI providers, downloads the result, and saves it into your documents library so the agent (or you) can reference it, attach it, or push it to the Canvas.

All three are registered as agent tools under Skills & Tools and run at Medium risk, so they pass through the normal approval flow before spending provider credits.

`image.generate`

Generates a still image from a prompt and writes a PNG to the output directory.

Providers — dalle (OpenAI DALL-E), fal (Fal.ai), stability (Stability AI), imagen (Google Imagen), replicate, together (Together AI). Defaults to dalle.
Key parameters — prompt (required), provider, model (e.g. dall-e-3, flux/schnell), and size (256x256 through 1792x1024). style (vivid / natural) and quality (standard / hd) apply to DALL-E 3 only.
Output — a single PNG saved with a timestamped filename into the documents library.

Most providers return image data synchronously. The replicate path is asynchronous: it creates a prediction and polls until the job reports succeeded (or times out), then downloads the resulting image.

`video.generate`

Generates a short video clip from a prompt, or animates a source image.

Providers — runway (RunwayML), luma (Luma Dream Machine), kling (KlingAI). Defaults to runway.
Modes — text-to-video (default) or image-to-video. The image mode requires an imageUrl.
Key parameters — prompt (required), provider, mode, imageUrl, duration (seconds, default 5), and aspectRatio (16:9, 9:16, 1:1).
Output — an MP4 downloaded into the documents library.

Video is always asynchronous. The tool submits the job, receives a task ID, then polls the provider every few seconds until it succeeds or fails. If the job does not complete within the poll window, the tool returns a timeout rather than a file.

`music.generate`

Generates music or audio from a descriptive prompt and optional lyrics.

Providers — elevenlabs (ElevenLabs), stability (Stable Audio), minimax (MiniMax). Defaults to elevenlabs.
Key parameters — prompt (required), provider, durationSeconds (default 30, clamped to provider limits), instrumental (boolean), lyrics (optional vocal text), and format (mp3 / wav).
Output — an MP3 or WAV saved into the documents library.

Lyrics are honored by ElevenLabs and MiniMax. Stability generates instrumental-only audio, so any supplied lyrics are dropped and the result notes that. Setting instrumental: true clears lyrics for every provider. The MiniMax path may run asynchronously, polling the job with a longer deadline than video before downloading the finished audio.

Summary

Tool	Media type	Providers	Risk
`image.generate`	Image (PNG)	DALL-E, Fal, Stability, Imagen, Replicate, Together	Medium
`video.generate`	Video (MP4)	Runway, Luma, Kling	Medium
`music.generate`	Music/audio (MP3/WAV)	ElevenLabs, Stability, MiniMax	Medium

Where the output goes

Every successful generation is downloaded and written to the agent's output directory, where it joins the Documents library as a normal file. From there the agent can attach it to a reply, reference it in later turns, or open it in the Canvas for preview and editing.

Each provider needs its own API key. Configure keys under your connections/models settings before invoking a tool. If a key is missing for the selected provider, the tool returns an error instead of generating, so you only pay for providers you have set up.

Where to go next

Skills & Tools — how built-in tools are registered, approved, and invoked
Canvas — preview and edit generated media alongside the chat
Documents — where generated files are stored and indexed

Media GenerationNEW