Media GenerationNEW
Generate images, video, and music from prompts using built-in AI tools.
Sophon ships three built-in generation tools that turn text prompts into media: image.generate, video.generate, and music.generate. Each one fans out to a choice of external AI providers, downloads the result, and saves it into your documents library so the agent (or you) can reference it, attach it, or push it to the Canvas.
All three are registered as agent tools under Skills & Tools and run at Medium risk, so they pass through the normal approval flow before spending provider credits.
image.generate
Generates a still image from a prompt and writes a PNG to the output directory.
- Providers —
dalle(OpenAI DALL-E),fal(Fal.ai),stability(Stability AI),imagen(Google Imagen),replicate,together(Together AI). Defaults todalle. - Key parameters —
prompt(required),provider,model(e.g.dall-e-3,flux/schnell), andsize(256x256through1792x1024).style(vivid/natural) andquality(standard/hd) apply to DALL-E 3 only. - Output — a single PNG saved with a timestamped filename into the documents library.
Most providers return image data synchronously. The replicate path is asynchronous: it creates a prediction and polls until the job reports succeeded (or times out), then downloads the resulting image.
video.generate
Generates a short video clip from a prompt, or animates a source image.
- Providers —
runway(RunwayML),luma(Luma Dream Machine),kling(KlingAI). Defaults torunway. - Modes —
text-to-video(default) orimage-to-video. The image mode requires animageUrl. - Key parameters —
prompt(required),provider,mode,imageUrl,duration(seconds, default 5), andaspectRatio(16:9,9:16,1:1). - Output — an MP4 downloaded into the documents library.
Video is always asynchronous. The tool submits the job, receives a task ID, then polls the provider every few seconds until it succeeds or fails. If the job does not complete within the poll window, the tool returns a timeout rather than a file.
music.generate
Generates music or audio from a descriptive prompt and optional lyrics.
- Providers —
elevenlabs(ElevenLabs),stability(Stable Audio),minimax(MiniMax). Defaults toelevenlabs. - Key parameters —
prompt(required),provider,durationSeconds(default 30, clamped to provider limits),instrumental(boolean),lyrics(optional vocal text), andformat(mp3/wav). - Output — an MP3 or WAV saved into the documents library.
Lyrics are honored by ElevenLabs and MiniMax. Stability generates instrumental-only audio, so any supplied lyrics are dropped and the result notes that. Setting instrumental: true clears lyrics for every provider. The MiniMax path may run asynchronously, polling the job with a longer deadline than video before downloading the finished audio.
Summary
| Tool | Media type | Providers | Risk |
|---|---|---|---|
image.generate | Image (PNG) | DALL-E, Fal, Stability, Imagen, Replicate, Together | Medium |
video.generate | Video (MP4) | Runway, Luma, Kling | Medium |
music.generate | Music/audio (MP3/WAV) | ElevenLabs, Stability, MiniMax | Medium |
Where the output goes
Every successful generation is downloaded and written to the agent's output directory, where it joins the Documents library as a normal file. From there the agent can attach it to a reply, reference it in later turns, or open it in the Canvas for preview and editing.
Each provider needs its own API key. Configure keys under your connections/models settings before invoking a tool. If a key is missing for the selected provider, the tool returns an error instead of generating, so you only pay for providers you have set up.
Where to go next
- Skills & Tools — how built-in tools are registered, approved, and invoked
- Canvas — preview and edit generated media alongside the chat
- Documents — where generated files are stored and indexed