Extended Thinking & Reasoning

Control how much the model reasons before answering across Anthropic, OpenAI, and Gemini.

Modern reasoning models can spend extra compute "thinking" before they answer — exploring options, checking their own work, planning multi-step solutions. Each provider exposes this differently: Anthropic has a token budget, OpenAI has a reasoning_effort string, Google Gemini has a thinking budget. Sophon hides those differences behind one thinking level so you pick a depth and Sophon maps it to whatever the active model expects.

Thinking levels

There are four levels:

Level	Meaning
Auto	Preserve each provider's default behavior — Sophon sends no explicit reasoning control
Off	Suppress reasoning where the provider allows it (fastest, cheapest)
Fast	A modest reasoning budget — quick wins on harder prompts without much latency
Full	A large reasoning budget — best for complex, multi-step, or agentic problems

Auto is the default. It is deliberately hands-off: Sophon omits the reasoning parameter entirely so the model behaves exactly as the provider intends out of the box.

How it maps to each provider

Sophon translates the level into the native control for the model you're using:

Provider	Native control	Fast	Full	Off / Auto
Anthropic	`thinking.budget_tokens`	4,096 tokens	12,288 tokens	block omitted
Google Gemini	`thinking_config.thinking_budget`	4,096 tokens	24,576 tokens	budget 0 (Off)
OpenAI	`reasoning_effort`	`low`	`high`	`minimal`

A few provider-specific details Sophon handles for you:

Anthropic requires temperature = 1 whenever thinking is enabled, and max_tokens must exceed the thinking budget — Sophon adjusts both automatically and adds answer headroom on top of the budget.
OpenAI only sends reasoning_effort for reasoning models (the o-series and gpt-5 family). The minimal effort is gpt-5-only, so Sophon omits the parameter for o-series models where it doesn't apply.
Gemini thinking tokens are billed separately and don't count against the output limit. For Gemini 2.5+ models, Auto applies a sensible heuristic (small utility requests skip thinking entirely).

Only reasoning-capable models expose this. On a model that doesn't support reasoning — for example a standard chat model without a thinking mode — the level is ignored and the request runs normally. Choosing Full does not turn a non-reasoning model into a reasoning one.

Setting the thinking level

In the Dashboard chat, open the Thinking selector in the composer controls and pick Auto, Off, Fast, or Full. The choice rides along with the next message you send, so you can change depth per turn — Full for a thorny architecture question, Off for a quick lookup.

The selected level is sent to the Gateway with your message, parsed into the internal level, and applied by the provider when the model is invoked. Unknown or empty values fall back to Auto.

Seeing the reasoning trace

When a thinking level surfaces the model's reasoning, Sophon streams it on a separate channel from the answer text and renders it as a collapsible Thinking block above the message. The trace stays open while the response streams, then collapses once the final answer settles — you can expand or collapse it anytime.

Reasoning tokens are tracked as a distinct count. The message footer shows total tokens and, when present, the reasoning portion (for example, 1240 tokens · 380 reasoning). For Gemini, reasoning tokens are included in the billed output count and also reported as a display-only subset, mirroring how OpenAI reports reasoning tokens.

Cost and billing

Reasoning is not free — thinking tokens are real tokens you pay for. Full can multiply token usage on a single turn, so reserve it for problems that genuinely benefit. Per-provider reasoning-token counts flow into Sophon's usage tracking; see Insights for how token consumption is rolled up across sessions, agents, and providers.

Limits and gotchas

Auto means "do nothing." Sophon does not force reasoning on or off in Auto — the model keeps its native default.
Anthropic thinking + tools. When a request includes tools, Anthropic extended thinking is currently disabled, because Anthropic requires the signed thinking block to be replayed verbatim on the follow-up tool turn — a path Sophon hasn't enabled yet. Thinking applies on non-tool turns.
Not every model qualifies. OpenAI reasoning effort only applies to the o-series and gpt-5 family; Gemini thinking applies to 2.5+ models. Non-reasoning models silently ignore the level.
OpenAI-compatible providers vary. Reasoning controls aren't portable — some compatible endpoints reason automatically and reject an explicit effort, so Sophon only sends it where it's known to be accepted.

Where to go next

Insights — token and reasoning-token tracking across your activity
Models configuration — choosing which model and provider a request uses

Extended Thinking & ReasoningNEW