Routing, Failover & Budgets

How Sophon picks a model, fails over between providers, tracks spend, and monitors provider health.

Sophon is bring-your-own-key: it ships with no preconfigured providers. Once you add one or more — Anthropic, OpenAI, a local Ollama endpoint, a Claude Pro subscription — a central provider registry decides which one handles each request, when to fail over, and when a provider is benched for being unhealthy or over budget. This page covers routing, failover, health monitoring, and budgets.

Every provider you add gets a numeric Priority (lower = preferred, default 1) and a Status (Active, Inactive, or Error). Routing only ever considers Active providers, in priority order.

How a request is routed

When the orchestration pipeline needs a model, the registry filters and sorts the providers you've configured:

Keep only providers whose Status is Active.
Optionally filter by provider type (e.g. force anthropic) or by model capability.
Sort by Priority ascending and pick the first match.

Plain routing returns the highest-priority active provider. Capability routing walks each active provider's models — in their own priority order — and returns the first one that satisfies a capability predicate. If nothing matches, it falls back to the highest-priority active provider's default model, so a request never fails just because no provider advertised a niche capability.

A per-session model override — set from the Dashboard chat inspector or the CLI /provider command — takes precedence over these rules for that session only. See Model Catalog.

Model capabilities

ModelCapabilities is what capability routing matches against. Each model reports:

Capability	Field	Meaning
Vision	`SupportsVision`	Accepts image input
Function calling	`SupportsFunctionCalling`	Native tool/function calls
Streaming	`SupportsStreaming`	Token-by-token streaming
Reasoning	`SupportsReasoning`	Extended thinking / reasoning tokens
Code generation	`SupportsCodeGeneration`	Tuned for code
Context size	`MaxContextTokens`	Max input window
Output size	`MaxOutputTokens`	Max generated tokens

So a step that needs vision routes to the highest-priority active provider whose model has SupportsVision. For reasoning controls (Off / Fast / Full) and how they map to each provider's thinking budget, see Extended Thinking & Reasoning.

Failover between providers

The non-streaming LLM call runs through the provider failover policy, which builds a chain starting with the chosen provider and appending every other active provider in priority order. Each provider gets up to 3 retries before the chain moves on.

A failure is retryable when it's an HTTP 429 or 529, or when the error message mentions a rate limit or an overloaded provider. The policy retries with exponential backoff — 250 ms base, doubling, capped at 8 s, ±20% jitter — and honors a Retry-After hint when present. A non-retryable error skips straight to the next provider. If every provider is exhausted, the call throws with a summary of what was attempted.

The current policy is a fixed, read-only snapshot:

Setting	Value
Strategy	`priority`
Fallback enabled	yes
Max retries per provider	3
Backoff	250 ms base, 8 s cap, ±20% jitter
Retryable status codes	429, 529
Key rotation	enabled

Streaming failover works the same way, but only for errors that happen before the first chunk arrives. Once tokens start flowing, an error propagates to the caller — Sophon can't silently re-route mid-stream.

API-key cooldown

If a provider has more than one API key configured, the policy rotates keys before falling back to a slower backoff. When a key hits a rate limit, a per-key cooldown tracker benches it for 60 seconds, picks the next key that isn't cooling down, and retries immediately on the fresh key. A cooled-down key becomes available again once its window expires. Single-key providers skip rotation and go straight to backoff.

Health monitoring

A background health monitor periodically probes every provider that isn't Inactive. It runs each provider's health check and flips Status to Active on success or Error on failure (or on a thrown exception). Because routing and failover only consider Active providers, a provider that goes unhealthy is automatically excluded until a later check brings it back.

Option	Default	Section
`Enabled`	`true`	`Sophon:HealthMonitor`
`IntervalMinutes`	`15`	`Sophon:HealthMonitor`

Budget tracking

Sophon records token usage per provider after each call and prices it against the model actually used — including per-session overrides — using the flat per-1M-token rates from the Model Catalog. When a rate is wrong or missing, correct it with an entry in the catalog override file (~/.sophon/config/models-catalog.json); the fix flows through to spend tracking automatically. Each provider can set an optional BudgetConfig:

Limit	Field	Window
Token cap	`MaxTokensPerDay`	Daily (UTC)
Cost cap	`MaxCostPerMonth`	Monthly (UTC)

The tracker logs a warning at 80% of a limit and again when it is reached. Enforcement is a hard gate in the orchestration pipeline — when a provider is out of budget, the LLM is never called and the user gets a budget_exceeded response instead.

Subscription providers report $0. Sign-in providers like Anthropic via Claude Pro/Max, OpenAI Codex via ChatGPT, and GitHub Copilot are flat-rate, so their per-token cost is reported as zero. Budgets there are about token volume, not dollars.

For the live view of token and cost trends across all your providers, see Insights. To review and prioritize the providers themselves, see Models & Providers.

Routing, Failover & BudgetsNEW