Sophon Docs
Models & Providers

Routing, Failover & BudgetsNEW

How Sophon picks a model, fails over between providers, tracks spend, and monitors provider health.

Sophon is bring-your-own-key: it ships with no preconfigured providers. Once you add one or more — Anthropic, OpenAI, a local Ollama endpoint, a Claude Pro subscription — a central provider registry decides which one handles each request, when to fail over, and when a provider is benched for being unhealthy or over budget. This page covers routing, failover, health monitoring, and budgets.

Every provider you add gets a numeric Priority (lower = preferred, default 1) and a Status (Active, Inactive, or Error). Routing only ever considers Active providers, in priority order.

How a request is routed

When the orchestration pipeline needs a model, the registry filters and sorts the providers you've configured:

  1. Keep only providers whose Status is Active.
  2. Optionally filter by provider type (e.g. force anthropic) or by model capability.
  3. Sort by Priority ascending and pick the first match.

Plain routing returns the highest-priority active provider. Capability routing walks each active provider's models — in their own priority order — and returns the first one that satisfies a capability predicate. If nothing matches, it falls back to the highest-priority active provider's default model, so a request never fails just because no provider advertised a niche capability.

A per-session model override — set from the Dashboard chat inspector or the CLI /provider command — takes precedence over these rules for that session only. See Model Catalog.

Model capabilities

ModelCapabilities is what capability routing matches against. Each model reports:

CapabilityFieldMeaning
VisionSupportsVisionAccepts image input
Function callingSupportsFunctionCallingNative tool/function calls
StreamingSupportsStreamingToken-by-token streaming
ReasoningSupportsReasoningExtended thinking / reasoning tokens
Code generationSupportsCodeGenerationTuned for code
Context sizeMaxContextTokensMax input window
Output sizeMaxOutputTokensMax generated tokens

So a step that needs vision routes to the highest-priority active provider whose model has SupportsVision. For reasoning controls (Off / Fast / Full) and how they map to each provider's thinking budget, see Extended Thinking & Reasoning.

Failover between providers

The non-streaming LLM call runs through the provider failover policy, which builds a chain starting with the chosen provider and appending every other active provider in priority order. Each provider gets up to 3 retries before the chain moves on.

A failure is retryable when it's an HTTP 429 or 529, or when the error message mentions a rate limit or an overloaded provider. The policy retries with exponential backoff — 250 ms base, doubling, capped at 8 s, ±20% jitter — and honors a Retry-After hint when present. A non-retryable error skips straight to the next provider. If every provider is exhausted, the call throws with a summary of what was attempted.

The current policy is a fixed, read-only snapshot:

SettingValue
Strategypriority
Fallback enabledyes
Max retries per provider3
Backoff250 ms base, 8 s cap, ±20% jitter
Retryable status codes429, 529
Key rotationenabled

Streaming failover works the same way, but only for errors that happen before the first chunk arrives. Once tokens start flowing, an error propagates to the caller — Sophon can't silently re-route mid-stream.

API-key cooldown

If a provider has more than one API key configured, the policy rotates keys before falling back to a slower backoff. When a key hits a rate limit, a per-key cooldown tracker benches it for 60 seconds, picks the next key that isn't cooling down, and retries immediately on the fresh key. A cooled-down key becomes available again once its window expires. Single-key providers skip rotation and go straight to backoff.

Health monitoring

A background health monitor periodically probes every provider that isn't Inactive. It runs each provider's health check and flips Status to Active on success or Error on failure (or on a thrown exception). Because routing and failover only consider Active providers, a provider that goes unhealthy is automatically excluded until a later check brings it back.

OptionDefaultSection
EnabledtrueSophon:HealthMonitor
IntervalMinutes15Sophon:HealthMonitor

Budget tracking

Sophon records token usage per provider after each call and prices it against the model actually used — including per-session overrides — using the flat per-1M-token rates from the Model Catalog. When a rate is wrong or missing, correct it with an entry in the catalog override file (~/.sophon/config/models-catalog.json); the fix flows through to spend tracking automatically. Each provider can set an optional BudgetConfig:

LimitFieldWindow
Token capMaxTokensPerDayDaily (UTC)
Cost capMaxCostPerMonthMonthly (UTC)

The tracker logs a warning at 80% of a limit and again when it is reached. Enforcement is a hard gate in the orchestration pipeline — when a provider is out of budget, the LLM is never called and the user gets a budget_exceeded response instead.

Subscription providers report $0. Sign-in providers like Anthropic via Claude Pro/Max, OpenAI Codex via ChatGPT, and GitHub Copilot are flat-rate, so their per-token cost is reported as zero. Budgets there are about token volume, not dollars.

For the live view of token and cost trends across all your providers, see Insights. To review and prioritize the providers themselves, see Models & Providers.