Orchestration Pipeline
The 12-middleware pipeline that processes every message — budgets, routing, tool filtering, planning, and the agentic loop.
Every user message in Sophon runs through the orchestration pipeline — an ASP.NET-style middleware chain in Sophon.Core. It replaced the original flat tool-calling loop with twelve composable middlewares that handle budgets, routing, context window management, tool filtering, planning, parallel execution, approvals, and the LLM call itself.
This page is the conceptual map. If you want the full reference (with feature flags, constants, and test counts), read docs/ORCHESTRATION.md in the Sophon repo.
The two zones
User Message
│
▼
┌─────────────────────────────────────────────┐
│ OUTER ZONE — runs once per message │
│ │
│ 1. SessionEvent Record user message │
│ 2. Budget Check spending │
│ 3. CapabilityRouting Pick provider/model │
│ 4. ContextWindow Token budget mgmt │
│ 5. ToolFilter Select top-10 tools │
│ 6. PromptToolBridge Non-FC tool inject │
│ 7. Planning Multi-step planning │
├─────────────────────────────────────────────┤
│ INNER LOOP — repeats for tool calls │
│ │
│ 8. AgenticLoop │
│ ┌─────────────────────────────────────┐ │
│ │ 9. ParallelExecution │ │
│ │ 10. Approval │ │
│ │ 11. ToolExecution │ │
│ │ 12. LlmInvoker ←── LLM API call │ │
│ └─────────────────────────────────────┘ │
│ ↑ │ │
│ └─── loop if tool ───┘ │
└─────────────────────────────────────────────┘
│
▼
Final Response → Memory → Title GenerationThe outer zone prepares the request; the inner loop repeats until the LLM produces a final response (no tool calls) or hits the iteration cap.
Middleware reference
1. Session event
Records a UserMessage event before the pipeline runs and an AgentMessage event after, so the full conversation is durable.
2. Budget
Checks per-provider spending limits (daily tokens, monthly cost). If exceeded, the pipeline short-circuits with a synthetic budget_exceeded response before any LLM call is made. Usage is recorded after each call.
3. Capability routing
Picks the LLM provider and model. If a plan step declared a ModelHint (e.g., "requires reasoning" or "requires code generation"), the router finds a provider that advertises those capabilities. Otherwise it uses the default.
4. Context window
Prevents context overflow. If estimated tokens exceed 80% of the model's max context, the middleware asks an LLM to compact the oldest 60% of the conversation. If compaction times out (default 30 seconds) or errors, it hard-truncates to 70% of the budget — always keeping the system prompt and the most recent messages.
5. Tool filter
Selects the top 10 most relevant tools for the request. It keyword-scores every tool (+1.0 per tag match, +0.5 per description keyword match) and always includes datetime.now and memory.search. This keeps the tool manifest small and reduces LLM confusion on agents with hundreds of tools.
6. Prompt-tool bridge
For providers that don't support native function calling, this middleware injects tool definitions into the system prompt and extracts <tool_call>{"name": "...", "arguments": {...}}</tool_call> XML blocks from the response. Lets you use function calling against models that don't speak it natively.
7. Planning
Decomposes complex requests into a DAG of steps. If the request is complex (by heuristic or LLM vote) and any step is rated ≥ Medium risk, the plan is presented to the user for approval before execution. See Planning.
8. Agentic loop
The main tool-calling loop. Calls the inner pipeline → if the LLM returned tool calls, executes them and loops. Stops when the LLM produces a final response or hits MaxToolIterations (default: 100, hard cap: 500). Every 10 iterations, it emits a progress checkpoint.
Built into the loop: a tool loop detector that watches the last 30 tool calls for three patterns:
| Pattern | Detection | Severity |
|---|---|---|
generic_repeat | Same tool + args called N+ times | ≥ 30: circuit breaker. ≥ 20: critical. ≥ 10: warning |
poll_no_progress | Same tool called consecutively with identical args | ≥ 5: critical. ≥ 3: warning |
ping_pong | Alternating A → B → A → B in last 6 calls | Warning |
A circuit breaker stops the loop immediately. Warnings inject a hint into the conversation steering the LLM away.
9. Parallel execution
If the LLM returned 2+ tool calls in one response, they run concurrently via Task.WhenAll. Results are truncated to 8000 chars (with JSON-aware truncation for arrays), then the pipeline re-invokes itself so the LLM sees all results at once.
10. Approval
For every tool call, looks up the tool's risk level. If ≥ High, it sends an approval request via IApprovalGate (SignalR). The originating channel (WhatsApp, Telegram, etc.) is preserved, so approvals can come back through the same channel. Rejected or timed-out tools are replaced with error messages; approved tools proceed. See Approval Gates.
11. Tool execution
Executes tool calls sequentially (the fallback for single calls or when parallel doesn't apply). Approval is already resolved by middleware 10, so this middleware calls the registry with skipApproval: true.
12. LLM invoker (terminal)
The innermost middleware. Builds the LlmRequest (messages + filtered tools + max_tokens=4096 + tool_choice=auto), then either:
- Streaming — if the caller set
IsStreamingRequestedand the provider supports it, usesStreamCompleteAsync()and forwards chunks viaOnStreamChunk(typically SignalR). - Non-streaming — uses
ProviderFailoverPolicy.ExecuteAsync()for retry + fallback, or directCompleteAsync()otherwise.
Safety mechanisms
Tool loop detection
Already covered above. Three patterns, three severity levels, one 30-call FIFO window.
Tool result truncation
ToolResultTruncator caps every tool result at 8000 chars. For JSON arrays it binary-searches for the max number of elements that fit and appends [...truncated, showing X/Y items]. For text it truncates at the last word boundary.
SSRF guard
HttpRequestTool routes every outbound HTTP call through SsrfGuard, which blocks 127.0.0.0/8, 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16, 169.254.0.0/16, ::1, fc00::/7, and metadata.google.internal.
Budget enforcement
The budget middleware is a hard gate. If you're out of budget, the LLM doesn't get called — the user sees a budget_exceeded response explaining the limit.
Provider failover
When an LLM call fails with a retryable error (HTTP 429, 529, rate limit, overload), ProviderFailoverPolicy retries with exponential backoff (250 ms base, 8 s cap, ±20% jitter). Before each backoff it tries to rotate API keys — marks the current key as cooling down for 60 seconds, picks the next available key, retries immediately.
If the primary provider exhausts its retries, it falls back through all other active providers in priority order. Each provider gets up to 3 retries before moving on.
Streaming has the same failover semantics, but only for errors that happen before the first chunk flows. Once streaming starts, errors propagate to the caller.
Background execution
The pipeline doesn't run on the SignalR hub thread. ChatHub.SendMessage enqueues an AgentTask and returns immediately. A BackgroundAgentService (an IHostedService) dequeues tasks, calls the pipeline, and pushes results back over SignalR (TaskStarted, AgentStatus, StreamChunk, TaskCompleted, TaskFailed).
| Tier | Queue backend |
|---|---|
| Personal | In-memory bounded Channel<AgentTask> (capacity 100) |
| Pro | Redis (roadmap) |
| Enterprise | RabbitMQ (roadmap) |
On startup, the service recovers any SessionRun records stuck in Processing (from a crash mid-turn) and marks them Failed.
Configuration
AgentExecutionOptions in appsettings.json:
{
"Sophon": {
"AgentExecution": {
"MaxToolIterations": 100,
"MaxToolIterationsHardLimit": 500,
"PlanMaxSteps": 20,
"MaxExecutionTime": "02:00:00",
"MaxConcurrentTasks": 3
}
}
}MaxToolIterations— per-agent override ceiling.MaxToolIterationsHardLimitis absolute.PlanMaxSteps— steps per auto-generated plan.MaxExecutionTime— wall-clock timeout per agent task.MaxConcurrentTasks— background-task concurrency.
Where to go next
- Planning — how complex requests get decomposed into DAGs
- Approval Gates — risk classification and human-in-the-loop gating
- Tiers — which features are available per tier